ai-evals
Systematic evaluation framework for AI products using practitioner-driven methodologies.
- Guides users through understanding what "good" looks like, designing rubrics and test cases, and implementing scoring criteria aligned with actual user needs
- Emphasizes manual review and error analysis as prerequisites to building meaningful evals, with structured workflows for clustering failure patterns
- Flags common pitfalls including vague criteria, LLM-as-judge without validation, and Likert scales; recommends binary Pass/Fail decisions instead
- Positions evals as core product specifications rather than optional quality checks, essential for product builders and non-ML roles alike
AI Evals
Help the user create systematic evaluations for AI products using insights from AI practitioners.
How to Help
When the user asks for help with AI evals:
- Understand what they're evaluating - Ask what AI feature or model they're testing and what "good" looks like
- Help design the eval approach - Suggest rubrics, test cases, and measurement methods
- Guide implementation - Help them think through edge cases, scoring criteria, and iteration cycles
- Connect to product requirements - Ensure evals align with actual user needs, not just technical metrics
Core Principles
Evals are the new PRD
Brendan Foody: "If the model is the product, then the eval is the product requirement document." Evals define what success looks like in AI products—they're not optional quality checks, they're core specifications.
Evals are a core product skill
More from refoundai/lenny-skills
personal-productivity
Help users manage their time and tasks more effectively. Use when someone is overwhelmed with work, struggling with focus, trying to balance multiple responsibilities, or asking how to get more done.
4.6Kcompetitive-analysis
Help users understand and respond to competition. Use when someone is positioning against competitors, evaluating market threats, running competitive war games, or deciding how much to focus on competitors versus customers.
1.9Kbrand-storytelling
Help users craft compelling brand narratives. Use when someone is defining brand strategy, writing company positioning, creating pitch narratives, developing messaging frameworks, or trying to make their company story more memorable.
1.8Kwriting-prds
Help users write effective PRDs. Use when someone is documenting product requirements, preparing specs for engineering, writing feature briefs, or defining what to build for their team.
1.8Kcontent-marketing
Help users build content marketing strategies. Use when someone is starting a blog, building SEO, creating thought leadership content, or deciding on content formats and distribution channels.
1.7Kvibe-coding
Help users build software using AI coding tools. Use when someone is using AI to generate code, building prototypes without deep technical skills, or exploring how non-engineers can create functional software through natural language.
1.7K