ai-system-evaluation
SKILL.md
AI System Evaluation
Evaluating AI systems end-to-end.
Evaluation Criteria
1. Domain-Specific Capability
| Domain | Benchmarks |
|---|---|
| Math & Reasoning | GSM-8K, MATH |
| Code | HumanEval, MBPP |
| Knowledge | MMLU, ARC |
| Multi-turn Chat | MT-Bench |
2. Generation Quality
| Criterion | Measurement |
|---|---|
| Factual Consistency | NLI, SAFE, SelfCheckGPT |
| Coherence | AI judge rubric |
| Relevance | Semantic similarity |
| Fluency | Perplexity |
3. Cost & Latency
@dataclass
class PerformanceMetrics:
ttft: float # Time to First Token (seconds)
tpot: float # Time Per Output Token
throughput: float # Tokens/second
def cost(self, input_tokens, output_tokens, prices):
return input_tokens * prices["input"] + output_tokens * prices["output"]
Model Selection Workflow
1. Define Requirements
├── Task type
├── Quality threshold
├── Latency requirements (<2s TTFT)
├── Cost budget
└── Deployment constraints
2. Filter Options
├── API vs Self-hosted
├── Open source vs Proprietary
└── Size constraints
3. Benchmark on Your Data
├── Create eval dataset (100+ examples)
├── Run experiments
└── Analyze results
4. Make Decision
└── Balance quality, cost, latency
Build vs Buy
| Factor | API | Self-Host |
|---|---|---|
| Data Privacy | Less control | Full control |
| Performance | Best models | Slightly behind |
| Cost at Scale | Expensive | Amortized |
| Customization | Limited | Full control |
| Maintenance | Zero | Significant |
Public Benchmarks
| Benchmark | Focus |
|---|---|
| MMLU | Knowledge (57 subjects) |
| HumanEval | Code generation |
| GSM-8K | Math reasoning |
| TruthfulQA | Factuality |
| MT-Bench | Multi-turn chat |
Caution: Benchmarks can be gamed. Data contamination is common. Always evaluate on YOUR data.
Best Practices
- Test on domain-specific data
- Measure both quality and cost
- Consider latency requirements
- Plan for fallback models
- Re-evaluate periodically
Weekly Installs
1
Repository
doanchienthangdev/omgkitGitHub Stars
3
First Seen
6 days ago
Security Audits
Installed on
zencoder1
amp1
cline1
openclaw1
opencode1
cursor1