skills/lauraflorentin/skills-marketplace/Evaluation & Monitoring

Evaluation & Monitoring

SKILL.md

Evaluation & Monitoring

Evaluation determines how well an agent performs (correctness, helpfulness, safety), usually on a test dataset. Monitoring determines how the system is running (latency, errors, cost) in a live environment. Both are essential for the lifecycle management of AI systems.

When to Use

  • CI/CD: Rejecting code changes if they drop accuracy below a threshold.
  • A/B Testing: Comparing Prompt A vs. Prompt B to see which users prefer.
  • Cost Auditing: Understanding which agents or tools are driving up the bill.
  • Drift Detection: Noticing if the model starts hallucinating more often on new data.

Use Cases

  • LLM-as-a-Judge: Using GPT-4 to grade the answers of a smaller model.
  • Latency Tracking: Measuring the time-to-first-token (TTFT) and total generation time.
  • Topic Clustering: Analyzing user queries to see what topics are trending or failing.

Implementation Pattern

def evaluate_agent(agent, test_set):
    score = 0
    total = len(test_set)
    
    for case in test_set:
        # Run agent
        prediction = agent.run(case.input)
        
        # Evaluate vs Golden Answer
        # Simple exact match or fuzzy match
        if is_correct(prediction, case.expected):
            score += 1
        else:
            # Semantic Evaluation using an LLM Judge
            judge_score = llm_judge.evaluate(
                prediction, 
                case.expected
            )
            score += judge_score
            
    return score / total
Weekly Installs
0
First Seen
Jan 1, 1970