prompt-testing
SKILL.md
Prompt Testing
Skill for testing, comparing, and measuring prompt performance.
Documentation
- metrics.md - Performance metrics definition
- methodology.md - A/B testing protocol
Testing Workflow
1. DEFINE
└── Test objective
└── Metrics to measure
└── Success criteria
2. PREPARE
└── Variants A and B
└── Test dataset
└── Baseline (if existing)
3. EXECUTE
└── Run on dataset
└── Collect results
└── Document observations
4. ANALYZE
└── Calculate metrics
└── Compare variants
└── Identify patterns
5. DECIDE
└── Recommendation
└── Statistical confidence
└── Next iterations
Performance Metrics
Quality
| Metric | Description | Calculation |
|---|---|---|
| Accuracy | Correct responses | Correct / Total |
| Compliance | Format adherence | Compliant / Total |
| Consistency | Response stability | 1 - Variance |
| Relevance | Meeting the need | Average score (1-5) |
Efficiency
| Metric | Description | Calculation |
|---|---|---|
| Tokens Input | Prompt size | Token count |
| Tokens Output | Response size | Token count |
| Latency | Response time | ms |
| Cost | Price per request | Tokens × Price |
Robustness
| Metric | Description | Calculation |
|---|---|---|
| Edge Cases | Edge case handling | Passed / Total |
| Jailbreak Resist | Bypass resistance | Blocked / Attempts |
| Error Recovery | Error recovery | Recovered / Errors |
Test Format
Test Dataset
{
"name": "Test Dataset v1",
"description": "Dataset for testing prompt XYZ",
"cases": [
{
"id": "case_001",
"type": "standard",
"input": "Test input",
"expected": "Expected output",
"tags": ["basic", "format"]
},
{
"id": "case_002",
"type": "edge_case",
"input": "Edge input",
"expected": "Expected behavior",
"tags": ["edge", "error"]
}
]
}
Test Report
# A/B Test Report: {{TEST_NAME}}
## Configuration
| Parameter | Value |
|-----------|-------|
| Date | {{DATE}} |
| Dataset | {{DATASET}} |
| Cases tested | {{N_CASES}} |
| Model | {{MODEL}} |
## Tested Variants
### Variant A (Baseline)
[Description or link to prompt A]
### Variant B (Challenger)
[Description or link to prompt B]
## Results
### Overall Scores
| Metric | A | B | Delta | Winner |
|--------|---|---|-------|--------|
| Accuracy | X% | Y% | +/-Z% | A/B |
| Compliance | X% | Y% | +/-Z% | A/B |
| Tokens | X | Y | +/-Z | A/B |
| Latency | Xms | Yms | +/-Zms | A/B |
### Detail by Case Type
| Type | A | B | Notes |
|------|---|---|-------|
| Standard | X% | Y% | |
| Edge cases | X% | Y% | |
| Error cases | X% | Y% | |
### Problematic Cases
| Case ID | Expected | A | B | Analysis |
|---------|----------|---|---|----------|
| case_XXX | ... | ❌ | ✅ | [Explanation] |
## Analysis
### B's Strengths
- [Improvement 1]
- [Improvement 2]
### B's Weaknesses
- [Regression 1]
### Observations
[Qualitative insights]
## Recommendation
**Verdict**: ✅ Adopt B / ⚠️ Iterate / ❌ Keep A
**Confidence**: High / Medium / Low
**Justification**:
[Explanation of recommendation]
## Next Steps
1. [Action 1]
2. [Action 2]
Commands
# Create a test
/prompt test create --name "Test v1" --dataset tests.json
# Run an A/B test
/prompt test run --a prompt_a.md --b prompt_b.md --dataset tests.json
# View results
/prompt test results --id test_001
# Compare two tests
/prompt test compare --tests test_001,test_002
Decision Criteria
When to adopt variant B?
IF:
- Accuracy B >= Accuracy A
AND (Tokens B <= Tokens A * 1.1 OR accuracy improvement > 5%)
AND no regression on edge cases
THEN:
→ Adopt B
ELSE IF:
- Accuracy improvement > 10%
AND token regression < 20%
THEN:
→ Consider B (acceptable trade-off)
ELSE:
→ Keep A or iterate
Best Practices
- Minimum 20 test cases for significance
- Include edge cases (15-20% of dataset)
- Test multiple runs for consistency
- Document hypotheses before testing
- Version the prompts being tested
Weekly Installs
23
Repository
fusengine/agentsGitHub Stars
3
First Seen
Feb 10, 2026
Security Audits
Installed on
amp23
github-copilot23
codex23
kimi-cli23
gemini-cli23
opencode23