Testing Validator

Overview

testing-validator provides comprehensive functional testing for Claude Code skills, validating that skills actually work correctly in practice through systematic testing operations.

Purpose: Functional validation - ensure skills work correctly, not just look good

The 5 Testing Operations:

Functional Testing - Core skill functionality works as intended
Example Validation - All code/command examples execute successfully
Integration Testing - Skills work correctly with dependencies and compositions
Regression Testing - Updates don't break existing functionality
Edge Case Testing - Handles unusual scenarios and boundary conditions

Complement to review-multi:

review-multi: Quality assessment (structure, content, patterns, usability) - "Is it good?"
testing-validator: Functional validation (does it work, examples execute, integrations function) - "Does it work?"
Together: Complete validation (quality + functionality)

Key Benefits:

Automated example execution (catch broken examples)
Integration validation (ensure skills compose correctly)
Regression prevention (detect breaks from updates)
Edge case coverage (handle unusual scenarios)
Systematic testing (consistent, repeatable)

When to Use

Use testing-validator when:

Pre-Deployment Testing - Validate functionality before release
Example Validation - Ensure all examples execute correctly
Integration Validation - Test workflow skills and dependencies
Post-Update Testing - Regression testing after changes
Comprehensive QA - Combined with review-multi for complete validation
CI/CD Integration - Automated testing in pipelines
Edge Case Validation - Test boundary conditions and unusual scenarios
Functional Certification - Certify skills work correctly in practice

Prerequisites

Skill to test
Ability to execute examples (appropriate environment)
Time allocation:
- Quick Check: 15-30 minutes
- Single Operation: 20-90 minutes
- Comprehensive Testing: 2-4 hours

Operations

Operation 1: Functional Testing

Purpose: Validate core skill functionality works as intended

When to Use This Operation:

Testing if skill achieves stated purpose
Validating core functionality
Checking if instructions lead to successful outcomes
Pre-deployment functional validation

Automation Level: 30% automated (script checks), 70% manual (scenario execution)

Process:

Select Test Scenarios
- Choose 2-3 scenarios from "When to Use" section
- Prioritize: primary use case + common case + edge case
- Ensure scenarios cover main functionality
Execute Scenarios
- Actually follow skill instructions
- Complete the intended task
- Document results (success/partial/failure)
- Note any issues encountered
Validate Outputs
- Does skill produce expected outputs?
- Are outputs useful and correct?
- Do outputs match documentation?
Check Error Handling
- What happens with errors?
- Are error messages helpful?
- Can users recover from errors?
Assess Functionality
- Does skill achieve stated purpose?
- Is functionality complete?
- Are there functional gaps?

Validation Checklist:

Primary use case tested (from "When to Use")
Common use case tested
Edge case tested (if applicable)
All scenarios completed successfully
Outputs correct and useful
Error handling works (if errors encountered)
Functionality complete (no gaps)
Skill achieves stated purpose

Test Results:

PASS: All scenarios succeed, functionality complete
PARTIAL: Some scenarios succeed, minor issues
FAIL: Scenarios fail, functionality broken

Outputs:

Test result (PASS/PARTIAL/FAIL)
Scenario execution results
Functional issues identified (if any)
Recommendations for fixes

Time Estimate: 30-90 minutes

Example:

Functional Testing: skill-researcher
====================================

Test Scenarios:
1. Primary: Research GitHub API integration patterns
2. Common: Research for skill development planning
3. Edge: Research with no results found

Scenario 1: GitHub API Integration Research
- Executed: Operation 2 (GitHub Repository Research)
- Result: ✅ SUCCESS
- Time: 25 minutes
- Output: Found 5 repositories, extracted patterns
- Functionality: Achieved purpose (research complete)

Scenario 2: Skill Development Research
- Executed: All 5 operations (Web, GitHub, Docs, Synthesis)
- Result: ✅ SUCCESS
- Time: 60 minutes
- Output: Research synthesis with 4 sources, 3 patterns
- Functionality: Fully achieved purpose

Scenario 3: No Results Edge Case
- Executed: Web search for obscure topic
- Result: ✅ HANDLED
- Time: 10 minutes
- Output: "No results found" with guidance to adjust search
- Error Handling: Good (helpful message, suggests alternatives)

Overall Functional Test: ✅ PASS
- All scenarios succeeded
- Functionality complete
- Error handling works
- Achieves stated purpose

Operation 2: Example Validation

Purpose: Verify all code/command examples in skill documentation execute correctly

When to Use This Operation:

Validating documentation accuracy
Ensuring examples are current and working
Preventing broken example deployment
Post-update example regression testing

Automation Level: 80% automated (example extraction and execution)

Process:

Extract All Examples
- Scan SKILL.md for code blocks (```)
- Extract examples with language tags
- Identify executable vs informational examples
- Count total examples
Categorize Examples
- Shell/bash commands
- Python code snippets
- YAML/config samples
- Informational (not executable)
Execute Examples Automatically
```
python3 scripts/validate-examples.py /path/to/skill
```
- Executes all bash/python examples
- Captures output and errors
- Compares to expected output (if documented)
- Reports success/failure per example
Manual Validation (for non-automatable):
- Configuration examples (check syntax)
- Conceptual examples (check accuracy)
- Workflow examples (check logic)
Generate Example Report
- Total examples: X
- Executable: Y (Z%)
- Passed: A
- Failed: B
- Success rate: A/(A+B) × 100%

Validation Checklist:

All examples extracted and counted
Executable examples identified
Automated validation run (bash/python examples)
Non-executable examples checked manually
All examples execute successfully OR expected failures documented
Broken examples identified with fixes
Success rate ≥90% (for production)

Test Results:

PASS: ≥90% of executable examples work correctly
PARTIAL: 70-89% examples work, some broken
FAIL: <70% examples work, many broken

Outputs:

Example inventory (total, executable, non-executable)
Execution results per example
Success rate
Broken examples list with error messages
Recommendations for fixes

Time Estimate: 20-45 minutes (mostly automated)

Example:

Example Validation: review-multi
=================================

Extraction Results:
- Total examples: 18
- Executable (bash): 12
- Executable (python): 3
- Informational (YAML): 3

Automated Execution:

Bash Examples (12 total):
✅ PASS: python3 scripts/validate-structure.py <path> (3 instances)
✅ PASS: python3 scripts/check-patterns.py <path>
✅ PASS: python3 scripts/generate-review-report.py <file>
✅ PASS: python3 scripts/review-runner.py <path>
⚠️ WARNING: Example uses placeholder <path> - works with substitution
- Success Rate: 12/12 (100%)

Python Examples (3 total):
✅ PASS: All 3 syntax-valid, execute correctly
- Success Rate: 3/3 (100%)

Manual Validation (3 YAML examples):
✅ PASS: All YAML examples valid syntax
✅ PASS: Frontmatter examples follow standards

Overall Example Validation: ✅ PASS
- Success Rate: 100% (18/18 examples work)
- Minor Note: Some examples use placeholders (acceptable with clear notes)

Recommendation: Examples excellent, all functional

Operation 3: Integration Testing

Purpose: Test skills work correctly with other skills, especially in workflows and compositions

When to Use This Operation:

Testing workflow skills (that compose others)
Validating dependencies work correctly
Checking skill integration points
Testing data flow between skills

Automation Level: 20% automated (dependency checking), 80% manual (actual integration testing)

Process:

Identify Integration Points
- Does skill depend on other skills?
- Does skill compose with others (workflow)?
- Are there data flows between skills?
- Integration examples provided?
Test Skill Dependencies
- Load required skills (can they be loaded?)
- Execute dependent functionality
- Verify dependency works as expected
- Check version compatibility (if applicable)
Test Workflow Compositions
- For workflow skills: execute multi-skill workflow
- Verify data flows correctly between steps
- Check each component skill integration
- Validate output-to-input transitions
Test Integration Examples
- Execute documented integration examples
- Verify skills compose as documented
- Check integration instructions accurate
Assess Integration Quality
- Integrations smooth or problematic?
- Data flows correctly?
- Clear integration guidance?
- Error handling across skill boundaries?

Validation Checklist:

Integration points identified
Dependencies tested (if applicable)
Workflow composition tested (if workflow skill)
Data flow validated (inputs/outputs correct)
Integration examples execute successfully
Cross-skill error handling works
Integration guidance accurate
No integration issues found

Test Results:

PASS: All integrations work smoothly
PARTIAL: Integrations work with minor issues
FAIL: Integration broken or major issues
N/A: Standalone skill with no integrations

Outputs:

Integration test results
Workflow execution results (if applicable)
Data flow validation
Integration issues (if any)
Recommendations

Time Estimate: 30-90 minutes (varies by integration complexity, N/A for standalone skills)

Example:

Integration Testing: development-workflow
==========================================

Integration Type: Workflow Composition (5 component skills)

Dependencies Identified:
1. skill-researcher (Step 1)
2. planning-architect (Step 2)
3. task-development (Step 3, optional)
4. prompt-builder (Step 4)
5. todo-management (Step 5)

Integration Test Execution:

Step 1 → Step 2 Integration:
- Input to Step 2: research-synthesis.md from Step 1
- Test: Create research synthesis, feed to planning-architect
- Result: ✅ PASS (planning-architect correctly uses research findings)
- Data Flow: Smooth (outputs match expected inputs)

Step 2 → Step 3 Integration:
- Input to Step 3: skill-architecture-plan.md from Step 2
- Test: Create architecture plan, feed to task-development
- Result: ✅ PASS (task-development breaks down plan correctly)
- Data Flow: Smooth

Step 3 → Step 4 Integration:
- Input to Step 4: task-breakdown.md from Step 3
- Test: Create task breakdown, feed to prompt-builder
- Result: ✅ PASS (prompt-builder creates prompts for tasks)
- Data Flow: Smooth

Step 4 → Step 5 Integration:
- Input to Step 5: prompts-collection.md from Step 4
- Test: Create prompts, feed to todo-management
- Result: ✅ PASS (todo-management creates todos from tasks)
- Data Flow: Smooth

Workflow Execution Test:
- Executed: Complete workflow (all 5 steps)
- Result: ✅ SUCCESS (produced complete skill planning artifacts)
- Time: 4.5 hours (as documented)
- Quality: High (artifacts complete and usable)

Overall Integration Test: ✅ PASS
- All 5 integrations work smoothly
- Data flows correctly between steps
- Workflow achieves stated purpose
- No integration issues found

Operation 4: Regression Testing

Purpose: Ensure updates don't break existing functionality

When to Use This Operation:

After skill updates or improvements
Before deploying changes
Validating skill-updater changes
Post-auto-updater verification

Automation Level: 60% automated (comparison, example re-execution), 40% manual

Process:

Establish Baseline
- Before changes: run tests, document results
- Save baseline test results
- Note which examples/scenarios work
Apply Changes
- Make updates to skill
- Document what changed
Re-Run Tests
- Re-execute same tests as baseline
- Run example validation again
- Test same scenarios
Compare Results
- Before vs After comparison
- Which tests changed status?
- New failures? (regressions)
- New successes? (improvements)
- Unchanged? (stable)
Identify Regressions
- Tests that passed before but fail now
- Functionality that worked but now broken
- Examples that executed but now error

Validation Checklist:

Baseline tests documented (before changes)
Changes applied and documented
All baseline tests re-executed
Results compared (before vs after)
No new failures (no regressions)
If failures: identified and documented
Regression fixes applied (if needed)
Final validation: all tests pass

Test Results:

PASS: No regressions (all baseline tests still pass)
REGRESSION: Some tests failed that previously passed
IMPROVED: Some tests pass that previously failed (plus no regressions)

Outputs:

Regression test report
Before/after comparison
Identified regressions (if any)
Regression fixes (if applicable)
Final test status

Time Estimate: 30-60 minutes

Example:

Regression Testing: planning-architect (after Quick Ref addition)
==================================================================

Baseline (Before Quick Reference):
- Structure validation: 5/5 (PASS)
- Example count: 8 examples
- All examples: Execute successfully
- Scenarios tested: 2 scenarios (both PASS)

Changes Applied:
- Added Quick Reference section (96 lines)
- Added tables, checklists, decision tree

Re-Run Tests (After Quick Reference):
- Structure validation: 5/5 (PASS) ✅ No regression
- Example count: 8 examples ✅ No change
- All examples: Execute successfully ✅ No regression
- Scenarios tested: 2 scenarios (both PASS) ✅ No regression
- NEW: Quick Reference detected ✅ Improvement

Comparison:
✅ All baseline tests still pass (no regressions)
✅ New functionality added (Quick Reference)
✅ Quality maintained (5/5 score)

Overall Regression Test: ✅ PASS (No Regressions)
Additional: ✅ IMPROVEMENT (Quick Reference added)

Recommendation: Changes safe to deploy

Operation 5: Edge Case Testing

Purpose: Test skill handles unusual scenarios, boundary conditions, and edge cases correctly

When to Use This Operation:

Testing robustness
Validating error handling
Checking boundary conditions
Ensuring graceful degradation

Automation Level: 30% automated (known edge case checks), 70% manual (scenario thinking)

Process:

Identify Edge Cases
- Empty inputs (what if no data?)
- Maximum inputs (what if too much data?)
- Invalid inputs (what if wrong format?)
- Missing dependencies (what if skill not found?)
- Boundary conditions (limits, thresholds)
Design Edge Case Tests
- Create test scenarios for each edge case
- Define expected behavior
- Document pass criteria
Execute Edge Case Tests
- Test with empty/minimal inputs
- Test with maximum/excessive inputs
- Test with invalid/malformed inputs
- Test with missing dependencies
- Test boundary conditions
Evaluate Handling
- Does skill handle edge case gracefully?
- Error messages clear and helpful?
- No crashes or undefined behavior?
- Appropriate fallbacks or defaults?
Document Edge Case Behavior
- Which edge cases handled well?
- Which edge cases cause issues?
- Expected vs actual behavior
- Recommendations for improvement

Validation Checklist:

Edge cases identified (minimum 3-5)
Each edge case tested
Error handling assessed
No crashes or undefined behavior
Error messages helpful (if applicable)
Graceful degradation (if applicable)
Edge case handling documented
Critical edge cases handled correctly

Test Results:

PASS: All critical edge cases handled correctly
PARTIAL: Most edge cases handled, some issues
FAIL: Critical edge cases cause errors or crashes

Outputs:

Edge case test results
Handling quality assessment
Issues identified
Recommendations for robustness

Time Estimate: 30-90 minutes

Example:

Edge Case Testing: todo-management
===================================

Edge Cases Identified:
1. Empty task list (initialize with 0 tasks)
2. Single task (minimal usage)
3. 100+ tasks (maximum usage)
4. Starting non-existent task
5. Completing already completed task

Edge Case Tests:

Test 1: Empty Task List
- Scenario: Initialize with empty list
- Execution: todo-management Operation 1 with 0 tasks
- Result: ✅ PASS (handles gracefully, shows empty state)
- Error: None

Test 2: Single Task
- Scenario: List with 1 task only
- Execution: Complete workflow on 1 task
- Result: ✅ PASS (works correctly, minimal case handled)

Test 3: 100 Tasks
- Scenario: Large task list
- Execution: Report progress on 100-task list
- Result: ✅ PASS (handles large lists, performance acceptable)
- Note: Report generation ~5 seconds (good)

Test 4: Non-Existent Task
- Scenario: Start task #999 (doesn't exist)
- Execution: Operation 2 (Start Task 999)
- Result: ✅ PASS (clear error: "Task 999 not found")
- Error Handling: Excellent (specific error message)

Test 5: Double Complete
- Scenario: Complete task #5 twice
- Execution: Operation 3 twice on same task
- Result: ✅ PASS (second attempt shows "Already completed")
- Error Handling: Good (informative message)

Overall Edge Case Test: ✅ PASS
- All critical edge cases handled correctly
- Error messages clear and helpful
- No crashes or undefined behavior
- Graceful handling of unusual scenarios

Recommendation: Edge case handling excellent

Testing Modes

Comprehensive Testing Mode

Purpose: Complete functional validation across all 5 operations

When to Use:

Pre-deployment (ensure everything works)
Major updates (comprehensive regression testing)
Quality certification (complete functional validation)

Process:

Run all 5 testing operations
Aggregate results
Generate comprehensive test report
Make deployment decision

Time Estimate: 2-4 hours

Output: Complete test report with PASS/FAIL for deployment

Quick Check Mode

Purpose: Fast functional validation (examples only)

When to Use:

During development (continuous testing)
Quick validation (examples work?)
Pre-commit checks

Process:

Run Operation 2 only (Example Validation)
Automated execution of all examples
Quick pass/fail report

Time Estimate: 15-30 minutes (automated)

Output: Example validation results

Custom Testing Mode

Purpose: Select specific operations based on needs

When to Use:

Targeted testing (only certain aspects)
Time constraints (can't do comprehensive)
Specific concerns (e.g., only integration testing)

Process:

Select operations to run (1-5)
Execute selected tests
Generate targeted report

Best Practices

1. Test Early and Often

Practice: Run Quick Check during development, Comprehensive before deployment

Rationale: Early testing catches issues before they compound

Application: Quick Check daily, Comprehensive pre-deploy

2. Automate Example Validation

Practice: Use automated example validation (validate-examples.py)

Rationale: 80% automated, fast, catches broken examples instantly

Application: Run after any example changes

3. Test Real Scenarios

Practice: Use actual use cases for functional testing

Rationale: Real scenarios reveal issues documentation review misses

Application: Test scenarios from "When to Use" section

4. Regression Test After Updates

Practice: Always run regression tests after skill changes

Rationale: Prevents breaking existing functionality with improvements

Application: Before/after comparison for all updates

5. Document Test Results

Practice: Save test reports for comparison over time

Rationale: Track testing trends, identify patterns

Application: Generate test report for each comprehensive test

6. Fix Broken Examples Immediately

Practice: Don't deploy with broken examples

Rationale: Broken examples destroy user confidence

Application: Example validation must PASS before deploy

Common Mistakes

Mistake 1: Skipping Example Validation

Symptom: Users report broken examples after deployment

Cause: Not testing examples before release

Fix: Run Operation 2 (Example Validation) before every deployment

Prevention: Make example validation mandatory in deployment checklist

Mistake 2: Only Testing Happy Path

Symptom: Skills break with unusual inputs or edge cases

Cause: Not testing edge cases

Fix: Run Operation 5 (Edge Case Testing)

Prevention: Include edge case testing in comprehensive mode

Mistake 3: No Regression Testing

Symptom: Updates break previously working functionality

Cause: Not testing before/after updates

Fix: Run Operation 4 (Regression Testing) after changes

Prevention: Make regression testing mandatory for all updates

Mistake 4: Not Testing Integrations

Symptom: Workflow skills break when actually composing other skills

Cause: Testing skills individually, not integrated

Fix: Run Operation 3 (Integration Testing) for workflow skills

Prevention: Always test integrations for workflow/composition skills

Mistake 5: Manual Testing Only

Symptom: Testing takes too long, often skipped

Cause: Not using automation

Fix: Use validate-examples.py for automated example checking

Prevention: Automate where possible (examples, scripts, structure)

Quick Reference

The 5 Testing Operations

Operation	Focus	Automation	Time	Pass Criteria
Functional	Core functionality works	30%	30-90m	Scenarios succeed
Example Validation	Examples execute correctly	80%	20-45m	≥90% examples work
Integration	Skills work together	20%	30-90m	Integrations smooth
Regression	Updates don't break functionality	60%	30-60m	No new failures
Edge Case	Handles unusual scenarios	30%	30-90m	Critical edge cases handled

Testing Modes

Mode	Time	Operations	Use Case
Comprehensive	2-4h	All 5 operations	Pre-deployment, certification
Quick Check	15-30m	Example validation only	During development
Custom	Variable	Selected operations	Targeted testing

Test Results

Result	Meaning	Action
PASS	All tests successful	Deploy with confidence
PARTIAL	Some issues, not critical	Fix issues, re-test, then deploy
FAIL	Critical issues	Fix before deployment

Integration with review-multi

Use Both for Complete Validation:

review-multi (quality) + testing-validator (functionality) = Complete Validation

review-multi: Is it good? (structure, content, patterns, usability)
testing-validator: Does it work? (functional, examples, integration)

Together: Ready to deploy? (quality + functionality validated)

Automation Scripts

# Validate all examples automatically
python3 scripts/validate-examples.py /path/to/skill

# Run comprehensive test suite
python3 scripts/test-runner.py /path/to/skill --mode comprehensive

# Generate test report
python3 scripts/generate-test-report.py test-results.json --output report.md

For More Information

Functional testing: references/functional-testing-guide.md
Example validation: references/example-validation-guide.md
Integration testing: references/integration-testing-guide.md
Regression testing: references/regression-testing-guide.md
Edge case testing: references/edge-case-testing-guide.md
Test reports: references/test-report-template.md

testing-validator ensures skills work correctly through comprehensive functional testing, example validation, integration testing, regression testing, and edge case validation.