sensei
Sensei
"A true master teaches not by telling, but by refining."
Automates skill frontmatter improvement using the Ralph loop pattern - iteratively improving skills until they reach Medium-High compliance with passing tests.
Help
When user says "sensei help" or asks how to use sensei:
╔══════════════════════════════════════════════════════════════════╗
║ SENSEI - Skill Frontmatter Compliance Improver ║
╠══════════════════════════════════════════════════════════════════╣
║ ║
║ USAGE: ║
║ Run sensei on <skill-name> # Single skill ║
║ Run sensei on <skill-name> --fast # Skip tests ║
║ Run sensei on <skill1>, <skill2> # Multiple skills ║
║ Run sensei on all Low-adherence skills # Batch by score ║
║ Run sensei on all skills # All skills ║
║ ║
║ WHAT IT DOES: ║
║ 1. READ - Load skill's SKILL.md and count tokens ║
║ 2. SCORE - Check compliance (Low/Medium/Medium-High/High) ║
║ 3. SCAFFOLD- Create tests from template if missing ║
║ 4. IMPROVE - Add WHEN: triggers (cross-model optimized) ║
║ 5. TEST - Run tests, fix if needed ║
║ 6. TOKENS - Check token budget ║
║ 7. SUMMARY - Show before/after comparison ║
║ 8. PROMPT - Ask: Commit, Create Issue, or Skip? ║
║ 9. REPEAT - Until Medium-High score achieved ║
║ ║
║ TARGET SCORE: Medium-High ║
║ ✓ Description > 150 chars, ≤ 60 words ║
║ ✓ Has "WHEN:" trigger phrases (preferred) ║
║ ✓ No "DO NOT USE FOR:" (risky in multi-skill envs) ║
║ ✓ Has "INVOKES:" for tool relationships (optional) ║
║ ✓ SKILL.md < 500 tokens (soft limit) ║
║ ║
║ MCP INTEGRATION (when INVOKES present): ║
║ ✓ Has "MCP Tools Used" table ║
║ ✓ Has Prerequisites section ║
║ ✓ Has CLI fallback pattern ║
║ ✓ No skill-tool name collision ║
║ ║
╚══════════════════════════════════════════════════════════════════╝
Configuration
Sensei uses these defaults (override by specifying in your prompt):
| Setting | Default | Description |
|---|---|---|
| Skills directory | skills/ or .github/skills/ |
Where SKILL.md files live |
| Tests directory | tests/ |
Where test files live |
| Token soft limit | 500 | Target for SKILL.md |
| Token hard limit | 5000 | Maximum for SKILL.md |
| Target score | Medium-High | Minimum compliance level |
| Max iterations | 5 | Per-skill loop limit |
Auto-detect skills directory by checking (in order):
skills/in project root.github/skills/- User-specified path
Invocation Modes
Single Skill
Run sensei on my-skill-name
Multiple Skills
Run sensei on skill-a, skill-b, skill-c
By Adherence Level
Run sensei on all Low-adherence skills
Run sensei on all Medium-adherence skills
All Skills
Run sensei on all skills
Fast Mode (Skip Tests)
Run sensei on my-skill --fast
GEPA Mode (Deep Optimization)
Run sensei on my-skill --gepa
Run sensei on my-skill --gepa --fast
Run sensei on all skills --gepa
When --gepa is used, Step 5 (IMPROVE) is replaced with GEPA evolutionary optimization.
Instead of template-based improvements, GEPA uses the existing test harness as a fitness
function and an LLM to propose and evaluate many candidate improvements automatically.
GEPA score-only mode (no LLM calls, just evaluate current quality):
Run sensei score my-skill
Run sensei score all skills
The Ralph Loop
For each skill, execute this loop until score >= Medium-High:
Step 1: READ
Load the skill's current state:
{skills-dir}/{skill-name}/SKILL.md
{tests-dir}/{skill-name}/ (if exists)
Run token count:
npm run tokens -- count {skills-dir}/{skill-name}/SKILL.md
Step 2: SCORE
Assess compliance by checking the frontmatter for:
- Description length (>= 150 chars, ≤ 60 words)
- "WHEN:" trigger phrases (preferred) or "USE FOR:"
- Routing clarity ("INVOKES:", "FOR SINGLE OPERATIONS:")
- No "DO NOT USE FOR:" anti-triggers (risky in multi-skill environments)
See references/scoring.md for detailed criteria.
Step 3: CHECK
If score >= Medium-High AND tests pass → go to SUMMARY step.
Step 4: SCAFFOLD (if needed)
If {tests-dir}/{skill-name}/ doesn't exist, create test scaffolding using templates from references/test-templates/.
Step 5: IMPROVE FRONTMATTER
Enhance the SKILL.md description to include:
- Lead with action verb - First sentence: unique action verb + domain
- Trigger phrases - "WHEN:" (preferred) or "USE FOR:" with 3-5 distinctive quoted phrases
- Keep description under 60 words and 1024 characters
⚠️ "DO NOT USE FOR:" carries context-dependent risk. In multi-skill environments (10+ skills with overlapping domains), anti-trigger clauses introduce the very keywords that cause wrong-skill activation on Claude Sonnet and fast-pattern-matching models (evidence). For small, isolated skill sets (1-5 skills), the risk is low. When in doubt, use positive routing with
WHEN:and distinctive quoted phrases.
Template (cross-model optimized):
---
name: skill-name
description: "[ACTION VERB] [UNIQUE_DOMAIN]. [One clarifying sentence]. WHEN: \"[phrase1]\", \"[phrase2]\", \"[phrase3]\", \"[phrase4]\", \"[phrase5]\"."
---
Template (with routing clarity for High score):
---
name: skill-name
description: "**WORKFLOW SKILL** — [ACTION VERB] [UNIQUE_DOMAIN]. [Clarifying sentence]. WHEN: \"[phrase1]\", \"[phrase2]\", \"[phrase3]\". INVOKES: [tools/MCP servers used]. FOR SINGLE OPERATIONS: [when to bypass this skill]."
---
Step 5-GEPA: IMPROVE WITH GEPA (when --gepa flag is set)
Replaces Step 5 with automated evolutionary optimization. Step 6 (IMPROVE TESTS) still runs normally.
-
Auto-discover test harness: Read
{tests-dir}/{skill-name}/triggers.test.tsand extractshouldTriggerPromptsandshouldNotTriggerPromptsarrays automatically. -
Build evaluator: Construct a GEPA evaluator that scores candidates on:
- Content quality (has ## Triggers, ## Rules, ## Steps, USE FOR, WHEN)
- Frontmatter description compliance (length, trigger phrases)
- Trigger accuracy (keywords extracted from description match test prompts correctly)
-
Run optimization: Call the GEPA auto-evaluator script:
python scripts/src/gepa/auto_evaluator.py optimize \ --skill {skill-name} \ --skills-dir {skills-dir} \ --tests-dir {tests-dir} \ --iterations 80 -
Review output: GEPA produces an optimized SKILL.md body. Show the diff to the user. The GEPA evaluator auto-generates from existing tests — no manual configuration needed.
Key: GEPA wraps existing tests as its fitness function. It does NOT replace or modify tests. The LLM proposes improved SKILL.md text, and the evaluator scores each candidate against the same test prompts the CI already uses. Only improvements that score higher are kept.
Step 6: IMPROVE TESTS
Update test prompts to match new frontmatter:
shouldTriggerPrompts- 5+ prompts matching "WHEN:" or "USE FOR:" phrasesshouldNotTriggerPrompts- 5+ prompts for unrelated topics and different-skill scenarios
Step 7: VERIFY
Run tests (skip if --fast flag):
# Framework-specific command based on project
npm test -- --testPathPattern={skill-name} # Jest
pytest tests/{skill-name}/ # pytest
waza run tests/{skill-name}/trigger_tests.yaml # Waza
Step 8: TOKENS
Check token budget:
npm run tokens -- check {skills-dir}/{skill-name}/SKILL.md
Budget guidelines:
- SKILL.md: < 500 tokens (soft), < 5000 (hard)
- references/*.md: < 1000 tokens each
Step 8b: MCP INTEGRATION (if INVOKES present)
When description contains INVOKES:, check:
- MCP Tools Used table - Does skill body have the table?
- Prerequisites section - Are requirements documented?
- CLI fallback - Is there a fallback when MCP unavailable?
- Name collision - Does skill name match an MCP tool?
If checks fail, add missing sections using patterns from mcp-integration.md.
Step 9: SUMMARY
Display before/after comparison:
╔══════════════════════════════════════════════════════════════════╗
║ SENSEI SUMMARY: {skill-name} ║
╠══════════════════════════════════════════════════════════════════╣
║ BEFORE AFTER ║
║ ────── ───── ║
║ Score: Low Score: Medium-High ║
║ Tokens: 142 Tokens: 385 ║
║ Triggers: 0 Triggers: 5 ║
║ Anti-triggers: 0 Anti-triggers: 3 ║
╚══════════════════════════════════════════════════════════════════╝
Step 10: PROMPT USER
Ask how to proceed:
- [C] Commit - Save with message
sensei: improve {skill-name} frontmatter - [I] Create Issue - Open issue with summary and suggestions
- [S] Skip - Discard changes, move to next skill
Step 11: REPEAT or EXIT
- If score < Medium-High AND iterations < 5 → go to Step 2
- If iterations >= 5 → timeout, show summary, move to next skill
Scoring Quick Reference
| Score | Requirements |
|---|---|
| Invalid | Description > 1024 chars (exceeds spec hard limit) |
| Low | Description < 150 chars OR no triggers |
| Medium | Description >= 150 chars AND has triggers but >60 words |
| Medium-High | Has "WHEN:" (preferred) or "USE FOR:" with ≤60 words |
| High | Medium-High + routing clarity (INVOKES/FOR SINGLE OPERATIONS) |
⚠️ "DO NOT USE FOR:" is risky in multi-skill environments (10+ overlapping skills) — causes keyword contamination on fast-pattern-matching models. Safe for small, isolated skill sets. Use positive routing with
WHEN:for cross-model safety.
MCP Integration Score (when INVOKES present)
| Check | Status |
|---|---|
| MCP Tools Used table | ✓/✗ |
| Prerequisites section | ✓/✗ |
| CLI fallback pattern | ✓/✗ |
| No name collision | ✓/✗ |
See references/scoring.md for full criteria. See references/mcp-integration.md for MCP patterns.
Frontmatter Patterns
Skill Classification Prefix
Add a prefix to clarify the skill type:
**WORKFLOW SKILL**- Multi-step orchestration**UTILITY SKILL**- Single-purpose helper**ANALYSIS SKILL**- Read-only analysis/reporting
Routing Clarity (for High score)
When skills interact with MCP tools or other skills, add:
INVOKES:- What tools/skills this skill callsFOR SINGLE OPERATIONS:- When to bypass this skill
Quick Example
Before (Low):
description: 'Process PDF files'
After (High with routing, cross-model optimized):
description: "**WORKFLOW SKILL** — Extract, rotate, merge, and split PDF files. WHEN: \"extract PDF text\", \"rotate PDF pages\", \"merge PDFs\", \"split PDF\". INVOKES: pdf-tools MCP for extraction, file-system for I/O. FOR SINGLE OPERATIONS: Use pdf-tools MCP directly for simple extractions."
See references/examples.md for more before/after transformations.
Commit Messages
sensei: improve {skill-name} frontmatter
Reference Documentation
- scoring.md - Detailed scoring criteria and algorithm
- mcp-integration.md - MCP tool integration patterns
- loop.md - Ralph loop workflow details
- examples.md - Before/after transformation examples
- configuration.md - Project setup patterns
- test-templates/ - Test scaffolding templates
- test-templates/waza.md - Waza trigger test format
Built-in Scripts
Run npm run tokens help for full usage.
Token Commands
npm run tokens count # Count all markdown files
npm run tokens check # Check against token limits
npm run tokens suggest # Get optimization suggestions
npm run tokens compare # Compare with git history
GEPA Commands
Requires: pip install gepa (or uv pip install gepa). See requirements.
# Score a single skill (no LLM calls, instant)
python scripts/src/gepa/auto_evaluator.py score --skill azure-deploy --skills-dir skills --tests-dir tests
# Score all skills
python scripts/src/gepa/auto_evaluator.py score-all --skills-dir skills --tests-dir tests
# Optimize a skill (requires LLM API — uses GitHub Models via gh auth token)
python scripts/src/gepa/auto_evaluator.py optimize --skill azure-deploy --skills-dir skills --tests-dir tests
# JSON output (for CI pipelines)
python scripts/src/gepa/auto_evaluator.py score-all --skills-dir skills --tests-dir tests --json
Configuration
Create .token-limits.json to customize limits:
{
"defaults": { "SKILL.md": 500, "references/**/*.md": 1000 },
"overrides": { "README.md": 3000 }
}