skill-creator-thepexcel
Skill Creator (ThepExcel Edition)
Based on Anthropic's official skill-creator (Apache 2.0). Enhanced with ThepExcel deployment workflow and structured audit/enhancement pipeline.
Quick Start
| ต้องการ | Mode | ไปที่ |
|---|---|---|
| สร้าง skill ใหม่ | Create | → Creating a Skill |
| ทดสอบ skill ด้วย evals | Eval | → Running and Evaluating |
| ปรับปรุงตาม feedback | Improve | → Improving the Skill |
| ตรวจคุณภาพ + enhance อย่าง structured | Enhance | → Enhancement Mode |
| วัดผล regression / เทียบ versions | Benchmark | → Benchmark & Description Opt. |
| Deploy ไป ThepExcel infrastructure | Deploy | → Deploy |
Figure out where the user is in this process and jump in. If they already have a draft, go straight to eval/iterate. If they say "just vibe with me", do that.
Core Principles
Concise is key — Context window เป็นทรัพยากรที่แชร์กัน ทุกบรรทัดต้องจ่ายค่า token
Explain the why — อธิบาย ทำไม ไม่ใช่แค่ อะไร Claude ฉลาดพอ generalize จาก reasoning ดีกว่า rule แข็งทื่อ ถ้าจะเขียน ALWAYS/NEVER all caps → yellow flag: reframe เป็น reasoning แทน
Generalize, don't overfit — Test cases คือตัวอย่าง ไม่ใช่ spec ทั้งหมด หา pattern ที่ work broadly
Keep the prompt lean — อ่าน transcript จริง ถ้า skill ทำให้ waste time → ตัดออก
Degrees of Freedom:
| Level | เมื่อไหร่ | ตัวอย่าง |
|---|---|---|
| High (text) | หลายวิธีถูกได้ | Code review guidelines |
| Medium (pseudocode) | มี pattern ที่ prefer | Report template |
| Low (scripts) | ต้องการ consistency | Database migrations |
Skill Structure
skill-name/
├── SKILL.md (required) ← < 500 lines ideal
│ ├── YAML frontmatter ← name + description (+ compatibility optional)
│ └── Markdown body
├── agents/ ← subagent instructions (grader, comparator, analyzer)
├── eval-viewer/ ← generate_review.py + viewer.html
├── assets/ ← eval_review.html template
├── scripts/ ← deterministic code (execute without loading into context)
├── references/ ← loaded on demand (one level deep only)
└── evals/ ← evals.json + test files
Progressive Disclosure (3 levels):
- Metadata (name + description) — always in context (~100 words)
- SKILL.md body — when triggered (< 500 lines)
- Bundled resources — as needed
Frontmatter: name max 64 chars kebab-case, description max 1024 chars third person what + when
"When to use" in body = useless — Claude sees description only when deciding to trigger. Put all trigger context there.
Description tip: Make it slightly "pushy" to combat undertriggering — include specific contexts even if not explicitly named.
For anti-patterns: anti-patterns.md
Creating a Skill
Step 1: Capture Intent
ถามถ้าไม่ชัด หรือ extract จาก conversation ถ้ามีอยู่แล้ว:
- Skill ควรทำอะไร?
- Trigger เมื่อไหร่? (user phrases/contexts)
- Expected output format?
- ต้องการ test cases ไหม? — Skills ที่มี objective output ควรมี (file transforms, data extraction, fixed workflows) / Subjective skills (writing style, art) ไม่จำเป็น
Step 2: Interview & Research
ถาม edge cases, input/output formats, example files, dependencies
Check available MCPs — research via subagents ถ้าทำได้
ใช้ /extract-expertise สำหรับ domain ที่ซับซ้อน
Step 3: Initialize
scripts/init_skill.py <skill-name> --path <output-directory>
Step 4: Write SKILL.md
ใส่: name, description (primary trigger mechanism — all "when to use" goes here), instructions
Writing style: imperative form, explain why behind each instruction, not rigid rules. Use theory of mind.
Design references:
- workflows.md — Sequential, conditional, loops
- output-patterns.md — Templates, formatting
- anti-patterns.md — Common mistakes
Step 5: Write Test Cases
2-3 realistic prompts — the kind of thing a real user would actually say. Share with user for confirmation. Save to evals/evals.json (don't add assertions yet):
{
"skill_name": "example-skill",
"evals": [
{
"id": 1,
"prompt": "User's task prompt",
"expected_output": "Description of expected result",
"files": [],
"expectations": []
}
]
}
See references/schemas.md for full schema including assertions field.
Running and Evaluating Test Cases
This is one continuous sequence — don't stop partway. Do NOT use any other testing skill.
Put results in <skill-name>-workspace/ (sibling to skill directory), organized by iteration-N/eval-name/.
Step 1: Spawn All Runs in the Same Turn
For each test case, launch two subagents simultaneously — one with-skill, one baseline. Don't launch with-skill first and come back for baselines later.
- New skill: baseline = no skill at all, save to
without_skill/outputs/ - Improving existing skill: baseline = old version snapshot (
cp -r <skill-path> <workspace>/skill-snapshot/), save toold_skill/outputs/
Write eval_metadata.json for each eval:
{
"eval_id": 0,
"eval_name": "descriptive-name",
"prompt": "The task prompt",
"assertions": []
}
Step 2: While Runs Are in Progress, Draft Assertions
Don't wait idle. Draft objectively verifiable assertions with descriptive names (they appear in the viewer). Explain them to the user. Subjective skills → qualitative only, skip assertions.
Update eval_metadata.json and evals/evals.json with assertions.
Step 3: Capture Timing Data
When each subagent completes, save immediately — this data exists only in the task notification:
{"total_tokens": 84852, "duration_ms": 23332, "total_duration_seconds": 23.3}
Save to timing.json in the run directory.
Step 4: Grade → Aggregate → Analyze → Launch Viewer
-
Grade — spawn grader subagent using
agents/grader.md, save tograding.json. For assertions checkable programmatically, write a script rather than eyeballing. -
Aggregate — run from skill-creator directory:
python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>Produces
benchmark.jsonandbenchmark.mdwith pass_rate, time, tokens (mean ± stddev + delta). -
Analyst pass — read
agents/analyzer.md(Analyzing Benchmark Results section) to surface patterns aggregate stats hide: non-discriminating assertions, high-variance evals, time/token tradeoffs. -
Launch viewer:
nohup python <skill-creator-path>/eval-viewer/generate_review.py \ <workspace>/iteration-N \ --skill-name "my-skill" \ --benchmark <workspace>/iteration-N/benchmark.json \ > /dev/null 2>&1 & VIEWER_PID=$!Iteration 2+: add
--previous-workspace <workspace>/iteration-<N-1>Cowork/headless: use
--static <output_path>instead. Feedback downloads asfeedback.json.⚠️ GENERATE THE EVAL VIEWER BEFORE evaluating inputs yourself. Get results in front of the human first.
-
Tell the user: "ผลอยู่ในเบราว์เซอร์แล้วค่ะ — tab Outputs ดู output แต่ละ test case, tab Benchmark ดู metrics เมื่อดูเสร็จแล้วกลับมาบอกหนูได้เลย"
Step 5: Read Feedback
Read feedback.json. Empty feedback = satisfied. Focus on test cases with specific complaints.
Kill viewer: kill $VIEWER_PID 2>/dev/null
Improving the Skill
How to Think About Improvements
- Generalize — หา pattern ที่ work broadly ไม่ใช่ fix specific test case
- Keep lean — อ่าน transcript จริง ถ้า skill ทำให้ waste time → ตัดออก
- Explain the why — ถ้าจะเขียน ALWAYS/NEVER → yellow flag: reframe เป็น reasoning แทน
- Bundle repeated work — ถ้า 3 test cases ล้วน write
create_docx.pyเอง → bundle ในscripts/
Iteration Loop
- Apply improvements to skill
- Rerun all test cases into
iteration-<N+1>/(including baselines) - Launch viewer with
--previous-workspacepointing at previous iteration - Wait for user review → read feedback → repeat
Stop when: user satisfied / all feedback empty / not making meaningful progress.
For quick fixes (typo, small gap): direct edit → skip eval loop.
Blind Comparison (Advanced)
For rigorous version comparison, read agents/comparator.md + agents/analyzer.md for blind A/B judgment. Optional — the human review loop is usually sufficient.
Enhancement Mode
ใช้เมื่อต้องการปรับปรุง skill เดิมที่มีอยู่แล้ว แบบ structured — เหมาะเมื่อยังไม่รู้ว่า skill มีปัญหาตรงไหน หรืออยากยก quality ขึ้นเป็น systematic
Route
| Condition | Path |
|---|---|
| Quick fix (typo, small gap) | Direct edit → done |
| Significant upgrade | Full pipeline below |
Full Pipeline: AUDIT → RESEARCH → INTEGRATE → OPTIMIZE → VALIDATE
AUDIT
อ่าน target skill ทั้งหมด → score ด้วย audit-rubric.md:
SKILL: [name]
SCORES: Coverage [?] | Depth [?] | Structure [?] | Actionability [?] | Examples [?]
TOTAL: [?]/25 → [Draft/Working/Solid/Production]
จุดที่ควรปรับ:
1. [ปัญหา + ผลกระทบ]
→ ถามผู้ใช้ก่อน: "ปรับทั้งหมด หรือเลือกเฉพาะข้อ?"
RESEARCH
ใช้ /deep-research หรือ /extract-expertise เพื่อเติม knowledge gaps
INTEGRATE
Classify findings → prioritize by impact → merge ด้วย integration-patterns.md
OPTIMIZE
Apply skill-creator standards: progressive disclosure, conciseness, references/
VALIDATE
Before/after comparison:
| Dimension | Before | After | เปลี่ยนอะไร |
|-----------|--------|-------|------------|
Log ใน enhancement-log.md → จากนั้นรัน Eval loop เพื่อยืนยัน improvement จริง
Benchmark & Description Optimization
Benchmark
Rerun all evals 3x per configuration with aggregate_benchmark.py. Track pass rate, time, tokens across iterations/models. ใช้สำหรับ regression detection เมื่อ model อัปเดต หรือหลัง enhance
Description Optimization
หลัง skill เสร็จ เสนอให้ optimize description สำหรับ triggering accuracy ที่ดีขึ้น
Step 1: Generate 20 trigger eval queries (mix should/should-not trigger). Be realistic — personal context, file paths, casual speech, typos. Near-miss negatives are the most valuable test cases.
Present via HTML template:
- Read
assets/eval_review.html, fill:__EVAL_DATA_PLACEHOLDER__,__SKILL_NAME_PLACEHOLDER__,__SKILL_DESCRIPTION_PLACEHOLDER__ - Write to
/tmp/eval_review_<skill-name>.html→ open it - User edits, clicks "Export Eval Set" →
~/Downloads/eval_set.json
Step 2: Run optimization loop (background):
python -m scripts.run_loop \
--eval-set <path-to-trigger-eval.json> \
--skill-path <path-to-skill> \
--model <model-id-powering-this-session> \
--max-iterations 5 \
--verbose
Splits 60/40 train/test, iterates up to 5x, returns best_description selected by test score to avoid overfitting.
Step 3: Apply best_description to SKILL.md frontmatter. Show before/after + scores.
Note: requires claude -p CLI → Claude Code only, not Claude.ai
Deploy (ThepExcel)
┌─ Skill ใหม่
│ ├─ Public? → /mnt/d/agent-skills/[skill-name]/
│ └─ Private? → /mnt/d/claude-private/skills/[skill-name]/
│
├─ Symlink
│ └─ ln -s /mnt/d/[repo]/[skill-name] ~/.claude/skills/[skill-name]
│
├─ Update registry
│ └─ /mnt/d/claude-master/CLAUDE.md → Skills Inventory
│
└─ Commit & Push
└─ git add → commit → push (ทั้ง skill repo + claude-master)
Validate before deploy:
scripts/quick_validate.py <path/to/skill-folder>
scripts/package_skill.py <path/to/skill-folder>
Platform Notes
| Platform | Subagents | Viewer | Description Opt |
|---|---|---|---|
| Claude Code | ✅ Parallel | ✅ Browser | ✅ |
| Claude.ai | ❌ → run serially | ❌ → show inline | ❌ |
| Cowork | ✅ | --static flag |
✅ |
Claude.ai: skip baselines + benchmarking, run test cases yourself one at a time, show results + ask feedback inline.
References
| File | Content |
|---|---|
| references/schemas.md | JSON structures: evals.json, grading.json, benchmark.json, timing.json, etc. |
| references/progressive-disclosure.md | Loading patterns (high-level, domain, conditional) |
| references/workflows.md | Sequential, conditional, feedback loops |
| references/output-patterns.md | Templates, formatting, terminology |
| references/anti-patterns.md | Common mistakes to avoid |
| references/audit-rubric.md | Quality scoring 5 dimensions × 1-5 |
| references/integration-patterns.md | How to merge findings into skills |
| references/enhancement-log.md | History of skill enhancements |
| agents/grader.md | Evaluate assertions against outputs |
| agents/comparator.md | Blind A/B comparison between two outputs |
| agents/analyzer.md | Analyze benchmark patterns + why one version beat another |
Related Skills
/extract-expertise— Extract expert knowledge to inform skill content/deep-research— Research domain before building or enhancing skill/optimize-prompt— Optimize skill descriptions and system prompts