prompt-engineer-toolkit
Prompt Engineer Toolkit
Overview
Use this skill to move prompts from ad-hoc drafts to production assets with repeatable testing, versioning, and regression safety. It emphasizes measurable quality over intuition. Apply it when launching a new LLM feature that needs reliable outputs, when prompt quality degrades after model or instruction changes, when multiple team members edit prompts and need history/diffs, when you need evidence-based prompt choice for production rollout, or when you want consistent prompt governance across environments.
Core Capabilities
- A/B prompt evaluation against structured test cases
- Quantitative scoring for adherence, relevance, and safety checks
- Prompt version tracking with immutable history and changelog
- Prompt diffs to review behavior-impacting edits
- Reusable prompt templates and selection guidance
- Regression-friendly workflows for model/prompt updates
Key Workflows
1. Run Prompt A/B Test
Prepare JSON test cases and run:
python3 scripts/prompt_tester.py \
--prompt-a-file prompts/a.txt \
--prompt-b-file prompts/b.txt \
--cases-file testcases.json \
--runner-cmd 'my-llm-cli --prompt {prompt} --input {input}' \
--format text
Input can also come from stdin/--input JSON payload.
2. Choose Winner With Evidence
The tester scores outputs per case and aggregates:
- expected content coverage
- forbidden content violations
- regex/format compliance
- output length sanity
Use the higher-scoring prompt as candidate baseline, then run regression suite.
3. Version Prompts
# Add version
python3 scripts/prompt_versioner.py add \
--name support_classifier \
--prompt-file prompts/support_v3.txt \
--author alice
# Diff versions
python3 scripts/prompt_versioner.py diff --name support_classifier --from-version 2 --to-version 3
# Changelog
python3 scripts/prompt_versioner.py changelog --name support_classifier
4. Regression Loop
- Store baseline version.
- Propose prompt edits.
- Re-run A/B test.
- Promote only if score and safety constraints improve.
Script Interfaces
python3 scripts/prompt_tester.py --help- Reads prompts/cases from stdin or
--input - Optional external runner command
- Emits text or JSON metrics
- Reads prompts/cases from stdin or
python3 scripts/prompt_versioner.py --help- Manages prompt history (
add,list,diff,changelog) - Stores metadata and content snapshots locally
- Manages prompt history (
Pitfalls, Best Practices & Review Checklist
Avoid these mistakes:
- Picking prompts from single-case outputs — use a realistic, edge-case-rich test suite.
- Changing prompt and model simultaneously — always isolate variables.
- Missing
must_not_contain(forbidden-content) checks in evaluation criteria. - Editing prompts without version metadata, author, or change rationale.
- Skipping semantic diffs before deploying a new prompt version.
- Optimizing one benchmark while harming edge cases — track the full suite.
- Model swap without rerunning the baseline A/B suite.
Before promoting any prompt, confirm:
- Task intent is explicit and unambiguous.
- Output schema/format is explicit.
- Safety and exclusion constraints are explicit.
- No contradictory instructions.
- No unnecessary verbosity tokens.
- A/B score improves and violation count stays at zero.
References
- references/prompt-templates.md
- references/technique-guide.md
- references/evaluation-rubric.md
- README.md
Evaluation Design
Each test case should define:
input: realistic production-like inputexpected_contains: required markers/contentforbidden_contains: disallowed phrases or unsafe contentexpected_regex: required structural patterns
This enables deterministic grading across prompt variants.
Versioning Policy
- Use semantic prompt identifiers per feature (
support_classifier,ad_copy_shortform). - Record author + change note for every revision.
- Never overwrite historical versions.
- Diff before promoting a new prompt to production.
Rollout Strategy
- Create baseline prompt version.
- Propose candidate prompt.
- Run A/B suite against same cases.
- Promote only if winner improves average and keeps violation count at zero.
- Track post-release feedback and feed new failure cases back into test suite.
More from alirezarezvani/claude-skills
marketing-skills
42 marketing agent skills and plugins for Claude Code, Codex, Gemini CLI, Cursor, OpenClaw, and 6 more coding agents. 7 pods: content, SEO, CRO, channels, growth, intelligence, sales. Foundation context + orchestration router. 27 Python tools (stdlib-only).
1.5Kengineering-skills
23 engineering agent skills and plugins for Claude Code, Codex, Gemini CLI, Cursor, OpenClaw, and 6 more tools. Architecture, frontend, backend, QA, DevOps, security, AI/ML, data engineering, Playwright, Stripe, AWS, MS365. 30+ Python tools (stdlib-only).
1.4Kfinance-skills
Financial analyst agent skill and plugin for Claude Code, Codex, Gemini CLI, Cursor, OpenClaw. Ratio analysis, DCF valuation, budget variance, rolling forecasts. 4 Python tools (stdlib-only).
1.4Kengineering-advanced-skills
25 advanced engineering agent skills and plugins for Claude Code, Codex, Gemini CLI, Cursor, OpenClaw. Agent design, RAG, MCP servers, CI/CD, database design, observability, security auditing, release management, platform ops.
1.3Kc-level-advisor
10 C-level advisory agent skills and plugins for Claude Code, Codex, Gemini CLI, Cursor, OpenClaw. CEO, CTO, COO, CPO, CMO, CFO, CRO, CISO, CHRO, Executive Mentor. Multi-role board meetings, strategy routing, structured recommendations. For founders needing executive-level decision support.
1.3Kbusiness-growth-skills
4 business growth agent skills and plugins for Claude Code, Codex, Gemini CLI, Cursor, OpenClaw. Customer success (health scoring, churn), sales engineer (RFP), revenue operations (pipeline, GTM), contract & proposal writer. Python tools (stdlib-only).
1.3K