Optimize
/optimize — Autonomous Optimization v2
Run an autonomous optimization loop against any target. Two modes:
- Metric mode — code targets with a shell command that produces a number (the original)
- Eval mode — skills, prompts, agents, or any text target judged by LLM-as-judge binary evals
The agent modifies the target, measures the result, keeps improvements, discards failures, and repeats.
Inspired by Karpathy's autoresearch and extended with LLM-as-judge evaluation.
Invocation
Metric Mode (code targets)
/optimize --metric "lighthouse_score" --higher-is-better \
--measure "npx lighthouse http://localhost:3000 --output=json" \
--extract "jq '.categories.performance.score * 100' lighthouse.json" \
--files "src/**/*.tsx,src/**/*.css" \
--budget 120
/optimize --resume # Resume a previous optimization loop
/optimize --status # Show results summary from last/current run
Eval Mode (skill/prompt/agent targets)
/optimize --target "~/.claude/skills/ExtractWisdom"
/optimize --target "~/.claude/skills/Research/Workflows/QuickResearch.md"
/optimize --target "prompts/my-prompt.md"
/optimize --target "~/.claude/skills/ExtractWisdom" --max-experiments 20
In eval mode, the system automatically:
- Detects the target type (skill, prompt, agent, code, function)
- Reads the target to understand its purpose and constraints
- Generates 3-6 binary eval criteria and 3-5 test inputs
- Presents criteria + inputs for your approval before starting
- Runs the optimization loop using LLM-as-judge scoring
- Presents a recommendation (apply/reject/partial) when done
What Happens
This skill triggers the PAI Algorithm in mode: optimize:
- OBSERVE — Define or auto-detect the target, set eval_mode
- THINK — Analyze codebase/skill, generate hypothesis queue
- PLAN — Prioritize hypotheses by expected impact
- BUILD — Phase 0: TARGET ANALYSIS (see
optimize-loop.md)- Detect target type, auto-generate eval criteria (eval mode), set up sandbox, baseline
- EXECUTE — The autonomous loop (
optimize-loop.md):- Hypothesize → Modify target → Measure (metric or eval) → Keep/Revert → Repeat
- Metric mode: ~12 experiments/hour (at 5-min budget)
- Eval mode: ~6-8 experiments/hour (multi-run judging is slower)
- VERIFY — Phase 9: RECOMMEND — diff, summary, apply/reject/partial options
- LEARN — Phase 10: EXTRACT LEARNINGS — what worked, what didn't, structured insights
Arguments — Metric Mode
| Argument | Required | Default | Description |
|---|---|---|---|
--metric NAME |
yes | Human-readable metric name | |
--measure COMMAND |
yes | Shell command that produces the metric | |
--files GLOB |
yes | Files the agent may modify (comma-separated) | |
--higher-is-better |
(default) | Higher metric values are better | |
--lower-is-better |
Lower metric values are better | ||
--extract COMMAND |
Last number in stdout | Extract metric from output | |
--budget SECONDS |
300 | Time budget per experiment | |
--target VALUE |
none | Stop when metric reaches this value | |
--max-experiments N |
none | Stop after N experiments | |
--locked GLOB |
none | Files the agent must NOT modify | |
--constraints TEXT |
none | Additional rules (e.g., "tests must pass") |
Arguments — Eval Mode
| Argument | Required | Default | Description |
|---|---|---|---|
--target PATH |
yes | Path to skill directory, prompt file, or agent definition | |
--max-experiments N |
none | Stop after N experiments | |
--runs N |
3 | Runs per experiment (more = more reliable, slower) | |
--criteria "Q1" "Q2" |
auto-generated | Override auto-generated eval criteria | |
--inputs "I1" "I2" |
auto-generated | Override auto-generated test inputs | |
--budget SECONDS |
300 | Time budget per experiment |
Shared Arguments
| Argument | Description |
|---|---|
--resume |
Resume a previous optimization run |
--status |
Show results summary |
Algorithm Integration
When /optimize is invoked, the Algorithm enters with mode: optimize in the ISA frontmatter. The eval_mode is set based on arguments:
--measureprovided →eval_mode: metric(git branch sandbox)--targetprovided →eval_mode: eval(directory sandbox)
ISC criteria become guard rails — assertions that must hold true across ALL experiments. Guard rails must REMAIN satisfied perpetually. A violation triggers automatic revert regardless of score improvement.
Reference files:
~/.claude/PAI/ALGORITHM/optimize-loop.md— the full loop protocol~/.claude/PAI/ALGORITHM/eval-guide.md— how to write good eval criteria~/.claude/PAI/ALGORITHM/target-types.md— target detection and ISC generation
Examples
Metric Mode
Optimize page load time:
/optimize --metric "lighthouse_perf" --higher-is-better \
--measure "npx lighthouse http://localhost:3000 --output=json --output-path=lh.json" \
--extract "jq '.categories.performance.score * 100' lh.json" \
--files "src/**/*.tsx,src/**/*.css" \
--target 95 --budget 120
Optimize bundle size:
/optimize --metric "bundle_bytes" --lower-is-better \
--measure "bun run build 2>&1 && du -sb dist/ | cut -f1" \
--files "src/**/*.ts" \
--constraints "all tests must pass"
ML training (Karpathy-style):
/optimize --metric "val_bpb" --lower-is-better \
--measure "uv run train.py > run.log 2>&1 && grep '^val_bpb:' run.log | cut -d' ' -f2" \
--files "train.py" \
--locked "prepare.py" \
--budget 300
Eval Mode
Optimize a skill's Extract workflow:
/optimize --target "~/.claude/skills/ExtractWisdom" --max-experiments 15
Optimize a standalone prompt:
/optimize --target "prompts/summarize-article.md" --runs 5
Optimize with custom criteria:
/optimize --target "~/.claude/skills/Research/Workflows/QuickResearch.md" \
--criteria "Does the output contain specific facts with sources?" \
"Is the output structured with clear sections?" \
"Does the output avoid generic filler?" \
--inputs "research quantum computing breakthroughs 2025" \
"quick research on supply chain security" \
"find recent developments in AI agents"
Gotchas
- Hill-climbing can get stuck in local optima. If score plateaus, consider resetting with different initial conditions.
- Eval mode vs metric mode: Use metric mode for quantifiable targets (latency, size). Use eval mode for qualitative targets (skill quality, prompt effectiveness).
- Regression tolerance prevents catastrophic changes. Don't set it to 0 — some regression in secondary metrics is acceptable if primary metric improves significantly.
More from danielmiessler/personal_ai_infrastructure
osint
Structured OSINT investigations — people lookup, company intel, investment due diligence, entity/threat intel, domain recon, organization research using public sources with ethical authorization framework. USE WHEN OSINT, due diligence, background check, research person, company intel, investigate, company lookup, domain lookup, entity lookup, organization lookup, threat intel, discover OSINT sources.
259firstprinciples
Physics-based reasoning framework (Musk/Elon methodology) that deconstructs problems to irreducible fundamental truths rather than reasoning by analogy. Three-step structure: DECONSTRUCT (break to constituent parts and actual values), CHALLENGE (classify every element as hard constraint / soft constraint / unvalidated assumption — only physics is truly immutable), RECONSTRUCT (build optimal solution from fundamentals alone, ignoring inherited form). Outputs: constituent-parts breakdown, constraint classification table, and reconstructed solution with key insight. Three workflows: Deconstruct.md, Challenge.md, Reconstruct.md. Integrates with RedTeam (attack assumptions before deploying adversarial agents), Security (decompose threat model), Architecture (challenge design constraints), and Pentesters (decompose assumed security boundaries). Other skills invoke via: Challenge on all stated constraints → classify as hard/soft/assumption. Cross-domain synthesis: solutions from unrelated fields often apply once the fundamental truths are exposed. NOT FOR incident investigation and causal chains (use RootCauseAnalysis). NOT FOR structural feedback loops (use SystemsThinking). USE WHEN first principles, fundamental truths, challenge assumptions, is this a real constraint, rebuild from scratch, what are we actually paying for, what is this really made of, start over, physics first, question everything, reasoning by analogy, is this really necessary.
160documents
Read, write, convert, and analyze documents — routes to PDF, DOCX, XLSX, PPTX sub-skills for creation, editing, extraction, and format conversion. USE WHEN document, process file, create document, convert format, extract text, PDF, DOCX, XLSX, PPTX, Word, Excel, spreadsheet, PowerPoint, presentation, slides, consulting report, large PDF, merge PDF, fill form, tracked changes, redlining.
114council
Multi-agent collaborative debate that produces visible round-by-round transcripts with genuine intellectual friction. All council members are custom-composed via ComposeAgent (Agents skill) with domain expertise, unique voice, and personality tailored to the specific topic — never built-in generic types. ComposeAgent invoked as: bun run ~/.claude/skills/Agents/Tools/ComposeAgent.ts. Two workflows: DEBATE (3 rounds, full transcript + synthesis, parallel execution within rounds, 40-90 seconds total) and QUICK (1 round, fast perspective check). Context files: CouncilMembers.md (agent composition instructions), RoundStructure.md (three-round structure and timing), OutputFormat.md (transcript format templates). Agents are designed per debate topic to create real disagreement; 4-6 well-composed agents outperform 12 generic ones. Council is collaborative-adversarial (debate to find best path); for pure adversarial attack on an idea, use RedTeam instead. NOT FOR parallel task execution across agents (use Delegation skill). USE WHEN council, debate, multiple perspectives, weigh options, deliberate, get different views, multi-agent discussion, what would experts say, is there consensus, pros and cons from multiple angles.
112privateinvestigator
Ethical people-finding using 15 parallel research agents (45 search threads) across public records, social media, reverse lookups. Public data only, no pretexting. USE WHEN find person, locate, reconnect, people search, skip trace, reverse lookup, social media search, public records search, verify identity.
112redteam
Military-grade adversarial analysis that deploys 32 parallel expert agents (engineers, architects, pentesters, interns) to stress-test ideas, strategies, and plans — not systems or infrastructure. Two workflows: ParallelAnalysis (5-phase: decompose into 24 atomic claims → 32-agent parallel attack → synthesis → steelman → counter-argument, each 8 points) and AdversarialValidation (competing proposals synthesized into best solution). Context files: Philosophy.md (core principles, success criteria, agent types), Integration.md (how to combine with FirstPrinciples, Council, and other skills; output format). Targets arguments, not network vulnerabilities. Findings ranked by severity; goal is to strengthen, not destroy — weaknesses delivered with remediation paths. Collaborates with FirstPrinciples (decompose assumptions before attacking) and Council (Council debates to find paths; RedTeam attacks whatever survives). Also invoked internally by Ideate (TEST phase) and WorldThreatModel (horizon stress-testing). NOT FOR AI instruction set auditing (use BitterPillEngineering). NOT FOR network/system vulnerability testing (use a security assessment skill). USE WHEN red team, attack idea, counterarguments, critique, stress test, devil's advocate, find weaknesses, break this, poke holes, what could go wrong, strongest objection, adversarial validation, battle of bots.
112