maptasks

Installation
SKILL.md

maptasks — Route Engineering Work to the Right Model

Overview

Given a specified engineering task (or set of tasks), produce a routing plan that:

  1. Checks whether the task is broken down enough to route. If not, proposes a finer breakdown first.
  2. Categorizes each leaf work unit using a 30-category task taxonomy.
  3. Assigns each unit to the best-suited model based on practitioner evidence (April 2026).
  4. Explains why this model is best for this category — citing the rule or evidence.
  5. Specifies the context briefing each model needs to do its job.
  6. Flags model-specific failure modes to watch for.

The skill is evidence-driven, not vibes-driven. Every recommendation traces to a category lookup or a numbered decision rule.

The Three-Tier Strategy (always implicit)

Every routing plan should distribute work across three tiers in roughly these proportions:

Tier Share of work Models
Flagship reasoning 5–10% Opus 4.6, GPT-5.4 Pro, Gemini 3.1 Pro
Balanced mid-tier 40–50% Sonnet 4.6, GPT-5.4, GPT-5.3-Codex, Gemini 2.5 Pro
Cheap / fast 30–40% Haiku 4.5, GPT-5.4 mini/nano, Gemini 3 Flash

A routing plan that uses only one tier (all Opus, or all Sonnet) is almost always wrong. The two failure modes to avoid:

  1. Verbose model on shallow task → cache thrashing, quota exhaustion (Opus on boilerplate).
  2. Speed model on hard reasoning → "fast failure" — race conditions and memory leaks discovered after merge.

The skill prevents both by forcing each unit through category lookup → tier check → model selection.

When to Use

  • You have a fully or partially specified task and are about to execute it with AI assistance.
  • You want to assemble a multi-model team (e.g. Architect + Implementer + Reviewer).
  • You're choosing between two or more models for a specific piece of work and want the comparison.
  • You want a routing plan before spinning up subagents or parallel sessions.
  • You want to know whether a task description is decomposed enough to be routable at all.

Do NOT use this skill for:

  • Picking a chat model for casual conversation.
  • Reviewing the quality of code (use kiss, solid-principles, beyond-solid-principles).
  • Estimating timelines or effort in person-hours.

Inputs

The user provides one of:

  1. A task description (e.g. "Implement OAuth2 login with Google, update schema, add tests, update docs").
  2. A pre-broken-down task list (markdown or bullets).
  3. A spec / PRD / GitHub issue / file path.

If anything critical is missing, ask for it before routing:

  • Scope: How many files? What's the language/framework/stack?
  • Constraints: Privacy (can data leave your infra?), ecosystem (Google/JVM/.NET), latency.
  • Cost posture: cost-optimized, balanced, or quality-first?
  • Done criteria: How will you know each unit is finished?

If the user describes a team type or role context (solo MVP dev, senior in production codebase, frontend startup, backend team, migration team, docs team, agentic CLI workflow, privacy-constrained/air-gapped), pre-load the matching default stack from references/rules.md Part 3 — Persona-Based Routing Scenarios — and use it as the starting policy before applying category lookups.

Workflow

1. INGEST          → Read task / spec. Identify scope and constraints.
2. SUFFICIENCY     → Run Breakdown Sufficiency Check on each unit.
3. DECOMPOSE       → If any unit fails, propose finer breakdown BEFORE routing.
4. CATEGORIZE      → Map each leaf unit to one of the 30 task categories.
5. SELECT          → Apply category lookup, then decision rules to pick a model.
6. BRIEF           → Build the context briefing for each unit.
7. EMIT            → Output the routing plan in the standard format.

Step 2: Breakdown Sufficiency Check

A work unit is routable only if all five hold:

# Criterion Failing example
1 Belongs to exactly one task category "Implement and test" — split
2 Requires exactly one model tier "Plan and write docs" mixes flagship + mid — split
3 Produces a single artifact (one file, one plan, one review) "Build OAuth flow" — split
4 Fits in one context budget (no mid-task re-injection) "Migrate the whole repo" — chunk by module
5 Has measurable done criteria "Make it better" — define done

If any criterion fails, propose a finer breakdown before routing. Don't route an under-decomposed unit; you'll just push the routing problem onto the model.

Step 5: Selecting the Model

  1. Category lookup (Quick Reference table below) gives you a primary + runner-up.
  2. Apply decision rules (Rules section below) to resolve ties or special cases.
  3. Check constraints: privacy → local model; ecosystem → Gemini for Google stack; long context → Gemini 2.5 Pro.
  4. Sanity check: cite the category number AND the rule number in the rationale.

Model Roster

Model Tier Sweet Spot Context (usable) $/M in $/M out
Claude Opus 4.6 Flagship Architecture, multi-file refactor, complex debug (race / cross-boundary), security audit, 7h+ autonomous 1M (~400K reliable) $5 $25
GPT-5.4 Pro Flagship Computer use (75% OSWorld), mid-task steering, native tool chaining 1M $30 $180
Gemini 3.1 Pro Flagship Frontier reasoning, frontend/UI, Google ecosystem (Go/Dart/Flutter/Angular) 1M+ $2-4 $12-18
Claude Sonnet 4.6 Mid The 80% model: daily features, bug fixes, tests, docs, API/SQL design, frontend components 1M (~400K reliable) $3 $15
GPT-5.4 Mid Structured output (~100% strict), tool/function calling, greenfield iteration 1.05M $2.50 $15
GPT-5.3-Codex Mid Terminal/DevOps/IaC (77.3% Terminal-Bench), code review, long-horizon autonomous (25h) 400K $2.50 $15
Gemini 2.5 Pro Mid Long-context analysis, whole-codebase summarization, data analysis, R workflows 1-2M $1.25 $10
Claude Haiku 4.5 Fast Sub-agent workhorse, classification, boilerplate, strict JSON adherence 200K $0.80 $5
GPT-5.4 mini Fast Routine code, sub-agent, "almost indistinguishable from full" on simple tasks 400K $0.75 $4.50
GPT-5.4 nano Fast Classification, extraction, guardrails, high-volume pipelines 400K $0.20 $1.25
Gemini 3 Flash Fast Cost king for coding (78% SWE-bench at $0.50/MTok), default in JetBrains/Replit/Warp 1M $0.50 $3
Codex-Spark Speed Rapid prototyping drafter (1,000+ tok/s, 15x faster than Codex). Pair with Opus reviewer (Pattern B) varies varies varies
Qwen3-Coder-Next Local Privacy-bound or offline coding (consumer hardware), Apache 2.0 128K free free
Gemma 4 Local Local function calling, structured output, RAG (Apache 2.0) varies free free

Full profiles, weaknesses, and failure modes → references/models.md.

Context Decay (the "1M Token Myth")

Advertised context windows ≠ usable context. Empirical retrieval accuracy:

Model 256K accuracy 1M accuracy Decay rate Practical max
Claude 4.6 Opus 92.1% 78.3% 15% ~400K reliable
GPT-5.4 79.5% 36.6% 54% ~7,500 LoC reliable
Gemini 3.1 Pro 35.0% 25.9% 26% ~15,000 LoC (no cross-file reasoning)

Routing implication: Long-context jobs should be routed by task type, not by raw window size:

  • Retrieval / summarization >200K → Gemini 2.5 Pro (1–2M, generous free tier)
  • Cross-file reasoning >200K → Claude Opus 4.6, but chunk to stay under 400K
  • Whole-codebase analysis with no precision needed → Gemini 2.5 Pro
  • Anything past Opus 400K threshold → switch to Gemini 2.5 Pro for that sub-task

Mixed-Model Workflow Patterns

When more than one model collaborates on a task, use a named pattern. Full descriptions in references/rules.md Part 2.

Pattern Composition Best for
A. Architect / Implementer / Reviewer Opus → Sonnet → (Codex+Opus dual review) New features in production codebases (the dominant 3-tier strategy)
B. Drafter / Reviewer Gemini 3 Flash or Codex-Spark or GPT-5.4 mini → Opus 4.6 patch Frontend iteration, prototyping, "show me 30 variants" — 3x faster than reasoning-only with no correctness loss
C. Claude / Gemini Pipeline Opus plan → Gemini 2.5/3 implement → Opus review Frontend-heavy work where Gemini's UI strengths matter
D. Multi-Tier Subagent Team Sonnet lead + Haiku workers + Opus/Codex specialists High-volume agentic pipelines (Augment Code style)
E. Cost-Capped Default Stack Sonnet default + Opus escalation + Haiku downshift + GPT-5.4 strict for structured output Sustainable production AI workflows
F. Single-Vendor Lock-In Opus→Sonnet→Haiku OR GPT-5.4→mini→nano OR Gemini 3.1→Flash→Flash-Lite Org policy / billing simplification

When a unit's rationale invokes a pattern, name it explicitly: "Pattern A — Architect/Implementer/Reviewer".

Task Categories (Quick Reference)

# Category Primary Runner-up Avoid
1 Architecture / system design Opus 4.6 GPT-5.4 Pro
2 Task decomposition / planning Opus 4.6 GPT-5.4 Haiku
3 Tech-stack / tradeoff analysis Opus 4.6 GPT-5.2
4 Spec writing (business → tech) Opus 4.6 Sonnet 4.6 Cursor Composer-2
5 API contract design (OpenAPI/protobuf) Sonnet 4.6 GPT-5.4
6 Database schema design Sonnet 4.6 Gemini 2.5 Pro
7 Greenfield feature implementation Sonnet 4.6 GPT-5.4
8 Boilerplate / scaffolding Haiku 4.5 GPT-5.4 nano Opus (10x cost waste)
9 Algorithm implementation Opus 4.6 Codex
10 Frontend components (React/Vue/CSS) Gemini 3.1 Pro Sonnet 4.6, Gemini 3 Flash (budget) Codex (weak on FE)
11 Backend service / REST endpoints Sonnet 4.6 GPT-5.4
12 Infrastructure-as-code / terminal GPT-5.3-Codex Sonnet 4.6
13 One-shot script (bash, python) Sonnet 4.6 Gemini 3 Flash
14 Refactoring (cross-file) Opus 4.6 GPT-5.2-Codex
15 Dependency upgrade / migration GPT-5.2-Codex Opus 4.6
16 Stack-trace bug fix Any frontier (cheapest)
17 Complex debug (race / cross-boundary) Opus 4.6 GPT-5.4
18 Performance optimization GPT-5.3-Codex Opus 4.6
19 Unit test generation Sonnet 4.6 Gemini 2.5 Pro
20 Integration / edge-case test design Opus 4.6 Codex
21 Code review Codex + Opus (dual) Single-model review
22 Security audit Opus + Codex (dual) GPT-4o (10% clean rate)
23 Code docs / README Sonnet 4.6 GPT-5.4 Codex (too terse)
24 ADR / trade-off documentation Opus 4.6 GPT-5.2
25 Structured output / function calling GPT-5.4 strict Sonnet 4.6
26 Long-context analysis (>200K) Gemini 2.5 Pro Opus 4.6 Haiku
27 SQL generation Sonnet 4.6 (24/25) GPT-5.4 mini (22/25, ¼ cost) Codex (slow, 21/25)
28 Classification / extraction GPT-5.4 nano Haiku 4.5 Opus
29 Long-horizon autonomous (>1h) Opus (general) / Codex (terminal) Gemini (spiral risk)
30 Data analysis / ETL Gemini 2.5 Pro Sonnet 4.6

Detailed criteria, edge cases, and confidence levels → references/categories.md.

Decision Rules

These rules resolve ties and special cases. Apply them after the category lookup. Each cites the source evidence.

Rule 1 — The 80/20 Escalation

Default to Sonnet 4.6. Escalate to Opus 4.6 only when the unit involves architecture, multi-file refactor (>5 files), complex debug (race / cross-boundary), security audit, or long-horizon autonomous (>1h). Opus produces identical results to Sonnet on single-file work — paying 5x more is waste. Source: NxCode 6-task comparison; Faros.ai Reddit synthesis.

Rule 2 — Stack-Trace Bugs Are Equal

All frontier models are equivalent on stack-trace debugging. All 5 models tested correctly identified root cause on identical FastAPI traces. Use the cheapest you have access to. Source: jdhodges.com 5-model test, April 2026.

Rule 3 — Race Conditions Justify Opus

Race conditions and cross-boundary bugs → Opus 4.6. In a 90-day practitioner test, Opus diagnosed a race in 2 prompts that Sonnet spent 47 files chasing. The 5x premium is recouped on the first hard bug. Source: alirezarezvani.medium.com 90-day comparison.

Rule 4 — Code Review Uses TWO Models

Always use Codex + Opus (or Codex + Sonnet) for code review and security audits. Different models have different blind spots: Codex catches LIKE injection, race conditions, and logical errors that Claude misses; Claude's explanations are more actionable. Single-model review misses entire bug classes. Source: J.D. Hodges head-to-head, April 2026.

Rule 5 — Terminal / DevOps → Codex

GPT-5.3-Codex: 77.3% Terminal-Bench vs Claude's 65.4%. For bash, Docker, K8s, Terraform, CI/CD, infrastructure scripts: Codex wins. Exception: if the task also involves frontend, switch to Sonnet (Codex is documented as weak on FE). Source: OpenAI Terminal-Bench, corroborated.

Rule 6 — Structured Output → GPT-5.4 Strict

Any task requiring guaranteed schema compliance (JSON, YAML, function calling) → GPT-5.4 with strict mode. CFG-based enforcement makes non-conforming responses structurally impossible (~100%). Claude's native structured outputs hit 99.2%. For production microservices, GPT-5.4 is the safest bet. Source: OpenAI docs; OpenAIToolsHub 300-call test.

Rule 7 — Frontend → Gemini 3.1 Pro, Sonnet, or Gemini 3 Flash

Frontend components: Gemini 3.1 Pro (flagship, "vibe coding" champion), Sonnet 4.6 (mid-tier default), or Gemini 3 Flash (budget). GPT-5.3-Codex is documented as weak on frontend. Gemini 3 Flash is 6x cheaper than Gemini 3.0 Pro and scores higher on SWE-bench — adopted by JetBrains, Replit, Warp. Source: Kilo Code evaluation; JetBrains adoption data; all three research synthesis reports name Gemini 3.1 Pro as the frontend leader.

Rule 8 — Long Context → Gemini 2.5 Pro

If the task needs >200K tokens AND is primarily analysis/summarization, route to Gemini 2.5 Pro. Its 1–2M window with a generous free tier makes it ideal for "dump the whole codebase and ask". BUT: Gemini slows past 300–400K tokens, and Opus still wins for cross-file reasoning that requires precision. Source: Reddit 15K-line refactor anecdote; Murat Firebase debugging case.

Rule 9 — Claude Context Degrades Past 400K

Even though Opus/Sonnet advertise 1M context, usable quality degrades past ~400K tokens. For tasks pushing the limit, chunk explicitly OR switch to Gemini 2.5 Pro for the analysis sub-task. Source: practitioner reports, Chroma Research 18-model context test.

Rule 10 — Haiku Is the Sub-Agent Default

For classification, extraction, boilerplate, or any work a junior would do: Haiku 4.5. ~90% of Sonnet 4.5 quality at one-third the cost, with rigid JSON adherence. Use as default worker in any Claude-led agent team. Adopted by Augment Code for sub-agent routing. Source: Augment Code adoption; practitioner consensus.

Rule 11 — Gemini 3 Flash Is the Cost King

78% SWE-bench Verified at $0.50/MTok input. Outperforms Gemini 3 Pro on coding while being 6x cheaper. Default in JetBrains AI Chat, Junie, Replit, Warp, Figma, Amp. BUT: still has Gemini failure modes (hallucination spirals, state contamination). Never run unattended. Source: Kilo Code; Google adoption announcements.

Rule 12 — GPT-5 Silent Failures Are Real

GPT-5 variants can produce code that runs but is wrong — removing safety checks, fabricating outputs to avoid crashes (IEEE Spectrum, April 2026). Never trust "no errors" as success. On safety-critical code, prefer Claude. Always run independent tests. Source: IEEE Spectrum, April 2026.

Rule 13 — Privacy → Local

If data cannot leave your infrastructure, route to local models. High complexity → gpt-oss-120b (single 80GB GPU). Medium → Qwen3-Coder-Next (3B active, consumer hardware). Function-calling/RAG → Gemma 4 (Apache 2.0, 8B). Frontier-class is unavailable locally; expect a quality drop on architecture and complex debug. Source: J.D. Hodges April 2026 head-to-head.

Rule 14 — Planning Quality > Model Choice

Once you reach a baseline model, output quality is mostly determined by how clearly the task is planned. Spending more time in Step 2 (Sufficiency Check) and Step 4 (Categorize) beats upgrading the model. Source: ShopBack engineer Hoang Nguyen, Jan 2026.

Rule 15 — Drafter-Reviewer Pattern (Speed + Quality)

For rapid iteration where final correctness still matters: use a fast model (Codex-Spark, Gemini 3 Flash, GPT-5.4 mini) to generate drafts, then a flagship (Opus 4.6) to review/patch. 3x faster than reasoning-only, with no correctness loss. Apply when iteration speed dominates. Source: Codex-Spark drafter-reviewer benchmark.

Extended rules with full source citations → references/rules.md.

Output Format

Emit a markdown document with this structure:

# Routing Plan: <Task Name>

## Breakdown Assessment
- Input was [sufficiently / insufficiently] decomposed.
- [If insufficient: list which units failed which Sufficiency criterion and the proposed finer breakdown.]

## Global Context
- **Scope:** <single/multi-file, files touched, languages, frameworks>
- **Constraints:** <privacy, ecosystem, latency, framework versions>
- **Cost posture:** <quality-first / balanced / cost-optimized>

## Work Units

### Unit 1: <short imperative name>
- **Category:** #<n> <category name>
- **Assigned model:** <primary> [+ <secondary> if dual-model pattern]
- **Rationale:** <2–4 sentences. Cite the category number AND the rule number. Explain the type of work and why this model is best for that type.>
- **Context briefing:**
  - *Files to read:* <explicit paths or globs>
  - *Prior outputs:* <which earlier units this depends on>
  - *Constraints:* <style, framework version, libs allowed/banned, conventions>
  - *Acceptance criteria:* <measurable done>
- **Failure modes to watch:** <2–3 model-specific risks from `references/models.md`>
- **Budget hint:** <rough token range if known>

### Unit 2: ...

## Execution Order
1. Unit 1 (entry point)
2. Units 2 & 3 (parallel after Unit 1)
3. ...

## Cost Summary
- Flagship calls: N
- Mid-tier calls: M
- Fast calls: K
- Rough envelope: <ballpark or "unknown  depends on token counts">

Worked Example

Input:

"Migrate the /users REST API from Express+Mongoose to Fastify+Prisma. Preserve endpoints and response shapes. Add tests. Update OpenAPI spec."

Output:

# Routing Plan: Express → Fastify+Prisma Migration for /users API

## Breakdown Assessment
Input was **insufficiently decomposed** — single sentence spans planning, schema, implementation, tests, docs across 3 model tiers. Failed Sufficiency criteria 1, 2, 3, 5. Proposed 8-unit breakdown below.

## Global Context
- **Scope:** Multi-file, TypeScript, Node backend, ~15 files in `src/routes/users/`.
- **Constraints:** Preserve API contract — no breaking changes to response shapes.
- **Cost posture:** Balanced.

## Work Units

### Unit 1: Migration plan + risk analysis
- **Category:** #15 Dependency upgrade / migration (also #2 planning)
- **Assigned model:** GPT-5.3-Codex (primary), Opus 4.6 (review)
- **Rationale:** Migration planning category — Codex's "touch the minimum necessary" approach reduces risk in large-repo upgrades (Rule 5). Opus reviews the plan for dependency-chain gaps (Rule 1: multi-file refactor escalation justified). The work type is sequencing-with-rollback, where conservative scoping matters more than cleverness.
- **Context briefing:**
  - *Files to read:* `src/routes/users/**/*.ts`, `src/models/User.ts`, `package.json`, `openapi.yaml`
  - *Prior outputs:* none (entry point)
  - *Constraints:* No endpoint removals; preserve response shapes; phased rollout.
  - *Acceptance criteria:* Phased plan with rollback points and risk-per-phase.
- **Failure modes:** Codex over-suggests scope ("Should I also do X?") — explicit "do not suggest extras"; Opus silent fallback to Sonnet under rate limit — verify model header.

### Unit 2: Prisma schema from Mongoose model
- **Category:** #6 Database schema design
- **Assigned model:** Sonnet 4.6
- **Rationale:** Single-artifact schema translation, well-scoped, single file. The 80/20 default applies (Rule 1) — Opus would produce identical output for this kind of structural translation.
- **Context briefing:**
  - *Files to read:* `src/models/User.ts`
  - *Prior outputs:* Unit 1 plan
  - *Constraints:* Prisma 5+, PostgreSQL target, preserve all indexes and unique constraints.
  - *Acceptance criteria:* `schema.prisma` compiles; index parity with source.
- **Failure modes:** Sonnet greenfield bias — may drop legacy constraints. Provide explicit "preserve all" instruction.

### Unit 3: Fastify route handlers (one per endpoint)
- **Category:** #11 Backend service / REST endpoints
- **Assigned model:** Sonnet 4.6
- **Rationale:** Standard REST handler implementation, where Sonnet has High-confidence first-try success. The work type is well-scoped translation between two REST frameworks, not novel logic.
- **Context briefing:**
  - *Files to read:* current Express handler, Prisma schema (Unit 2 output), OpenAPI contract.
  - *Prior outputs:* Units 1, 2.
  - *Constraints:* Preserve status codes, error shapes, query params; one handler per call.
  - *Acceptance criteria:* Existing integration tests still pass.
- **Failure modes:** Sonnet "haphazard multi-file changes" — constrain to one handler per call.

### Unit 4: Unit tests for each handler
- **Category:** #19 Unit test generation
- **Assigned model:** Sonnet 4.6
- **Rationale:** Test generation is Sonnet's strongest area. Opus would add marginal edge cases at 5x cost — not worth it for unit-level tests (Rule 1).
- **Context briefing:** Handler code (Unit 3), test framework config, existing test style guide.
- **Failure modes:** ~40% mutation kill rate across all models — don't trust generated tests as exhaustive coverage.

### Unit 5: Integration test for migration parity
- **Category:** #20 Integration / edge-case test design
- **Assigned model:** Opus 4.6
- **Rationale:** Cross-boundary coverage (request → handler → Prisma → DB) is exactly where Opus catches edge cases others miss ("webhook replay, partial refund rounding"). The work type is *contract verification across two stacks* — flagship reasoning earns its premium here.
- **Context briefing:** Both old and new routes, sample payloads, prior bug reports.
- **Failure modes:** Opus may suggest unrelated refactors — explicit negative constraint.

### Unit 6: OpenAPI spec update
- **Category:** #25 Structured output
- **Assigned model:** GPT-5.4 (strict mode)
- **Rationale:** OpenAPI is schema-bound output where strict mode guarantees validity (Rule 6). Claude could do this at 99.2% but GPT-5.4 strict is the production-safe choice.
- **Context briefing:** New handler signatures, old OpenAPI spec, target file path.
- **Failure modes:** GPT-5.4 loses long system-prompt constraints — keep prompt lean and re-inject schema.

### Unit 7: Security review of auth paths
- **Category:** #22 Security audit
- **Assigned model:** **Opus 4.6 + GPT-5.3-Codex** (dual)
- **Rationale:** Rule 4 — different blind spots. Opus catches auth-flow logic issues; Codex catches concrete injection and race vulnerabilities. Merge findings.
- **Context briefing:** All new handler files, schema, middleware.
- **Failure modes:** Iterating on security review worsens outcomes (37.6% increase in critical vulns after 5 iterations) — one-shot review, do not iterate.

### Unit 8: CHANGELOG and migration guide
- **Category:** #23 Code docs / README
- **Assigned model:** Sonnet 4.6
- **Rationale:** Natural prose, low cost; Codex would be too terse for migration guides.
- **Context briefing:** Diff summary, Units 1–7 outputs.
- **Failure modes:** — (routine task)

## Execution Order
1. Unit 1 (planning)
2. Unit 2 (schema) — after Unit 1
3. Units 3 & 4 (handlers + tests) — parallel per endpoint, after Unit 2
4. Units 5 & 6 (integration test + OpenAPI) — parallel after Unit 3
5. Unit 7 (security review) — after Unit 3 completes
6. Unit 8 (docs) — last

## Cost Summary
- Flagship calls (Opus): 1 review (Unit 1) + Unit 5 + Unit 7 → ~3
- Mid-tier calls (Sonnet/Codex/GPT-5.4): ~8
- Fast calls: 0
- Rough envelope: budget 4–8h of agentic developer time; flagship dominates token cost.

Common Mistakes

Mistake Fix
Routing "implement feature X" to one model Run Sufficiency Check first; break into plan → schema → impl → tests → review → docs
Using Opus for boilerplate Rule 10: Haiku or GPT-5.4 nano. Opus is 10x cost waste here.
Single-model code review Rule 4: dual-model is mandatory for review and security
Ignoring Claude context degradation past 400K Rule 9: chunk explicitly or hand off to Gemini 2.5 Pro
Trusting "no errors" from GPT-5 as success Rule 12: always run independent tests
Picking by vibes instead of category lookup Category lookup first, decision rules second
Omitting failure-mode warnings in briefings Include 2–3 per unit from references/models.md
Routing private code to API models Rule 13: check privacy in Step 1; route to local (Qwen3 / Gemma 4)
Routing frontend to Codex Codex is documented as weak on frontend; use Sonnet or Gemini 3 Flash
Rationale doesn't cite a category # or rule # Not evidence-based — re-do

Red Flags — STOP and Re-Route

  • The same model is selected for every unit → you're defaulting, not routing. Re-apply categories.
  • No rationale cites a category number or rule number → not evidence-based.
  • Context briefing says "the whole repo" → chunk it or use long-context model.
  • Any unit estimated >1h autonomous runtime without a circuit breaker → add iteration limit and time cap.
  • Privacy constraint exists but model is API-hosted → re-route to local.

Data Freshness

This skill reflects practitioner evidence as of April 2026. The model landscape shifts monthly. Re-validate quarterly.

Models covered:

  • Anthropic: Claude Opus 4.6 (Feb 5 2026), Sonnet 4.6 (Feb 17 2026), Haiku 4.5 (Oct 2025)
  • OpenAI: GPT-5.4 / 5.4 Pro / 5.4 mini / 5.4 nano (Mar 5 2026), GPT-5.3-Codex, GPT-5.2-Codex (retiring Jun 5 2026), GPT-4.1
  • Google: Gemini 3.1 Pro (Feb 19 2026), Gemini 3 Flash, Gemini 2.5 Pro/Flash/Flash-Lite
  • Open-weight: Qwen3-Coder-Next, Gemma 4, gpt-oss-120b/20b, MiniMax M2.5

References

  • references/models.md — Full model profiles: pricing, context, sweet spots, weaknesses, failure modes.
  • references/categories.md — All 30 task categories with confidence levels and edge cases.
  • references/rules.md — Extended decision rules with full source citations.
Related skills
Installs
2
GitHub Stars
9
First Seen
Apr 15, 2026