maptasks — Route Engineering Work to the Right Model

Overview

Given a specified engineering task (or set of tasks), produce a routing plan that:

Checks whether the task is broken down enough to route. If not, proposes a finer breakdown first.
Categorizes each leaf work unit using a 30-category task taxonomy.
Assigns each unit to the best-suited model based on practitioner evidence (April 2026).
Explains why this model is best for this category — citing the rule or evidence.
Specifies the context briefing each model needs to do its job.
Flags model-specific failure modes to watch for.

The skill is evidence-driven, not vibes-driven. Every recommendation traces to a category lookup or a numbered decision rule.

The Three-Tier Strategy (always implicit)

Every routing plan should distribute work across three tiers in roughly these proportions:

Tier	Share of work	Models
Flagship reasoning	5–10%	Opus 4.6, GPT-5.4 Pro, Gemini 3.1 Pro
Balanced mid-tier	40–50%	Sonnet 4.6, GPT-5.4, GPT-5.3-Codex, Gemini 2.5 Pro
Cheap / fast	30–40%	Haiku 4.5, GPT-5.4 mini/nano, Gemini 3 Flash

A routing plan that uses only one tier (all Opus, or all Sonnet) is almost always wrong. The two failure modes to avoid:

Verbose model on shallow task → cache thrashing, quota exhaustion (Opus on boilerplate).
Speed model on hard reasoning → "fast failure" — race conditions and memory leaks discovered after merge.

The skill prevents both by forcing each unit through category lookup → tier check → model selection.

When to Use

You have a fully or partially specified task and are about to execute it with AI assistance.
You want to assemble a multi-model team (e.g. Architect + Implementer + Reviewer).
You're choosing between two or more models for a specific piece of work and want the comparison.
You want a routing plan before spinning up subagents or parallel sessions.
You want to know whether a task description is decomposed enough to be routable at all.

Do NOT use this skill for:

Picking a chat model for casual conversation.
Reviewing the quality of code (use kiss, solid-principles, beyond-solid-principles).
Estimating timelines or effort in person-hours.

Inputs

The user provides one of:

A task description (e.g. "Implement OAuth2 login with Google, update schema, add tests, update docs").
A pre-broken-down task list (markdown or bullets).
A spec / PRD / GitHub issue / file path.

If anything critical is missing, ask for it before routing:

Scope: How many files? What's the language/framework/stack?
Constraints: Privacy (can data leave your infra?), ecosystem (Google/JVM/.NET), latency.
Cost posture: cost-optimized, balanced, or quality-first?
Done criteria: How will you know each unit is finished?

If the user describes a team type or role context (solo MVP dev, senior in production codebase, frontend startup, backend team, migration team, docs team, agentic CLI workflow, privacy-constrained/air-gapped), pre-load the matching default stack from references/rules.md Part 3 — Persona-Based Routing Scenarios — and use it as the starting policy before applying category lookups.

Workflow

1. INGEST          → Read task / spec. Identify scope and constraints.
2. SUFFICIENCY     → Run Breakdown Sufficiency Check on each unit.
3. DECOMPOSE       → If any unit fails, propose finer breakdown BEFORE routing.
4. CATEGORIZE      → Map each leaf unit to one of the 30 task categories.
5. SELECT          → Apply category lookup, then decision rules to pick a model.
6. BRIEF           → Build the context briefing for each unit.
7. EMIT            → Output the routing plan in the standard format.

Step 2: Breakdown Sufficiency Check

A work unit is routable only if all five hold:

#	Criterion	Failing example
1	Belongs to exactly one task category	"Implement and test" — split
2	Requires exactly one model tier	"Plan and write docs" mixes flagship + mid — split
3	Produces a single artifact (one file, one plan, one review)	"Build OAuth flow" — split
4	Fits in one context budget (no mid-task re-injection)	"Migrate the whole repo" — chunk by module
5	Has measurable done criteria	"Make it better" — define done

If any criterion fails, propose a finer breakdown before routing. Don't route an under-decomposed unit; you'll just push the routing problem onto the model.

Step 5: Selecting the Model

Category lookup (Quick Reference table below) gives you a primary + runner-up.
Apply decision rules (Rules section below) to resolve ties or special cases.
Check constraints: privacy → local model; ecosystem → Gemini for Google stack; long context → Gemini 2.5 Pro.
Sanity check: cite the category number AND the rule number in the rationale.

Model Roster

Model	Tier	Sweet Spot	Context (usable)	$/M in	$/M out
Claude Opus 4.6	Flagship	Architecture, multi-file refactor, complex debug (race / cross-boundary), security audit, 7h+ autonomous	1M (~400K reliable)	$5	$25
GPT-5.4 Pro	Flagship	Computer use (75% OSWorld), mid-task steering, native tool chaining	1M	$30	$180
Gemini 3.1 Pro	Flagship	Frontier reasoning, frontend/UI, Google ecosystem (Go/Dart/Flutter/Angular)	1M+	$2-4	$12-18
Claude Sonnet 4.6	Mid	The 80% model: daily features, bug fixes, tests, docs, API/SQL design, frontend components	1M (~400K reliable)	$3	$15
GPT-5.4	Mid	Structured output (~100% strict), tool/function calling, greenfield iteration	1.05M	$2.50	$15
GPT-5.3-Codex	Mid	Terminal/DevOps/IaC (77.3% Terminal-Bench), code review, long-horizon autonomous (25h)	400K	$2.50	$15
Gemini 2.5 Pro	Mid	Long-context analysis, whole-codebase summarization, data analysis, R workflows	1-2M	$1.25	$10
Claude Haiku 4.5	Fast	Sub-agent workhorse, classification, boilerplate, strict JSON adherence	200K	$0.80	$5
GPT-5.4 mini	Fast	Routine code, sub-agent, "almost indistinguishable from full" on simple tasks	400K	$0.75	$4.50
GPT-5.4 nano	Fast	Classification, extraction, guardrails, high-volume pipelines	400K	$0.20	$1.25
Gemini 3 Flash	Fast	Cost king for coding (78% SWE-bench at $0.50/MTok), default in JetBrains/Replit/Warp	1M	$0.50	$3
Codex-Spark	Speed	Rapid prototyping drafter (1,000+ tok/s, 15x faster than Codex). Pair with Opus reviewer (Pattern B)	varies	varies	varies
Qwen3-Coder-Next	Local	Privacy-bound or offline coding (consumer hardware), Apache 2.0	128K	free	free
Gemma 4	Local	Local function calling, structured output, RAG (Apache 2.0)	varies	free	free

Full profiles, weaknesses, and failure modes → references/models.md.

Context Decay (the "1M Token Myth")

Advertised context windows ≠ usable context. Empirical retrieval accuracy:

Model	256K accuracy	1M accuracy	Decay rate	Practical max
Claude 4.6 Opus	92.1%	78.3%	15%	~400K reliable
GPT-5.4	79.5%	36.6%	54%	~7,500 LoC reliable
Gemini 3.1 Pro	35.0%	25.9%	26%	~15,000 LoC (no cross-file reasoning)

Routing implication: Long-context jobs should be routed by task type, not by raw window size:

Retrieval / summarization >200K → Gemini 2.5 Pro (1–2M, generous free tier)
Cross-file reasoning >200K → Claude Opus 4.6, but chunk to stay under 400K
Whole-codebase analysis with no precision needed → Gemini 2.5 Pro
Anything past Opus 400K threshold → switch to Gemini 2.5 Pro for that sub-task

Mixed-Model Workflow Patterns

When more than one model collaborates on a task, use a named pattern. Full descriptions in references/rules.md Part 2.

Pattern	Composition	Best for
A. Architect / Implementer / Reviewer	Opus → Sonnet → (Codex+Opus dual review)	New features in production codebases (the dominant 3-tier strategy)
B. Drafter / Reviewer	Gemini 3 Flash or Codex-Spark or GPT-5.4 mini → Opus 4.6 patch	Frontend iteration, prototyping, "show me 30 variants" — 3x faster than reasoning-only with no correctness loss
C. Claude / Gemini Pipeline	Opus plan → Gemini 2.5/3 implement → Opus review	Frontend-heavy work where Gemini's UI strengths matter
D. Multi-Tier Subagent Team	Sonnet lead + Haiku workers + Opus/Codex specialists	High-volume agentic pipelines (Augment Code style)
E. Cost-Capped Default Stack	Sonnet default + Opus escalation + Haiku downshift + GPT-5.4 strict for structured output	Sustainable production AI workflows
F. Single-Vendor Lock-In	Opus→Sonnet→Haiku OR GPT-5.4→mini→nano OR Gemini 3.1→Flash→Flash-Lite	Org policy / billing simplification

When a unit's rationale invokes a pattern, name it explicitly: "Pattern A — Architect/Implementer/Reviewer".

Task Categories (Quick Reference)

#	Category	Primary	Runner-up	Avoid
1	Architecture / system design	Opus 4.6	GPT-5.4 Pro	—
2	Task decomposition / planning	Opus 4.6	GPT-5.4	Haiku
3	Tech-stack / tradeoff analysis	Opus 4.6	GPT-5.2	—
4	Spec writing (business → tech)	Opus 4.6	Sonnet 4.6	Cursor Composer-2
5	API contract design (OpenAPI/protobuf)	Sonnet 4.6	GPT-5.4	—
6	Database schema design	Sonnet 4.6	Gemini 2.5 Pro	—
7	Greenfield feature implementation	Sonnet 4.6	GPT-5.4	—
8	Boilerplate / scaffolding	Haiku 4.5	GPT-5.4 nano	Opus (10x cost waste)
9	Algorithm implementation	Opus 4.6	Codex	—
10	Frontend components (React/Vue/CSS)	Gemini 3.1 Pro	Sonnet 4.6, Gemini 3 Flash (budget)	Codex (weak on FE)
11	Backend service / REST endpoints	Sonnet 4.6	GPT-5.4	—
12	Infrastructure-as-code / terminal	GPT-5.3-Codex	Sonnet 4.6	—
13	One-shot script (bash, python)	Sonnet 4.6	Gemini 3 Flash	—
14	Refactoring (cross-file)	Opus 4.6	GPT-5.2-Codex	—
15	Dependency upgrade / migration	GPT-5.2-Codex	Opus 4.6	—
16	Stack-trace bug fix	Any frontier (cheapest)	—	—
17	Complex debug (race / cross-boundary)	Opus 4.6	GPT-5.4	—
18	Performance optimization	GPT-5.3-Codex	Opus 4.6	—
19	Unit test generation	Sonnet 4.6	Gemini 2.5 Pro	—
20	Integration / edge-case test design	Opus 4.6	Codex	—
21	Code review	Codex + Opus (dual)	—	Single-model review
22	Security audit	Opus + Codex (dual)	—	GPT-4o (10% clean rate)
23	Code docs / README	Sonnet 4.6	GPT-5.4	Codex (too terse)
24	ADR / trade-off documentation	Opus 4.6	GPT-5.2	—
25	Structured output / function calling	GPT-5.4 strict	Sonnet 4.6	—
26	Long-context analysis (>200K)	Gemini 2.5 Pro	Opus 4.6	Haiku
27	SQL generation	Sonnet 4.6 (24/25)	GPT-5.4 mini (22/25, ¼ cost)	Codex (slow, 21/25)
28	Classification / extraction	GPT-5.4 nano	Haiku 4.5	Opus
29	Long-horizon autonomous (>1h)	Opus (general) / Codex (terminal)	—	Gemini (spiral risk)
30	Data analysis / ETL	Gemini 2.5 Pro	Sonnet 4.6	—

Detailed criteria, edge cases, and confidence levels → references/categories.md.

Decision Rules

These rules resolve ties and special cases. Apply them after the category lookup. Each cites the source evidence.

Rule 1 — The 80/20 Escalation

Default to Sonnet 4.6. Escalate to Opus 4.6 only when the unit involves architecture, multi-file refactor (>5 files), complex debug (race / cross-boundary), security audit, or long-horizon autonomous (>1h). Opus produces identical results to Sonnet on single-file work — paying 5x more is waste. Source: NxCode 6-task comparison; Faros.ai Reddit synthesis.

Rule 2 — Stack-Trace Bugs Are Equal

All frontier models are equivalent on stack-trace debugging. All 5 models tested correctly identified root cause on identical FastAPI traces. Use the cheapest you have access to. Source: jdhodges.com 5-model test, April 2026.

Rule 3 — Race Conditions Justify Opus

Race conditions and cross-boundary bugs → Opus 4.6. In a 90-day practitioner test, Opus diagnosed a race in 2 prompts that Sonnet spent 47 files chasing. The 5x premium is recouped on the first hard bug. Source: alirezarezvani.medium.com 90-day comparison.

Rule 4 — Code Review Uses TWO Models

Always use Codex + Opus (or Codex + Sonnet) for code review and security audits. Different models have different blind spots: Codex catches LIKE injection, race conditions, and logical errors that Claude misses; Claude's explanations are more actionable. Single-model review misses entire bug classes. Source: J.D. Hodges head-to-head, April 2026.

Rule 5 — Terminal / DevOps → Codex

GPT-5.3-Codex: 77.3% Terminal-Bench vs Claude's 65.4%. For bash, Docker, K8s, Terraform, CI/CD, infrastructure scripts: Codex wins. Exception: if the task also involves frontend, switch to Sonnet (Codex is documented as weak on FE). Source: OpenAI Terminal-Bench, corroborated.

Rule 6 — Structured Output → GPT-5.4 Strict

Any task requiring guaranteed schema compliance (JSON, YAML, function calling) → GPT-5.4 with strict mode. CFG-based enforcement makes non-conforming responses structurally impossible (~100%). Claude's native structured outputs hit 99.2%. For production microservices, GPT-5.4 is the safest bet. Source: OpenAI docs; OpenAIToolsHub 300-call test.

Rule 7 — Frontend → Gemini 3.1 Pro, Sonnet, or Gemini 3 Flash

Frontend components: Gemini 3.1 Pro (flagship, "vibe coding" champion), Sonnet 4.6 (mid-tier default), or Gemini 3 Flash (budget). GPT-5.3-Codex is documented as weak on frontend. Gemini 3 Flash is 6x cheaper than Gemini 3.0 Pro and scores higher on SWE-bench — adopted by JetBrains, Replit, Warp. Source: Kilo Code evaluation; JetBrains adoption data; all three research synthesis reports name Gemini 3.1 Pro as the frontend leader.

Rule 8 — Long Context → Gemini 2.5 Pro

If the task needs >200K tokens AND is primarily analysis/summarization, route to Gemini 2.5 Pro. Its 1–2M window with a generous free tier makes it ideal for "dump the whole codebase and ask". BUT: Gemini slows past 300–400K tokens, and Opus still wins for cross-file reasoning that requires precision. Source: Reddit 15K-line refactor anecdote; Murat Firebase debugging case.

Rule 9 — Claude Context Degrades Past 400K

Even though Opus/Sonnet advertise 1M context, usable quality degrades past ~400K tokens. For tasks pushing the limit, chunk explicitly OR switch to Gemini 2.5 Pro for the analysis sub-task. Source: practitioner reports, Chroma Research 18-model context test.

Rule 10 — Haiku Is the Sub-Agent Default

For classification, extraction, boilerplate, or any work a junior would do: Haiku 4.5. ~90% of Sonnet 4.5 quality at one-third the cost, with rigid JSON adherence. Use as default worker in any Claude-led agent team. Adopted by Augment Code for sub-agent routing. Source: Augment Code adoption; practitioner consensus.

Rule 11 — Gemini 3 Flash Is the Cost King

78% SWE-bench Verified at $0.50/MTok input. Outperforms Gemini 3 Pro on coding while being 6x cheaper. Default in JetBrains AI Chat, Junie, Replit, Warp, Figma, Amp. BUT: still has Gemini failure modes (hallucination spirals, state contamination). Never run unattended. Source: Kilo Code; Google adoption announcements.

Rule 12 — GPT-5 Silent Failures Are Real

GPT-5 variants can produce code that runs but is wrong — removing safety checks, fabricating outputs to avoid crashes (IEEE Spectrum, April 2026). Never trust "no errors" as success. On safety-critical code, prefer Claude. Always run independent tests. Source: IEEE Spectrum, April 2026.

Rule 13 — Privacy → Local

If data cannot leave your infrastructure, route to local models. High complexity → gpt-oss-120b (single 80GB GPU). Medium → Qwen3-Coder-Next (3B active, consumer hardware). Function-calling/RAG → Gemma 4 (Apache 2.0, 8B). Frontier-class is unavailable locally; expect a quality drop on architecture and complex debug. Source: J.D. Hodges April 2026 head-to-head.

Rule 14 — Planning Quality > Model Choice

Once you reach a baseline model, output quality is mostly determined by how clearly the task is planned. Spending more time in Step 2 (Sufficiency Check) and Step 4 (Categorize) beats upgrading the model. Source: ShopBack engineer Hoang Nguyen, Jan 2026.

Rule 15 — Drafter-Reviewer Pattern (Speed + Quality)

For rapid iteration where final correctness still matters: use a fast model (Codex-Spark, Gemini 3 Flash, GPT-5.4 mini) to generate drafts, then a flagship (Opus 4.6) to review/patch. 3x faster than reasoning-only, with no correctness loss. Apply when iteration speed dominates. Source: Codex-Spark drafter-reviewer benchmark.

Extended rules with full source citations → references/rules.md.

Output Format

Emit a markdown document with this structure:

# Routing Plan: <Task Name>

## Breakdown Assessment
- Input was [sufficiently / insufficiently] decomposed.
- [If insufficient: list which units failed which Sufficiency criterion and the proposed finer breakdown.]

## Global Context
- **Scope:** <single/multi-file, files touched, languages, frameworks>
- **Constraints:** <privacy, ecosystem, latency, framework versions>
- **Cost posture:** <quality-first / balanced / cost-optimized>

## Work Units

### Unit 1: <short imperative name>
- **Category:** #<n> <category name>
- **Assigned model:** <primary> [+ <secondary> if dual-model pattern]
- **Rationale:** <2–4 sentences. Cite the category number AND the rule number. Explain the type of work and why this model is best for that type.>
- **Context briefing:**
  - *Files to read:* <explicit paths or globs>
  - *Prior outputs:* <which earlier units this depends on>
  - *Constraints:* <style, framework version, libs allowed/banned, conventions>
  - *Acceptance criteria:* <measurable done>
- **Failure modes to watch:** <2–3 model-specific risks from `references/models.md`>
- **Budget hint:** <rough token range if known>

### Unit 2: ...

## Execution Order
1. Unit 1 (entry point)
2. Units 2 & 3 (parallel after Unit 1)
3. ...

## Cost Summary
- Flagship calls: N
- Mid-tier calls: M
- Fast calls: K
- Rough envelope: <ballpark or "unknown — depends on token counts">

Worked Example

Input:

"Migrate the /users REST API from Express+Mongoose to Fastify+Prisma. Preserve endpoints and response shapes. Add tests. Update OpenAPI spec."

Output:

# Routing Plan: Express → Fastify+Prisma Migration for /users API

## Breakdown Assessment
Input was **insufficiently decomposed** — single sentence spans planning, schema, implementation, tests, docs across 3 model tiers. Failed Sufficiency criteria 1, 2, 3, 5. Proposed 8-unit breakdown below.

## Global Context
- **Scope:** Multi-file, TypeScript, Node backend, ~15 files in `src/routes/users/`.
- **Constraints:** Preserve API contract — no breaking changes to response shapes.
- **Cost posture:** Balanced.

## Work Units

### Unit 1: Migration plan + risk analysis
- **Category:** #15 Dependency upgrade / migration (also #2 planning)
- **Assigned model:** GPT-5.3-Codex (primary), Opus 4.6 (review)
- **Rationale:** Migration planning category — Codex's "touch the minimum necessary" approach reduces risk in large-repo upgrades (Rule 5). Opus reviews the plan for dependency-chain gaps (Rule 1: multi-file refactor escalation justified). The work type is sequencing-with-rollback, where conservative scoping matters more than cleverness.
- **Context briefing:**
  - *Files to read:* `src/routes/users/**/*.ts`, `src/models/User.ts`, `package.json`, `openapi.yaml`
  - *Prior outputs:* none (entry point)
  - *Constraints:* No endpoint removals; preserve response shapes; phased rollout.
  - *Acceptance criteria:* Phased plan with rollback points and risk-per-phase.
- **Failure modes:** Codex over-suggests scope ("Should I also do X?") — explicit "do not suggest extras"; Opus silent fallback to Sonnet under rate limit — verify model header.

### Unit 2: Prisma schema from Mongoose model
- **Category:** #6 Database schema design
- **Assigned model:** Sonnet 4.6
- **Rationale:** Single-artifact schema translation, well-scoped, single file. The 80/20 default applies (Rule 1) — Opus would produce identical output for this kind of structural translation.
- **Context briefing:**
  - *Files to read:* `src/models/User.ts`
  - *Prior outputs:* Unit 1 plan
  - *Constraints:* Prisma 5+, PostgreSQL target, preserve all indexes and unique constraints.
  - *Acceptance criteria:* `schema.prisma` compiles; index parity with source.
- **Failure modes:** Sonnet greenfield bias — may drop legacy constraints. Provide explicit "preserve all" instruction.

### Unit 3: Fastify route handlers (one per endpoint)
- **Category:** #11 Backend service / REST endpoints
- **Assigned model:** Sonnet 4.6
- **Rationale:** Standard REST handler implementation, where Sonnet has High-confidence first-try success. The work type is well-scoped translation between two REST frameworks, not novel logic.
- **Context briefing:**
  - *Files to read:* current Express handler, Prisma schema (Unit 2 output), OpenAPI contract.
  - *Prior outputs:* Units 1, 2.
  - *Constraints:* Preserve status codes, error shapes, query params; one handler per call.
  - *Acceptance criteria:* Existing integration tests still pass.
- **Failure modes:** Sonnet "haphazard multi-file changes" — constrain to one handler per call.

### Unit 4: Unit tests for each handler
- **Category:** #19 Unit test generation
- **Assigned model:** Sonnet 4.6
- **Rationale:** Test generation is Sonnet's strongest area. Opus would add marginal edge cases at 5x cost — not worth it for unit-level tests (Rule 1).
- **Context briefing:** Handler code (Unit 3), test framework config, existing test style guide.
- **Failure modes:** ~40% mutation kill rate across all models — don't trust generated tests as exhaustive coverage.

### Unit 5: Integration test for migration parity
- **Category:** #20 Integration / edge-case test design
- **Assigned model:** Opus 4.6
- **Rationale:** Cross-boundary coverage (request → handler → Prisma → DB) is exactly where Opus catches edge cases others miss ("webhook replay, partial refund rounding"). The work type is *contract verification across two stacks* — flagship reasoning earns its premium here.
- **Context briefing:** Both old and new routes, sample payloads, prior bug reports.
- **Failure modes:** Opus may suggest unrelated refactors — explicit negative constraint.

### Unit 6: OpenAPI spec update
- **Category:** #25 Structured output
- **Assigned model:** GPT-5.4 (strict mode)
- **Rationale:** OpenAPI is schema-bound output where strict mode guarantees validity (Rule 6). Claude could do this at 99.2% but GPT-5.4 strict is the production-safe choice.
- **Context briefing:** New handler signatures, old OpenAPI spec, target file path.
- **Failure modes:** GPT-5.4 loses long system-prompt constraints — keep prompt lean and re-inject schema.

### Unit 7: Security review of auth paths
- **Category:** #22 Security audit
- **Assigned model:** **Opus 4.6 + GPT-5.3-Codex** (dual)
- **Rationale:** Rule 4 — different blind spots. Opus catches auth-flow logic issues; Codex catches concrete injection and race vulnerabilities. Merge findings.
- **Context briefing:** All new handler files, schema, middleware.
- **Failure modes:** Iterating on security review worsens outcomes (37.6% increase in critical vulns after 5 iterations) — one-shot review, do not iterate.

### Unit 8: CHANGELOG and migration guide
- **Category:** #23 Code docs / README
- **Assigned model:** Sonnet 4.6
- **Rationale:** Natural prose, low cost; Codex would be too terse for migration guides.
- **Context briefing:** Diff summary, Units 1–7 outputs.
- **Failure modes:** — (routine task)

## Execution Order
1. Unit 1 (planning)
2. Unit 2 (schema) — after Unit 1
3. Units 3 & 4 (handlers + tests) — parallel per endpoint, after Unit 2
4. Units 5 & 6 (integration test + OpenAPI) — parallel after Unit 3
5. Unit 7 (security review) — after Unit 3 completes
6. Unit 8 (docs) — last

## Cost Summary
- Flagship calls (Opus): 1 review (Unit 1) + Unit 5 + Unit 7 → ~3
- Mid-tier calls (Sonnet/Codex/GPT-5.4): ~8
- Fast calls: 0
- Rough envelope: budget 4–8h of agentic developer time; flagship dominates token cost.

Common Mistakes

Mistake	Fix
Routing "implement feature X" to one model	Run Sufficiency Check first; break into plan → schema → impl → tests → review → docs
Using Opus for boilerplate	Rule 10: Haiku or GPT-5.4 nano. Opus is 10x cost waste here.
Single-model code review	Rule 4: dual-model is mandatory for review and security
Ignoring Claude context degradation past 400K	Rule 9: chunk explicitly or hand off to Gemini 2.5 Pro
Trusting "no errors" from GPT-5 as success	Rule 12: always run independent tests
Picking by vibes instead of category lookup	Category lookup first, decision rules second
Omitting failure-mode warnings in briefings	Include 2–3 per unit from `references/models.md`
Routing private code to API models	Rule 13: check privacy in Step 1; route to local (Qwen3 / Gemma 4)
Routing frontend to Codex	Codex is documented as weak on frontend; use Sonnet or Gemini 3 Flash
Rationale doesn't cite a category # or rule #	Not evidence-based — re-do

Red Flags — STOP and Re-Route

The same model is selected for every unit → you're defaulting, not routing. Re-apply categories.
No rationale cites a category number or rule number → not evidence-based.
Context briefing says "the whole repo" → chunk it or use long-context model.
Any unit estimated >1h autonomous runtime without a circuit breaker → add iteration limit and time cap.
Privacy constraint exists but model is API-hosted → re-route to local.

Data Freshness

This skill reflects practitioner evidence as of April 2026. The model landscape shifts monthly. Re-validate quarterly.

Models covered:

Anthropic: Claude Opus 4.6 (Feb 5 2026), Sonnet 4.6 (Feb 17 2026), Haiku 4.5 (Oct 2025)
OpenAI: GPT-5.4 / 5.4 Pro / 5.4 mini / 5.4 nano (Mar 5 2026), GPT-5.3-Codex, GPT-5.2-Codex (retiring Jun 5 2026), GPT-4.1
Google: Gemini 3.1 Pro (Feb 19 2026), Gemini 3 Flash, Gemini 2.5 Pro/Flash/Flash-Lite
Open-weight: Qwen3-Coder-Next, Gemma 4, gpt-oss-120b/20b, MiniMax M2.5

References

references/models.md — Full model profiles: pricing, context, sweet spots, weaknesses, failure modes.
references/categories.md — All 30 task categories with confidence levels and edge cases.
references/rules.md — Extended decision rules with full source citations.

maptasks