Cross-Model Code Review

Cross-model validation: the authoring model writes code, a different model reviews it. Different architectures, different training distributions, no self-approval bias.

Core insight: Single-model self-review is systematically biased. The same blind spots that let bugs through during writing let them through during review. Cross-model review catches different bug classes because the reviewer has fundamentally different failure modes.

Host Detection & Direction

Identify the current host, then invoke the other model's CLI.

Current Host	Reviewer CLI	Direction
Claude Code	`codex`	Claude writes → Codex reviews
Codex	`claude`	Codex writes → Claude reviews

Verify the reviewer CLI is installed and authenticated before starting:

codex --help       # For Claude-hosted sessions
claude --version   # For Codex-hosted sessions

User defaults are authoritative. Both CLIs read configured defaults (~/.codex/config.toml, ~/.claude/settings.json). Never specify --model, -m, --effort, or -c model= in invocations unless the user explicitly asks to override.

Invocation Cheat Sheet

Capability	Claude → Codex	Codex → Claude
Structured diff review	`codex review --base main`	Use piped diff or tool access (no equivalent)
Commit review	`codex review --commit <SHA>`	`git show <SHA> \| claude -p "PROMPT"`
Uncommitted WIP	`codex review --uncommitted`	`git diff \| claude -p "PROMPT"`
Freeform prompt	`codex exec "PROMPT"`	`claude -p "PROMPT"`
Pipe diff in	`git diff main...HEAD \| codex exec "PROMPT"`	`git diff main...HEAD \| claude -p "PROMPT"`
Read-only exploration	`codex exec -s read-only "PROMPT"`	`claude -p --allowedTools "Read,Glob,Grep,Bash(git *)" "PROMPT"`
JSON output	`codex exec --json "PROMPT"`	`claude -p --output-format json "PROMPT"`

Codex has a dedicated review subcommand with structured output; Claude reviews go through print mode (-p) with a prompt.

Review Patterns

Each pattern shows both directions. Pick the one matching your host.

Pattern 1: Pre-PR Full Review

Standard review before opening a PR.

Claude → Codex:

codex review --base main

Codex → Claude (fast, piped diff):

git diff main...HEAD | claude -p \
  "Review this diff. Prioritize correctness, security, and performance.
   For each finding: cite file and line, explain the risk, suggest a fix.
   Rate confidence 0.0-1.0. Skip formatting and naming style.
   Verdict: patch is correct / patch is incorrect."

Codex → Claude (deep, full context):

claude -p --allowedTools "Read,Glob,Grep,Bash(git *)" \
  "Review the changes between main and HEAD in this repository.
   Prioritize correctness, security, and performance.
   For each finding: cite file and line, explain the risk, suggest a fix.
   Rate confidence 0.0-1.0. Skip formatting and naming style."

Pattern 2: Commit-Level Review

Quick check after a meaningful commit.

Claude → Codex:

codex review --commit <SHA>

Codex → Claude:

git show <SHA> | claude -p \
  "Review this commit. Flag bugs, security issues, and logic errors.
   Cite file:line for each finding. Confidence threshold: 0.7."

Pattern 3: WIP Check

Review uncommitted work mid-development.

Claude → Codex:

codex review --uncommitted

Codex → Claude:

git diff | claude -p \
  "Review these uncommitted changes. Flag anything that looks wrong.
   Cite file:line for each finding. Skip incomplete code paths."

Pattern 4: Focused Investigation

Surgical deep-dive on a specific concern (security, performance, concurrency).

Claude → Codex:

codex exec \
  "You are a senior [DOMAIN] engineer. Analyze [CONCERN] in the changes
   between main and HEAD. For each issue: cite file and line, explain the
   risk, suggest a concrete fix. Confidence threshold: 0.7."

Codex → Claude:

claude -p --allowedTools "Read,Glob,Grep,Bash(git *)" \
  "You are a senior [DOMAIN] engineer. Analyze [CONCERN] in the changes
   between main and HEAD. For each issue: cite file and line, explain the
   risk, suggest a concrete fix. Confidence threshold: 0.7."

Replace [DOMAIN] with the review focus and [CONCERN] with the specific worry.

Pattern 5: Ralph Loop (Implement-Review-Fix)

Iterative quality enforcement. Max 3 iterations.

Iteration 1:
  Host -> implement feature
  Reviewer CLI -> findings
  Host -> fix critical/high findings

Iteration 2:
  Reviewer CLI -> verify fixes + catch remaining
  Host -> fix remaining issues

Iteration 3 (final):
  Reviewer CLI -> clean or accept trade-offs

STOP after 3 iterations. Diminishing returns beyond this.

Invoke the reviewer with Pattern 1 commands each iteration.

Piped Diff vs Tool Access (Codex → Claude)

For Codex-hosted sessions reviewing with Claude, choose based on depth needed:

Approach	Command Shape	When to Use
Piped diff	`git diff ... \| claude -p "PROMPT"`	Quick reviews; reviewer sees only the diff
Tool access	`claude -p --allowedTools "Read,Glob,Grep,Bash(git *)" "PROMPT"`	Architecture/security deep-dives; reviewer can trace data flow across files

Piped diff is faster and cheaper. Tool access costs more tokens but catches bugs that require surrounding context (function signatures defined elsewhere, downstream consumers, similar patterns in the codebase).

Multi-Pass Strategy

For thorough reviews, run multiple focused passes. Each pass gets a specific persona and concern domain.

Pass	Focus	Approach
Correctness	Bugs, logic, edge cases, race conditions	Structured review (`codex review`) or piped diff with general prompt
Security	OWASP Top 10:2025, injection, auth, secrets	Focused investigation with security persona
Architecture	Coupling, abstractions, API consistency	Tool-access mode for full file context
Performance	O(n^2), N+1 queries, memory leaks	Focused investigation with performance persona

When to go multi-pass:

Change Size	Strategy
< 50 lines, single concern	Single review pass
50-300 lines, feature work	Review + security pass
300+ lines or architecture change	Full 4-pass
Security-sensitive (auth, payments, crypto)	Always include security pass

Run passes sequentially. Fix critical findings between passes to avoid noise compounding.

Decision Tree: Which Pattern?

digraph review_decision {
    rankdir=TB;
    node [shape=diamond];

    "What stage?" -> "Pre-commit" [label="writing code"];
    "What stage?" -> "Pre-PR" [label="ready to submit"];
    "What stage?" -> "Post-commit" [label="just committed"];
    "What stage?" -> "Investigating" [label="specific concern"];

    node [shape=box];
    "Pre-commit" -> "Pattern 3: WIP Check";
    "Pre-PR" -> "How big?";
    "Post-commit" -> "Pattern 2: Commit Review";
    "Investigating" -> "Pattern 4: Focused Investigation";

    "How big?" [shape=diamond];
    "How big?" -> "Pattern 1: Pre-PR Review" [label="< 300 lines"];
    "How big?" -> "Full Multi-Pass" [label=">= 300 lines"];
}

Prompt Engineering Rules

These apply to both directions — the prompts are model-agnostic.

Assign a persona — "senior security engineer" beats "review for security"
Specify what to skip — "Skip formatting, naming style, minor docs gaps"
Require confidence scores — Only act on findings >= 0.7
Demand file:line citations — Vague findings aren't actionable
Ask for concrete fixes — "Suggest a specific fix"
One domain per pass — Security-only, architecture-only

Ready-to-use prompt templates for security, architecture, performance, error handling, and concurrency are in references/prompts.md.

Anti-Patterns

Anti-Pattern	Why It Fails	Fix
Self-review (model reviews its own code)	Systematic bias — same blind spots	Cross-model: author and reviewer are different models
"Review this code" (no specifics)	Too vague, produces bikeshedding	Domain-specific prompts with persona
Single pass for everything	Context dilution	Multi-pass, one concern per pass
No confidence threshold	Noise floods signal	Only act on >= 0.7
> 3 review iterations	Diminishing returns	Stop at 3, accept trade-offs
Hardcoding model names in commands	Overrides user config, goes stale fast	Omit model/effort flags; use configured defaults
Style comments in review	LLMs default to bikeshedding	"Skip: formatting, naming, minor docs"
Piped diff for architecture review	Diff lacks surrounding context	Use tool-access mode for architecture passes
Using an MCP wrapper	Unnecessary indirection over a CLI binary	Call the reviewer CLI directly via Bash
Review without project context	Generic advice disconnected from codebase	Run from repo root so project memory and source files are visible

What This Skill is NOT

Not a replacement for human review — can't evaluate product direction or UX
Not a linter — use linters for formatting and style
Not infallible — 5-15% false positive rate is normal; triage findings
Not for self-approval — the entire point is cross-model validation

References

For ready-to-use prompt templates (security, architecture, performance, error handling, concurrency), see references/prompts.md.

cross-model-review

Cross-Model Code Review

Host Detection & Direction

Invocation Cheat Sheet

Review Patterns

Pattern 1: Pre-PR Full Review

Pattern 2: Commit-Level Review

Pattern 3: WIP Check

Pattern 4: Focused Investigation

Pattern 5: Ralph Loop (Implement-Review-Fix)

Piped Diff vs Tool Access (Codex → Claude)

Multi-Pass Strategy

Decision Tree: Which Pattern?

Prompt Engineering Rules

Anti-Patterns

What This Skill is NOT

References