research
Deep Research
General-purpose deep research with multi-source synthesis, confidence scoring, and anti-hallucination verification. Adopts SOTA patterns from OpenAI Deep Research (multi-agent triage pipeline), Google Gemini Deep Research (user-reviewable plans), STORM (perspective-guided conversations), Perplexity (source confidence ratings), and LangChain ODR (supervisor-researcher with reflection).
Vocabulary
| Term | Definition |
|---|---|
| query | The user's research question or topic; the unit of investigation |
| claim | A discrete assertion to be verified; extracted from sources or user input |
| source | A specific origin of information: URL, document, database record, or API response |
| evidence | A source-backed datum supporting or contradicting a claim; always has provenance |
| provenance | The chain from evidence to source: tool used, URL, access timestamp, excerpt |
| confidence | Score 0.0-1.0 per claim; based on evidence strength and cross-validation |
| cross-validation | Verifying a claim across 2+ independent sources; the core anti-hallucination mechanism |
| triangulation | Confirming a finding using 3+ methodologically diverse sources |
| contradiction | When two credible sources assert incompatible claims; must be surfaced explicitly |
| synthesis | The final research product: not a summary but a novel integration of evidence with analysis |
| journal | The saved markdown record of a research session, stored in ~/.claude/research/ |
| sweep | Wave 1: broad parallel search across multiple tools and sources |
| deep dive | Wave 2: targeted follow-up on specific leads from the sweep |
| lead | A promising source or thread identified during the sweep, warranting deeper investigation |
| tier | Complexity classification: Quick (0-2), Standard (3-5), Deep (6-8), Exhaustive (9-10) |
| finding | A verified claim with evidence chain, confidence score, and provenance; the atomic unit of output |
| gap | An identified area where evidence is insufficient, contradictory, or absent |
| bias marker | An explicit flag on a finding indicating potential bias (recency, authority, LLM prior, etc.) |
| degraded mode | Operation when research tools are unavailable; confidence ceilings applied |
Dispatch
$ARGUMENTS |
Action |
|---|---|
Question or topic text (has verb or ?) |
Investigate — classify complexity, execute wave pipeline |
Vague input (<5 words, no verb, no ?) |
Intake — ask 2-3 clarifying questions, then classify |
check <claim> or verify <claim> |
Fact-check — verify claim against 3+ search engines |
compare <A> vs <B> [vs <C>...] |
Compare — structured comparison with decision matrix output |
survey <field or topic> |
Survey — landscape mapping, annotated bibliography |
track <topic> |
Track — load prior journal, search for updates since last session |
resume [number or keyword] |
Resume — resume a saved research session |
list [active|domain|tier] |
List — show journal metadata table |
archive |
Archive — move journals older than 90 days |
delete <N> |
Delete — delete journal N with confirmation |
export [N] |
Export — render HTML dashboard for journal N (default: current) |
| Empty | Gallery — show topic examples + "ask me anything" prompt |
Auto-Detection Heuristic
If no mode keyword matches:
- Ends with
?or starts with question word (who/what/when/where/why/how/is/are/can/does/should/will) → Investigate - Contains
vs,versus,compared to,orbetween noun phrases → Compare - Declarative statement with factual claim, no question syntax → Fact-check
- Broad field name with no specific question → ask: "Investigate a specific question, or survey the entire field?"
- Ambiguous → ask: "Would you like me to investigate this question, verify this claim, or survey this field?"
Gallery (Empty Arguments)
Present research examples spanning domains:
| # | Domain | Example | Likely Tier |
|---|---|---|---|
| 1 | Technology | "What are the current best practices for LLM agent architectures?" | Deep |
| 2 | Academic | "What is the state of evidence on intermittent fasting for longevity?" | Standard |
| 3 | Market | "How does the competitive landscape for vector databases compare?" | Deep |
| 4 | Fact-check | "Is it true that 90% of startups fail within the first year?" | Standard |
| 5 | Architecture | "When should you choose event sourcing over CRUD?" | Standard |
| 6 | Trends | "What emerging programming languages gained traction in 2025-2026?" | Standard |
Pick a number, paste your own question, or type
guide me.
Skill Awareness
Before starting research, check if another skill is a better fit:
| Signal | Redirect |
|---|---|
| Code review, PR review, diff analysis | Suggest /honest-review |
| Strategic decision with adversaries, game theory | Suggest /wargame |
| Multi-perspective expert debate | Suggest /host-panel |
| Prompt optimization, model-specific prompting | Suggest /prompt-engineer |
If the user confirms they want general research, proceed.
Complexity Classification
Score the query on 5 dimensions (0-2 each, total 0-10):
| Dimension | 0 | 1 | 2 |
|---|---|---|---|
| Scope breadth | Single fact/definition | Multi-faceted, 2-3 domains | Cross-disciplinary, 4+ domains |
| Source difficulty | Top search results suffice | Specialized databases or multiple source types | Paywalled, fragmented, or conflicting sources |
| Temporal sensitivity | Stable/historical | Evolving field (months matter) | Fast-moving (days/weeks matter), active controversy |
| Verification complexity | Easily verifiable (official docs) | 2-3 independent sources needed | Contested claims, expert disagreement, no consensus |
| Synthesis demand | Answer is a fact or list | Compare/contrast viewpoints | Novel integration of conflicting threads |
| Total | Tier | Strategy |
|---|---|---|
| 0-2 | Quick | Inline, 1-2 searches, fire-and-forget |
| 3-5 | Standard | Subagent wave, 3-5 parallel searchers, report delivered |
| 6-8 | Deep | Agent team (TeamCreate), 3-5 teammates, interactive session |
| 9-10 | Exhaustive | Agent team, 4-6 teammates + nested subagent waves, interactive |
Present the scoring to the user. User can override tier with --depth <tier>.
Wave Pipeline
All non-Quick research follows this 5-wave pipeline. Quick merges Waves 0+1+4 inline.
Wave 0: Triage (always inline, never parallelized)
- Run
!uv run python skills/research/scripts/research-scanner.py "$ARGUMENTS"for deterministic pre-scan - Decompose query into 2-5 sub-questions
- Score complexity on the 5-dimension rubric
- Check tool availability — probe key MCP tools; set degraded mode flags and confidence ceilings per
references/source-selection.md - Select tools per domain signals — read
references/source-selection.md - Check for existing journals — if
trackorresume, load prior state - Present triage to user — show: complexity score, sub-questions, planned strategy, estimated tier. User may override.
Wave 1: Broad Sweep (parallel)
Scale by tier:
Quick (inline): 1-2 tool calls sequentially. No subagents.
Standard (subagent wave): Dispatch 3-5 parallel subagents via Task tool:
Subagent A → brave-search + duckduckgo-search for sub-question 1
Subagent B → exa + g-search for sub-question 2
Subagent C → context7 / deepwiki / arxiv / semantic-scholar for technical specifics
Subagent D → wikipedia / wikidata for factual grounding
[Subagent E → PubMed / openalex if academic domain detected]
Deep (agent team): TeamCreate "research-{slug}":
Lead: triage (Wave 0), orchestrate, judge reconcile (Wave 3), synthesize (Wave 4)
|-- web-researcher: brave-search, duckduckgo-search, exa, g-search
|-- tech-researcher: context7, deepwiki, arxiv, semantic-scholar, package-version
|-- content-extractor: fetcher, trafilatura, docling, wikipedia, wayback
|-- [academic-researcher: arxiv, semantic-scholar, openalex, crossref, PubMed]
|-- [adversarial-reviewer: devil's advocate — counter-search all emerging findings]
Spawn academic-researcher if domain signals include academic/scientific. Spawn adversarial-reviewer for Exhaustive tier or if verification complexity >= 2.
Exhaustive: Deep team + each teammate runs nested subagent waves internally.
Each subagent/teammate returns structured findings:
{
"sub_question": "...",
"findings": [{"claim": "...", "source_url": "...", "source_tool": "...", "excerpt": "...", "confidence_raw": 0.6}],
"leads": ["url1", "url2"],
"gaps": ["could not find data on X"]
}
Wave 1.5: Perspective Expansion (Deep/Exhaustive only)
STORM-style perspective-guided conversation. Spawn 2-4 perspective subagents:
| Perspective | Focus | Question Style |
|---|---|---|
| Skeptic | What could be wrong? What's missing? | "What evidence would disprove this?" |
| Domain Expert | Technical depth, nuance, edge cases | "What do practitioners actually encounter?" |
| Practitioner | Real-world applicability, trade-offs | "What matters when you actually build this?" |
| Theorist | First principles, abstractions, frameworks | "What underlying model explains this?" |
Each perspective agent reviews Wave 1 findings and generates 2-3 additional sub-questions from their viewpoint. These sub-questions feed into Wave 2.
Wave 2: Deep Dive (parallel, targeted)
- Rank leads from Wave 1 by potential value (citation frequency, source authority, relevance)
- Dispatch deep-read subagents — use fetcher/trafilatura/docling to extract full content from top leads
- Follow citation chains — if a source cites another, fetch the original
- Fill gaps — for each gap identified in Wave 1, dispatch targeted searches
- Use thinking MCPs:
cascade-thinkingfor multi-perspective analysis of complex findingsstructured-thinkingfor tracking evidence chains and contradictionsthink-strategiesfor complex question decomposition (Standard+ only)
Wave 3: Cross-Validation (parallel)
The anti-hallucination wave. Read references/confidence-rubric.md and references/self-verification.md.
For every claim surviving Waves 1-2:
- Independence check — are supporting sources truly independent? Sources citing each other are NOT independent.
- Counter-search — explicitly search for evidence AGAINST each major claim using a different search engine
- Freshness check — verify sources are current (flag if >1 year old for time-sensitive topics)
- Contradiction scan — read
references/contradiction-protocol.md, identify and classify disagreements - Confidence scoring — assign 0.0-1.0 per
references/confidence-rubric.md - Bias sweep — check each finding against 10 bias categories (7 core + 3 LLM-specific) per
references/bias-detection.md
Self-Verification (3+ findings survive): Spawn devil's advocate subagent per references/self-verification.md:
For each finding, attempt to disprove it. Search for counterarguments. Check if evidence is outdated. Verify claims actually follow from cited evidence. Flag LLM confabulations.
Adjust confidence: Survives +0.05, Weakened -0.10, Disproven set to 0.0. Adjustments are subject to hard caps — single-source claims remain capped at 0.60 even after survival adjustment.
Wave 4: Synthesis (always inline, lead only)
Produce the final research product. Read references/output-formats.md for templates.
The synthesis is NOT a summary. It must:
- Answer directly — answer the user's question clearly
- Map evidence — all verified findings with confidence and citations
- Surface contradictions — where sources disagree, with analysis of why
- Show confidence landscape — what is known confidently, what is uncertain, what is unknown
- Audit biases — biases detected during research
- Identify gaps — what evidence is missing, what further research would help
- Distill takeaways — 3-7 numbered key findings
- Cite sources — full bibliography with provenance
Output format adapts to mode:
- Investigate → Research Brief (Standard) or Deep Report (Deep/Exhaustive)
- Fact-check → Quick Answer with verdict + evidence
- Compare → Decision Matrix
- Survey → Annotated Bibliography
- User can override with
--format brief|deep|bib|matrix
Confidence Scoring
| Score | Basis |
|---|---|
| 0.9-1.0 | Official docs + 2 independent sources agree, no contradictions |
| 0.7-0.8 | 2+ independent sources agree, minor qualifications |
| 0.5-0.6 | Single authoritative source, or 2 sources with partial agreement |
| 0.3-0.4 | Single non-authoritative source, or conflicting evidence |
| 0.2-0.3 | Multiple non-authoritative sources with partial agreement, or single source with significant caveats |
| 0.1-0.2 | LLM reasoning only, no external evidence found |
| 0.0 | Actively contradicted by evidence |
Hard rules:
- No claim reported at >= 0.7 unless supported by 2+ independent sources
- Single-source claims cap at 0.6 regardless of source authority
- Degraded mode (all research tools unavailable): max confidence 0.4, all findings labeled "unverified"
Merged confidence (for claims supported by multiple sources):
c_merged = 1 - (1-c1)(1-c2)...(1-cN) capped at 0.99
Evidence Chain Structure
Every finding carries this structure:
FINDING RR-{seq:03d}: [claim statement]
CONFIDENCE: [0.0-1.0]
EVIDENCE:
1. [source_tool] [url] [access_timestamp] — [relevant excerpt, max 100 words]
2. [source_tool] [url] [access_timestamp] — [relevant excerpt, max 100 words]
CROSS-VALIDATION: [agrees|contradicts|partial] across [N] independent sources
BIAS MARKERS: [none | list of detected biases with category]
GAPS: [none | what additional evidence would strengthen this finding]
Use !uv run python skills/research/scripts/finding-formatter.py --format markdown to normalize.
Source Selection
Read references/source-selection.md during Wave 0 for the full tool-to-domain mapping. Summary:
| Domain Signal | Primary Tools | Secondary Tools |
|---|---|---|
| Library/API docs | context7, deepwiki, package-version | brave-search |
| Academic/scientific | arxiv, semantic-scholar, PubMed, openalex | crossref, brave-search |
| Current events/trends | brave-search, exa, duckduckgo-search, g-search | fetcher, trafilatura |
| GitHub repos/OSS | deepwiki, repomix | brave-search |
| General knowledge | wikipedia, wikidata, brave-search | fetcher |
| Historical content | wayback, brave-search | fetcher |
| Fact-checking | 3+ search engines mandatory | wikidata for structured claims |
| PDF/document analysis | docling | trafilatura |
Multi-engine protocol: For any claim requiring verification, use minimum 2 different search engines. Different engines have different indices and biases. Agreement across engines increases confidence.
Bias Detection
Check every finding against 10 bias categories. Read references/bias-detection.md for full detection signals and mitigation strategies.
| Bias | Detection Signal | Mitigation |
|---|---|---|
| LLM prior | Matches common training patterns, lacks fresh evidence | Flag; require fresh source confirmation |
| Recency | Overweighting recent results, ignoring historical context | Search for historical perspective |
| Authority | Uncritically accepting prestigious sources | Cross-validate even authoritative claims |
| Confirmation | Queries constructed to confirm initial hypothesis | Use neutral queries; search for counterarguments |
| Survivorship | Only finding successful examples | Search for failures/counterexamples |
| Selection | Search engine bubble, English-only | Use multiple engines; note coverage limitations |
| Anchoring | First source disproportionately shapes interpretation | Document first source separately; seek contrast |
State Management
- Journal path:
~/.claude/research/ - Archive path:
~/.claude/research/archive/ - Filename convention:
{YYYY-MM-DD}-{domain}-{slug}.md{domain}:tech,academic,market,policy,factcheck,compare,survey,track,general{slug}: 3-5 word semantic summary, kebab-case- Collision: append
-v2,-v3
- Format: YAML frontmatter + markdown body +
<!-- STATE -->blocks
Save protocol:
- Quick: save once at end with
status: Complete - Standard/Deep/Exhaustive: save after Wave 1 with
status: In Progress, update after each wave, finalize after synthesis
Resume protocol:
resume(no args): findstatus: In Progressjournals. One → auto-resume. Multiple → show list.resume N: Nth journal fromlistoutput (reverse chronological).resume keyword: search frontmatterqueryanddomain_tagsfor match.
Use !uv run python skills/research/scripts/journal-store.py for all journal operations.
State snapshot (appended after each wave save):
<!-- STATE
wave_completed: 2
findings_count: 12
leads_pending: ["url1", "url2"]
gaps: ["topic X needs more sources"]
contradictions: 1
next_action: "Wave 3: cross-validate top 8 findings"
-->
In-Session Commands (Deep/Exhaustive)
Available during active research sessions:
| Command | Effect |
|---|---|
drill <finding #> |
Deep dive into a specific finding with more sources |
pivot <new angle> |
Redirect research to a new sub-question |
counter <finding #> |
Explicitly search for evidence against a finding |
export |
Render HTML dashboard |
status |
Show current research state without advancing |
sources |
List all sources consulted so far |
confidence |
Show confidence distribution across findings |
gaps |
List identified knowledge gaps |
? |
Show command menu |
Read references/session-commands.md for full protocols.
Reference File Index
| File | Content | Read When |
|---|---|---|
references/source-selection.md |
Tool-to-domain mapping, multi-engine protocol, degraded mode | Wave 0 (selecting tools) |
references/confidence-rubric.md |
Scoring rubric, cross-validation rules, independence checks | Wave 3 (assigning confidence) |
references/evidence-chain.md |
Finding template, provenance format, citation standards | Any wave (structuring evidence) |
references/bias-detection.md |
10 bias categories (7 core + 3 LLM-specific), detection signals, mitigation strategies | Wave 3 (bias audit) |
references/contradiction-protocol.md |
4 contradiction types, resolution framework | Wave 3 (contradiction detection) |
references/self-verification.md |
Devil's advocate protocol, hallucination detection | Wave 3 (self-verification) |
references/output-formats.md |
Templates for all 5 output formats | Wave 4 (formatting output) |
references/team-templates.md |
Team archetypes, subagent prompts, perspective agents | Wave 0 (designing team) |
references/session-commands.md |
In-session command protocols | When user issues in-session command |
references/dashboard-schema.md |
JSON data contract for HTML dashboard | export command |
Loading rule: Load ONE reference at a time per the "Read When" column. Do not preload.
Critical Rules
- No claim >= 0.7 unless supported by 2+ independent sources — single-source claims cap at 0.6
- Never fabricate citations — if URL, author, title, or date cannot be verified, use vague attribution ("a study in this tradition") rather than inventing specifics
- Always surface contradictions explicitly — never silently resolve disagreements; present both sides with evidence
- Always present triage scoring before executing research — user must see and can override complexity tier
- Save journal after every wave in Deep/Exhaustive mode — enables resume after interruption
- Never skip Wave 3 (cross-validation) for Standard/Deep/Exhaustive tiers — this is the anti-hallucination mechanism
- Multi-engine search is mandatory for fact-checking — use minimum 2 different search tools (e.g., brave-search + duckduckgo-search)
- Apply the Accounting Rule after every parallel dispatch — N dispatched = N accounted for before proceeding to next wave
- Distinguish facts from interpretations in all output — factual claims carry evidence; interpretive claims are explicitly labeled as analysis
- Flag all LLM-prior findings — claims matching common training data but lacking fresh evidence must be flagged with bias marker
- Max confidence 0.4 in degraded mode — when all research tools are unavailable, report all findings as "unverified — based on training knowledge"
- Load ONE reference file at a time — do not preload all references into context
- Track mode must load prior journal before searching — avoid re-researching what is already known
- The synthesis is not a summary — it must integrate findings into novel analysis, identify patterns across sources, and surface emergent insights not present in any single source
- PreToolUse Edit hook is non-negotiable — the research skill never modifies source files; it only creates/updates journals in
~/.claude/research/