skills/whynowlab/stack-skills/cross-verified-research

cross-verified-research

SKILL.md

Cross-Verified Research

Systematic research engine with anti-hallucination safeguards and source quality tiering.

Rules (Absolute)

  1. Never fabricate sources. No fake URLs, no invented papers, no hallucinated statistics.
  2. Source-traceability gate. Every factual claim must be traceable to a specific, citable source. If a claim cannot be traced to any source, mark it as Unverified (internal knowledge only) and state what verification would be needed. Never present untraced claims as findings.
  3. No speculation as fact. Do not present unverified claims using hedging language as if they were findings. Banned patterns: "아마도", "~인 것 같습니다", "~로 보입니다", "~수도 있습니다", "probably", "I think", "seems like", "appears to be", "likely". If a claim is not verified, label it explicitly as Unverified or Contested — do not soften it with hedging.
  4. BLUF output. Lead with conclusion, follow with evidence. Never bury the answer.
  5. Scaled effort. Match research depth to question scope:
    • Narrow factual (single claim, date, specification): 2-3 queries, 2+ sources
    • Technology comparison (A vs B): 5+ queries, 5+ sources
    • Broad landscape (market analysis, state-of-art): 8+ queries, 8+ sources Default to the higher tier when scope is ambiguous.
  6. Cross-verify. Every key claim must appear in 2+ independent sources before presenting as fact. "Independent" means the sources conducted their own analysis or reporting — two articles that both cite the same original source (press release, blog post, study) count as ONE source, not two. Trace claims back to their origin.
  7. Scope before search. If the research question is ambiguous or overly broad, decompose it into specific sub-questions in Stage 1 and present them to the user for confirmation before proceeding to Stage 2. Do not research a vague question — sharpen it first.

Pipeline

Execute these 4 stages sequentially. Do NOT skip stages.

Stage 1: Deconstruct

Break the research question into atomic sub-questions.

Input: "Should we use Bun or Node.js for our backend?"
Decomposed:
  1. Runtime performance benchmarks (CPU, memory, startup)
  2. Ecosystem maturity (npm compatibility, native modules)
  3. Production stability (known issues, enterprise adoption)
  4. Developer experience (tooling, debugging, testing)
  5. Long-term viability (funding, community, roadmap)
  • Identify what requires external verification vs. internal knowledge
  • If the original question is vague or overly broad, present the decomposed sub-questions to the user for confirmation before proceeding (Rule 7)
  • For each sub-question, note what a traceable source would look like

Stage 2: Search & Collect

For each sub-question requiring verification:

  1. Formulate diverse queries — vary keywords, include year filters, try both English and Korean
  2. Use WebSearch for broad discovery, WebFetch for specific page analysis
  3. Classify every source by tier immediately (see Source Tiers below)
  4. Extract specific data points — numbers, dates, versions, quotes with attribution
  5. Record contradictions — when sources disagree, note both positions
  6. Trace origin — when multiple sources cite the same underlying source, identify the original

Search pattern (scale per Rule 5):

Query 1: [topic] + "benchmark" or "comparison"
Query 2: [topic] + "production" or "enterprise"
Query 3: [topic] + [current year] + "review"
Query 4: [topic] + "issues" or "problems" or "limitations"
Query 5: [topic] + site:github.com (issues, discussions)

Fallback when WebSearch is unavailable or returns no results:

  1. Use WebFetch to directly access known authoritative URLs (official docs, GitHub repos, Wikipedia)
  2. Rely on internal knowledge but label all claims as Unverified (no external search available)
  3. Ask the user to provide source URLs or documents for verification
  4. Reduce the minimum source requirement but maintain cross-verification where possible

Stage 3: Cross-Verify

For each key finding:

  • Does it appear in 2+ independent Tier S/A sources? → Verified
  • Does it appear in only 1 source? → Unverified (label it)
  • Do sources contradict? → Contested (present both sides with tier labels)

Remember: "independent" means each source did its own analysis. Two articles both citing the same benchmark study = 1 source.

Build a verification matrix:

| Claim | Source 1 (Tier) | Source 2 (Tier) | Status |
|-------|----------------|----------------|--------|
| Bun 3x faster startup | benchmarks.dev (A) | bun.sh/blog (B) | Verified (note: Bun's own blog = biased) |

Stage 4: Synthesize

Produce the final report in BLUF format.

Output Format

## Research: [Topic]

### Conclusion (BLUF)
[1-3 sentence definitive answer or recommendation]

### Key Findings
[Numbered findings, each with inline source tier labels]

1. **[Finding]** — [evidence summary]
   Sources: 🏛️ [source1], 🛡️ [source2]

2. **[Finding]** — [evidence summary]
   Sources: 🛡️ [source1], 🛡️ [source2]

### Contested / Uncertain
[Any claims that couldn't be cross-verified or where sources conflict]
- ⚠️ [claim] — Source A says X, Source B says Y

### Verification Matrix
| Claim | Sources | Tier | Status |
|-------|---------|------|--------|
| ... | ... | ... | Verified/Unverified/Contested |

### Sources
[All sources, grouped by tier]

#### 🏛️ Tier S — Academic & Primary Research
- [Title](URL) — Journal/Org (Year)

#### 🛡️ Tier A — Trusted Official
- [Title](URL) — Source (Year)

#### ⚠️ Tier B — Community / Caution
- [Title](URL) — Platform (Year)

#### Tier C — General
- [Title](URL)

Quality Calibration

BAD Example — What to Avoid

## Research: Is Rust faster than Go for web servers?

### Conclusion (BLUF)
Rust is generally faster than Go for web servers due to zero-cost abstractions.

### Key Findings
1. **Rust is 2-5x faster than Go** — Rust's ownership model eliminates GC pauses.
   Sources: 🛡️ https://rust-performance-comparison.example.com
2. **Rust uses less memory** — Typically 50% less memory in production.
   Sources: 🛡️ https://memory-benchmarks.example.com
3. **Go is easier to learn** — Most developers pick up Go in a week.
   Sources: 🏛️ https://developer-survey.example.com

### Verification Matrix
| Claim | Sources | Tier | Status |
|-------|---------|------|--------|
| 2-5x faster | 1 benchmark site | A | Verified |
| 50% less memory | 1 benchmark site | A | Verified |

Why this is bad:

  • Source URLs are fabricated (nonexistent domains)
  • "2-5x faster" and "50% less memory" are presented as Verified with only 1 source each
  • No contested claims section despite this being a nuanced topic
  • Claims are restated internal knowledge dressed up with fake citations
  • No origin tracing — where did "2-5x" come from?
  • The "Verified" labels are false — nothing was actually cross-verified

GOOD Example — What to Aim For

## Research: Is Rust faster than Go for web servers?

### Conclusion (BLUF)
Rust outperforms Go in raw throughput benchmarks (typically 1.5-3x in TechEmpower), but the gap narrows significantly with real-world I/O workloads. Go's GC pauses (sub-millisecond since Go 1.19) are rarely a bottleneck for typical web services. Choose based on your latency tail requirements, not averages.

### Key Findings
1. **Rust frameworks lead TechEmpower benchmarks** — Actix-web and Axum consistently rank in the top 10; Go's stdlib and Gin rank 20-40 range in plaintext/JSON tests.
   Sources: 🏛️ TechEmpower Round 22 (2024), 🛡️ Axum GitHub benchmarks
2. **Go's GC latency is sub-millisecond since 1.19** — p99 GC pause < 500μs confirmed by the Go team.
   Sources: 🛡️ Go Blog "Getting to Go" (2022), 🛡️ Go 1.19 Release Notes
3. **Real-world gap is smaller than microbenchmarks suggest** — Discord's 2020 migration (Go→Rust) showed tail latency improvements, but their workload (millions of concurrent connections) is atypical.
   Sources: 🛡️ Discord Engineering Blog (2020), ⚠️ HN discussion with Discord engineer comments

### Contested / Uncertain
- ⚠️ **"Rust uses 50% less memory than Go"** — Frequently repeated on Reddit/HN but no independent benchmark reproduces a consistent figure. Memory usage depends heavily on allocator choice (jemalloc vs system) and workload. **Unverified.**
- ⚠️ **Developer productivity trade-off** — Go advocates claim 2-3x faster development time. No peer-reviewed study supports a specific multiplier. **Unverified (internal knowledge only)** — would need controlled study to verify.

### Verification Matrix
| Claim | Sources | Tier | Status |
|-------|---------|------|--------|
| Rust 1.5-3x faster (synthetic) | TechEmpower R22 (S), Axum bench (A) | S+A | Verified |
| Go GC < 500μs p99 | Go Blog (A), Release Notes (A) | A+A | Verified |
| Discord latency improvement | Discord Blog (A), HN thread (B) | A+B | Verified (single case study) |
| Rust 50% less memory | Reddit threads (B) only | B | Unverified |
| Go 2-3x dev speed | No source found || Unverified (internal knowledge only) |

### Sources

#### 🏛️ Tier S — Academic & Primary Research
- [TechEmpower Framework Benchmarks Round 22](https://www.techempower.com/benchmarks/) — TechEmpower (2024)

#### 🛡️ Tier A — Trusted Official
- [Getting to Go: The Journey of Go's Garbage Collector](https://go.dev/blog/ismmkeynote) — Go Blog (2022)
- [Go 1.19 Release Notes](https://go.dev/doc/go1.19) — Go Team (2022)
- [Why Discord is Switching from Go to Rust](https://discord.com/blog/why-discord-is-switching-from-go-to-rust) — Discord Engineering (2020)
- [Axum Benchmarks](https://github.com/tokio-rs/axum) — Tokio Project

#### ⚠️ Tier B — Community / Caution
- [HN Discussion on Discord migration](https://news.ycombinator.com/item?id=22238289) — Hacker News (2020)

Why this is good:

  • Every URL is a real, verifiable page
  • Claims that lack sources are explicitly labeled Unverified
  • The "50% less memory" myth is called out rather than repeated
  • Verification matrix honestly shows what's verified vs. not
  • Sources are independent (TechEmpower did their own benchmarks, not citing each other)
  • Nuance preserved: "the gap narrows with real-world I/O"

Source Tiers

Classify every source on discovery.

Tier Label Trust Level Examples
S 🏛️ Academic, peer-reviewed, primary research, official specs Google Scholar, arXiv, PubMed, W3C/IETF RFCs, language specs (ECMAScript, PEPs)
A 🛡️ Government, .edu, major press, official docs .gov/.edu, Reuters/AP/BBC, official framework docs, company engineering blogs (Google AI, Netflix Tech)
B ⚠️ Social media, forums, personal blogs, wikis — flag to user Twitter/X, Reddit, StackOverflow, Medium, dev.to, Wikipedia, 나무위키
C (none) General websites not fitting above categories Corporate marketing, press releases, SEO content, news aggregators

Tier Classification Rules

  • Company's own content about their product:
    • Official docs → Tier A
    • Feature announcements → Tier A (existence), Tier B (performance claims)
    • Marketing pages → Tier C
  • GitHub:
    • Official repos (e.g., facebook/react) → Tier A
    • Issues/Discussions with reproduction → Tier A (for bug existence)
    • Random user repos → Tier B
  • Benchmarks:
    • Independent, reproducible, methodology disclosed → Tier S
    • Official by neutral party → Tier A
    • Vendor's own benchmarks → Tier B (note bias)
  • StackOverflow: Accepted answers with high votes = borderline Tier A; non-accepted = Tier B
  • Tier B sources must never be cited alone — corroborate with Tier S or A

When to Use

  • Technology evaluation or comparison
  • Fact-checking specific claims
  • Architecture decision research
  • Market/competitor analysis
  • "Is X true?" verification tasks
  • Any question where accuracy matters more than speed

When NOT to Use

  • Creative writing or brainstorming (use creativity-sampler)
  • Code implementation (use search-first for library discovery)
  • Simple questions answerable from internal knowledge with high confidence
  • Opinion-based questions with no verifiable answer

Integration Notes

  • With brainstorming: Can be invoked during brainstorming's "Explore context" phase for fact-based inputs
  • With search-first: search-first finds tools/libraries to USE; this skill VERIFIES factual claims. Different purposes.
  • With adversarial-review: Research findings can feed into adversarial review for stress-testing conclusions
Weekly Installs
22
GitHub Stars
14
First Seen
10 days ago
Installed on
opencode22
github-copilot22
codex22
kimi-cli22
gemini-cli22
amp22