swain-search

Installation
SKILL.md

swain-search

Collect, normalize, and cache source materials into reusable troves that swain-design artifacts can reference.

Script invocation convention

Scripts live under this skill's scripts/ directory. Use the <SKILL_DIR> placeholder to mean the folder holding this SKILL.md. Resolve it at run time. In an installed skill, that is .claude/skills/swain-search/. In the swain repo, it is skills/swain-search/.

Run the bootstrap once per session before the media or X-thread flows:

bash "<SKILL_DIR>/scripts/bootstrap.sh"

The script checks that uv is on PATH. After the first run, a marker file at ~/.local/share/swain-search/.bootstrapped short-circuits later runs. If it exits non-zero, stop and tell the operator what is missing.

Mode detection

Signal Mode
No trove exists for the topic, or user says "research X" / "gather sources" Create — new trove
Trove exists and user provides new sources or says "add to" / "extend" Extend — add sources to existing trove
Trove exists and user says "refresh" or sources are past TTL Refresh — re-fetch stale sources
User asks "what troves do we have" or "find sources about X" Discover — search existing troves by tag

Prior art check

Before creating a new trove or running web searches, scan existing troves for relevant content. This avoids duplicating research and surfaces connections to prior work.

Phase 1 — Literal keyword match

Search for the source name, URL fragments, and author name:

# Search trove manifests by tag
grep -rl "<keyword>" docs/troves/*/manifest.yaml 2>/dev/null

# Search trove source content
grep -rl "<keyword>" docs/troves/*/sources/**/*.md 2>/dev/null

# Search trove syntheses
grep -rl "<keyword>" docs/troves/*/synthesis.md 2>/dev/null

Phase 2 — Semantic topic match

After fetching the source and understanding what it's about, extract 3-5 topic keywords from the source's content (not just its name or URL). Then search existing troves by topic:

# Search trove tags for topic keywords
grep -l "<topic-keyword-1>\|<topic-keyword-2>\|<topic-keyword-3>" docs/troves/*/manifest.yaml 2>/dev/null

# Search synthesis summaries for topic keywords
grep -l "<topic-keyword-1>\|<topic-keyword-2>\|<topic-keyword-3>" docs/troves/*/synthesis.md 2>/dev/null

Topic keywords should describe what the source is about, not what it's called. For example, a repo named "Cog" that implements a memory system for Claude Code should generate topic keywords like agent-memory, memory-architecture, claude-code, persistent-memory — not cog or marciopuga.

If the source has not been fetched yet (URL-only invocation), use whatever topic information is available from the URL or title and defer full topic matching until after the source is fetched.

Decision gate

Before proceeding to Create or Extend mode, output a visible routing decision:

Prior art check: Phase 1 found [N matches / no matches]. Phase 2 found [N matches / no matches]: [trove-id (tags: x, y), ...]. Decision: Extending [trove-id] / Creating new trove [slug] because [reason].

This makes the trove routing decision auditable. If any trove matches on 2+ topic keywords, default to Extend mode unless the topic is genuinely distinct (adjacent but different subject matter).

Action on matches

If existing troves contain relevant sources:

  1. Report what was found — show the trove ID, matching source titles, and relevant excerpts
  2. Suggest extend over create — if an existing trove covers the same topic, extend it rather than creating a parallel trove
  3. Cross-link — if the topic is adjacent but distinct, create a new trove but note the related trove in synthesis.md

This step runs in all modes (Create, Extend, Discover) and before any web searches. Existing trove content is always checked first.

Snapshot evidence gate (SPEC-220)

Before a remote source can be treated as collected evidence, the run must produce a raw snapshot and a metadata ledger entry in .agents/search-snapshots/metadata.jsonl.

Required flow for remote sources:

  1. Export/download the raw snapshot first:
    • bash "<SKILL_DIR>/scripts/export-snapshot.sh" --url "<source-url>" --out-dir ".agents/search-snapshots/raw"
  2. Normalize the downloaded file using writing-skills or skill-creator (never summary-only browser notes).
  3. Log metadata:
    • bash "<SKILL_DIR>/scripts/log-snapshot-metadata.sh" --source-url "<source-url>" --export-mode "<mode>" --raw-path "<raw-path>" --normalized-path "<normalized-path>" --normalization-skill "<writing-skills|skill-creator>"
  4. Verify before publication:
    • bash "<SKILL_DIR>/scripts/verify-snapshot-evidence.sh" --source-url "<source-url>"

If verification fails, mark the source unverified, do not publish it downstream, and report the warning to the operator.

Create mode

Build a new trove from scratch.

Step 1 — Gather inputs

Ask the user (or infer from context) for:

  1. Trove ID — a slug for the topic (e.g., websocket-vs-sse). Suggest one if the context is clear.
  2. Tags — keywords for discovery (e.g., real-time, websocket, sse)
  3. Sources — any combination of:
    • Web search queries ("search for WebSocket vs SSE comparisons")
    • URLs (web pages, forum threads, docs)
    • Video/audio URLs
    • Local file paths
  4. Freshness TTL overrides — optional, defaults are fine for most troves

If invoked from swain-design (e.g., spike entering Active), the artifact context provides the topic, tags, and sometimes initial sources.

Step 2 — Collect and normalize

For each source, use the appropriate capability. Read references/normalization-formats.md for the exact markdown structure per source type.

Web search queries:

  1. Use a web search capability to find relevant results
  2. Select the top 3-5 most relevant results
  3. For each: fetch the page, normalize to markdown per the web page format
  4. If no web search capability is available, tell the user and skip

Web page URLs:

  1. Fetch the page using a browser or page-fetching capability
  2. Strip boilerplate (nav, ads, sidebars, cookie banners)
  3. Normalize to markdown per the web page format
  4. If fetch fails, record the URL in manifest with a failed: true flag and move on

Google Docs / Drive-like documents:

  1. Export raw content first (required):
    • bash "<SKILL_DIR>/scripts/export-snapshot.sh" --url "<source-url>" --out-dir ".agents/search-snapshots/raw"
  2. Prefer API export modes (google-doc-export, google-slides-export, google-drive-download).
  3. If API export fails, use a browser helper fallback only when available.
  4. Normalize the exported file with writing-skills or skill-creator.
  5. Log metadata in .agents/search-snapshots/metadata.jsonl.
  6. Verify with verify-snapshot-evidence.sh before including the source in trove outputs.

Paywall proxy fallback:

After fetching a web page, check if a paywall proxy is available for the URL's domain:

  1. Run bash "<SKILL_DIR>/scripts/resolve-proxy.sh" <url>
    • Exit 1: no proxy configured — use the direct fetch content as-is
    • Exit 0: outputs PROXY:<name>:<proxy-url> and SIGNAL:<text> lines
  2. If exit 0, check the fetched content for each SIGNAL text (case-sensitive literal match)
  3. If any signal matches (or the article body is under ~200 words):
    • Log: "Paywall detected for <url> — trying proxy fallback"
    • Try each PROXY URL in order, fetching via the same page-fetching capability used for web pages
    • First proxy that returns substantive content (more than the truncated original) wins
    • Set proxy-used: <name> and notes: "Full article retrieved via <name> proxy" in the manifest entry
  4. If no signals match: use the direct fetch content as-is (no proxy needed)
  5. If all proxies fail: keep the original truncated content, set notes: "Paywalled; proxies exhausted — content from direct fetch only"

The registry lives at references/paywall-proxies.yaml. Add new domains or proxies there — no skill file changes needed.

Video/audio URLs (YouTube, Instagram, podcasts):

Follow the tiered chain below. Each tier writes /tmp/swain_search_media_transcript.txt. That file is then normalized per the media format in references/normalization-formats.md. The output goes to sources/<source-id>/<source-id>.md.

  1. Fetch subs and metadata via a single yt-dlp call:

    bash "<SKILL_DIR>/scripts/yt-dlp.sh" --write-auto-sub --sub-lang en --write-info-json --skip-download --sub-format vtt -o "/tmp/swain_search_media" "<URL>"
    

    For Instagram URLs, add --cookies-from-browser <browser> (chrome/safari/firefox/brave). For podcast or conference URLs not already on YouTube, resolve the title to a YouTube link first via a web-search capability.

  2. VTT path (preferred). If /tmp/swain_search_media.en.vtt exists and is non-empty:

    test -s /tmp/swain_search_media.en.vtt && uv run "<SKILL_DIR>/scripts/parse_vtt.py"
    

    The script writes [HH:MM:SS] line segments. Set transcript-source: vtt in the source frontmatter.

  3. Caption fallback. If no VTT, read /tmp/swain_search_media.info.json and take the description field. Strip #\w+ tags. Strip leading and trailing whitespace. If the remaining text is over 100 characters, write it to /tmp/swain_search_media_transcript.txt (one paragraph per line). Set transcript-source: caption. Omit timestamps.

  4. Frame-extraction fallback (needs operator approval). If the caption is too short (100 chars or fewer), stop and ask:

    No subtitles or usable caption found. I can extract frames and read on-screen text to build a transcript. This needs opencv-python-headless (~30MB, via uv). Proceed?

    On approval:

    • Download the video: bash "<SKILL_DIR>/scripts/yt-dlp.sh" -o "/tmp/swain_search_video.mp4" "<URL>". Add --cookies-from-browser <browser> for Instagram.
    • Extract frames: uv run --with opencv-python-headless python3 "<SKILL_DIR>/scripts/extract_frames.py" /tmp/swain_search_video.mp4. Saves /tmp/swain_search_frame_000.png and up, via scene-change detection.
    • Probe vision. Use the Read tool on /tmp/swain_search_frame_000.png. Try to read the visible text. If that works, go to step 5. If not, go to step 6.
  5. Vision OCR (preferred). Use the Read tool on each remaining frame. Extract all visible text. Dedupe adjacent-frame repeats. Write the text to /tmp/swain_search_media_transcript.txt. Set transcript-source: vision-ocr. No timestamps.

  6. Local OCR fallback. If vision fails, ask:

    Vision not available. Falling back to local OCR via EasyOCR (~400MB first-run download). Proceed?

    On approval:

    uv run --with "easyocr,opencv-python-headless" python3 "<SKILL_DIR>/scripts/ocr_frames.py"
    

    Set transcript-source: local-ocr. No timestamps.

  7. Normalize and write the source. Derive the source ID slug from the video title. Use lowercase, numbers, and hyphens only. Write sources/<source-id>/<source-id>.md per the media format. Add transcript-source to the frontmatter. Add duration, speakers, and YouTube deep-links only when step 2 ran.

If no tier succeeds, record the source in the manifest with failed: true and reason: <tier>.

X/Twitter threads:

URL pattern: (x|twitter|fxtwitter|fixupx).com/.+/status/\d+. Unrolled via the public fxtwitter API. No auth needed.

  1. Fetch the thread:

    uv run "<SKILL_DIR>/scripts/fetch_x_thread.py" "<URL>"
    

    The script writes two files: /tmp/swain_search_thread.json holds the raw response, and /tmp/swain_search_thread_transcript.txt holds the stitched transcript with cited-post blockquotes. It also prints a metadata JSON object to stdout. Capture these fields: author_name, author_handle, author_url, published_date, tweet_count, title_guess, source_url, post_urls, cited_posts.

  2. Unrollable thread. If the script exits with "only 1 post returned" on a thread-opener, the upstream deployment lacks an account proxy. Record the entry with failed: true and reason: x-thread-unrollable. Move on.

  3. Derive the source ID as <author_handle>-<first-few-title-words>. Sanitize to lowercase, numbers, and hyphens only. Strip any @.

  4. Normalize per the x-thread format in references/normalization-formats.md:

    • Frontmatter: add author-handle, author-name, published-date, and tweet-count to the common fields.
    • Body: render every post verbatim as a numbered list. Hyperlink each number back to its tweet URL. Strip leading auto-mention chains. These are the @handle prefixes X adds to replies. Hyperlink inline @mentions as [@handle](https://x.com/handle). Hyperlink hashtags as [#tag](https://x.com/hashtag/tag).
    • Cited posts: render each cited_posts entry as an indented blockquote under the citing post. Append up to 3 substantive self-replies as continuation. Skip bare-URL self-replies; they already appear in external_links. Link out if more than 3 self-replies exist.
  5. Save to sources/<source-id>/<source-id>.md.

Local files:

  1. Use a document conversion capability (PDF, DOCX, etc.) or read directly if already markdown
  2. Normalize per the document format using writing-skills or skill-creator
  3. For markdown files: add frontmatter only, preserve content

Forum threads / discussions:

  1. Fetch and normalize per the forum format (chronological, author-attributed)
  2. Flatten nested threads to chronological order with reply-to context

Repositories:

  1. Clone or read the repository contents
  2. Mirror the original directory tree under sources/<source-id>/
  3. Default: mirror the full tree. For large repositories (thousands of files), ingest selectively and set selective: true in the manifest entry
  4. Populate the highlights array with paths to the most important files (relative to the source-id directory)

Documentation sites:

  1. Crawl or fetch the documentation site
  2. Mirror the section hierarchy under sources/<source-id>/
  3. Default: mirror the full site. For large sites, ingest selectively and set selective: true
  4. Populate the highlights array with paths to the most important pages
  5. Preserve internal link structure where possible

CLI tools:

First, detect if the target is a CLI tool. Check these criteria:

  • The target exists in PATH (run command -v <tool-name>)
  • The name matches CLI patterns (lowercase, hyphens, no spaces)
  • The context indicates a command-line tool

If the target is a CLI tool, run the capture sequence:

  1. Manpage capture:

    • Run man <tool-name>
    • If successful, save as cli-manpage type
    • If no manpage, skip to help capture
  2. Primary help capture:

    • Run <tool-name> --help
    • If that fails, try <tool-name> -h
    • Save as cli-help type
  3. Subcommand discovery:

    • Look for these patterns in help output:
      • "Commands:" or "Subcommands:" headings
      • Indented command lists under "Usage:" sections
      • Command patterns like <tool> <command> [options]
    • Filter out non-commands:
      • Keep single words or hyphenated strings
      • Keep positional arguments from usage lines
      • Skip anything starting with - (those are flags)
  4. Recursive subcommand capture:

    • For each subcommand, run <tool-name> <subcommand> --help
    • Save as cli-subcommand-help with depth: 1
    • If subcommand has its own subcommands, go one level deeper
    • Maximum depth: 2 levels
  5. Failure handling:

    • No manpage? Use help capture only
    • Both --help and -h fail? Note in manifest
    • All captures fail? Mark as failed: true and continue

Each capture becomes a separate source:

  • sources/<tool>-manpage/<tool>-manpage.md (type: cli-manpage)
  • sources/<tool>-help/<tool>-help.md (type: cli-help)
  • sources/<tool>-<subcommand>-help/<tool>-<subcommand>-help.md (type: cli-subcommand-help)

Each normalized source gets a slug-based source ID and lives in a directory-per-source layout:

  • Flat sources (web, forum, media, document, local): sources/<source-id>/<source-id>.md
  • Hierarchical sources (repository, documentation-site): sources/<source-id>/ with the original tree mirrored inside

Source ID generation:

  • Derive the source ID as a slug from the source title or URL (e.g., mdn-websocket-api, strangeloop-2025-realtime)
  • When a slug collides with an existing source ID: append __word1-word2 using two random words from references/wordlist.txt
  • If the wordlist is missing, append __ followed by 4 hex characters (e.g., __a3f8) as a fallback

Step 3 — Generate manifest

Create manifest.yaml following the schema in references/manifest-schema.md. Include:

  • Trove metadata (id, created date, tags)
  • Default freshness TTL per source type
  • One entry per source with provenance (URL/path, fetch date, content hash, type)

Compute content hashes as bare hex SHA-256 digests (no prefix) of the normalized markdown content:

shasum -a 256 sources/mdn-websocket-api/mdn-websocket-api.md | cut -d' ' -f1

Step 4 — Generate synthesis

Create synthesis.md — a structured distillation of key findings across all sources.

Structure the synthesis by theme, not by source. Group related findings together, cite sources by ID, and surface:

  • Key findings — what the sources collectively say about the topic
  • Points of agreement — where sources converge
  • Points of disagreement — where sources conflict or present alternatives
  • Gaps — what the sources don't cover that might matter

Keep it concise. The synthesis is a starting point, not a comprehensive report — the user or artifact author will refine it.

Step 5 — Commit and stamp

Use the dual-commit pattern (same as swain-design lifecycle stamps) to give the trove a reachable commit hash.

Before Commit A — append a history entry to manifest.yaml with a -- placeholder for the commit hash:

history:
  - event: created
    date: 2026-03-09
    commit: "--"
    sources: 3

Commit A — commit the trove content:

git add docs/troves/<trove-id>/
git commit -m "research(<trove-id>): create trove with N sources

Co-Authored-By: <model-name-from-system-prompt> <noreply@unknown>"
TROVE_HASH=$(git rev-parse HEAD)

Commit B — back-fill the commit hash into the history entry, then update the referencing artifact's frontmatter (if one exists):

# Replace "--" with the real hash in the history entry
# Update artifact frontmatter: trove: <trove-id>@<TROVE_HASH>
git add docs/troves/<trove-id>/manifest.yaml
git add docs/<artifact-type>/<phase>/<artifact-dir>/   # if artifact exists
git commit -m "docs(<trove-id>): stamp history hash ${TROVE_HASH:0:7}

Co-Authored-By: <model-name-from-system-prompt> <noreply@unknown>"

If no referencing artifact exists yet (standalone research), Commit B still stamps the history entry — report the hash so it can be referenced later.

Push — after Commit B, ALWAYS push to origin/trunk so the trove is immediately available to other agents and sessions. This is mandatory, not optional:

git push origin trunk

Derive synthesis URL — construct a stable permalink to the synthesis file for the final report:

REMOTE_URL=$(git remote get-url origin | sed 's/\.git$//' | sed 's/git@github.com:/https:\/\/github.com\//')
SYNTHESIS_URL="${REMOTE_URL}/blob/${TROVE_HASH}/docs/troves/<trove-id>/synthesis.md"

Step 6 — Report

Tell the user what was created:

Trove <trove-id> created with N sources — committed as <TROVE_HASH:0:7>.

  • docs/troves/<trove-id>/manifest.yaml — provenance and metadata
  • docs/troves/<trove-id>/sources/ — N normalized source files
  • docs/troves/<trove-id>/synthesis.md — thematic distillation: <SYNTHESIS_URL>

Reference from artifacts with: trove: <trove-id>@<TROVE_HASH:0:7>

Always include the synthesis file URL in the report. For multiple troves created in a single run, list each synthesis URL.

Extend mode

Add new sources to an existing trove.

  1. Read the existing manifest.yaml
  2. Collect and normalize new sources (same as Create step 2)
  3. Assign slug-based source IDs to new sources (following the same ID generation rules)
  4. Append new entries to manifest.yaml
  5. Update refreshed date
  6. Regenerate synthesis.md incorporating all sources (old + new)
  7. Append a history entry with event: extended and commit: "--" placeholder
  8. Commit and stamp (same dual-commit pattern as Create step 5):
    • Commit A: `git commit -m "research(): extend with N new sources

Co-Authored-By: noreply@unknown"`

  • Capture TROVE_HASH=$(git rev-parse HEAD)
  • Commit B: back-fill hash in history entry, update referencing artifact frontmatter (if artifact exists)
  • Push (mandatory): git push origin trunk
  • Derive SYNTHESIS_URL (same method as Create step 5)
  1. Report what was added, including the new commit hash and the synthesis file URL

Refresh mode

Re-fetch stale sources and update changed content.

  1. Read manifest.yaml
  2. For each source, check if fetched date + freshness-ttl has elapsed
  3. For stale sources:
    • Re-fetch the raw content
    • Re-normalize to markdown
    • Compute new content hash
    • If hash changed: replace the source file, update manifest entry
    • If hash unchanged: update only fetched date
  4. Update refreshed date in manifest
  5. If any content changed, regenerate synthesis.md
  6. Append a history entry with event: refreshed, sources-changed: M, and commit: "--" placeholder
  7. Commit and stamp (same dual-commit pattern as Create step 5):
    • Commit A: `git commit -m "research(): refresh N sources (M changed)

Co-Authored-By: noreply@unknown"`

  • Capture TROVE_HASH=$(git rev-parse HEAD)
  • Commit B: back-fill hash in history entry, update referencing artifact(s) frontmatter — check referenced-by in manifest for all dependents
  • Push (mandatory): git push origin trunk
  • Derive SYNTHESIS_URL (same method as Create step 5)
  1. Report: "Refreshed N sources. M had changed content, K were unchanged. New hash: <TROVE_HASH:0:7>. Synthesis: <SYNTHESIS_URL>"

For sources with freshness-ttl: never, skip them during refresh.

Discover mode

Help the user find existing troves relevant to their topic.

  1. Scan docs/troves/*/manifest.yaml for all troves
  2. Match against the user's query by:
    • Tag match — trove tags contain query keywords
    • Title match — trove ID slug contains query keywords
  3. For each match, show: trove ID, tags, source count, last refreshed date, referenced-by list
  4. If no matches, suggest creating a new trove

Graceful degradation

The skill references capabilities generically. When a capability isn't available:

Capability Fallback
Web search Skip search-based sources. Tell user: "No web search capability available — provide URLs directly or add a search MCP."
Browser / page fetcher Try basic URL fetch. If that fails: "Can't fetch this URL — paste the content or provide a local file."
Snapshot export for remote docs If export fails and no helper exists: mark source unverified, do not publish downstream, report exact URL and failure mode.
Media transcription "No transcription capability available — provide a pre-made transcript file, or add a media conversion tool."
Document conversion "Can't convert this file type — provide a markdown version, or add a document conversion tool."
Paywall proxy Keep truncated content. Note in manifest: "Paywalled; proxies exhausted." Suggest user provide content manually.

Never fail the entire run because one capability is missing. Collect what you can, skip what you can't, and report clearly.

Capability detection

Before collecting sources, check what's available. Look for tools matching these patterns — the exact tool names vary by installation:

  • Web search: tools with "search" in the name (e.g., brave_web_search, bing-search-to-markdown)
  • Page fetching: tools with "fetch", "webpage", "browser" in the name (e.g., fetch_content, webpage-to-markdown, browser_navigate)
  • Media transcription: tools with "audio", "video", "youtube" in the name (e.g., audio-to-markdown, youtube-to-markdown)
  • Document conversion: tools with "pdf", "docx", "pptx", "xlsx" in the name (e.g., pdf-to-markdown, docx-to-markdown)
  • CLI tool capture: built-in bash capabilities (man, command execution) — always available on POSIX systems

Report available capabilities at the start of collection so the user knows what will and won't work.

Linking from artifacts

Artifacts reference troves in frontmatter:

trove: websocket-vs-sse@abc1234

The format is <trove-id>@<commit-hash>. The commit hash pins the trove to a specific version — troves evolve over time as sources are added or refreshed, and the hash ensures reproducibility.

The dual-commit workflow in Create step 5, Extend step 8, and Refresh step 7 handles this automatically — Commit A records the trove content and Commit B stamps the hash into the history entry and referencing artifact's frontmatter. Do not defer this to the operator.

Related skills

More from cristoslc/swain

Installs
112
Repository
cristoslc/swain
GitHub Stars
2
First Seen
Mar 10, 2026