/bookstrap-research - Execute Research Tasks

Execute research tasks autonomously to fill knowledge gaps identified during planning. Search the web for sources, evaluate their reliability, ingest relevant content, extract entities and relationships, and mark gaps as resolved.

Purpose

Fill knowledge gaps by conducting autonomous web research. This command loads pending research tasks from the database, searches for high-quality sources, ingests them into the corpus, and resolves the associated knowledge gaps.

This command delegates the research work to the researcher agent, which operates in research mode with full web access.

Input Arguments

None. This command reads pending research tasks from the database:

Unresolved knowledge gaps with their questions and context
Priority levels (high, medium, low)
Chapters/sections blocked by each gap
Previous research attempts (if any)

Processing Workflow

1. Load Pending Research Tasks

Query the database to retrieve unresolved knowledge gaps:

# Query pending research tasks
surreal sql --conn http://localhost:2665 \
  --user root --pass root \
  --ns bookstrap --db <database-name> \
  --query "SELECT * FROM knowledge_gap WHERE resolved = false ORDER BY priority DESC, created_at ASC;"

Filter and prioritize:

High priority tasks first (blocking core content)
Tasks with existing context from BRD
Tasks with clear, actionable research questions
Skip tasks with failed attempts beyond retry limit

2. For Each Research Task

Execute research tasks one at a time, committing progress after each successful ingestion.

2.1 Web Search for Sources

Use configured search provider (Tavily, Brave, Serper, Google) to find relevant sources:

# Perform web search based on research question
# Apply domain filters from bookstrap.config.json
# Respect rate limits
# Target max_sources_per_task from config (default: 5)

Search strategy:

Start with the exact research question
Broaden search if too few results
Narrow search if too many irrelevant results
Look for multiple source types (academic, primary, journalism)

2.2 Evaluate Source Reliability

For each search result, assess reliability:

Source Type	Reliability	Use Case
Academic/peer-reviewed journals	High	Factual claims, scholarly arguments
Primary sources (archives, documents)	High	Historical facts, original evidence
Reputable journalism (NYT, BBC, etc.)	Medium-High	Recent events, context
Government/institutional sites	Medium-High	Official data, statistics
Expert blogs/substacks	Medium	Opinion, analysis, niche expertise
Wikipedia	Low (for leads only)	Starting point, bibliography mining
Random blogs/forums	Very Low	Avoid unless unique perspective needed

Check reliability indicators:

Author credentials and affiliations
Publication date (prefer recent for current topics)
Citations and references provided
Domain authority and reputation
Fact-checking and editorial standards

2.3 Ingest Relevant Content

For approved sources, ingest into the database:

# Use ingest-file.py for web content
python ./scripts/ingest-file.py \
  --url <source-url> \
  --source-type <primary|secondary|web> \
  --reliability <high|medium|low> \
  --context "ingested_during:research" \
  --gap-id <knowledge-gap-id>

Ingestion performs:

Fetch content: Download and extract text
Semantic chunking: LLM identifies natural breakpoints
Generate embeddings: Via configured provider
Extract entities: Characters, locations, events, concepts, dates
Auto-create relationships: Build graph edges
Update timeline: Order events chronologically
Link to knowledge gap: Track which gap this resolves

2.4 Extract Entities and Relationships

The ingestion script uses extract-entities.py with LLM-based extraction:

# Extract entities from ingested content
python ./scripts/extract-entities.py \
  --source-id <source-id> \
  --content <chunk-content> \
  --context <surrounding-context>

Extracts:

Characters/People: Names, roles, descriptions, relationships
Locations: Places, settings, geographic details
Events: Occurrences, actions, dates, sequences
Concepts: Ideas, theories, terminology, themes
Dates/Times: Temporal markers for timeline

Creates relationships:

-- Link source to extracted entities
RELATE source:<id>->supports->concept:<id>;
RELATE event:<id>->precedes->event:<next-id>;
RELATE character:<id>->knows->character:<other-id>;
RELATE location:<id>->contains->location:<sub-id>;

2.5 Mark Knowledge Gap Resolved

After successful ingestion, update the knowledge gap:

UPDATE knowledge_gap:<gap-id> SET
  resolved = true,
  resolved_by = source:<source-id>,
  resolved_at = time::now()
;

2.6 Commit Progress

After each successfully resolved gap:

git add .
git commit -m "[bookstrap] Research: Resolved gap '<gap-question>' with <N> sources

Sources ingested:
- <source-1-title> (<reliability>)
- <source-2-title> (<reliability>)
...

Entities extracted: <count>
Relationships created: <count>

Co-Authored-By: Claude <noreply@anthropic.com>"

3. Delegate to Researcher Agent

Invoke the researcher agent to perform the detailed research work:

# Load researcher agent with context
# Agent will:
# 1. Load next pending research task
# 2. Search web for sources
# 3. Evaluate source reliability
# 4. Ingest approved sources
# 5. Extract entities and relationships
# 6. Mark gap resolved
# 7. Commit progress
# 8. Continue to next task or exit if blocked

The researcher agent has read+write database access, web search capabilities, and uses the research and surrealdb skills.

4. Continue Until Complete or Blocked

The researcher agent continues processing tasks until:

All tasks complete: All knowledge gaps resolved
Rate limited: Search API rate limit reached (pause and resume)
No results found: Research question too specific or unavailable
Quality threshold not met: No sources meet reliability criteria
Manual intervention needed: Gap requires human expertise or access

Output Format

Report research progress to the user:

RESEARCH EXECUTION
==================

Configuration:
- Search provider: Tavily
- Max sources per task: 5
- Rate limit: 10 requests/min
- Database: bookstrap/my_book

TASK 1/12: High Priority
-------------------------
Question: "SOE wireless operator training protocols 1942-1943"
Context: Core to protagonist's role, zero sources
Blocks: Chapters 2, 4, 6

Web search: 15 results found
Evaluating sources...
  ✓ [HIGH] "SOE Training at Beaulieu" - Historical Society Journal
  ✓ [HIGH] "Wireless Operator Manual 1943" - National Archives
  ✓ [MEDIUM] "Special Operations Executive Training" - Imperial War Museum
  ✗ [LOW] Wikipedia article (using for bibliography only)

Ingesting 3 sources...
  → Source 1: 2,340 words, 4 chunks, 15 entities extracted
  → Source 2: 1,850 words, 3 chunks, 8 entities extracted
  → Source 3: 3,120 words, 5 chunks, 12 entities extracted

Entities extracted:
  - Characters: 5 (trainers, notable agents)
  - Locations: 3 (Beaulieu, training facilities)
  - Events: 12 (training procedures, protocols)
  - Concepts: 10 (wireless techniques, codes)
  - Dates: 8 (timeline entries)

Relationships created: 24 graph edges

Gap resolved ✓
Committed: [abc1234]

---

TASK 2/12: High Priority
-------------------------
Question: "Lyon Resistance network structure and key figures"
Context: Primary setting, only 2 tangential sources
Blocks: Chapters 3, 5, 7

Web search: 22 results found
Evaluating sources...
  ✓ [HIGH] "Combat: Resistance Movements in Lyon" - French Archives
  ✓ [MEDIUM] "The Lyon Resistance 1940-1944" - BBC History
  ✓ [MEDIUM] "Jean Moulin and Lyon Networks" - Resistance Museum
  ✗ [LOW] Blog post (insufficient citations)

Ingesting 3 sources...
[...]

---

PROGRESS SUMMARY
================

Tasks completed: 2/12
Tasks remaining: 10 (8 high priority, 2 medium priority)
Sources ingested: 6
Entities extracted: 35
Relationships created: 48
Knowledge gaps resolved: 2
Commits: 2

Time elapsed: 8 minutes
Rate limit status: 18/60 requests used this hour

NEXT STEPS
----------
Continue research: /bookstrap-research (auto-resumes)
Check progress: /bookstrap-status
View corpus: /bookstrap-query "What sources do we have about wireless training?"

Behavior Characteristics

Research Mode (Web Access Enabled)

This command operates in research mode:

✓ Can access web via search APIs
✓ Can fetch external URLs
✓ Fills gaps in corpus
✓ Exploratory and messy (casts wide net)
✓ Commits after each successful ingestion
✗ Does NOT write manuscript content
✗ Does NOT access during write mode

Autonomous Execution

Runs fully autonomously:

No human approval needed per source (uses reliability rubric)
Processes tasks sequentially (easier to debug and resume)
Commits atomic progress (recoverable if interrupted)
Respects rate limits (pauses and resumes)
Logs all decisions (source acceptance/rejection reasons)

Quality Over Quantity

Prioritizes source quality:

Better to skip a gap than ingest poor sources
Prefers fewer high-quality sources over many weak ones
Flags gaps as "needs human review" if no quality sources found
Logs rejected sources with reasons for transparency

Rate Limiting and Resumption

Handles rate limits gracefully:

{
  "research": {
    "provider": "tavily",
    "rate_limit": {
      "requests_per_minute": 10,
      "retry_after_seconds": 60,
      "max_retries": 3
    }
  }
}

If rate limited:

Log current progress
Wait for rate limit reset
Resume from next pending task
Continue until complete

Re-running /bookstrap-research after interruption:

Skips already resolved gaps
Resumes from next pending task
Maintains priority order
Idempotent (safe to run multiple times)

Configuration

Research behavior configured in bookstrap.config.json:

{
  "research": {
    "provider": "tavily",
    "api_key_env": "TAVILY_API_KEY",
    "rate_limit": {
      "requests_per_minute": 10
    },
    "blocked_domains": ["example-spam-site.com"],
    "allowed_domains": [],
    "max_sources_per_task": 5,
    "min_reliability": "medium",
    "auto_commit": true,
    "max_retries_per_task": 3
  }
}

Settings:

provider: Search API (tavily, brave, serper, google)
max_sources_per_task: Limit sources per research question (prevents over-research)
min_reliability: Minimum acceptable source quality (low, medium, high)
blocked_domains: Never fetch from these domains
allowed_domains: If set, only fetch from these domains
auto_commit: Commit after each gap resolved (recommended)
max_retries_per_task: Abandon task after N failed attempts

Error Handling

Error	Recovery
Search API unavailable	Skip to next task, flag for manual research
Rate limit exceeded	Pause, wait for reset, resume
No sources found	Flag gap as "no sources available", continue
All sources rejected (low quality)	Flag gap as "needs human review", continue
Ingestion failure	Log error, skip source, try next source
Database write failure	Abort, report error, preserve uncommitted work
Network timeout	Retry up to max_retries, then skip

Pre-requisites

Before running /bookstrap-research:

BRD created: /bookstrap-init must have been run
SurrealDB running: Database must be accessible
Research tasks generated: /bookstrap-plan-research must have identified gaps
Search API configured: API key in environment, provider configured
Embedding provider configured: For ingesting new content

Related Commands

/bookstrap-plan-research - Generate research tasks (run this first)
/bookstrap-ingest - Ingest user-provided sources (manual alternative)
/bookstrap-status - Monitor research progress and gap resolution
/bookstrap-query - Query ingested sources
/bookstrap-plan-write - Generate writing tasks (run after research complete)

Supporting Agents

Agent	Role
`researcher`	Executes web research, evaluates sources, ingests content

Supporting Skills

Skill	Purpose
`research/`	Source evaluation, web search strategies, entity extraction patterns
`surrealdb/`	Database query patterns for gap tracking and entity storage

Supporting Scripts

Script	Purpose
`ingest-file.py`	Ingest content from URLs or files
`generate-embedding.py`	Generate embeddings via configured provider
`extract-entities.py`	LLM-based entity extraction
`chunk.py`	Semantic chunking strategies

Example Usage

# After identifying gaps, execute research
/bookstrap-plan-research
/bookstrap-research

# Research interrupted? Resume where you left off
/bookstrap-research

# Check progress
/bookstrap-status

# Continue research if new gaps found during writing
/bookstrap-write
# (discovers gaps)
/bookstrap-plan-research
/bookstrap-research
/bookstrap-write

Integration with Research Loop

This command is part of the research-write cycle:

init → ingest → plan-research → research
                     ↑              │
                     │              ▼
                     │         plan-write → write → edit
                     │              │
                     └──── gaps ────┘

When writing discovers knowledge gaps:

Writing task marks gap and continues to next task
User runs /bookstrap-plan-research to generate research tasks
User runs /bookstrap-research to fill gaps
User runs /bookstrap-write to resume writing (now unblocked)

Statistics to Track

Calculate and report:

Tasks completed vs. remaining
Sources ingested per task (average)
Entities extracted per source (average)
Reliability distribution (high/medium/low sources)
Time per task (estimate remaining time)
Rate limit usage (requests per hour)
Gaps flagged for human review
Success rate (gaps resolved / gaps attempted)

Logging

Detailed logging for transparency:

[2024-01-15 14:23:45] [RESEARCH] Task 1/12 started: "SOE wireless protocols"
[2024-01-15 14:23:47] [SEARCH] Query: "SOE wireless operator training 1942-1943"
[2024-01-15 14:23:49] [SEARCH] Found 15 results
[2024-01-15 14:23:50] [EVAL] Source 1: "SOE Training at Beaulieu" - HIGH (academic journal)
[2024-01-15 14:23:51] [EVAL] Source 2: "Wireless Manual 1943" - HIGH (primary source)
[2024-01-15 14:23:52] [EVAL] Source 3: "Random blog post" - REJECTED (insufficient citations)
[2024-01-15 14:23:55] [INGEST] Source 1: 2,340 words, 4 chunks, 15 entities
[2024-01-15 14:24:12] [INGEST] Source 2: 1,850 words, 3 chunks, 8 entities
[2024-01-15 14:24:28] [GRAPH] Created 24 relationships
[2024-01-15 14:24:30] [DB] Gap resolved: knowledge_gap:gap-001
[2024-01-15 14:24:32] [GIT] Committed: abc1234
[2024-01-15 14:24:33] [RESEARCH] Task 1/12 complete (8m 48s)

Implementation Notes

Agent Delegation

This command is a thin wrapper that:

Verifies database connection
Checks that research tasks exist
Loads research configuration
Invokes the researcher agent
Displays the agent's output
Reports final statistics

The actual research logic lives in the researcher agent to keep concerns separated.

Idempotency

Re-running /bookstrap-research is safe:

Only processes unresolved gaps
Skips already-ingested sources (via URL deduplication)
Maintains priority order
Can be interrupted and resumed
No duplicate ingestion

Separation of Concerns

Research mode vs. Write mode:

Research mode: Web access, fills gaps, commits per task
Write mode: Database only, flags gaps, commits per section

This command operates exclusively in research mode and is never invoked during writing. Writing discovers gaps but never fills them.

Advanced Features

Bibliography Mining

When ingesting academic sources:

Extract citations and references
Add cited works to research queue
Build citation graph
Track source quality via citation chains

Iterative Deepening

For complex topics:

First pass: Broad overview sources
Identify sub-topics needing depth
Second pass: Targeted deep dives
Mark topic as "exhausted" when no new info found

Cross-referencing

Validate facts across sources:

Flag contradictions between sources
Prefer higher-reliability sources in conflicts
Store conflicting claims for human review
Track consensus vs. outlier claims

Troubleshooting

No sources found

Gap: "Obscure technical detail from 1943"
Status: No sources found after 3 searches
Action: Flagged for human review

Possible solutions:

Rephrase research question
Broaden search terms
Accept lower reliability threshold
Provide sources manually via /bookstrap-ingest

All sources rejected

Gap: "Controversial historical claim"
Status: 8 sources found, all rejected (reliability too low)
Action: Flagged for human review