seo-team-the-researcher
SEO Keyword Researcher
Transforms a starting point — a topic, domain, competitor, or keyword list — into a structured, prioritized keyword map that downstream skills (seo-team-the-writer, seo-team-the-doctor, seo-team-the-general) can act on.
Prerequisites
- DataForSEO API key — This skill uses
seocli, which sources all its data from the DataForSEO API. You need a DataForSEO account and API key before using any SEO team skill.
Pipeline Overview
CHECK STATE → SEED → EXPAND → ENRICH → CLUSTER → PRIORITIZE → MAP
Each stage feeds the next. The full pipeline produces a keyword map with clusters, opportunity scores, and recommended actions. For a quick pass on user-provided keywords, skip SEED and EXPAND.
Input Classification
Parse the user's request to determine seeding strategy:
| Input Type | Detection | Seeding Path |
|---|---|---|
| Topic | Subject without a domain ("keyword research for home brewing") | Path A: LLM brainstorming |
| Domain | URL the user controls ("research keywords for mysite.com") | Path B: Domain API calls |
| Competitors | One or more competitor domains | Path C: Competitor ranked keywords |
| Keyword list | User provides specific keywords | Path D: Skip seeding, go to EXPAND or ENRICH |
Inputs combine: "research home brewing for mysite.com vs competitor.com" uses A + B + C.
Configuration
Before any API calls, resolve location and language:
- Check
workspace/seo/config.yamlfor saved defaults - Check if user specified in their request ("keywords in the UK")
- If neither: ask the user. Default suggestion:
--location-code 2840(US),--language-code en
Save resolved config:
# workspace/seo/config.yaml
domain: example.com
location_code: 2840
location_name: "United States"
language_code: "en"
language_name: "English"
competitors:
- competitor1.com
All seocli commands below require --location-code and --language-code (plus --location-name and --language-name for dataforseo-labs commands). Omitted for brevity — always include them.
For the complete command reference with all flags and batch limits, see reference/seocli-commands.md.
Step 0: Check Shared State
Before running any pipeline stages, check for existing data:
- Keyword map: Load
workspace/seo/keyword-map.json— extract relevant seeds, skip re-researching existing keywords - Competitor gaps: Check
workspace/seo/competitor-gaps/{competitor}.json— reuse gap keywords instead of re-runningdomain-intersection - Audit history: Check
workspace/seo/audit-history/— domain authority data used later in PRIORITIZE for Personal Keyword Difficulty
Decision logic:
- Existing keyword data found → merge into seeds, skip to EXPAND
- Existing gap data found → load gap keywords, tag as "gap"
- No existing data → full pipeline from Stage 1
Stage 1: SEED
Goal: Generate 30–100 initial seed keywords.
Path A: Topic-Based (no API calls)
Brainstorm seeds across these angles:
- Core terms — head keywords and spelling variants
- Problem-focused — what problems does this solve?
- Solution-focused — what solutions does it offer?
- Audience segments — who searches for this?
- Modifiers — append to core terms: best, top, how to, guide, tutorial, vs, alternative, [current year], for beginners, for [audience]
- Question variants — who/what/where/when/why/how for each core term
Target: 50–100 seeds. Don't filter for quality yet.
Path B: Domain-Based (3 API calls)
# Keywords associated with the domain
seocli keywords-data google-ads keywords-for-site live \
--target example.com --sort-by search_volume --limit 200
# Topical footprint
seocli dataforseo-labs google categories-for-domain \
--target example.com --include-subcategories --limit 20
# Organic competitors (feed into Path C)
seocli dataforseo-labs google competitors-domain \
--target example.com --limit 10 --exclude-top-domains
Use top keywords from call 1 as seeds. Use categories from call 2 to brainstorm adjacent topics. Use competitors from call 3 as input to Path C.
Path C: Competitor-Based
First: Check workspace/seo/competitor-gaps/{competitor}.json. If gap data exists, load directly — skip the API calls below.
If no existing data:
# Per competitor: top ranked keywords
seocli dataforseo-labs google ranked-keywords \
--target competitor1.com --limit 200 \
--order-by "keyword_data.keyword_info.search_volume,desc"
# Gap analysis: what they rank for that you don't
seocli dataforseo-labs google domain-intersection \
--target1 competitor1.com --target2 example.com --limit 200 \
--order-by "keyword_data.keyword_info.search_volume,desc"
Filter intersection results for keywords where the competitor ranks and the user doesn't. Tag these as "gap" keywords.
Save results to workspace/seo/competitor-gaps/{competitor}.json for reuse by seo-team-the-general.
API cost: 1–2 calls per competitor (up to 5 competitors).
Path D: User-Provided Keywords
Pass directly to EXPAND or ENRICH depending on whether the user wants expansion.
Seed Output
Deduplicated list tagged with source:
[
{
"keyword": "home brewing kit",
"source": "brainstorm",
"angle": "solution"
},
{
"keyword": "ipa recipe home brew",
"source": "gap",
"competitor": "competitor1.com"
}
]
Stage 2: EXPAND
Goal: Turn 50–100 seeds into 200–1,000 unique candidates.
Method 1: Related keywords (primary engine)
seocli keywords-data google-ads keywords-for-keywords live \
--keywords "seed1" --keywords "seed2" --keywords "seed3" \
--sort-by search_volume
Batch up to ~10 keywords per call. Returns Google Ads keyword suggestions.
Method 2: Category-level ideas
seocli dataforseo-labs google keyword-ideas \
--keywords "seed1" --keywords "seed2" \
--include-serp-info --include-clickstream-data --limit 500
Broader discovery. --include-serp-info captures SERP feature data early (reuse in CLUSTER).
Method 3: SERP mining (5–10 representative seeds)
seocli serp google organic live \
--keyword "seed keyword" --depth 10 --device desktop
Extract People Also Ask questions and related searches as additional candidates. Note SERP features for later use.
Method 4: Programmatic long-tail (zero API cost)
For every core seed, generate variants by prepending question prefixes ("what is", "how to", "why does") and appending commercial modifiers, specificity terms, temporal modifiers, and format terms.
Deduplication
- Lowercase all keywords
- Remove exact duplicates
- Normalize near-duplicates (whitespace, hyphens, compound forms) — keep the form with highest volume if known
- Remove obviously irrelevant results (seed topic words absent AND not from competitor gap data)
API cost: ~10–20 calls total.
Stage 3: ENRICH
Goal: Add volume, difficulty, CPC, intent, and funnel-stage data to every keyword.
Volume and CPC
# Up to 700 keywords per call
seocli keywords-data google-ads search-volume live \
--keywords "kw1" --keywords "kw2" ... --sort-by search_volume
Extract per keyword: search_volume, cpc, competition, competition_level, monthly_searches (12-month array).
Keyword Difficulty
# Up to 1,000 keywords per call
seocli dataforseo-labs google bulk-keyword-difficulty \
--keywords "kw1" --keywords "kw2" ...
Returns keyword_difficulty (0–100). Note: DataForSEO KD runs higher than Ahrefs/Semrush — a "30" here ≈ "20" in Ahrefs.
Intent, Funnel Stage, Trends, and Zero-Click Risk
For detailed classification rules, scoring formulas, and trend detection logic, see reference/scoring-and-classification.md.
Summary:
- Intent: Rule-based first (questions → informational, "buy/price" → transactional, "best/top" → commercial, brands → navigational). Verify ambiguous cases against SERP data.
- Funnel stage: Informational → ToFu, Commercial → MoFu, Transactional → BoFu, Navigational → navigational.
- Trends: Compare last 3 months avg to previous 3 months avg from
monthly_searches. >20% change → rising/declining. - Zero-click risk: Flag keywords where AI Overviews or featured snippets fully answer the query. Apply 0.5× volume multiplier in opportunity scoring.
Post-Enrichment Filtering
Remove only: zero-volume keywords with no trend signal (unless gap keywords the user wants). Do NOT aggressively filter — low-volume keywords can be valuable as cluster supporting content.
API cost: 2–4 calls for a typical 500-keyword list.
Stage 4: CLUSTER
Goal: Group keywords into content clusters — sets of keywords a single page should target. Prevents cannibalization and maximizes per-page keyword coverage.
For the full clustering algorithm (SERP similarity method, completeness scoring formula, content format inference table), see reference/clustering-guide.md.
Algorithm Summary
- Select candidates: Sort by volume descending, take top 30–50 as cluster candidates
- Tentative assignment: Assign remaining keywords to nearest candidate by textual similarity
- SERP similarity check: For candidate pairs with textual overlap, pull SERPs and compare top-10 URLs
- 3+ shared URLs → same cluster
- 2 shared URLs → likely same cluster if textually similar
- 0–1 shared → different clusters
- Merge and assign: Merge overlapping candidates, assign remaining keywords to clusters
Cluster Metadata
Each cluster gets: pillar keyword (highest volume), supporting keywords, total volume, average difficulty, dominant intent/funnel stage, recommended content format (inferred from SERP), SERP features, keyword count, and a completeness score (0–1).
Completeness status flags:
needs_expansion(<3 keywords)ready_for_content(5+ keywords, mixed difficulty, good volume)monitor(between states)
API cost: 20–50 SERP calls. Control cost by capping at 50 SERP calls, reusing cached SERP data from Stage 2, and stopping pairwise comparison when clusters stabilize.
Stage 5: PRIORITIZE
Goal: Score and rank clusters so the user knows what to work on first.
Opportunity Score
Opportunity = (total_cluster_volume × intent_weight × zero_click_adj) / (avg_difficulty × pkd_ratio) × relevance
| Component | Values |
|---|---|
| Intent weights | Informational: 1.0, Commercial: 2.0, Transactional: 3.0, Navigational: 0.5 |
| Zero-click adjustment | 0.5 if AI Overview fully answers, else 1.0 |
| PKD ratio | user_DR / avg_DR_of_top_10 if domain authority known, else 1.0 |
| Relevance | Default 1.0. Ask user if they have priority topics to boost. |
For full formula details including Personal Keyword Difficulty, see reference/scoring-and-classification.md.
Tier Assignment
| Tier | Criteria | Timeline |
|---|---|---|
| Quick Wins | KD < 30, volume > 100/mo | Weeks |
| Growth | KD 30–60, volume > 500/mo | 1–3 months |
| Long-term Bets | KD > 60, volume > 2,000/mo | 6+ months |
| Low Priority | KD > 60, volume < 500/mo | Deprioritize |
Special Flags
For each top-20 cluster, check and flag:
- AI Overview opportunity: Run
seocli serp google ai-mode live --keyword "[pillar]"— note format and cited sources - Video opportunity: Video results in SERP top 10
- Featured snippet: Structure content for snippet capture
- PAA presence: Include FAQ section addressing those questions
- Existing ranking: Cross-reference user's domain rankings via
seocli dataforseo-labs google ranked-keywords --target example.com --limit 500- Positions 1–3: Defend
- Positions 4–20: Optimize (high-ROI striking distance)
- Positions 21+: Evaluate for rewrite
- Not ranking: Create new content
API cost: 10–25 calls.
Stage 6: MAP
Goal: Produce the final keyword map — the actionable output.
Keyword Map Table
| Cluster | Pillar KW | Supporting KWs | Intent | Total Vol | Avg KD | Tier | Format | Target URL | Action | Score | Flags |
|---|
- Target URL: Existing page on user's domain ranking for cluster keywords. "—" if none.
- Action: "optimize" (page exists), "create" (no page), "consolidate" (multiple pages compete = cannibalization)
- Flags: AI Overview, Video, Snippet, PAA, Shopping, Seasonal, Rising, Gap
Supporting Outputs
- Keyword Universe Spreadsheet — every keyword with all enrichment data, flat
- Cluster Architecture — visual tree showing pillar → cluster → sub-cluster relationships
- Opportunity Brief — top 10 Quick Wins, top 10 Growth, top 5 Long-term, top 5 AI Overview opportunities
- Competitor Gap Report (if competitors analyzed) — gap keywords with volume, KD, competitor ranking URL, user status
- Content Calendar Suggestion — Week 1–2: Quick wins, Week 3–4: First growth piece, Month 2: Growth + optimize striking-distance, Month 3+: Long-term pillar content
Next Actions
Include explicit handoff directives:
{
"next_actions": [
{
"skill": "seo-team-the-writer",
"action": "Create content for cluster C-002 (Quick Win)",
"priority": 1
},
{
"skill": "seo-team-the-doctor",
"action": "Audit striking-distance pages for clusters C-001, C-005",
"priority": 2
},
{
"skill": "seo-team-the-general",
"action": "Analyze competitor gaps — 30 gap keywords identified",
"priority": 3
}
]
}
Data Persistence
All outputs persist to workspace/seo/:
workspace/seo/
├── config.yaml # domain, location, language, competitors
├── keyword-map.json # master keyword map (Stage 6 output)
├── keyword-universe.json # all keywords with enrichment data
├── clusters.json # cluster definitions with metadata
├── research-runs/
│ └── YYYY-MM-DD-{topic-slug}.json # timestamped run metadata
└── competitor-gaps/
└── {competitor-domain}.json # per-competitor gap analysis
Incremental Updates
The keyword map is a living document. On subsequent runs:
- Load existing
keyword-map.json - Merge new keywords into existing clusters (don't create duplicates)
- Update volume/difficulty data
- Add new clusters for genuinely new topics
- Preserve user annotations (relevance overrides, priority boosts)
- Timestamp in
research-runs/
Cross-Skill Consumption
- seo-team-the-writer reads
keyword-map.jsonfor clusters needing content,clusters.jsonfor brief data - seo-team-the-doctor reads
keyword-map.jsonto cross-reference pages against target keywords - seo-team-the-general reads everything — keyword map, competitor gaps, cluster architecture
Cost Control
| Rule | Detail |
|---|---|
| Never re-research | Check keyword-universe.json before expanding |
| Batch aggressively | search-volume: 700/call, bulk-difficulty: 1,000/call |
| Reuse SERP data | Cache Stage 2 SERPs for Stage 4 clustering |
| Confirm large runs | If expanded list > 500 keywords, show estimated cost before enrichment |
| Cap SERP sampling | Max 50 SERP calls for clustering; use textual similarity for remainder |
Typical Cost
| Stage | Calls |
|---|---|
| Seed (domain) | 3 |
| Seed (competitors) | 2–10 |
| Expand | 10–20 |
| Enrich | 2–4 |
| Cluster | 20–50 |
| Prioritize | 10–25 |
| Total | ~50–100 |
Error Handling
| Error | Response |
|---|---|
| API rate limit | Wait, retry with backoff, inform user |
| Keywords return 0 volume | Keep in list, flag "low-data" |
| SERP returns empty | Skip SERP clustering for that keyword, fall back to textual |
| Location/language unsupported | Suggest nearest supported alternative |
| Keyword list > 2,000 | Warn about cost, suggest filtering to top 1,000 first |
| Corrupt/missing keyword-map.json | Start fresh |