SEO Technical: robots.txt

Guides configuration and auditing of robots.txt for search engine and AI crawler control.

When invoking: On first use, if helpful, open with 1–2 sentences on what this skill covers and why it matters, then provide the main output. On subsequent use or when the user asks to skip, go directly to the main output.

Scope (Technical SEO)

Robots.txt: Review Disallow/Allow; avoid blocking important pages
Crawler access: Ensure crawlers (including AI crawlers) can access key pages
Indexing: Misconfigured robots.txt can block indexing; verify no accidental blocks

Initial Assessment

Check for product marketing context first: If .claude/product-marketing-context.md or .cursor/product-marketing-context.md exists, read it for site URL and indexing goals.

Identify:

Site URL: Base domain (e.g., https://example.com)
Indexing scope: Full site, partial, or specific paths to exclude
AI crawler strategy: Allow search/indexing vs. block training data crawlers

Best Practices

Purpose and Limitations

Point	Note
Purpose	Controls crawler access; does NOT prevent indexing (disallowed URLs may still appear in search without snippet)
No-index	Use noindex meta or auth for sensitive content; robots.txt is publicly readable
Indexed vs non-indexed	Not all content should be indexed. robots.txt and noindex complement each other: robots for path-level crawl control, noindex for page-level indexing. See seo-technical-indexing
Advisory	Rules are advisory; malicious crawlers may ignore

Location and Format

Item	Requirement
Path	Site root: `https://example.com/robots.txt`
Encoding	UTF-8 plain text
Standard	RFC 9309 (Robots Exclusion Protocol)

Core Directives

Directive	Purpose	Example
`User-agent:`	Target crawler	`User-agent: Googlebot`, `User-agent: *`
`Disallow:`	Block path prefix	`Disallow: /admin/`
`Allow:`	Allow path (can override Disallow)	`Allow: /public/`
`Sitemap:`	Declare sitemap absolute URL	`Sitemap: https://example.com/sitemap.xml`
`Clean-param:`	Strip query params (Yandex)	See below

Critical: Do Not Block Rendering Resources

Do not block CSS, JS, images; Google needs them to render pages
Only block paths that don't need crawling: admin, API, temp files

AI Crawler Strategy

User-agent	Purpose	Typical
OAI-SearchBot	ChatGPT search	Allow
GPTBot	OpenAI training	Disallow
Claude-SearchBot	Claude search	Allow
ClaudeBot	Anthropic training	Disallow
PerplexityBot	Perplexity search	Allow
Google-Extended	Gemini training	Disallow
CCBot	Common Crawl	Disallow

Clean-param (Yandex)

Clean-param: utm_source&utm_medium&utm_campaign&utm_term&utm_content&ref&fbclid&gclid

Output Format

Current state (if auditing)
Recommended robots.txt (full file)
Compliance checklist
References: Google robots.txt

Related Skills

seo-technical-sitemap: Sitemap URL to reference in robots.txt
seo-technical-crawlability: Broader crawl and structure guidance

seo-technical-robots