ai-content-discovery
AI Content Discovery
Fixes Category 1 (AI Content Discovery, 30% weight) issues from IsAgentReady.com. This category checks whether AI systems can find, crawl, and understand your website's content. It evaluates 7 checkpoints worth 100 points total.
When to Use
- Fixing robots.txt issues (missing, wrong content type, blocking bots)
- Adding or updating AI crawler directives (GPTBot, ClaudeBot, etc.)
- Creating or fixing XML sitemaps
- Creating llms.txt or llms-full.txt files
- Removing restrictive meta robots tags (noindex, noai)
- Fixing WAF/CDN bot blocking issues
- Adding content freshness signals (dateModified, article:modified_time)
- Any task to "improve AI discoverability" or "make site crawlable by AI"
When NOT to Use
- Adding structured data / JSON-LD (use
structured-dataskill) - Fixing semantic HTML or heading hierarchy (use
content-semanticsskill) - Setting up agent protocols like WebMCP or A2A (use
agent-protocolsskill) - Configuring security headers like CSP or HSTS (use
security-trustskill)
Checkpoints Overview
| ID | Checkpoint | Max Points | What It Tests |
|---|---|---|---|
| 1.8 | HTTP bot accessibility | 15 | Page returns HTTP 200-299 (not 401/403 from WAF) |
| 1.1 | robots.txt present | 15 | /robots.txt returns 200 with text/plain Content-Type |
| 1.2 | AI crawler directives | 15 | Allow/Disallow rules for 13 AI user-agents in robots.txt |
| 1.3 | XML Sitemap | 15 | Valid XML sitemap with <urlset> or <sitemapindex> |
| 1.4 | llms.txt | 15 | /llms.txt with markdown heading + URLs; bonus for /llms-full.txt |
| 1.5 | Meta robots / X-Robots-Tag | 15 | No restrictive directives (noindex, noai, noimageai) |
| 1.6 | Content freshness signals | 10 | dateModified in JSON-LD, article:modified_time, or Last-Modified |
Checkpoint 1.8: HTTP Bot Accessibility (15 pts)
What passes: HTTP status 200-299. What fails: HTTP 401 or 403 (WAF/CDN blocking bots).
Fix Workflow
-
Diagnose — test with an AI crawler user-agent:
curl -sI -A "Mozilla/5.0 (compatible; GPTBot/1.0)" https://example.com/ curl -sI -A "Mozilla/5.0 (compatible; ClaudeBot/1.0)" https://example.com/ -
If blocked by Cloudflare — create a WAF exception:
# Dashboard -> Security -> WAF -> Custom Rules -> Create rule: # Field: User Agent | Operator: contains | Value: GPTBot # Action: Skip remaining rules # # Repeat for ClaudeBot, Amazonbot, ChatGPT-User, etc. -
If blocked by Nginx rate limiting — allow AI user-agents:
map $http_user_agent $is_ai_bot { default 0; "~*GPTBot" 1; "~*ClaudeBot" 1; "~*Amazonbot" 1; "~*ChatGPT" 1; } # Skip rate limiting for AI bots limit_req_zone $binary_remote_addr zone=general:10m rate=10r/s; server { location / { if ($is_ai_bot) { # Allow AI bots through without rate limits } limit_req zone=general burst=20; } } -
If blocked by Apache — allow in
.htaccess:RewriteEngine On RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|Amazonbot) [NC] RewriteRule ^ - [L] -
Verify — re-test with curl to confirm 200 response.
Checkpoint 1.1: robots.txt Present (15 pts)
What passes: /robots.txt returns HTTP 200 with Content-Type: text/plain.
What fails: Missing file (404), HTML error page served, or wrong Content-Type.
Fix Workflow
-
Check current state:
curl -sI https://example.com/robots.txt | head -20 -
Create
/robots.txtat your web root:User-agent: * Allow: / User-agent: GPTBot Allow: / User-agent: ClaudeBot Allow: / User-agent: Google-Extended Allow: / Sitemap: https://example.com/sitemap.xml -
Ensure correct Content-Type — must return
text/plain. Nginx:default_type text/plain;in the location block. Apache:ForceType text/plainin a<Files>directive. -
Verify:
curl -sI https://example.com/robots.txt | grep -i content-type # Expected: Content-Type: text/plain
See references/robots-txt-guide.md for complete robots.txt syntax and rules.
Checkpoint 1.2: AI Crawler Directives (15 pts)
What passes: All 13 AI crawlers explicitly allowed (15 pts), or some allowed with none blocked (15 pts), or wildcard Allow: / with none blocked (15 pts).
Partial credit: No AI crawlers mentioned but default allow applies (10 pts), or mixed policies with some blocked (7 pts).
What fails: All AI crawlers explicitly disallowed (0 pts).
The 13 AI User-Agents
| User-Agent | Owner | Purpose |
|---|---|---|
| GPTBot | OpenAI | Training data crawling |
| ChatGPT-User | OpenAI | Real-time browsing in ChatGPT |
| OAI-SearchBot | OpenAI | SearchGPT results |
| ClaudeBot | Anthropic | Training data crawling |
| Claude-User | Anthropic | Real-time browsing in Claude |
| Claude-SearchBot | Anthropic | Claude search results |
| Google-Extended | Gemini AI training | |
| Amazonbot | Amazon | Alexa/AI training |
| Bytespider | ByteDance | TikTok/AI training |
| CCBot | Common Crawl | Open dataset crawling |
| PerplexityBot | Perplexity | AI search results |
| Applebot-Extended | Apple | Apple Intelligence training |
| meta-externalagent | Meta | Meta AI training |
Fix Workflow
-
Check current directives:
curl -s https://example.com/robots.txt -
Add explicit Allow directives for each AI crawler to your robots.txt:
# AI Crawlers — explicitly allow (one block per agent) User-agent: GPTBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: OAI-SearchBot Allow: / User-agent: ClaudeBot Allow: / User-agent: Claude-User Allow: / User-agent: Claude-SearchBot Allow: / User-agent: Google-Extended Allow: / User-agent: Amazonbot Allow: / User-agent: Bytespider Allow: / User-agent: CCBot Allow: / User-agent: PerplexityBot Allow: / User-agent: Applebot-Extended Allow: / User-agent: meta-externalagent Allow: / -
If you want to allow all crawlers — a simple wildcard also works:
User-agent: * Allow: / -
If you want selective control — allow some, block others:
# Allow search-oriented AI crawlers User-agent: ChatGPT-User Allow: / User-agent: PerplexityBot Allow: / # Block training-oriented crawlers User-agent: GPTBot Disallow: / User-agent: CCBot Disallow: /
See references/robots-txt-guide.md for full syntax and AI user-agent details.
Checkpoint 1.3: XML Sitemap (15 pts)
What passes: Valid XML sitemap found at a discoverable URL with <urlset> or <sitemapindex>.
What fails: No sitemap found, or sitemap is not valid XML.
The scanner checks these locations in order:
- URLs from
Sitemap:directives in robots.txt /sitemap.xml/sitemap_index.xml
Fix Workflow
-
Check if a sitemap exists:
curl -sI https://example.com/sitemap.xml | head -5 curl -s https://example.com/robots.txt | grep -i sitemap -
Create
/sitemap.xml:<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>https://example.com/</loc> <lastmod>2025-01-15</lastmod> <changefreq>weekly</changefreq> <priority>1.0</priority> </url> <url> <loc>https://example.com/about</loc> <lastmod>2025-01-10</lastmod> </url> </urlset> -
For large sites, use a sitemap index:
<?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc>https://example.com/sitemap-pages.xml</loc> <lastmod>2025-01-15</lastmod> </sitemap> <sitemap> <loc>https://example.com/sitemap-blog.xml</loc> <lastmod>2025-01-14</lastmod> </sitemap> </sitemapindex> -
Add the Sitemap directive to robots.txt:
Sitemap: https://example.com/sitemap.xml -
Verify the sitemap is valid XML:
curl -s https://example.com/sitemap.xml | head -5 # Should start with <?xml and contain <urlset or <sitemapindex
Framework-Specific Generation
Most frameworks have sitemap plugins — prefer automated generation over manual files:
- WordPress: Yoast SEO or built-in (
/wp-sitemap.xml) - Next.js:
next-sitemappackage or App Routersitemap.ts - Rails:
sitemap_generatorgem - Django:
django.contrib.sitemaps - Laravel:
spatie/laravel-sitemap - Phoenix/Elixir: Custom plug or controller route
Checkpoint 1.4: llms.txt (15 pts)
What passes: /llms.txt returns HTTP 200 with text/plain or text/markdown, starts with a # heading, and contains at least one URL. Bonus: /llms-full.txt companion found.
What fails: Missing file, wrong content type, no heading, or no URLs.
Fix Workflow
-
Create
/llms.txt— a markdown-formatted overview of your site for LLMs:# Your Company Name > Brief description of what your company does. ## Docs - [Getting Started](https://example.com/docs/getting-started) - [API Reference](https://example.com/docs/api) - [Tutorials](https://example.com/docs/tutorials) ## Products - [Product Overview](https://example.com/products) - [Pricing](https://example.com/pricing) ## Optional - [Blog](https://example.com/blog) - [Changelog](https://example.com/changelog) - [Status Page](https://status.example.com) -
Optionally create
/llms-full.txt— expanded version with more detail:# Your Company Name > Detailed description of your company, products, and services. ## Getting Started Full getting started content here, not just a link. Include setup instructions, prerequisites, etc. ## API Reference Inline API documentation or detailed summaries of endpoints. -
Ensure correct Content-Type — must be
text/plainortext/markdown. Same server config as robots.txt (see checkpoint 1.1). -
Verify:
curl -sI https://example.com/llms.txt | grep -i content-type curl -s https://example.com/llms.txt | head -5 # First line must start with # # Must contain at least one https:// URL
See references/llms-txt-guide.md for the full specification and examples for different site types.
Checkpoint 1.5: Meta Robots / X-Robots-Tag (15 pts)
What passes: No restrictive directives found (15 pts).
Partial credit: Restrictive directives other than noindex found (8 pts) — e.g., nofollow, nosnippet, noai, noimageai.
What fails: noindex directive found (0 pts).
The scanner checks both:
<meta name="robots" content="...">in HTMLX-Robots-TagHTTP response header
Restrictive Directives
| Directive | Effect |
|---|---|
noindex |
Prevents indexing entirely (worst for AI) |
nofollow |
Prevents following links on the page |
nosnippet |
Prevents showing snippets in search results |
noai |
Signals no AI usage (some crawlers respect) |
noimageai |
Signals no AI usage of images |
Fix Workflow
-
Check current meta robots:
curl -s https://example.com/ | grep -i 'name="robots"' curl -sI https://example.com/ | grep -i x-robots-tag -
Remove or replace restrictive tags in your HTML
<head>:<!-- WRONG: blocks AI indexing --> <meta name="robots" content="noindex, nofollow"> <!-- CORRECT: allows AI indexing --> <meta name="robots" content="index, follow"> <!-- ALSO CORRECT: omit entirely (default is index, follow) --> -
Remove restrictive X-Robots-Tag headers:
Nginx:
# Remove if present: # add_header X-Robots-Tag "noindex"; # Replace with (or remove entirely): add_header X-Robots-Tag "index, follow"; -
For specific AI directives — if you have
noaiornoimageaiand want to allow AI:<!-- Remove noai/noimageai to allow AI systems --> <meta name="robots" content="index, follow"> -
Verify:
curl -s https://example.com/ | grep -i 'name="robots"' # Should show: content="index, follow" or no meta robots tag at all
Checkpoint 1.6: Content Freshness Signals (10 pts)
What passes: dateModified in JSON-LD or article:modified_time meta tag (10 pts).
Partial: Only datePublished, article:published_time, or <time datetime> (7 pts). Only Last-Modified header (5 pts).
What fails: No freshness signals detected (0 pts).
Why it matters: ChatGPT shows 3.2x preference for content with fresh date signals. AI systems use dates to prioritize recent, authoritative content.
Fix Workflow
-
Check current signals:
curl -sI https://example.com/ | grep -i last-modified curl -s https://example.com/ | grep -iE 'dateModified|article:modified_time' -
Add
dateModifiedanddatePublishedto JSON-LD (best — 10 pts):{ "@context": "https://schema.org", "@type": "Article", "datePublished": "2024-01-15T09:00:00Z", "dateModified": "2024-03-01T14:30:00Z" } -
Add Open Graph meta tags (also 10 pts):
<meta property="article:published_time" content="2024-01-15T09:00:00Z"> <meta property="article:modified_time" content="2024-03-01T14:30:00Z"> -
Last-ModifiedHTTP header scores only 5 pts — use JSON-LD or meta tags for full credit. -
<time datetime>element scores 7 pts as a fallback:<time datetime="2024-03-01T14:30:00Z">March 1, 2024</time> -
Verify:
curl -s https://example.com/ | grep -iE 'dateModified|article:modified_time'
Key Gotchas
- robots.txt returns HTML — 404 page served instead of a real robots.txt file
- Wrong Content-Type — robots.txt or llms.txt served as
text/htmlinstead oftext/plain - Wildcard Disallow blocks everything —
User-agent: * / Disallow: /blocks all AI crawlers - Sitemap is not valid XML — JSON or HTML page served at
/sitemap.xml - llms.txt has no URLs — File exists but is just plain text without any links
- noai vs noindex confusion —
noaicosts 7 points,noindexcosts all 15 - Last-Modified header only — scores 5/10; add
dateModifiedin JSON-LD for full 10 points
See references/gotchas.md for detailed correct vs incorrect examples of each.
References
- robots-txt-guide.md — Complete robots.txt syntax, AI user-agents, and testing
- llms-txt-guide.md — llms.txt specification, content structure, and site-type examples
- gotchas.md — Common pitfalls with wrong vs correct examples
Instructions
- Identify failing checkpoints from the IsAgentReady.com scan results
- Follow the fix workflow for each failing checkpoint above
- Apply the code examples — adapt URLs, domain names, and content to the user's site
- Verify each fix using the curl commands provided in each workflow
- Re-scan at isagentready.com to confirm improvements
If $ARGUMENTS is provided, interpret it as the URL to fix or the specific checkpoint to address.