ai-content-discovery
AI Content Discovery
Fixes Category 1 (AI Content Discovery, 30% weight) issues from IsAgentReady.com. This category checks whether AI systems can find, crawl, and understand your website's content. It evaluates 7 checkpoints worth 100 points total.
When to Use
- Fixing robots.txt issues (missing, wrong content type, blocking bots)
- Adding or updating AI crawler directives (GPTBot, ClaudeBot, etc.)
- Creating or fixing XML sitemaps
- Creating llms.txt or llms-full.txt files
- Removing restrictive meta robots tags (noindex, noai)
- Fixing WAF/CDN bot blocking issues
- Adding content freshness signals (dateModified, article:modified_time)
- Any task to "improve AI discoverability" or "make site crawlable by AI"
When NOT to Use
- Adding structured data / JSON-LD (use
structured-dataskill) - Fixing semantic HTML or heading hierarchy (use
content-semanticsskill) - Setting up agent protocols like WebMCP or A2A (use
agent-protocolsskill) - Configuring security headers like CSP or HSTS (use
security-trustskill)
Checkpoints Overview
| ID | Checkpoint | Max Points | What It Tests |
|---|---|---|---|
| 1.8 | HTTP bot accessibility | 15 | Page returns HTTP 200-299 (not 401/403 from WAF) |
| 1.1 | robots.txt present | 15 | /robots.txt returns 200 with text/plain Content-Type |
| 1.2 | AI crawler directives | 15 | Allow/Disallow rules for 13 AI user-agents in robots.txt |
| 1.3 | XML Sitemap | 15 | Valid XML sitemap with <urlset> or <sitemapindex> |
| 1.4 | llms.txt | 15 | /llms.txt with markdown heading + URLs; bonus for /llms-full.txt |
| 1.5 | Meta robots / X-Robots-Tag | 15 | No restrictive directives (noindex, noai, noimageai) |
| 1.6 | Content freshness signals | 10 | dateModified in JSON-LD, article:modified_time, or Last-Modified |
Checkpoint 1.8: HTTP Bot Accessibility (15 pts)
What passes: HTTP status 200-299. What fails: HTTP 401 or 403 (WAF/CDN blocking bots).
Fix Workflow
-
Diagnose — test with an AI crawler user-agent:
curl -sI -A "Mozilla/5.0 (compatible; GPTBot/1.0)" https://example.com/ curl -sI -A "Mozilla/5.0 (compatible; ClaudeBot/1.0)" https://example.com/ -
If blocked by Cloudflare — create a WAF exception:
# Dashboard -> Security -> WAF -> Custom Rules -> Create rule: # Field: User Agent | Operator: contains | Value: GPTBot # Action: Skip remaining rules # # Repeat for ClaudeBot, Amazonbot, ChatGPT-User, etc. -
If blocked by Nginx rate limiting — allow AI user-agents:
map $http_user_agent $is_ai_bot { default 0; "~*GPTBot" 1; "~*ClaudeBot" 1; "~*Amazonbot" 1; "~*ChatGPT" 1; } # Skip rate limiting for AI bots limit_req_zone $binary_remote_addr zone=general:10m rate=10r/s; server { location / { if ($is_ai_bot) { # Allow AI bots through without rate limits } limit_req zone=general burst=20; } } -
If blocked by Apache — allow in
.htaccess:RewriteEngine On RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|Amazonbot) [NC] RewriteRule ^ - [L] -
Verify — re-test with curl to confirm 200 response.
Checkpoint 1.1: robots.txt Present (15 pts)
What passes: /robots.txt returns HTTP 200 with Content-Type: text/plain.
What fails: Missing file (404), HTML error page served, or wrong Content-Type.
Fix Workflow
-
Check current state:
curl -sI https://example.com/robots.txt | head -20 -
Create
/robots.txtat your web root:User-agent: * Allow: / User-agent: GPTBot Allow: / User-agent: ClaudeBot Allow: / User-agent: Google-Extended Allow: / Sitemap: https://example.com/sitemap.xml -
Ensure correct Content-Type — must return
text/plain. Nginx:default_type text/plain;in the location block. Apache:ForceType text/plainin a<Files>directive. -
Verify:
curl -sI https://example.com/robots.txt | grep -i content-type # Expected: Content-Type: text/plain
See references/robots-txt-guide.md for complete robots.txt syntax and rules.
Checkpoint 1.2: AI Crawler Directives (15 pts)
What passes: All 13 AI crawlers explicitly allowed (15 pts), or some allowed with none blocked (15 pts), or wildcard Allow: / with none blocked (15 pts).
Partial credit: No AI crawlers mentioned but default allow applies (10 pts), or mixed policies with some blocked (7 pts).
What fails: All AI crawlers explicitly disallowed (0 pts).
The 13 AI User-Agents
| User-Agent | Owner | Purpose |
|---|---|---|
| GPTBot | OpenAI | Training data crawling |
| ChatGPT-User | OpenAI | Real-time browsing in ChatGPT |
| OAI-SearchBot | OpenAI | SearchGPT results |
| ClaudeBot | Anthropic | Training data crawling |
| Claude-User | Anthropic | Real-time browsing in Claude |
| Claude-SearchBot | Anthropic | Claude search results |
| Google-Extended | Gemini AI training | |
| Amazonbot | Amazon | Alexa/AI training |
| Bytespider | ByteDance | TikTok/AI training |
| CCBot | Common Crawl | Open dataset crawling |
| PerplexityBot | Perplexity | AI search results |
| Applebot-Extended | Apple | Apple Intelligence training |
| meta-externalagent | Meta | Meta AI training |
Fix Workflow
-
Check current directives:
curl -s https://example.com/robots.txt -
Add explicit Allow directives for each AI crawler to your robots.txt:
# AI Crawlers — explicitly allow (one block per agent) User-agent: GPTBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: OAI-SearchBot Allow: / User-agent: ClaudeBot Allow: / User-agent: Claude-User Allow: / User-agent: Claude-SearchBot Allow: / User-agent: Google-Extended Allow: / User-agent: Amazonbot Allow: / User-agent: Bytespider Allow: / User-agent: CCBot Allow: / User-agent: PerplexityBot Allow: / User-agent: Applebot-Extended Allow: / User-agent: meta-externalagent Allow: / -
If you want to allow all crawlers — a simple wildcard also works:
User-agent: * Allow: / -
If you want selective control — allow some, block others:
# Allow search-oriented AI crawlers User-agent: ChatGPT-User Allow: / User-agent: PerplexityBot Allow: / # Block training-oriented crawlers User-agent: GPTBot Disallow: / User-agent: CCBot Disallow: /
See references/robots-txt-guide.md for full syntax and AI user-agent details.
Checkpoint 1.3: XML Sitemap (15 pts)
What passes: Valid XML sitemap found at a discoverable URL with <urlset> or <sitemapindex>.
What fails: No sitemap found, or sitemap is not valid XML.
The scanner checks these locations in order:
- URLs from
Sitemap:directives in robots.txt /sitemap.xml/sitemap_index.xml
Fix Workflow
-
Check if a sitemap exists:
curl -sI https://example.com/sitemap.xml | head -5 curl -s https://example.com/robots.txt | grep -i sitemap -
Create
/sitemap.xml:<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>https://example.com/</loc> <lastmod>2025-01-15</lastmod> <changefreq>weekly</changefreq> <priority>1.0</priority> </url> <url> <loc>https://example.com/about</loc> <lastmod>2025-01-10</lastmod> </url> </urlset> -
For large sites, use a sitemap index:
<?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc>https://example.com/sitemap-pages.xml</loc> <lastmod>2025-01-15</lastmod> </sitemap> <sitemap> <loc>https://example.com/sitemap-blog.xml</loc> <lastmod>2025-01-14</lastmod> </sitemap> </sitemapindex> -
Add the Sitemap directive to robots.txt:
Sitemap: https://example.com/sitemap.xml -
Verify the sitemap is valid XML:
curl -s https://example.com/sitemap.xml | head -5 # Should start with <?xml and contain <urlset or <sitemapindex
Framework-Specific Generation
Most frameworks have sitemap plugins — prefer automated generation over manual files:
- WordPress: Yoast SEO or built-in (
/wp-sitemap.xml) - Next.js:
next-sitemappackage or App Routersitemap.ts - Rails:
sitemap_generatorgem - Django:
django.contrib.sitemaps - Laravel:
spatie/laravel-sitemap - Phoenix/Elixir: Custom plug or controller route
Checkpoint 1.4: llms.txt (15 pts)
What passes: /llms.txt returns HTTP 200 with text/plain or text/markdown, starts with a # heading, and contains at least one URL. Bonus: /llms-full.txt companion found.
What fails: Missing file, wrong content type, no heading, or no URLs.
Fix Workflow
-
Create
/llms.txt— a markdown-formatted overview of your site for LLMs:# Your Company Name > Brief description of what your company does. ## Docs - [Getting Started](https://example.com/docs/getting-started) - [API Reference](https://example.com/docs/api) - [Tutorials](https://example.com/docs/tutorials) ## Products - [Product Overview](https://example.com/products) - [Pricing](https://example.com/pricing) ## Optional - [Blog](https://example.com/blog) - [Changelog](https://example.com/changelog) - [Status Page](https://status.example.com) -
Optionally create
/llms-full.txt— expanded version with more detail:# Your Company Name > Detailed description of your company, products, and services. ## Getting Started Full getting started content here, not just a link. Include setup instructions, prerequisites, etc. ## API Reference Inline API documentation or detailed summaries of endpoints. -
Ensure correct Content-Type — must be
text/plainortext/markdown. Same server config as robots.txt (see checkpoint 1.1). -
Verify:
curl -sI https://example.com/llms.txt | grep -i content-type curl -s https://example.com/llms.txt | head -5 # First line must start with # # Must contain at least one https:// URL
See references/llms-txt-guide.md for the full specification and examples for different site types.
Checkpoint 1.5: Meta Robots / X-Robots-Tag (15 pts)
What passes: No restrictive directives found (15 pts).
Partial credit: Restrictive directives other than noindex found (8 pts) — e.g., nofollow, nosnippet, noai, noimageai.
What fails: noindex directive found (0 pts).
The scanner checks both:
<meta name="robots" content="...">in HTMLX-Robots-TagHTTP response header
Restrictive Directives
| Directive | Effect |
|---|---|
noindex |
Prevents indexing entirely (worst for AI) |
nofollow |
Prevents following links on the page |
nosnippet |
Prevents showing snippets in search results |
noai |
Signals no AI usage (some crawlers respect) |
noimageai |
Signals no AI usage of images |
Fix Workflow
-
Check current meta robots:
curl -s https://example.com/ | grep -i 'name="robots"' curl -sI https://example.com/ | grep -i x-robots-tag -
Remove or replace restrictive tags in your HTML
<head>:<!-- WRONG: blocks AI indexing --> <meta name="robots" content="noindex, nofollow"> <!-- CORRECT: allows AI indexing --> <meta name="robots" content="index, follow"> <!-- ALSO CORRECT: omit entirely (default is index, follow) --> -
Remove restrictive X-Robots-Tag headers:
Nginx:
# Remove if present: # add_header X-Robots-Tag "noindex"; # Replace with (or remove entirely): add_header X-Robots-Tag "index, follow"; -
For specific AI directives — if you have
noaiornoimageaiand want to allow AI:<!-- Remove noai/noimageai to allow AI systems --> <meta name="robots" content="index, follow"> -
Verify:
curl -s https://example.com/ | grep -i 'name="robots"' # Should show: content="index, follow" or no meta robots tag at all
Checkpoint 1.6: Content Freshness Signals (10 pts)
What passes: dateModified in JSON-LD or article:modified_time meta tag (10 pts).
Partial: Only datePublished, article:published_time, or <time datetime> (7 pts). Only Last-Modified header (5 pts).
What fails: No freshness signals detected (0 pts).
Why it matters: ChatGPT shows 3.2x preference for content with fresh date signals. AI systems use dates to prioritize recent, authoritative content.
Fix Workflow
-
Check current signals:
curl -sI https://example.com/ | grep -i last-modified curl -s https://example.com/ | grep -iE 'dateModified|article:modified_time' -
Add
dateModifiedanddatePublishedto JSON-LD (best — 10 pts):{ "@context": "https://schema.org", "@type": "Article", "datePublished": "2024-01-15T09:00:00Z", "dateModified": "2024-03-01T14:30:00Z" } -
Add Open Graph meta tags (also 10 pts):
<meta property="article:published_time" content="2024-01-15T09:00:00Z"> <meta property="article:modified_time" content="2024-03-01T14:30:00Z"> -
Last-ModifiedHTTP header scores only 5 pts — use JSON-LD or meta tags for full credit. -
<time datetime>element scores 7 pts as a fallback:<time datetime="2024-03-01T14:30:00Z">March 1, 2024</time> -
Verify:
curl -s https://example.com/ | grep -iE 'dateModified|article:modified_time'
Key Gotchas
- robots.txt returns HTML — 404 page served instead of a real robots.txt file
- Wrong Content-Type — robots.txt or llms.txt served as
text/htmlinstead oftext/plain - Wildcard Disallow blocks everything —
User-agent: * / Disallow: /blocks all AI crawlers - Sitemap is not valid XML — JSON or HTML page served at
/sitemap.xml - llms.txt has no URLs — File exists but is just plain text without any links
- noai vs noindex confusion —
noaicosts 7 points,noindexcosts all 15 - Last-Modified header only — scores 5/10; add
dateModifiedin JSON-LD for full 10 points
See references/gotchas.md for detailed correct vs incorrect examples of each.
References
- robots-txt-guide.md — Complete robots.txt syntax, AI user-agents, and testing
- llms-txt-guide.md — llms.txt specification, content structure, and site-type examples
- gotchas.md — Common pitfalls with wrong vs correct examples
Instructions
- Identify failing checkpoints from the IsAgentReady.com scan results
- Follow the fix workflow for each failing checkpoint above
- Apply the code examples — adapt URLs, domain names, and content to the user's site
- Verify each fix using the curl commands provided in each workflow
- Re-scan at isagentready.com to confirm improvements
If $ARGUMENTS is provided, interpret it as the URL to fix or the specific checkpoint to address.
More from bartwaardenburg/isagentready-skills
content-semantics
Fixes content and semantic HTML issues — implements server-side rendering, heading hierarchy, semantic HTML elements, ARIA landmarks, image alt text, language attributes, descriptive link texts, question-based headings, form labels, autocomplete hints, keyboard navigability, document titles, button accessible names, and aria-hidden misuse so AI agents can navigate and understand page content via the accessibility tree. Use when asked to "fix semantic HTML", "add SSR", "fix heading hierarchy", "add alt text", "improve semantics score", "fix accessibility for AI", "add ARIA landmarks", "fix language attribute", "add question headings", "fix form labels", "add autocomplete", "fix keyboard navigation", "fix document title", "fix button names", "fix aria-hidden", or any semantic HTML task.
11agent-protocols
Implements AI agent interaction protocols — adds WebMCP declarative tool annotations, WebMCP manifests, Google A2A Agent Cards, MCP discovery documents, OpenAPI specifications, and agents.json so AI agents can directly interact with website functionality. Use when asked to "add WebMCP", "implement A2A", "create agent card", "add MCP discovery", "create OpenAPI spec", "add agents.json", "improve agent protocols score", "make site agent-interactive", or any WebMCP, A2A, MCP, or OpenAPI task.
10structured-data
Fixes structured data issues — adds and validates JSON-LD markup with Schema.org types including Organization, WebSite, Product, Article, FAQPage, BreadcrumbList, entity linking, and author attribution so AI search engines can understand and cite website content. Use when asked to "add JSON-LD", "fix structured data", "add schema markup", "improve structured data score", "add Organization schema", "add BreadcrumbList", "fix Schema.org", "add product markup", "add FAQ schema", "add author markup", or any JSON-LD or Schema.org task.
10security-trust
Fixes security and trust issues — configures HTTPS, HSTS, Content-Security-Policy, X-Content-Type-Options, frame protection, CORS, and Referrer-Policy headers so AI agents and platforms trust the website for interaction. Use when asked to "fix security headers", "add HSTS", "configure CSP", "fix HTTPS", "improve security score", "add security headers", "fix CORS", "add Referrer-Policy", or any security header configuration task.
9