seo-engine
SEO Engine
Use this skill to run deterministic checks on HTML, headers, robots.txt, and related resources. It returns pass/fail outcomes with minimal heuristics and clear remediation steps.
When to use
- Auditing a page's indexability and crawlability
- Verifying content structure (title, headings, links, images)
- Flagging spam policy violations (cloaking, hidden text, keyword stuffing)
- Sanity-checking redirect behavior and HTTP status
- Verifying dashboard filters/metrics for SEO reporting
- Analyzing website URLs provided by users (extracts HTML, robots.txt, and sitemap automatically)
- Analyzing existing HTML files, robots.txt files, sitemap.xml files
Input Preparation
When provided with a website URL, use the input preparation scripts to extract all required data:
- fetch_html.py - Extracts HTML source code from the webpage
- fetch_robots_txt.py - Downloads robots.txt crawling permissions
- fetch_sitemap.py - Finds and downloads sitemap.xml structure
See scripts/prepare_input/README.md for detailed usage instructions.
Quick Reference
Analyze Website URL
Provide a website URL and the agent will extract all required data and run SEO analysis:
"Analyze https://example.com for SEO issues"
"Run SEO audit on example.com"
"Check this website for technical SEO problems: https://site.com"
Analyze Local Files
Provide local files and the agent will apply appropriate SEO rules:
"Analyze this HTML file for SEO compliance: page.html"
"Check these files: page.html, robots.txt"
"What SEO issues exist in this webpage file?"
Rule Information Lookup
Ask about specific rules or categories:
"What does the FAVICON_DIMENSIONS rule check?"
"Show me all Content Basics rules"
"Explain the PAGE_EXPERIENCE_DIVERSITY requirement"
"List all Critical priority SEO rules"
File Requirements by Rule
Each rule specifies its required inputs in the YAML frontmatter:
- inputFields.html - Requires HTML file content
- inputFields.robotsTxt - Requires robots.txt file content
- inputFields.sitemap - Requires sitemap.xml content
The agent automatically reads rule definitions and applies appropriate checks based on your provided files.
Included Rules
| Rule ID | Title | Priority | Category | Description |
|---|---|---|---|---|
| PAGE_TITLE_EXISTS | Page title tag present and non-empty | Low | Content Basics | Ensures the <title> tag exists and is not empty. |
| MAIN_HEADING_EXISTS | Main heading (<h1>) present and non-empty | Low | Content Basics | At least one <h1> element exists with non-empty text. |
| IMAGE_ALT_TEXT | All images have non-empty alt attributes | Medium | Content Basics | Every <img> element includes a non-empty alt attribute. |
| CRAWLABLE_LINKS | All anchor links are crawlable (have valid href) | Low | Content Basics | Every <a> element has a non-empty href that does not start with "javascript:". |
| GOOGLEBOT_NOT_BLOCKED | Googlebot is not blocked by robots.txt | High | Technical Requirements | No Disallow rule for Googlebot (or *) matches the page URL. |
| PAGE_HTTP_200_STATUS | Page returns HTTP 200 status | Critical | Technical Requirements | Page responds with HTTP 200, not error or redirect. |
| PAGE_INDEXABLE_CONTENT | Page has indexable textual content | Medium | Technical Requirements | Body contains at least one alphanumeric character after stripping markup. |
| HEAD_SECTION_VALID_HTML | Head section must be valid HTML | High | Technical Requirements | The page contains exactly one <head> element and the HTML parses without syntax errors. |
| CLOAKING_DETECTION | Detect cloaking by comparing bot vs user content | High | Spam Policies | Content served to Googlebot and regular users is substantially identical (≥ 90% similarity). |
| HIDDEN_TEXT_DETECTION | Detect hidden text or links intended for search | Medium | Spam Policies | No elements with hidden styles contain visible text or links. |
| KEYWORD_STUFFING_DETECTION | Detect excessive repetition of keywords | Medium | Spam Policies | No single keyword exceeds a density of 5% of total words. |
| SNEAKY_REDIRECT_DETECTION | Detect sneaky redirects for bots vs users | High | Spam Policies | Both HTTP status codes and final URLs are identical for Googlebot and regular users. |
| TITLE_DESCRIPTIVE | Page title is descriptive, specific, and accurate | Medium | Content Optimization | Title text contains at least two words. |
| META_DESCRIPTION_PRESENT | Meta description tag is present and descriptive | Medium | Content Optimization | Meta description is present and its length is at least 50 characters. |
| IMAGE_ALT_ATTRIBUTES | All images have descriptive alt attributes | Medium | Content Optimization | Every <img> element has a non-empty alt attribute. |
| HEADING_HIERARCHY | Page uses heading elements for hierarchy | Low | Content Optimization | At least one <h1> element is present. |
| GA_FILTER_SOURCE_MEDIUM | GA data filtered to source=google, medium=organic | Medium | Dashboard Setup | Both source=google and medium=organic filter conditions are present in dashboard config. |
| DASHBOARD_METRICS_PRESENT | Dashboard includes required five metrics | Medium | Dashboard Setup | All five required metrics are present in the dashboard configuration. |
| DASHBOARD_DATA_SOURCES_CONNECTED | Dashboard connects to GA and SC | Medium | Dashboard Setup | Both Search Console and Google Analytics data sources are referenced in dashboard config. |
| AMP_PAGE_MUST_FOLLOW_SPEC | AMP page must follow AMP HTML specification | Critical | AMP Validation | Ensures the page complies with the AMP HTML specification for Google Search features. |
| BANNER_DATA_NOSNIPPET_PRESENT | Ensure banner or popup uses data-nosnippet attribute | Medium | Site Functionality | Prevents banner or popup content from being shown in search result snippets. |
| ROBOTS_TXT_NOT_503 | robots.txt must not return HTTP 503 status | High | Site Availability | A 503 response for robots.txt blocks all crawling, preventing indexing. |
| RETRY_AFTER_HEADER_PRESENT_ON_503 | 503 error pages must include a Retry-After header | Medium | Site Availability | Provides crawlers with guidance on when to retry, reducing unnecessary load. |
| NO_URL_FRAGMENTS | Avoid URL fragments that change content | High | URL Structure | Google Search may not crawl URLs where fragments are used to change content. |
| HYPHENS_IN_PATH | Use hyphens to separate words in URL path | Medium | URL Structure | Hyphens improve readability for users and search engines, aiding crawlability. |
| PERCENT_ENCODING_NECESSARY | Percent‑encode non‑ASCII characters in URLs | Medium | URL Structure | Percent‑encoding ensures URLs are valid, crawlable, and correctly interpreted. |
| CHECK_REL_CANONICAL_PRESENT | Presence of rel="canonical" link element | Medium | Canonicalization | rel="canonical" link annotations influence how Google determines the canonical URL. |
| CHECK_URL_IN_SITEMAP | URL presence in sitemap.xml | Medium | Canonicalization | Presence of the URL in a sitemap is a factor influencing canonical selection. |
| CHECK_HTTP_HTTPS_CONSISTENCY | Consistent use of HTTPS scheme | Low | Canonicalization | The page's protocol (HTTP vs HTTPS) is a factor that influences canonicalization. |
| DATA_NOSNIPPET_VALID_HTML | Ensure HTML containing data-nosnippet attribute is well‑formed | High | Data Nosnippet | HTML section must be valid HTML for data‑nosnippet to be machine‑readable. |
| ROBOTS_TXT_ALLOW_INDEXING_RULES | URLs containing robots meta or X‑Robots‑Tag must not be disallowed | Critical | Robots.txt Rules | URLs with indexing/serving rules cannot be disallowed from crawling via robots.txt. |
| NO_CLOAKING_DETECTED | Ensure no cloaking between Googlebot and users | High | A/B Testing | Cloaking violates spam policies and can cause demotion or removal from search results. |
| REL_CANONICAL_PRESENT | Use rel="canonical" on test variant URLs | Medium | A/B Testing | rel=canonical signals the preferred URL, preventing duplicate indexing of test variants. |
| TEMPORARY_REDIRECT_302 | Use 302 redirects for temporary test redirects | Medium | A/B Testing | A 302 redirect signals a temporary change, ensuring the original URL remains indexed. |
| CANONICAL_LINK_IN_HEAD | rel="canonical" link element must be placed in | High | Canonicalization | The rel="canonical" link element is only accepted if it appears in the section. |
| CANONICAL_LINK_ABSOLUTE_URL | rel="canonical" link element must use an absolute URL | Medium | Canonicalization | Documentation recommends using absolute URLs for rel="canonical" link elements. |
| CANONICAL_HEADER_ABSOLUTE_URL | rel="canonical" HTTP header must use an absolute URL | Medium | Canonicalization | Documentation states that absolute URLs must be used in the rel="canonical" HTTP header. |
| AVOID_ROBOTS_TXT_FOR_CANONICAL | Do not use robots.txt for canonicalization | Low | Canonicalization | Documentation explicitly advises against using robots.txt for canonicalization. |
| CONSISTENT_CANONICAL_METHOD | Do not specify different canonical URLs using different methods | High | Canonicalization | Specifying different canonical URLs via different techniques can cause conflicts. |
| PHP_HEADERS_BEFORE_OUTPUT | Ensure HTTP redirect headers are sent before any body content in PHP redirects | High | Server-side redirects | The documentation states "You must set the headers before sending anything to the screen" for PHP redirects, requiring headers to precede any output. |
| NOINDEX_ON_LOGIN_PAGE | Ensure login pages include a noindex robots meta tag | High | Edit or remove unwanted text before moving to a public file format | Login pages may expose redacted content; a noindex meta tag prevents search engines from indexing them. |
| URL_NO_EMAIL | URLs must not contain email addresses | Medium | Edit or remove unwanted text before moving to a public file format | Email addresses in URLs can be indexed and expose personal information. |
| IMAGE_NON_VECTOR_FORMAT | Ensure exported images are in non-vector formats (PNG or WEBP) | Medium | Edit and export images before embedding them | Vector formats may retain hidden layers or metadata that can be indexed. |
| NOINDEX_META_TAG_PRESENT | Presence of noindex meta tag in HTML head | Low | Implementing noindex | A <meta name="robots" content="noindex"> tag placed in the <head> prevents search engines that support the noindex rule from indexing the page. |
| FILETYPE_INDEXABLE_CHECK | File extension is indexable by Google | Low | File types indexable by Google | Google can index the content of the listed text‑based and media file types; resources with other extensions may not be indexed. |
| REDIRECT_USES_PERMANENT_STATUS | Redirect uses permanent HTTP status | High | Redirects | Permanent redirects (301/308) preserve link equity and signal the move to Google. |
| REDIRECT_CHAIN_MAX_LENGTH | Redirect chain length limit | Medium | Redirects | Long redirect chains add latency and may exceed Googlebot's limit. |
| CANONICAL_SELF_REFERENCING | Self-referencing rel=canonical tag present | Medium | Canonical Tags | Self-referencing canonical informs Google of the preferred URL for the content. |
| HEAD_ALLOWED_ELEMENTS_MUST | Only allowed elements in <head> | High | Page Metadata | Google processes only the allowed elements in the <head>; any invalid element causes the rest of the metadata to be ignored. |
| HEAD_INVALID_ELEMENTS_ORDER_SHOULD | Place invalid <head> elements after allowed elements | Medium | Page Metadata | If an invalid element appears before allowed elements, Google stops reading further elements, causing later metadata to be ignored. |
| REL_ATTRIBUTE_ALLOWED_VALUES | Validate allowed rel attribute values on outbound links | Low | Qualify your outbound links to Google | Ensures that rel attributes on <a> elements use only the values documented (sponsored, ugc, nofollow) so Google can interpret link qualifications correctly. |
| ROBOTS_TXT_IMAGE_BLOCK | Block image URLs via robots.txt Disallow rule | Low | robots.txt | Robots.txt Disallow rules prevent Googlebot-Image from indexing specified image URLs, removing them from search results. |
| NOINDEX_HEADER_IMAGE_BLOCK | Block image URLs via noindex X-Robots-Tag header | High | noindex X-Robots-Tag | The noindex X-Robots-Tag header tells Googlebot not to index the image, but the URL must be crawlable for the header to be read. |
| NOINDEX_RULE_ABSENT | Ensure noindex robots rule is not present on new site pages | Medium | Prepare the new hosting infrastructure | Prevents accidental indexing of the test site before it goes live. |
| TEMPORARY_BLOCKS_REMOVED | Verify temporary crawling blocks are removed before launch | High | Start the move | Ensure the site is fully crawlable by Googlebot after the move. |
| SEARCH_CONSOLE_VERIFICATION_PRESENT | Verify Search Console verification assets are present on the new site | Medium | Prepare the new hosting infrastructure | Ownership verification must continue to work after the hosting move. |
| GOOGLEBOT_ACCESSIBLE | Confirm Googlebot can access the new site (HTTP 200) | Critical | Check that Googlebot is able to access the new hosting infrastructure | Googlebot must be able to retrieve pages to index them after the move. |
| RESOURCES_NOT_BLOCKED_BY_ROBOTS_TXT | Ensure resources are not blocked by robots.txt | High | Resources | Resources such as images, CSS, and JavaScript must be accessible to Google; if they are blocked by robots.txt Google cannot crawl the page properly. |
| HREFLANG_TAGS_PRESENT | hreflang annotations present for multilingual pages | Medium | Internationalized or multi-lingual sites | hreflang tags tell Google which language or regional version of a page to serve, preventing duplicate content issues across locales. |
| SITE_USES_HTTPS | Site should be served over HTTPS | High | Manage the user experience | HTTPS provides security for users and is recommended by Google as a ranking signal. |
| TRUE_404_FOR_NOT_FOUND | Return proper 404 status for missing pages | High | Migrating a single URL | A true 404 response signals to Google that a page is permanently unavailable; soft 404s can mislead indexing. |
| SEO_SHOULD_NOT_LINK_TO_SEO | Avoid linking to SEO provider | Medium | Helpful guidelines | Linking to an SEO provider can be considered a link scheme and may violate Google's policies. |
| SEO_SHOULD_EXPLAIN_FTP_CHANGES | SEO with FTP access must explain changes | Low | Helpful guidelines | Transparency about changes made via FTP ensures the site owner can verify compliance and avoid hidden manipulations. |
| IMG_ALT_TEXT | Ensure all images have descriptive alt text | Medium | Add images to your site, and optimize them | Alt text helps search engines understand image content and improves accessibility. |
| TITLE_ELEMENT_PRESENT | Verify presence of a element | Medium | Influence your title links | The element is used by Google to generate title links in search results. |
| META_DESCRIPTION_PRESENT | Ensure presence of a meta description tag | Low | Control your snippets | Meta description often supplies the snippet shown in search results. |
| LINK_TEXT_DESCRIPTIVE | Verify that anchor text is non‑empty and descriptive | Medium | Link to relevant resources | Descriptive anchor text helps users and search engines understand linked content. |
| ROBOTS_TXT_DISALLOW_CRAWL | Ensure page is not disallowed by robots.txt | High | Crawling | Pages disallowed by robots.txt cannot be crawled, which prevents Google from discovering and indexing them. |
| HTTP_STATUS_NOT_5XX | Verify page does not return a server error status | High | Crawling | HTTP 5xx responses indicate server errors that prevent Googlebot from successfully crawling the page. |
| META_ROBOTS_NOINDEX | Ensure meta robots tag does not block indexing | High | Indexing | A meta robots tag containing "noindex" tells Google not to index the page, preventing it from appearing in search results. |
| PAGE_EXPERIENCE_DIVERSITY | Avoid focusing on only one or two aspects of page experience | Medium | Provide a great page experience | Google advises site owners not to focus on only one or two aspects of page experience, but to provide an overall great experience across many signals. |
| AI_GENERATED_IMAGE_METADATA_MUST_CONTAIN_IPTC_DIGITAL_SOURCE_TYPE | AI-generated images must include IPTC DigitalSourceType metadata | Low | AI-generated image metadata | Ensures AI‑generated images are identifiable and comply with Google Merchant Center policies. |
| AI_GENERATED_PRODUCT_DATA_MUST_BE_LABELED | AI-generated product titles and descriptions must be labeled as AI-generated | Low | AI-generated product data | Guarantees transparency for users and compliance with Google Merchant Center AI content policies. |
| FAVICON_CRAWLABILITY | Ensure favicon and home page are crawlable by Googlebot | High | Guidelines | Googlebot-Image and Googlebot must be able to crawl the favicon file and the home page; blocking them prevents the favicon from appearing in search results. |
| FAVICON_DIMENSIONS | Verify favicon is square and at least 8x8 pixels | Medium | Guidelines | Google requires the favicon to be a square image with a minimum size of 8x8 px to be eligible for display in search results. |
| FAVICON_URL_STABILITY | Ensure favicon URL is stable and not frequently changed | Low | Guidelines | A stable favicon URL prevents Google from losing the association between the site and its favicon, ensuring consistent display in search results. |
| CANONICAL_SELF_LINK | Web Story must have self-referential canonical link | High | Check if the Web Story is indexed | A self‑referential canonical link tells Google the definitive URL for the story, enabling correct indexing and avoiding duplicate content issues. |
Rule Categories and Included Rules
Technical Requirements
- PAGE_HTTP_200_STATUS (Critical): Page returns HTTP 200 status. Ensures the page responds with HTTP 200, not error or redirect.
- GOOGLEBOT_NOT_BLOCKED (High): Googlebot is not blocked by robots.txt. No Disallow rule for Googlebot (or *) matches the page URL.
- PAGE_INDEXABLE_CONTENT (Medium): Page has indexable textual content. Body contains at least one alphanumeric character after stripping markup.
- HEAD_SECTION_VALID_HTML (High): Head section must be valid HTML. The page contains exactly one <head> element and the HTML parses without syntax errors.
Spam Policies
- CLOAKING_DETECTION (High): Detect cloaking by comparing bot vs user content. Content served to Googlebot and regular users is substantially identical (≥ 90% similarity).
- HIDDEN_TEXT_DETECTION (Medium): Detect hidden text or links intended for search. No elements with hidden styles contain visible text or links.
- KEYWORD_STUFFING_DETECTION (Medium): Detect excessive repetition of keywords. No single keyword exceeds a density of 5% of total words.
- SNEAKY_REDIRECT_DETECTION (High): Detect sneaky redirects for bots vs users. Both HTTP status codes and final URLs are identical for Googlebot and regular users.
Content Basics
- PAGE_TITLE_EXISTS (Low): Page title tag present and non-empty. Ensures the <title> tag exists and is not empty.
- MAIN_HEADING_EXISTS (Low): Main heading (<h1>) present and non-empty. At least one <h1> element exists with non-empty text.
- CRAWLABLE_LINKS (Low): All anchor links are crawlable (have valid href). Every <a> element has a non-empty href that does not start with "javascript:".
- IMAGE_ALT_TEXT (Medium): All images have non-empty alt attributes. Every <img> element includes a non-empty alt attribute.
Content Optimization
- TITLE_DESCRIPTIVE (Medium): Page title is descriptive, specific, and accurate. Title text contains at least two words.
- META_DESCRIPTION_PRESENT (Medium): Meta description tag is present and descriptive. Meta description is present and its length is at least 50 characters.
- IMAGE_ALT_ATTRIBUTES (Medium): All images have descriptive alt attributes. Every <img> element has a non-empty alt attribute.
- HEADING_HIERARCHY (Low): Page uses heading elements for hierarchy. At least one <h1> element is present.
Dashboard Setup
- GA_FILTER_SOURCE_MEDIUM (Medium): GA data filtered to source=google, medium=organic. Both source=google and medium=organic filter conditions are present in dashboard config.
- DASHBOARD_METRICS_PRESENT (Medium): Dashboard includes required five metrics. All five required metrics are present in the dashboard configuration.
- DASHBOARD_DATA_SOURCES_CONNECTED (Medium): Dashboard connects to GA and SC. Both Search Console and Google Analytics data sources are referenced in dashboard config.
AMP Validation
- AMP_PAGE_MUST_FOLLOW_SPEC (Critical): AMP page must follow AMP HTML specification. Ensures the page complies with the AMP HTML specification for Google Search features.
Site Functionality
- BANNER_DATA_NOSNIPPET_PRESENT (Medium): Ensure banner or popup uses data-nosnippet attribute. Prevents banner or popup content from being shown in search result snippets.
Site Availability
- ROBOTS_TXT_NOT_503 (High): robots.txt must not return HTTP 503 status. A 503 response for robots.txt blocks all crawling, preventing indexing.
- RETRY_AFTER_HEADER_PRESENT_ON_503 (Medium): 503 error pages must include a Retry-After header. Provides crawlers with guidance on when to retry, reducing unnecessary load.
URL Structure
- NO_URL_FRAGMENTS (High): Avoid URL fragments that change content. Google Search may not crawl URLs where fragments are used to change content.
- HYPHENS_IN_PATH (Medium): Use hyphens to separate words in URL path. Hyphens improve readability for users and search engines, aiding crawlability.
- PERCENT_ENCODING_NECESSARY (Medium): Percent‑encode non‑ASCII characters in URLs. Percent‑encoding ensures URLs are valid, crawlable, and correctly interpreted.
Canonicalization
- CHECK_REL_CANONICAL_PRESENT (Medium): Presence of rel="canonical" link element. rel="canonical" link annotations influence how Google determines the canonical URL.
- CHECK_URL_IN_SITEMAP (Medium): URL presence in sitemap.xml. Presence of the URL in a sitemap is a factor influencing canonical selection.
- CHECK_HTTP_HTTPS_CONSISTENCY (Low): Consistent use of HTTPS scheme. The page's protocol (HTTP vs HTTPS) is a factor that influences canonicalization.
- CANONICAL_LINK_IN_HEAD (High): rel="canonical" link element must be placed in <head>. The rel="canonical" link element is only accepted if it appears in the <head> section.
- CANONICAL_LINK_ABSOLUTE_URL (Medium): rel="canonical" link element must use an absolute URL. Documentation recommends using absolute URLs for rel="canonical" link elements.
- CANONICAL_HEADER_ABSOLUTE_URL (Medium): rel="canonical" HTTP header must use an absolute URL. Documentation states that absolute URLs must be used in the rel="canonical" HTTP header.
- AVOID_ROBOTS_TXT_FOR_CANONICAL (Low): Do not use robots.txt for canonicalization. Documentation explicitly advises against using robots.txt for canonicalization.
- CONSISTENT_CANONICAL_METHOD (High): Do not specify different canonical URLs using different methods. Specifying different canonical URLs via different techniques can cause conflicts.
Data Nosnippet
- DATA_NOSNIPPET_VALID_HTML (High): Ensure HTML containing data-nosnippet attribute is well‑formed. HTML section must be valid HTML for data‑nosnippet to be machine‑readable.
Robots.txt Rules
- ROBOTS_TXT_ALLOW_INDEXING_RULES (Critical): URLs containing robots meta or X‑Robots‑Tag must not be disallowed. URLs with indexing/serving rules cannot be disallowed from crawling via robots.txt.
A/B Testing
- NO_CLOAKING_DETECTED (High): Ensure no cloaking between Googlebot and users. Cloaking violates spam policies and can cause demotion or removal from search results.
- REL_CANONICAL_PRESENT (Medium): Use rel="canonical" on test variant URLs. rel=canonical signals the preferred URL, preventing duplicate indexing of test variants.
- TEMPORARY_REDIRECT_302 (Medium): Use 302 redirects for temporary test redirects. A 302 redirect signals a temporary change, ensuring the original URL remains indexed.
Server-side redirects
- PHP_HEADERS_BEFORE_OUTPUT (High): Ensure HTTP redirect headers are sent before any body content in PHP redirects. The documentation states "You must set the headers before sending anything to the screen" for PHP redirects, requiring headers to precede any output.
Edit or remove unwanted text before moving to a public file format
- NOINDEX_ON_LOGIN_PAGE (High): Ensure login pages include a noindex robots meta tag. Login pages may expose redacted content; a noindex meta tag prevents search engines from indexing them.
- URL_NO_EMAIL (Medium): URLs must not contain email addresses. Email addresses in URLs can be indexed and expose personal information.
Edit and export images before embedding them
- IMAGE_NON_VECTOR_FORMAT (Medium): Ensure exported images are in non-vector formats (PNG or WEBP). Vector formats may retain hidden layers or metadata that can be indexed.
Implementing noindex
- NOINDEX_META_TAG_PRESENT (Low): Presence of noindex meta tag in HTML head. A <meta name="robots" content="noindex"> tag placed in the <head> prevents search engines that support the noindex rule from indexing the page.
File types indexable by Google
- FILETYPE_INDEXABLE_CHECK (Low): File extension is indexable by Google. Google can index the content of the listed text‑based and media file types; resources with other extensions may not be indexed.
Redirects
- REDIRECT_USES_PERMANENT_STATUS (High): Redirect uses permanent HTTP status. Permanent redirects (301/308) preserve link equity and signal the move to Google.
- REDIRECT_CHAIN_MAX_LENGTH (Medium): Redirect chain length limit. Long redirect chains add latency and may exceed Googlebot's limit.
Canonical Tags
- CANONICAL_SELF_REFERENCING (Medium): Self-referencing rel=canonical tag present. Self-referencing canonical informs Google of the preferred URL for the content.
Page Metadata
- HEAD_ALLOWED_ELEMENTS_MUST (High): Only allowed elements in <head>. Google processes only the allowed elements in the <head>; any invalid element causes the rest of the metadata to be ignored.
- HEAD_INVALID_ELEMENTS_ORDER_SHOULD (Medium): Place invalid <head> elements after allowed elements. If an invalid element appears before allowed elements, Google stops reading further elements, causing later metadata to be ignored.
Qualify your outbound links to Google
- REL_ATTRIBUTE_ALLOWED_VALUES (Low): Validate allowed rel attribute values on outbound links. Ensures that rel attributes on <a> elements use only the values documented (sponsored, ugc, nofollow) so Google can interpret link qualifications correctly.
robots.txt
- ROBOTS_TXT_IMAGE_BLOCK (Low): Block image URLs via robots.txt Disallow rule. Robots.txt Disallow rules prevent Googlebot-Image from indexing specified image URLs, removing them from search results.
noindex X-Robots-Tag
- NOINDEX_HEADER_IMAGE_BLOCK (High): Block image URLs via noindex X-Robots-Tag header. The noindex X-Robots-Tag header tells Googlebot not to index the image, but the URL must be crawlable for the header to be read.
Prepare the new hosting infrastructure
- NOINDEX_RULE_ABSENT (Medium): Ensure noindex robots rule is not present on new site pages. Prevents accidental indexing of the test site before it goes live.
- SEARCH_CONSOLE_VERIFICATION_PRESENT (Medium): Verify Search Console verification assets are present on the new site. Ownership verification must continue to work after the hosting move.
Start the move
- TEMPORARY_BLOCKS_REMOVED (High): Verify temporary crawling blocks are removed before launch. Ensure the site is fully crawlable by Googlebot after the move.
Check that Googlebot is able to access the new hosting infrastructure
- GOOGLEBOT_ACCESSIBLE (Critical): Confirm Googlebot can access the new site (HTTP 200). Googlebot must be able to retrieve pages to index them after the move.
Resources
- RESOURCES_NOT_BLOCKED_BY_ROBOTS_TXT (High): Ensure resources are not blocked by robots.txt. Resources such as images, CSS, and JavaScript must be accessible to Google; if they are blocked by robots.txt Google cannot crawl the page properly.
Internationalized or multi-lingual sites
- HREFLANG_TAGS_PRESENT (Medium): hreflang annotations present for multilingual pages. hreflang tags tell Google which language or regional version of a page to serve, preventing duplicate content issues across locales.
Manage the user experience
- SITE_USES_HTTPS (High): Site should be served over HTTPS. HTTPS provides security for users and is recommended by Google as a ranking signal.
Migrating a single URL
- TRUE_404_FOR_NOT_FOUND (High): Return proper 404 status for missing pages. A true 404 response signals to Google that a page is permanently unavailable; soft 404s can mislead indexing.
Helpful guidelines
- SEO_SHOULD_NOT_LINK_TO_SEO (Medium): Avoid linking to SEO provider. Linking to an SEO provider can be considered a link scheme and may violate Google's policies.
- SEO_SHOULD_EXPLAIN_FTP_CHANGES (Low): SEO with FTP access must explain changes. Transparency about changes made via FTP ensures the site owner can verify compliance and avoid hidden manipulations.
Add images to your site, and optimize them
- IMG_ALT_TEXT (Medium): Ensure all images have descriptive alt text. Alt text helps search engines understand image content and improves accessibility.
Influence your title links
- TITLE_ELEMENT_PRESENT (Medium): Verify presence of a element. The element is used by Google to generate title links in search results.
Control your snippets
- META_DESCRIPTION_PRESENT (Low): Ensure presence of a meta description tag. Meta description often supplies the snippet shown in search results.
Link to relevant resources
- LINK_TEXT_DESCRIPTIVE (Medium): Verify that anchor text is non‑empty and descriptive. Descriptive anchor text helps users and search engines understand linked content.
Crawling
- ROBOTS_TXT_DISALLOW_CRAWL (High): Ensure page is not disallowed by robots.txt. Pages disallowed by robots.txt cannot be crawled, which prevents Google from discovering and indexing them.
- HTTP_STATUS_NOT_5XX (High): Verify page does not return a server error status. HTTP 5xx responses indicate server errors that prevent Googlebot from successfully crawling the page.
Indexing
- META_ROBOTS_NOINDEX (High): Ensure meta robots tag does not block indexing. A meta robots tag containing "noindex" tells Google not to index the page, preventing it from appearing in search results.
Provide a great page experience
- PAGE_EXPERIENCE_DIVERSITY (Medium): Avoid focusing on only one or two aspects of page experience. Google advises site owners not to focus on only one or two aspects of page experience, but to provide an overall great experience across many signals.
AI-generated image metadata
- AI_GENERATED_IMAGE_METADATA_MUST_CONTAIN_IPTC_DIGITAL_SOURCE_TYPE (Low): AI-generated images must include IPTC DigitalSourceType metadata. Ensures AI‑generated images are identifiable and comply with Google Merchant Center policies.
AI-generated product data
- AI_GENERATED_PRODUCT_DATA_MUST_BE_LABELED (Low): AI-generated product titles and descriptions must be labeled as AI-generated. Guarantees transparency for users and compliance with Google Merchant Center AI content policies.
Guidelines
- FAVICON_CRAWLABILITY (High): Ensure favicon and home page are crawlable by Googlebot. Googlebot-Image and Googlebot must be able to crawl the favicon file and the home page; blocking them prevents the favicon from appearing in search results.
- FAVICON_DIMENSIONS (Medium): Verify favicon is square and at least 8x8 pixels. Google requires the favicon to be a square image with a minimum size of 8x8 px to be eligible for display in search results.
- FAVICON_URL_STABILITY (Low): Ensure favicon URL is stable and not frequently changed. A stable favicon URL prevents Google from losing the association between the site and its favicon, ensuring consistent display in search results.
Check if the Web Story is indexed
- CANONICAL_SELF_LINK (High): Web Story must have self-referential canonical link. A self‑referential canonical link tells Google the definitive URL for the story, enabling correct indexing and avoiding duplicate content issues.
Analysis Workflows
The SEO engine supports two main analysis workflows:
Workflow A: Analyze Website URL
When provided with a website URL, extract all required data and apply SEO rules:
1. Prepare Input Data
Use the input preparation scripts to extract all required data from the target website:
cd scripts/prepare_input/
# Set your target URL
URL="https://example.com"
# Extract all required data (takes 1-3 minutes)
python fetch_html.py $URL # Gets HTML source code
python fetch_robots_txt.py $URL # Gets crawling permissions
python fetch_sitemap.py $URL # Gets site structure
This creates input files:
example.com.html- Page source for content analysisexample.com_robots.txt- Crawling rules and restrictionsexample.com_sitemap.xml- Site URL inventory
2. Apply SEO Rules
With input data prepared, check relevant SEO rules under rules The agent will:
- Analyze the downloaded files - Read and understand the content structure
- Select relevant rules - Choose applicable rules based on available file types
- Apply rule logic - Execute the checks described in each rule's documentation 3.a GO THROUGH ALL THE RULES IN THE RULES DIRECTORY AND CHECK WHICH ONES CAN BE APPLIED BASED ON THE FILES PROVIDED.UNLESS SPECIFICALLY EXCLUDED.
- Report findings - Provide pass/fail results with actionable recommendations
Workflow B: Analyze Existing Files
When you already have HTML files, robots.txt, sitemap.xml, the agent will directly apply SEO rules by reading the rule definitions and checking the provided files.
1. File Requirements
The SEO engine accepts these file types:
- HTML files (
.html,.htm) - Webpage source code for content analysis - robots.txt files - Crawling permissions and restrictions
- Sitemap files (
.xml) - Site URL structure and priorities
2. Intelligent Rule Application
The agent will:
- Analyze the provided files - Read and understand the content structure
- Select relevant rules - Choose applicable rules based on available file types
- Apply rule logic - Execute the checks described in each rule's documentation 3.a GO THROUGH ALL THE RULES IN THE RULES DIRECTORY AND CHECK WHICH ONES CAN BE APPLIED BASED ON THE FILES PROVIDED.UNLESS SPECIFICALLY EXCLUDED.
- Report findings - Provide pass/fail results with actionable recommendations
- Fix issues when possible - For certain rules, the agent can suggest or implement fixes directly in the files
3. Supported Analysis Types
Single HTML File Analysis:
Provide an HTML file and the agent will check:
- Page title existence and quality
- Heading hierarchy (h1, h2, etc.)
- Image alt attributes
- Link crawlability
- Favicon dimensions and format
- Content structure and indexability
- And other HTML-based rules...
HTML + robots.txt Analysis:
Provide both files and the agent will additionally check:
- Resource blocking by robots.txt
- Page crawlability permissions
- robots.txt syntax and rules
- And other crawling-related rules...
Individual Rule Documentation
Each rule is documented in detail in the rules/ directory. The agent can access and explain any rule:
rules/PAGE_TITLE_EXISTS.md
rules/MAIN_HEADING_EXISTS.md
rules/IMAGE_ALT_TEXT.md
rules/CRAWLABLE_LINKS.md
rules/GOOGLEBOT_NOT_BLOCKED.md
rules/PAGE_HTTP_200_STATUS.md
rules/PAGE_INDEXABLE_CONTENT.md
rules/CLOAKING_DETECTION.md
rules/HIDDEN_TEXT_DETECTION.md
rules/KEYWORD_STUFFING_DETECTION.md
rules/SNEAKY_REDIRECT_DETECTION.md
rules/TITLE_DESCRIPTIVE.md
rules/META_DESCRIPTION_PRESENT.md
rules/IMAGE_ALT_ATTRIBUTES.md
rules/HEADING_HIERARCHY.md
rules/GA_FILTER_SOURCE_MEDIUM.md
rules/DASHBOARD_METRICS_PRESENT.md
rules/DASHBOARD_DATA_SOURCES_CONNECTED.md
rules/HEAD_SECTION_VALID_HTML.md
rules/AMP_PAGE_MUST_FOLLOW_SPEC.md
rules/BANNER_DATA_NOSNIPPET_PRESENT.md
rules/ROBOTS_TXT_NOT_503.md
rules/RETRY_AFTER_HEADER_PRESENT_ON_503.md
rules/NO_URL_FRAGMENTS.md
rules/HYPHENS_IN_PATH.md
rules/PERCENT_ENCODING_NECESSARY.md
rules/CHECK_REL_CANONICAL_PRESENT.md
rules/CHECK_URL_IN_SITEMAP.md
rules/CHECK_HTTP_HTTPS_CONSISTENCY.md
rules/DATA_NOSNIPPET_VALID_HTML.md
rules/ROBOTS_TXT_ALLOW_INDEXING_RULES.md
rules/NO_CLOAKING_DETECTED.md
rules/REL_CANONICAL_PRESENT.md
rules/TEMPORARY_REDIRECT_302.md
rules/CANONICAL_LINK_IN_HEAD.md
rules/CANONICAL_LINK_ABSOLUTE_URL.md
rules/CANONICAL_HEADER_ABSOLUTE_URL.md
rules/AVOID_ROBOTS_TXT_FOR_CANONICAL.md
rules/CONSISTENT_CANONICAL_METHOD.md
rules/PHP_HEADERS_BEFORE_OUTPUT.md
rules/NOINDEX_ON_LOGIN_PAGE.md
rules/URL_NO_EMAIL.md
rules/IMAGE_NON_VECTOR_FORMAT.md
rules/NOINDEX_META_TAG_PRESENT.md
rules/FILETYPE_INDEXABLE_CHECK.md
rules/REDIRECT_USES_PERMANENT_STATUS.md
rules/REDIRECT_CHAIN_MAX_LENGTH.md
rules/CANONICAL_SELF_REFERENCING.md
rules/HEAD_ALLOWED_ELEMENTS_MUST.md
rules/HEAD_INVALID_ELEMENTS_ORDER_SHOULD.md
rules/REL_ATTRIBUTE_ALLOWED_VALUES.md
rules/ROBOTS_TXT_IMAGE_BLOCK.md
rules/NOINDEX_HEADER_IMAGE_BLOCK.md
rules/NOINDEX_RULE_ABSENT.md
rules/TEMPORARY_BLOCKS_REMOVED.md
rules/SEARCH_CONSOLE_VERIFICATION_PRESENT.md
rules/GOOGLEBOT_ACCESSIBLE.md
rules/RESOURCES_NOT_BLOCKED_BY_ROBOTS_TXT.md
rules/HREFLANG_TAGS_PRESENT.md
rules/SITE_USES_HTTPS.md
rules/TRUE_404_FOR_NOT_FOUND.md
rules/SEO_SHOULD_NOT_LINK_TO_SEO.md
rules/SEO_SHOULD_EXPLAIN_FTP_CHANGES.md
rules/IMG_ALT_TEXT.md
rules/TITLE_ELEMENT_PRESENT.md
rules/META_DESCRIPTION_PRESENT.md
rules/LINK_TEXT_DESCRIPTIVE.md
rules/ROBOTS_TXT_DISALLOW_CRAWL.md
rules/HTTP_STATUS_NOT_5XX.md
rules/META_ROBOTS_NOINDEX.md
rules/PAGE_EXPERIENCE_DIVERSITY.md
rules/AI_GENERATED_IMAGE_METADATA_MUST_CONTAIN_IPTC_DIGITAL_SOURCE_TYPE.md
rules/AI_GENERATED_PRODUCT_DATA_MUST_BE_LABELED.md
rules/FAVICON_CRAWLABILITY.md
rules/FAVICON_DIMENSIONS.md
rules/FAVICON_URL_STABILITY.md
rules/CANONICAL_SELF_LINK.md
Each rule file contains:
- Brief explanation of why it matters
- Incorrect example with explanation
- Correct example with explanation
- Additional context and references
Ask the agent about any specific rule for detailed information, examples, and guidance.
Complete SEO Analysis
The agent provides comprehensive SEO analysis by:
- Reading all rule definitions from the rules directory
- Understanding each rule's logic and requirements
- Applying appropriate rules based on your input files
- Providing detailed pass/fail results with actionable recommendations
- Explaining rule violations with examples and fixes
Simply provide your files or website URL and ask for SEO analysis!