pseo-data
pSEO Data Architecture
Design and implement the structured data layer that feeds all programmatic SEO pages. This is the foundation every other pSEO skill depends on.
Core Principles
- Single source of truth: All page data flows from one data layer
- SEO-complete models: Every content model includes all fields needed for metadata, schema markup, and linking
- Unique slugs by construction: Slug generation enforces uniqueness at the data level
- Type safety: All data models are fully typed (TypeScript interfaces/types)
- Separation of concerns: Data fetching is decoupled from page rendering
Implementation Steps
1. Define Content Models
Create TypeScript interfaces for each page type using a two-tier model. The lightweight index tier is safe to hold in memory for all pages; the heavy full tier is loaded per-page only.
// Index tier: safe to load all at once (~1KB per page)
interface PageIndex {
slug: string; // unique, URL-safe
title: string; // page title (50-60 chars target)
metaDescription: string; // meta description (150-160 chars target)
h1: string; // primary heading (can differ from title)
canonicalPath: string; // canonical URL path
category: string; // for hub-spoke and breadcrumbs
lastModified: string; // ISO date for sitemap
}
// Full tier: extends PageIndex with heavy fields (~50-500KB per page)
interface BaseSEOContent extends PageIndex {
introText: string;
bodyContent: string;
faqs?: FAQ[];
relatedSlugs?: string[];
featuredImage?: SEOImage;
}
Extend BaseSEOContent for each page type with domain-specific fields. The interfaces above show the minimum required fields. See references/content-models.md for the full definitions (which add subcategory, tags, publishedDate, status, and more) and extended type examples (LocationPage, ProductPage, ComparisonPage, CategoryPage).
2. Build the Data-Fetching Layer
Create a centralized data module (e.g., lib/data.ts or src/data/index.ts) that exports:
getAllSlugs()- Returns all valid slugs for static generation. Must handle pagination internally when the data source has 1000+ records (fetch in batches, return the complete list).getPageData(slug)- Returns full content for a single pagegetPagesByCategory(category, opts?)- Returns pages in a category for hub pages. Accept optionallimitandoffsetfor paginated hub pages.getRelatedPages(slug, limit?)- Returns related pages for internal linkinggetAllCategories()- Returns all categories for navigation and hubsgetPageCount()- Returns total page count (useful for sitemap splitting and build diagnostics)
All functions must be:
- Cached or memoized during build to avoid redundant reads
- Typed with explicit return types
- Guarded against missing or malformed data
- Internally paginated when the data source imposes limits (e.g., CMS APIs with 100-item pages). The consumer should never need to handle pagination — the data layer abstracts it.
3. Implement Slug Generation
Design a slug strategy that:
- Produces URL-safe, lowercase, hyphenated strings
- Guarantees uniqueness across the entire dataset
- Is deterministic (same input always produces same slug)
- Includes a collision detection mechanism
- Follows a consistent URL hierarchy (e.g.,
/category/page-slug)
4. Validate Data Integrity
Build a validation function or script that checks:
- No duplicate slugs exist
- All required fields are present and non-empty
- Title and description lengths are within SEO targets
- All category references resolve to valid categories
- No orphan pages (pages not reachable through any category)
5. Set Up Data Source Integration
Based on the data source ($ARGUMENTS or detected):
JSON files: Create a data/ directory with typed JSON, a loader, and build-time validation.
CMS (headless): Create API client with typed responses, implement caching, handle pagination for 1000+ items.
Database: Create a query layer with connection pooling, implement cursor-based pagination, add query caching.
MDX files: Set up frontmatter schema validation, create a content loader with gray-matter parsing.
API: Create a typed API client, implement rate limiting and retry logic, add response caching.
Scale Limits
The in-memory and file-based patterns in this skill work up to ~10K pages. Beyond that:
- 10K-50K pages: Requires a database (PostgreSQL, MySQL). In-memory index tier becomes borderline at 50K (~50MB). File-based data sources are too slow.
- 50K-100K+ pages: Requires database + cache layer (Redis) + cursor-based pagination.
getAllSlugs()must use cursor iteration, not array return. Data sufficiency gating prevents generating thin pages.
See pseo-scale for the complete database-backed data layer, sufficiency scoring, and scale-specific patterns.
Memory-Conscious Data Patterns
At 1000+ pages, how data is loaded matters more than what is loaded. A full content model with body text, FAQs, and images can be 50-500KB per page. Loading all pages into memory simultaneously will OOM.
Two-tier data model:
Split the data layer into lightweight index data and full page data. The PageIndex and BaseSEOContent interfaces from section 1 define the two tiers:
getAllSlugs(),getRelatedPages(),getPagesByCategory()— returnPageIndex[](lightweight, ~1KB per page)getPageData()— returnsBaseSEOContent(or an extended type) for a single page (heavy, ~50-500KB per page, only one at a time)
Never do this:
// Loads ALL full content into memory — will OOM at scale
const allPages = await Promise.all(slugs.map(s => getPageData(s)));
Instead:
// Process pages one at a time or in small batches
for (const slug of slugs) {
const page = await getPageData(slug);
await processPage(page);
// page is GC'd after each iteration
}
CMS/API pagination:
- Fetch in batches of 100-250 records
- Yield or push to an array incrementally — don't hold all API responses in memory simultaneously
- If using GraphQL, only request index fields in list queries, full fields in single-item queries
File Organization
lib/
data/
index.ts # public API (re-exports)
types.ts # TypeScript interfaces
fetcher.ts # data source integration
slugs.ts # slug generation and validation
validation.ts # data integrity checks
cache.ts # build-time caching utilities
Quality Checks
Before considering this complete:
- All content models extend BaseSEOContent (which extends PageIndex)
-
getAllSlugs()returns 0 duplicates - Data validation passes with zero errors
- Data layer exports are fully typed with no
any - Fetching is memoized for build performance
- A test or script can validate the full dataset
- Two-tier data model implemented (index data vs. full page data)
- No function loads all full page content into memory simultaneously
- CMS/API fetching uses batched pagination internally
Relationship to Other Skills
This skill provides the data foundation for:
- pseo-templates: Consumes
getPageData()andgetAllSlugs() - pseo-metadata: Reads title, description, canonical from content models
- pseo-schema: Uses structured fields for JSON-LD generation
- pseo-linking: Uses
getRelatedPages()and category data - pseo-quality-guard: Validates against the content models
More from lisbeth718/pseo-skills
pseo-audit
Audit and assess a codebase for programmatic SEO readiness at 1000+ page scale. Use when starting a pSEO project, evaluating an existing codebase for pSEO gaps, or when the user asks to audit, assess, or review their site for programmatic SEO scalability.
23pseo-llm-visibility
Optimize programmatic SEO pages for visibility and citation in AI-generated answers from ChatGPT, Perplexity, Google AI Overviews, and other LLM-powered search. Use when optimizing for LLM citation, implementing llms.txt, configuring AI crawler access, structuring content for AI extraction, or when the user asks about generative engine optimization (GEO), AI search visibility, or getting cited by AI.
18pseo-discovery
Analyze a codebase and business context to discover programmatic SEO opportunities, identifying what page types to generate, what data assets exist, and what search intent can be matched at scale. Use when starting a new pSEO project, when the user isn't sure what to build programmatically, or when exploring what structured data exists in the codebase or business that could power scalable pages.
15pseo-quality-guard
Validate programmatic SEO pages against quality standards to prevent thin content, duplicate content, and keyword cannibalization. Use when auditing pSEO output quality, before deploying new pages, when Google Search Console reports issues, or when checking if generated pages meet quality thresholds. This skill can also be used automatically to validate changes made by other pseo-* skills.
13pseo-templates
Create page templates with dynamic routing for programmatic SEO, including unique intent-matched content per page with differentiated titles, headings, descriptions, and FAQs. Use when building or refactoring pSEO page templates, setting up dynamic routes, or ensuring each generated page has unique, valuable content.
12pseo-orchestrate
Orchestrate the full programmatic SEO implementation by coordinating all pseo-* skills in the correct order. Use when implementing pSEO from scratch, running the full pSEO pipeline, or when the user asks to "set up programmatic SEO" or "build pSEO pages" without specifying a single skill.
12