database-schema-design
Database Schema Design
Overview
Guide the design, implementation, and optimization of database schemas with sound data modeling, safe migrations, effective indexing, and appropriate query patterns. This skill covers the full lifecycle from conceptual modeling through physical optimization, ensuring schemas that are normalized, performant, and safely evolvable.
Announce at start: "I'm using the database-schema-design skill to design the database schema."
Phase 1: Discovery and Conceptual Model
Ask these questions to understand the data requirements:
| # | Question | What It Determines |
|---|---|---|
| 1 | What entities does the system manage? | Table names |
| 2 | What are the relationships between entities? | Foreign keys, join tables |
| 3 | What are the key attributes of each entity? | Column definitions |
| 4 | What are the primary query patterns? | Index strategy |
| 5 | What is the expected data volume? (rows, growth rate) | Partitioning, scaling |
| 6 | What is the read/write ratio? | Normalization vs denormalization |
| 7 | SQL or NoSQL? (or both?) | Storage engine selection |
Storage Engine Decision Table
| Factor | Choose SQL (PostgreSQL, MySQL) | Choose Document (MongoDB) | Choose Key-Value (Redis) |
|---|---|---|---|
| Data shape | Structured, relational | Semi-structured, nested | Simple lookups, caching |
| Query complexity | Complex joins, aggregations | Document-level queries | Key-based access only |
| Consistency needs | ACID required | Eventual consistency OK | Ephemeral or cached data |
| Schema evolution | Migrations manageable | Schema-free flexibility | No schema |
| Scale pattern | Vertical first, then read replicas | Horizontal sharding | In-memory, limited size |
STOP after discovery — present the conceptual model (entities, relationships, cardinality) for confirmation.
Phase 2: Logical Model Design
Translate the conceptual model into tables, columns, types, and constraints.
Column Design Rules
| Decision | Guidance |
|---|---|
| Primary keys | UUIDs for distributed systems, auto-increment for single-node |
| Column types | Use the most specific type (timestamptz not varchar for dates) |
| Nullability | Default NOT NULL; allow NULL only when absence is meaningful |
| Defaults | Set sensible defaults (created_at DEFAULT now()) |
| Constraints | Add CHECK, UNIQUE, and FK constraints at the schema level |
| Naming | snake_case, singular table names or plural — be consistent |
Normalization Guide
| Normal Form | Rule | Violation Example | Fix |
|---|---|---|---|
| 1NF | Atomic values, no repeating groups | tags VARCHAR "urgent,priority,vip" |
Separate order_tags table |
| 2NF | All non-key columns depend on entire PK | product_name in order_items (composite PK) |
Move to products table |
| 3NF | No transitive dependencies | city depends on zip_code, not user_id |
Separate zip_codes table |
Rule: Always start normalized. Denormalize only with measured evidence.
Denormalization Decision Table
| Scenario | Pattern | When to Apply |
|---|---|---|
| Read-heavy dashboards | Materialized views or summary tables | Measured slow query |
| Frequently joined data | Embed as JSONB column | Join is >80% of query time |
| Reporting / analytics | Separate denormalized reporting tables | OLAP workload |
| Caching layer | Computed columns refreshed on write | High-frequency reads |
Relationship Patterns
| Relationship | Implementation | Index Needed |
|---|---|---|
| One-to-One | FK with UNIQUE constraint on child | On FK column |
| One-to-Many | FK on the "many" side | On FK column |
| Many-to-Many | Junction/join table with composite PK | On both FK columns |
| Polymorphic | Separate FK columns with CHECK constraint (preferred) or type+id pattern | On type+id or each FK |
| Self-referential (trees) | parent_id FK to same table; or ltree/materialized path |
On parent_id or path |
STOP after logical model — present the table definitions for review.
Phase 3: Physical Model and Indexing
Index Type Decision Table
| Index Type | Best For | Example |
|---|---|---|
| B-tree (default) | Equality and range queries | CREATE INDEX idx_users_email ON users(email) |
| GIN | Full-text search, JSONB, arrays | CREATE INDEX idx_posts_search ON posts USING GIN(to_tsvector('english', body)) |
| Partial | Subset of rows matching condition | CREATE INDEX idx_active_users ON users(email) WHERE active = true |
| Covering (INCLUDE) | Index-only scans avoiding table lookup | CREATE INDEX idx_users_email ON users(email) INCLUDE (name) |
| Composite | Multi-column queries | CREATE INDEX idx_orders ON orders(tenant_id, status) |
Composite Index Column Order
| Position | Column Type | Reason |
|---|---|---|
| First | High-cardinality equality columns | Most selective filter first |
| Middle | Additional equality columns | Further narrows results |
| Last | Range columns (dates, numbers) | Range scan on remaining rows |
Rule: A composite index on (A, B, C) supports queries on A, A+B, A+B+C — but NOT B alone or C alone.
Query Optimization Checklist
| Signal in EXPLAIN ANALYZE | Problem | Fix |
|---|---|---|
| Seq Scan on large table | Missing index | Add appropriate index |
| Nested Loop with large outer table | Inefficient join | Add index or restructure query |
| High actual vs estimated rows | Stale statistics | Run ANALYZE on table |
| Hash Join high memory | work_mem too low |
Tune work_mem or restructure |
N+1 Detection and Prevention
-- N+1 problem (bad):
SELECT * FROM users;
-- Then for EACH user: SELECT * FROM orders WHERE user_id = ?;
-- Fixed with join:
SELECT u.*, o.* FROM users u LEFT JOIN orders o ON o.user_id = u.id;
-- Fixed with batch load:
SELECT * FROM orders WHERE user_id = ANY($1);
STOP after physical model — present indexes and optimization strategy for review.
Phase 4: Migration Strategy
Zero-Downtime Migration (Expand-Contract)
Never make a breaking change in a single migration. Use two phases:
Expand phase (backward compatible):
- Add new column/table (nullable or with default)
- Deploy code that writes to both old and new
- Backfill existing data in batches
- Deploy code that reads from new
Contract phase (after all code uses new schema):
- Remove code that writes to old
- Drop old column/table
Migration Safety Rules
| Rule | Rationale |
|---|---|
| Every migration has a corresponding rollback | Safe to revert |
| Test rollback in staging before production | Verify reversibility |
| Data-destructive rollbacks need explicit approval | Prevent accidental data loss |
| Keep migration files immutable once applied | Reproducible state |
| Backfill large tables in batches (1000 rows) | Avoid table locks |
Backfill Pattern
-- Backfill in chunks of 1000
UPDATE users SET display_name = username
WHERE display_name IS NULL
AND id IN (SELECT id FROM users WHERE display_name IS NULL LIMIT 1000);
Migration Type Decision Table
| Change Type | Safe Approach | Dangerous Approach |
|---|---|---|
| Add column | Add nullable or with default | Add NOT NULL without default |
| Remove column | Expand-contract (two deploys) | Drop column directly |
| Rename column | Add new, copy data, drop old | ALTER RENAME (breaks queries) |
| Add index | CREATE INDEX CONCURRENTLY |
CREATE INDEX (locks table) |
| Change column type | Add new column, migrate data | ALTER COLUMN TYPE (locks table) |
STOP after migration plan — confirm rollback strategy before finalizing.
Phase 5: Save and Transition
After explicit approval:
- Save schema design to
docs/database/or generate migration files - Commit with message:
docs(db): add schema design for <feature>
Transition Decision Table
| User Intent | Next Skill | Rationale |
|---|---|---|
| "Create the migrations" | planning |
Plan migration implementation |
| "Write specs for this" | spec-writing |
Behavioral specs for data operations |
| "Implement the schema" | test-driven-development |
TDD with migration tests |
| "Just save the design" | None | Schema design is the deliverable |
| "Review for performance" | performance-optimization |
Analyze query patterns |
ORM Guidance
| ORM | Language | Strength | Watch Out For |
|---|---|---|---|
| Prisma | TypeScript | Type-safe schema, migrations | N+1 in nested queries, limited raw SQL |
| Drizzle | TypeScript | SQL-like API, lightweight | Newer ecosystem, fewer guides |
| SQLAlchemy | Python | Mature, flexible, raw SQL support | Complex session management |
| GORM | Go | Convention-based, auto-migrate | Silent failures, implicit behavior |
ORM Best Practices
- Always review generated SQL (enable query logging in development)
- Use eager loading to prevent N+1 queries
- Write raw SQL for complex queries rather than fighting the ORM
- Use ORM migrations, not auto-sync in production
- Test query performance with realistic data volumes
Connection Pooling
- Use a connection pooler (PgBouncer, built-in pool)
- Pool size formula:
connections = (CPU cores * 2) + disk spindles - Use transaction-level pooling for most workloads
- Application servers should not open raw connections
Anti-Patterns / Common Mistakes
| Mistake | Why It Is Wrong | What To Do Instead |
|---|---|---|
| No foreign key constraints | Orphaned data, broken relationships | Always define FK constraints |
| VARCHAR for everything | Loses type safety, wastes storage | Use specific types (timestamptz, int, uuid) |
| No indexes on FK columns | Slow joins on related tables | Index every FK column |
| Premature denormalization | Complexity without measured benefit | Start normalized, denormalize with evidence |
| Dropping columns directly | Breaks running application code | Use expand-contract pattern |
CREATE INDEX without CONCURRENTLY |
Locks table during index creation | Always use CONCURRENTLY in production |
| Auto-sync schema in production | Unpredictable destructive changes | Use explicit migration files |
| No rollback plan for migrations | Cannot recover from failed deploy | Write down migration for every up migration |
| Nullable columns everywhere | Loses data integrity guarantees | Default NOT NULL, allow NULL intentionally |
Anti-Rationalization Guards
- Do NOT skip the conceptual model — understand entities and relationships first
- Do NOT add indexes speculatively — measure query patterns first
- Do NOT denormalize without measured evidence of a performance problem
- Do NOT create migrations without rollback plans
- Do NOT skip the discovery phase — understand query patterns and data volume
- Do NOT drop columns or tables without expand-contract pattern in production
Documentation Lookup (Context7)
Use mcp__context7__resolve-library-id then mcp__context7__query-docs for up-to-date docs. Returned docs override memorized knowledge.
prisma— for schema syntax, relations, or migration APItypeorm— for entity decorators, repository patterns, or query builderknex— for query builder syntax, migrations, or seed files
Integration Points
| Skill | Relationship |
|---|---|
api-design |
Upstream: API resources map to database entities |
spec-writing |
Upstream: specs define data persistence requirements |
planning |
Downstream: schema design informs implementation plan |
test-driven-development |
Downstream: migration tests written before migration code |
performance-optimization |
Downstream: query optimization after schema is live |
reverse-engineering-specs |
Upstream: reverse-engineer existing schema behavior |
senior-backend |
Parallel: backend specialist for ORM and query patterns |
Verification Gate
Before claiming the schema design is complete:
- VERIFY all entities and relationships are modeled
- VERIFY normalization is at least 3NF (or denormalization is justified)
- VERIFY indexes are defined for all query patterns and FK columns
- VERIFY migration strategy includes rollback for every step
- VERIFY the user has approved the schema design
- VERIFY connection pooling strategy is defined for production
Skill Type
Flexible — Adapt storage engine, normalization level, and index strategy to project needs while preserving the conceptual-to-physical modeling progression, migration safety rules, and measured-evidence-before-denormalization principle.