database-architect
Database Architect Skill
Step 1: Understand Data Requirements
Gather requirements:
- Entities: What data needs to be stored?
- Relationships: How do entities relate (1:1, 1:N, N:M)?
- Access Patterns: How will data be queried?
- Volume: Expected data size and growth rate
- Consistency: ACID requirements vs eventual consistency
Step 2: Design Schema
For Relational Databases:
- Normalize: Start with 3NF to reduce redundancy
- Define Primary Keys: Use surrogate keys (UUID/SERIAL) or natural keys
- Define Foreign Keys: Establish referential integrity
- Consider Denormalization: Only for proven performance needs
For NoSQL Databases:
- Model for Queries: Design documents/collections around access patterns
- Embed vs Reference: Embed for 1:1/1:few, reference for 1:many
- Shard Key Selection: Choose keys that distribute evenly
Step 3: Plan Indexes
Index strategy based on query patterns:
-- Example: Users table with common queries
CREATE INDEX idx_users_email ON users(email); -- Exact match
CREATE INDEX idx_users_name ON users(last_name, first_name); -- Range/sort
CREATE INDEX idx_users_created ON users(created_at DESC); -- Ordering
Index Guidelines:
- Index columns used in WHERE, JOIN, ORDER BY
- Consider composite indexes for multi-column queries
- Avoid over-indexing (slows writes)
- Use covering indexes for read-heavy queries
Step 4: Plan Migrations
Create versioned migrations:
migrations/
001_create_users.sql
002_add_email_index.sql
003_create_orders.sql
Migration Best Practices:
- Always include up and down migrations
- Test migrations on production-like data
- Plan for zero-downtime migrations
- Backup before running migrations
Step 5: Optimize Queries
Analyze and improve slow queries:
- Use EXPLAIN ANALYZE: Understand execution plans
- Identify Table Scans: Replace with index scans
- Optimize JOINs: Ensure indexes on join columns
- Batch Operations: Use bulk inserts/updates
- Connection Pooling: Reduce connection overhead
Step 6: PostgreSQL 17 Features (2024–2026)
Leverage PostgreSQL 17 capabilities where applicable:
Performance improvements:
- New
VACUUMmemory management — up to 20x lower memory footprint; vacuum now runs faster on busy systems - Streaming I/O interface accelerates sequential scans on large datasets
BRINindexes support parallel builds- B-tree indexes are more efficient for
INclause queries - Optimized CTE (Common Table Expression) planning
SQL/JSON enhancements (PG 17):
JSON_TABLE()— converts JSON data into relational table representation- JSON constructors and identity functions (
JSON(),JSON_SCALAR(),JSON_ARRAY(),JSON_OBJECT()) - Use
jsonpathfor expressive path-based queries over JSONB columns
Incremental backups:
pg_basebackupsupports incremental backup; combine withpg_upgradefor zero-data-loss major version upgrades
Logical replication improvements:
- Failover control for logical replication slots
pg_createsubscribercreates logical replicas from physical standbyspg_upgradenow preserves logical replication slots across major version upgrades
Security:
- New
MAINTAINprivilege — grants targeted maintenance rights without full superuser access sslnegotiation=directclient option for direct TLS handshake (avoids round-trip)
COPY improvements:
COPY ... ON_ERROR ignore— continues import on row-level errors instead of aborting
Step 7: pgvector for AI Embeddings
Store and query vector embeddings alongside relational data to avoid a separate vector database:
-- Install extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Table with embedding column (1536 dims for OpenAI text-embedding-3-small)
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
content TEXT NOT NULL,
embedding vector(1536),
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- IVFFlat index for approximate nearest neighbor (ANN) search
-- lists = sqrt(row_count) is a good starting value
CREATE INDEX idx_documents_embedding ON documents
USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
-- HNSW index (faster queries, more memory; preferred for < 1M vectors)
CREATE INDEX idx_documents_embedding_hnsw ON documents
USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);
-- Similarity search (cosine distance)
SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT 10;
When to use pgvector vs. dedicated vector DB:
- Up to ~10M vectors: pgvector is sufficient (sub-50ms queries with HNSW index)
- Above 10M vectors or requiring specialized ANN algorithms: consider Pinecone, Weaviate, or Qdrant
- pgvector advantage: same backups, replication, and connection pooling as the rest of PostgreSQL
Step 8: Table Partitioning Strategies
Use declarative partitioning for tables expected to exceed available RAM:
-- Range partitioning by date (common for time-series / logs)
CREATE TABLE events (
id BIGSERIAL,
created_at TIMESTAMPTZ NOT NULL,
event_type TEXT NOT NULL,
payload JSONB
) PARTITION BY RANGE (created_at);
-- Monthly partitions
CREATE TABLE events_2025_01 PARTITION OF events
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE events_2025_02 PARTITION OF events
FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
-- Hash partitioning for even distribution (e.g., multi-tenant)
CREATE TABLE orders (
id UUID NOT NULL,
tenant_id UUID NOT NULL,
total DECIMAL(12,2)
) PARTITION BY HASH (tenant_id);
CREATE TABLE orders_p0 PARTITION OF orders FOR VALUES WITH (modulus 4, remainder 0);
CREATE TABLE orders_p1 PARTITION OF orders FOR VALUES WITH (modulus 4, remainder 1);
CREATE TABLE orders_p2 PARTITION OF orders FOR VALUES WITH (modulus 4, remainder 2);
CREATE TABLE orders_p3 PARTITION OF orders FOR VALUES WITH (modulus 4, remainder 3);
Partition pruning: PostgreSQL automatically skips irrelevant partitions when the partition key appears in WHERE. Always include the partition key in queries.
Index on partitioned tables: Indexes created on the parent table are automatically created on all child partitions.
Step 9: JSONB Patterns at Scale
-- Generated columns promote hot JSONB fields to indexed native columns
CREATE TABLE customers (
id BIGSERIAL PRIMARY KEY,
data JSONB NOT NULL,
-- Promote frequently filtered fields to B-tree indexed generated columns
country TEXT GENERATED ALWAYS AS (data->>'country') STORED,
signup_date DATE GENERATED ALWAYS AS ((data->>'signup_date')::DATE) STORED
);
CREATE INDEX idx_customers_country ON customers (country);
CREATE INDEX idx_customers_signup ON customers (signup_date);
-- GIN index for containment / key-existence queries
CREATE INDEX idx_customers_data_gin ON customers USING GIN (data);
-- Partial GIN index for large tables (index only active records)
CREATE INDEX idx_customers_data_active ON customers
USING GIN (data) WHERE (data->>'status') = 'active';
-- jsonpath query example (PG 17)
SELECT * FROM customers
WHERE data @? '$.tags[*] ? (@ == "premium")';
Step 10: Connection Pooling
Use a connection pooler in front of PostgreSQL for all production deployments:
PgBouncer (lightweight, battle-tested):
# pgbouncer.ini
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb
[pgbouncer]
pool_mode = transaction ; transaction pooling for stateless apps
max_client_conn = 1000
default_pool_size = 25
server_pool_size = 5
Supavisor (cloud-native, multi-tenant, Elixir-based):
- Designed for serverless / edge functions with thousands of short-lived connections
- Supports both session and transaction pooling modes
- Used by default in Supabase deployments; available as self-hosted
Pooling modes:
| Mode | Use Case | Notes |
|---|---|---|
| Session | Long-running connections, LISTEN/NOTIFY |
1 client = 1 server connection |
| Transaction | Stateless APIs (recommended default) | Most efficient; breaks SET / prepared statements |
| Statement | Rarely needed | Each statement can use a different server connection |
</execution_process>
<best_practices>
- Normalize First: Optimize later based on data
- Index Thoughtfully: Based on actual query patterns
- Use Migrations: Never modify schema directly
- Monitor Performance: Use database profiling tools
- Plan for Scale: Consider partitioning for large tables
- Upgrade to PostgreSQL 17: Benefit from new VACUUM memory manager, SQL/JSON functions, and incremental backup support
- Use pgvector for AI: Store embeddings in PostgreSQL with HNSW or IVFFlat indexes before reaching for a dedicated vector database
- JSONB at Scale: Promote hot JSONB fields to generated columns with B-tree indexes; use GIN for containment queries
- Always Pool Connections: Use PgBouncer (transaction mode) or Supavisor for all production PostgreSQL deployments
- Partition Large Tables: Apply range (time-series), list (category), or hash (multi-tenant) partitioning for tables projected to exceed server RAM
</best_practices>
Design a schema for an e-commerce orders system
Example Response:
-- Users table
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email VARCHAR(255) UNIQUE NOT NULL,
created_at TIMESTAMP DEFAULT NOW()
);
-- Orders table
CREATE TABLE orders (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID REFERENCES users(id),
status VARCHAR(20) DEFAULT 'pending',
total_amount DECIMAL(10,2),
created_at TIMESTAMP DEFAULT NOW()
);
-- Order items (1:N relationship)
CREATE TABLE order_items (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
order_id UUID REFERENCES orders(id) ON DELETE CASCADE,
product_id UUID NOT NULL,
quantity INTEGER NOT NULL,
unit_price DECIMAL(10,2)
);
-- Indexes for common queries
CREATE INDEX idx_orders_user ON orders(user_id);
CREATE INDEX idx_orders_status ON orders(status);
CREATE INDEX idx_order_items_order ON order_items(order_id);
</usage_example>
Rules
- Always justify denormalization with performance data
- Include rollback strategy for all migrations
- Document relationships and constraints
Related Workflow
This skill has a corresponding workflow for complex multi-agent scenarios:
- Workflow:
.claude/workflows/database-architect-skill-workflow.md - When to use workflow: For comprehensive database design including requirements analysis, schema design, query optimization, migration planning, and testing (multi-phase, multi-agent)
- When to use skill directly: For quick schema reviews or single-agent database tasks
Iron Laws
- NEVER make schema changes without versioned migrations that include both UP and DOWN scripts — manual DDL in production is not recoverable.
- ALWAYS normalize to at least 3NF before considering denormalization — never prematurely optimize without measured performance evidence.
- ALWAYS plan indexes based on actual query patterns from EXPLAIN ANALYZE — never add indexes speculatively before profiling real workloads.
- NEVER test or deploy a migration without running it against production-like data first — schema issues surface under realistic volume, not on empty tables.
- ALWAYS use connection pooling (Supavisor or PgBouncer) in production — direct connections from serverless functions exhaust the database connection limit under load.
Anti-Patterns
| Anti-Pattern | Why It Fails | Correct Approach |
|---|---|---|
| Manual DDL directly on production | No rollback path; breaks migration history | Always use versioned migrations with DOWN scripts |
| Premature denormalization | Adds complexity before profiling; often no measurable gain | Normalize first, denormalize only after EXPLAIN ANALYZE reveals bottleneck |
| Indexing every column | Slows writes; wastes storage; misleads query planner | Index only columns that appear in WHERE, JOIN, and ORDER BY clauses |
| Adding NOT NULL column without default | Locks entire table during migration on large datasets | Add nullable column, backfill in batches, then add NOT NULL constraint |
| Direct connections from serverless functions | Connection limit exhausted under load spike | Use PgBouncer or Supavisor in transaction-pooling mode |
Memory Protocol (MANDATORY)
Before starting:
cat .claude/context/memory/learnings.md
After completing:
- New pattern ->
.claude/context/memory/learnings.md - Issue found ->
.claude/context/memory/issues.md - Decision made ->
.claude/context/memory/decisions.md
ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.