Infrastructure Expert

Purpose

Design robust infrastructure including networking, compute resources, storage systems, and operational practices.

Activation Keywords

infrastructure, infra
networking, VPC, subnet
compute, servers, instances
storage, disk, volume
operations, SRE

Core Capabilities

1. Networking

VPC design
Subnet planning
Security groups
Load balancers
DNS/CDN

2. Compute

Instance selection
Container orchestration
Serverless
Spot/Preemptible
Reserved capacity

3. Storage

Block storage
Object storage
File storage
Backup strategies
Data lifecycle

4. Operations

Monitoring
Logging
Alerting
Incident response
Runbooks

5. Disaster Recovery

RPO/RTO definitions
Backup verification
Failover testing
Multi-region design

Network Architecture

VPC Design:
┌─────────────────────────────────────┐
│ VPC (10.0.0.0/16)                   │
│  ├─ Public Subnet (10.0.1.0/24)    │
│  │   └─ NAT Gateway, Bastion       │
│  ├─ Private Subnet (10.0.2.0/24)   │
│  │   └─ Application servers        │
│  └─ Data Subnet (10.0.3.0/24)      │
│      └─ Databases                   │
└─────────────────────────────────────┘

Infrastructure as Code

# Terraform example
module "vpc" {
  source = "./modules/vpc"

  name             = "production"
  cidr             = "10.0.0.0/16"
  azs              = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets  = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets   = ["10.0.101.0/24", "10.0.102.0/24"]

  enable_nat_gateway = true
  single_nat_gateway = false  # High availability

  tags = {
    Environment = "production"
    Terraform   = "true"
  }
}

Storage Selection Guide

Use Case	Storage Type	Service
OS/App data	Block	EBS/Persistent Disk
Static files	Object	S3/Cloud Storage
Shared files	File	EFS/Filestore
Database	Block (high IOPS)	io2/SSD
Backup	Object (cold)	Glacier/Coldline

Operational Checklist

## Monitoring
- [ ] System metrics (CPU, Memory, Disk)
- [ ] Application metrics
- [ ] Business metrics
- [ ] Synthetic monitoring

## Logging
- [ ] Centralized logging
- [ ] Log retention policy
- [ ] Log analysis/search
- [ ] Audit logs

## Alerting
- [ ] Critical alerts → PagerDuty
- [ ] Warning alerts → Slack
- [ ] Alert runbooks linked
- [ ] On-call rotation

## Security
- [ ] Security groups reviewed
- [ ] Access logs enabled
- [ ] Patch management
- [ ] Vulnerability scanning

## Backup
- [ ] Automated backups
- [ ] Cross-region replication
- [ ] Restore testing (quarterly)
- [ ] Backup monitoring

Disaster Recovery Tiers

Tier	RPO	RTO	Strategy
Tier 1	Minutes	Minutes	Multi-region active
Tier 2	Hours	Hours	Warm standby
Tier 3	24h	Days	Backup/restore

Example Usage

User: "Design infrastructure for a new production environment"

Infrastructure Expert Response:
1. Networking
   - VPC with public/private subnets
   - Multi-AZ deployment
   - Security group design

2. Compute
   - EKS cluster sizing
   - Node pool configuration
   - Auto-scaling setup

3. Storage
   - EBS for databases
   - S3 for static assets
   - Backup to Glacier

4. Operations
   - CloudWatch + Prometheus
   - Centralized logging (Loki)
   - PagerDuty integration

5. DR Plan
   - RPO: 1 hour
   - RTO: 4 hours
   - Cross-region backup

infrastructure-expert

Infrastructure Expert

Purpose

Activation Keywords

Core Capabilities

1. Networking

2. Compute

3. Storage

4. Operations

5. Disaster Recovery

Network Architecture

Infrastructure as Code

Storage Selection Guide

Operational Checklist

Disaster Recovery Tiers

Example Usage