Infrastructure as Code

Provision and manage cloud infrastructure using code-based automation tools. This skill covers tool selection, state management, module design, and operational patterns across Terraform/OpenTofu, Pulumi, and AWS CDK.

When to Use

Use this skill when:

Provisioning cloud infrastructure (compute, networking, databases, storage)
Migrating from manual infrastructure to code-based workflows
Designing reusable infrastructure modules
Implementing multi-cloud or hybrid-cloud deployments
Establishing state management and drift detection patterns
Integrating infrastructure provisioning into CI/CD pipelines
Evaluating IaC tools (Terraform vs Pulumi vs CDK)

Common requests:

"Create a Terraform module for VPC provisioning"
"Set up remote state with locking for team collaboration"
"Compare Pulumi vs Terraform for our use case"
"Design composable infrastructure modules"
"Implement drift detection for existing infrastructure"

Core Concepts

Infrastructure as Code Fundamentals

Key Principles:

Declarative vs Imperative - Describe desired state (Terraform) or program infrastructure (Pulumi)
Idempotency - Same input produces same output, safe to re-run
Version Control - Infrastructure changes tracked in Git
State Management - Track actual infrastructure state
Module Composition - Reusable, versioned infrastructure components

Benefits:

Reproducibility (same code = same infrastructure)
Auditability (Git history shows all changes)
Collaboration (code reviews for infrastructure changes)
Automation (CI/CD deploys infrastructure)
Disaster recovery (rebuild from code)

Tool Selection Framework

Choose IaC tools based on team composition and cloud strategy:

Terraform/OpenTofu - Declarative, HCL-based

Multi-cloud and hybrid-cloud deployments
Operations/SRE teams prefer declarative approach
Largest provider ecosystem (AWS, GCP, Azure, 3000+ providers)
Mature module registry and community

Pulumi - Imperative, programming language-based

Developer-centric teams familiar with TypeScript/Python/Go
Complex logic requires programming constructs (loops, conditionals, functions)
Native unit testing using familiar test frameworks
Strong typing and IDE support

AWS CDK - AWS-native, programming language-based

AWS-only infrastructure
Tight integration with AWS services
L1/L2/L3 construct abstractions
CloudFormation under the hood

Decision Tree:

Multi-cloud required?
├─ YES → Team composition?
│  ├─ Ops/SRE focused → Terraform/OpenTofu
│  └─ Developer focused → Pulumi
└─ NO → AWS only?
   ├─ YES → Language preference?
   │  ├─ HCL/declarative → Terraform
   │  ├─ TypeScript/Python → AWS CDK
   │  └─ YAML/simple → CloudFormation
   └─ NO → GCP/Azure only?
      └─ Terraform or Pulumi

State Management Architecture

Remote state with locking enables team collaboration:

Backend Selection:

Cloud Provider	Recommended Backend	Locking Mechanism
AWS	S3 + DynamoDB	DynamoDB table
GCP	Google Cloud Storage	Native
Azure	Azure Blob Storage	Lease-based
Multi-cloud	Terraform Cloud/Enterprise	Built-in
Pulumi	Pulumi Service	Built-in

State Isolation Strategies:

Directory Separation (recommended for most teams)
- Separate directories per environment (prod/, staging/, dev/)
- Complete state file isolation
- No risk of cross-environment contamination
Workspaces
- Single codebase, multiple environments
- Shared state backend, environment namespacing
- Risk: accidental cross-environment operations
Layered Architecture
- Separate state files for networking, compute, data layers
- Blast radius reduction
- Cross-layer references via remote state data sources

Critical State Management Rules:

Always use remote state for team environments
Enable state file encryption at rest
Enable versioning on state storage
Use state locking to prevent concurrent modifications
Never commit state files to Git
Mark sensitive outputs as sensitive = true

Module Design Patterns

Composable Module Structure:

modules/
├── vpc/              # Network foundation
├── security-group/   # Reusable security group patterns
├── rds/              # Database with backups, encryption
├── ecs-cluster/      # Container orchestration base
├── ecs-service/      # Individual microservice
└── alb/              # Application load balancer

Module Versioning:

Pin module versions in production (version = "5.1.0")
Use semantic versioning for internal modules
Test module updates in non-prod first
Maintain CHANGELOG for module releases

Module Design Principles:

Clear input contract (required vs optional variables)
Documented outputs (what consumers can reference)
Sane defaults where possible
Validation rules for inputs
Examples directory showing usage

When to Create a Module:

Resource group is reused 3+ times
Clear boundaries and responsibilities
Stable interface contract
Team has module maintenance capacity

When to Keep Monolithic:

One-off infrastructure
Rapid prototyping phase
High coupling between resources
Small team, simple infrastructure

Quick Reference

Terraform/OpenTofu Commands

# Initialize providers and backend
terraform init

# Plan changes (preview)
terraform plan

# Apply changes
terraform apply

# Destroy infrastructure
terraform destroy

# Format HCL files
terraform fmt

# Validate syntax
terraform validate

# Show state
terraform state list
terraform state show <resource>

# Import existing resources
terraform import <resource.name> <id>

# Workspace management
terraform workspace list
terraform workspace new staging
terraform workspace select prod

Pulumi Commands

# Initialize new project
pulumi new aws-typescript

# Preview changes
pulumi preview

# Apply changes
pulumi up

# Destroy infrastructure
pulumi destroy

# Show stack outputs
pulumi stack output

# Manage stacks
pulumi stack ls
pulumi stack select prod

# Import existing resources
pulumi import <type> <name> <id>

# Export/import state
pulumi stack export > state.json
pulumi stack import < state.json

AWS CDK Commands

# Initialize new app
cdk init app --language typescript

# Synthesize CloudFormation
cdk synth

# Preview changes
cdk diff

# Deploy stack
cdk deploy

# Destroy stack
cdk destroy

# Bootstrap account/region
cdk bootstrap

# List stacks
cdk list

Common Patterns Checklist

Infrastructure Provisioning:

Remote state configured with locking
State file encryption enabled
Provider versions pinned
Module versions pinned (production)
Variables have descriptions and types
Sensitive outputs marked as sensitive
Tagging strategy implemented
Cost allocation tags applied

Module Development:

Clear README with usage examples
Required vs optional variables documented
Outputs documented with descriptions
Validation rules for critical inputs
Examples directory with working code
Tests for module behavior (Terratest/CDK assertions)
CHANGELOG for version tracking
Semantic versioning followed

Operational Readiness:

Drift detection scheduled
CI/CD pipeline for plan/apply
State backup strategy
Disaster recovery documented
Team access controls configured (IAM/RBAC)
Cost estimation integrated (Infracost)
Security scanning integrated (Checkov/tfsec)
Documentation kept current

Detailed Documentation

For comprehensive patterns and implementation details:

Tool-Specific Patterns:

references/terraform-patterns.md - Terraform/OpenTofu best practices, HCL patterns
references/pulumi-patterns.md - Pulumi across TypeScript/Python/Go

Architecture and Design:

references/state-management.md - Remote state, locking, isolation strategies
references/module-design.md - Composable modules, versioning, registries

Operations:

references/drift-detection.md - Detecting and remediating infrastructure drift

Working Examples

Practical implementations demonstrating IaC patterns:

Terraform Examples:

examples/terraform/vpc-module/ - Multi-AZ VPC with public/private subnets
examples/terraform/ecs-service/ - ECS service with ALB, autoscaling
examples/terraform/rds-cluster/ - Aurora cluster with backups, encryption
examples/terraform/state-backend/ - S3 + DynamoDB backend setup

Pulumi Examples:

examples/pulumi/typescript/vpc/ - TypeScript VPC component
examples/pulumi/python/ecs-service/ - Python ECS service
examples/pulumi/go/rds-cluster/ - Go RDS cluster
examples/pulumi/testing/ - Unit tests for Pulumi programs

AWS CDK Examples:

examples/cdk/typescript/vpc-stack/ - VPC using L2 constructs
examples/cdk/typescript/ecs-fargate/ - Fargate service with ALB
examples/cdk/typescript/pipeline-stack/ - Self-mutating CDK pipeline
examples/cdk/testing/ - CDK assertions and snapshot tests

Utility Scripts

Automated validation and operational tools:

scripts/validate-terraform.sh - Terraform fmt, validate, tflint
scripts/cost-estimate.sh - Infracost wrapper for cost analysis
scripts/drift-check.sh - Scheduled drift detection
scripts/security-scan.sh - Checkov/tfsec security scanning
scripts/state-backup.sh - State file backup automation
scripts/module-release.sh - Module versioning and publishing

Integration with Other Skills

Deployment Pipeline:

building-ci-pipelines - Automate terraform plan/apply in CI/CD
gitops-workflows - GitOps-based infrastructure deployment

Platform Engineering:

kubernetes-operations - Provision EKS, GKE, AKS clusters
platform-engineering - Internal developer platform infrastructure

Security:

secret-management - Provision Vault, External Secrets Operator
security-hardening - Implement infrastructure security controls
compliance-frameworks - Policy-as-code for compliance

Operations:

observability - Provision monitoring infrastructure (Prometheus, Grafana)
disaster-recovery - Infrastructure rebuild procedures
cost-optimization - Implement cost controls via IaC

Data Platform:

data-architecture - Provision data lakes, warehouses
streaming-data - Provision Kafka, Kinesis infrastructure

Best Practices

Development Workflow:

Write infrastructure code in feature branches
Run terraform plan / pulumi preview locally
Submit pull request with plan output
Code review focuses on security, cost, blast radius
CI runs automated tests and security scans
Apply only after approval and CI passes
Monitor for drift post-deployment

State Management:

Use remote state from day one (never local state for teams)
Separate state files per environment
Enable state locking to prevent concurrent modifications
Version state storage for rollback capability
Encrypt state at rest (contains sensitive data)
Regular state backups to separate location

Module Development:

Start with monolithic code, extract modules when patterns emerge
Design for reusability but avoid premature abstraction
Document all inputs and outputs
Provide working examples in examples/ directory
Pin provider versions in modules
Test modules before publishing
Use semantic versioning for releases

Security:

Scan IaC for security issues before apply (Checkov, tfsec)
Never commit secrets to code (use secret references)
Mark sensitive outputs as sensitive = true
Implement least-privilege IAM policies
Enable resource encryption by default
Use private module registries for internal modules

Cost Management:

Estimate costs before applying changes (Infracost)
Tag all resources for cost allocation
Review cost impact in pull requests
Set up cost alerts for drift
Rightsize resources based on usage

Operational Excellence:

Schedule regular drift detection
Document disaster recovery procedures
Maintain runbooks for common operations
Monitor state file access logs
Practice infrastructure rebuilds periodically
Keep provider versions current with testing

Common Pitfalls

State File Issues:

Manual state editing - Use terraform state commands, not direct edits
No state locking - Race conditions corrupt state
Local state for teams - State divergence across team members
Large state files - Break into multiple state files by layer

Module Design:

Over-abstraction - Too generic, hard to understand
Under-abstraction - Copy-paste code everywhere
No version pinning - Unexpected breaking changes
No examples - Users don't know how to consume module

Operations:

No drift detection - Manual changes go unnoticed
Direct resource modification - Bypassing IaC creates drift
No rollback plan - Can't recover from failed apply
Ignoring plan output - Surprises during apply

Security:

Secrets in code - Hard-coded credentials
No security scanning - Vulnerabilities in production
Overly permissive IAM - Excessive privileges
No state encryption - Sensitive data exposed

Troubleshooting Guide

State Lock Issues:

terraform force-unlock <lock-id>  # Use only if certain no other process running

Import Existing Resources:

terraform import aws_vpc.main vpc-12345678
pulumi import aws:ec2/vpc:Vpc main vpc-12345678

Drift Detection:

terraform plan -detailed-exitcode  # Exit 2 = drift detected
pulumi preview --diff

For detailed drift remediation, see references/drift-detection.md.

State Recovery:

# Terraform: Restore from S3 versioning
aws s3 cp s3://bucket/backup/terraform.tfstate terraform.tfstate

# Pulumi: Restore from checkpoint
pulumi stack export --version <timestamp> | pulumi stack import

Related Skills

For cloud-specific implementations:

aws-patterns - AWS-specific resource patterns
gcp-patterns - GCP-specific resource patterns
azure-patterns - Azure-specific resource patterns

For infrastructure operations:

kubernetes-operations - Manage Kubernetes clusters provisioned via IaC
gitops-workflows - GitOps-based infrastructure deployment
platform-engineering - Internal developer platforms

For security and compliance:

security-hardening - Infrastructure security controls
secret-management - Secret injection and rotation
compliance-frameworks - Policy-as-code for compliance

For deployment automation:

building-ci-pipelines - CI/CD for infrastructure code
deploying-applications - Application deployment to provisioned infrastructure

For cost and observability:

cost-optimization - FinOps practices for infrastructure
observability - Monitoring infrastructure health

writing-infrastructure-code

Infrastructure as Code

When to Use

Core Concepts

Infrastructure as Code Fundamentals

Tool Selection Framework

State Management Architecture

Module Design Patterns

Quick Reference

Terraform/OpenTofu Commands

Pulumi Commands

AWS CDK Commands

Common Patterns Checklist

Detailed Documentation

Working Examples

Utility Scripts

Integration with Other Skills

Best Practices

Common Pitfalls

Troubleshooting Guide

Related Skills