infrastructure-as-code
Infrastructure as Code
Overview
Tool-agnostic IaC best practices covering the decision framework, testing pyramid, CI/CD integration, state management, and policy enforcement. Includes Terraform-focused examples as the most common tool.
Tool Selection Decision Matrix
| Factor | Terraform | CloudFormation | CDK | Pulumi |
|---|---|---|---|---|
| Multi-cloud | Excellent | AWS only | AWS (multi via constructs) | Excellent |
| Language | HCL | JSON/YAML | TypeScript/Python/Java | TypeScript/Python/Go |
| State | External (S3, etc.) | AWS-managed | AWS-managed | Pulumi Cloud or self-hosted |
| Ecosystem | Largest provider registry | AWS-native | Growing | Growing |
| Learning curve | Moderate (HCL) | Low (declarative) | Low (familiar lang) | Low (familiar lang) |
| Testing | Native tests + Terratest | TaskCat, cfn-lint | CDK assertions | Pulumi testing |
| Best for | Multi-cloud, large teams | AWS-only shops | AWS devs who prefer code | Devs who dislike DSLs |
Recommendation: Default to Terraform for multi-cloud or large teams. Use CDK/Pulumi if team strongly prefers general-purpose languages. Use CloudFormation if AWS-only and team already knows it.
Project Structure
Terraform Layout
infrastructure/
├── modules/ # Reusable modules
│ ├── vpc/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ └── README.md
│ ├── eks/
│ ├── rds/
│ └── lambda/
├── environments/ # Per-environment configs
│ ├── dev/
│ │ ├── main.tf # Module composition
│ │ ├── variables.tf
│ │ ├── terraform.tfvars
│ │ ├── backend.tf # S3 + DynamoDB state config
│ │ └── providers.tf
│ ├── staging/
│ └── prod/
├── policies/ # OPA/Sentinel policies
│ └── deny-public-s3.rego
└── tests/ # Integration tests
└── vpc_test.go
Key rules:
- Separate state per environment (separate backend.tf)
- Shared modules in
modules/, consumed byenvironments/ - Pin module versions:
source = "../modules/vpc"orversion = "~> 2.0" - Pin provider versions in
providers.tf - Never commit
.tfstateor.tfvarswith secrets
Module Design
# modules/vpc/variables.tf
variable "name" {
description = "VPC name"
type = string
validation {
condition = can(regex("^[a-z][a-z0-9-]*$", var.name))
error_message = "Name must be lowercase alphanumeric with hyphens."
}
}
variable "cidr_block" {
description = "VPC CIDR block"
type = string
default = "10.0.0.0/16"
validation {
condition = can(cidrhost(var.cidr_block, 0))
error_message = "Must be a valid CIDR block."
}
}
variable "availability_zones" {
description = "List of AZs to use"
type = list(string)
}
variable "tags" {
description = "Tags to apply to all resources"
type = map(string)
default = {}
}
Module rules:
- One concern per module (VPC, EKS, RDS — not "everything")
- All inputs as variables with descriptions, types, and validations
- All outputs documented
- Use
for_eachovercountfor stable resource identity - Use
localsfor computed values - Generate docs with
terraform-docs - Semantic versioning when publishing
IaC Testing Pyramid
Run in order, fail fast:
Layer 1: Format & Validate (seconds)
# Pre-commit hook or CI step 1
terraform fmt -check -recursive
terraform validate
Layer 2: Lint (seconds)
# Enforce best practices
tflint --recursive
# Check for deprecated syntax, naming conventions, unused variables
Layer 3: Security Scan (seconds)
# Detect misconfigurations
checkov -d .
# or
tfsec .
# Catches: public S3 buckets, open security groups, unencrypted resources, missing tags
Layer 4: Plan & Cost (minutes)
# Preview changes — mandatory before apply
terraform plan -out=tfplan
# Estimate cost impact (Infracost)
infracost breakdown --path=tfplan --format=json
# Post cost diff as PR comment
infracost diff --path=tfplan --compare-to=infracost-base.json
Layer 5: Policy-as-Code (seconds)
# OPA (Open Policy Agent) — custom guardrails
terraform show -json tfplan | opa eval -d policies/ -i - 'data.terraform.deny'
# Example policy: deny public S3 buckets
# policies/deny-public-s3.rego
package terraform
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_s3_bucket"
resource.change.after.acl == "public-read"
msg := sprintf("S3 bucket %s must not be public", [resource.address])
}
Layer 6: Integration Tests (minutes)
// tests/vpc_test.go (Terratest)
func TestVPCCreation(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../modules/vpc",
Vars: map[string]interface{}{
"name": "test-vpc",
"cidr_block": "10.0.0.0/16",
"availability_zones": []string{"us-east-1a", "us-east-1b"},
},
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
vpcID := terraform.Output(t, terraformOptions, "vpc_id")
assert.NotEmpty(t, vpcID)
}
Or use native Terraform tests (1.6+):
# tests/vpc.tftest.hcl
run "create_vpc" {
command = apply
variables {
name = "test-vpc"
cidr_block = "10.0.0.0/16"
availability_zones = ["us-east-1a", "us-east-1b"]
}
assert {
condition = output.vpc_id != ""
error_message = "VPC ID must not be empty"
}
}
CI/CD for Infrastructure
GitHub Actions Workflow
name: Infrastructure
on:
pull_request:
paths: ['infrastructure/**']
push:
branches: [main]
paths: ['infrastructure/**']
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Format Check
run: terraform fmt -check -recursive
working-directory: infrastructure
- name: Validate
run: |
for env in infrastructure/environments/*/; do
terraform -chdir="$env" init -backend=false
terraform -chdir="$env" validate
done
- name: Lint
run: tflint --recursive
working-directory: infrastructure
- name: Security Scan
run: checkov -d infrastructure/ --quiet
plan:
needs: validate
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
environment: plan
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Plan
run: |
terraform -chdir=infrastructure/environments/prod init
terraform -chdir=infrastructure/environments/prod plan -out=tfplan -no-color
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Cost Estimate
uses: infracost/actions/setup@v3
- run: infracost diff --path=infrastructure/environments/prod/tfplan --format=json --out-file=/tmp/infracost.json
- uses: infracost/actions/comment@v3
with:
path: /tmp/infracost.json
behavior: update
apply:
needs: plan
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: production # Requires manual approval
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Apply
run: |
terraform -chdir=infrastructure/environments/prod init
terraform -chdir=infrastructure/environments/prod apply -auto-approve
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
GitOps Flow
Developer → PR with infra changes
→ CI: fmt + validate + lint + security scan (automated)
→ CI: terraform plan + cost estimate (automated, posted as PR comment)
→ Review: team reviews plan output and cost impact
→ Approve: merge to main
→ CD: terraform apply (automated with manual approval gate for prod)
Rules:
- Never
terraform applyfrom local machines in production - Plan output must be reviewed by a human before apply
- Production apply requires explicit approval (GitHub environment protection)
- All state operations logged for audit
State Management
Remote State (Terraform + AWS)
# backend.tf
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "prod/vpc/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
State rules:
- Always use remote backend (never local state in production)
- Enable encryption at rest
- Enable state locking (DynamoDB for AWS, GCS for GCP)
- Enable versioning on state bucket (disaster recovery)
- Separate state per component AND per environment
- State key pattern:
<env>/<component>/terraform.tfstate - Never store state in Git
- Use
terraform state mvfor refactoring, with backup
Drift Detection
# Schedule in CI (weekly or daily)
terraform plan -detailed-exitcode
# Exit code 0: no changes
# Exit code 1: error
# Exit code 2: changes detected (drift!)
# Alert on exit code 2 → investigate manual changes
Common AWS Patterns
VPC with Public/Private Subnets
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.0"
name = "${var.project}-${var.environment}"
cidr = var.vpc_cidr
azs = var.availability_zones
private_subnets = var.private_subnet_cidrs
public_subnets = var.public_subnet_cidrs
enable_nat_gateway = true
single_nat_gateway = var.environment != "prod" # Cost: single for dev, per-AZ for prod
enable_dns_hostnames = true
tags = merge(var.tags, {
"kubernetes.io/cluster/${var.cluster_name}" = "shared"
})
private_subnet_tags = {
"kubernetes.io/role/internal-elb" = 1
"karpenter.sh/discovery" = var.cluster_name
}
public_subnet_tags = {
"kubernetes.io/role/elb" = 1
}
}
Tagging Strategy
# providers.tf — apply default tags to ALL resources
provider "aws" {
region = var.region
default_tags {
tags = {
Environment = var.environment
Project = var.project
ManagedBy = "terraform"
Team = var.team
CostCenter = var.cost_center
}
}
}
Required tags (enforce via OPA/Config Rules):
Environment: dev/staging/prodProject: application nameManagedBy: terraform/manual/cdkTeam: owning teamCostCenter: billing allocation
Anti-Patterns
- Monolithic state file for entire infrastructure (one blast radius)
terraform applyfrom laptops (no audit trail, no review)- No state locking (corruption from concurrent runs)
- Hardcoded values (regions, account IDs, CIDRs)
- No security scanning in pipeline
- Using
countfor resources with side effects on index changes terraform destroywithout confirming scope- Committing
.tfvarswith secrets - No drift detection (manual changes go unnoticed)
- Skipping
terraform planreview before apply
References
- Anton Babenko's terraform-skill: comprehensive Terraform-specific patterns
- HashiCorp official agent skills: Terraform and Packer best practices
- Pulumi agent skills: Pulumi-specific patterns and migration guides
More from pfangueiro/claude-code-agents
deep-read
Comprehensive codebase reading engine. Systematically reads actual source code line by line through a 6-phase protocol — scoping, structural mapping, execution tracing, deep reading, pattern synthesis, and structured reporting. Source code is the source of truth. Use when needing to truly understand how code works, not just what documentation claims.
46ci-cd-templates
Production-ready CI/CD pipeline templates for GitHub Actions, GitLab CI, and CircleCI
7execute
Orchestrated task execution engine. Decomposes any goal into small atomic tasks, plans dependencies, selects the right agent/tool/MCP server for each, executes in optimally parallel batches, and tracks everything. Use when given a complex, multi-step goal that benefits from structured decomposition and full tool utilization.
6library-docs
Quick access to up-to-date library documentation using MCP. Use this skill when you need to reference official documentation for libraries, frameworks, or APIs. Leverages the context7 MCP server to fetch current docs for React, Next.js, Vue, MongoDB, Supabase, and hundreds of other libraries. Complements the documentation-maintainer agent.
6deep-analysis
Structured multi-step reasoning for complex problems using the sequential-thinking MCP server. Use when facing architectural decisions, performance bottlenecks, complex debugging, design trade-offs, technology selection, or any problem requiring rigorous step-by-step analysis with hypothesis testing.
6ui-guidelines
Comprehensive UI/UX guidelines for building React/Next.js components with Ant Design, shadcn/ui charts, and consistent styling. Use when creating forms, tables, modals, cards, or any UI component. Enforces color palette, typography, spacing (8px/12px/16px/24px), animations, and component patterns specific to the application.
5