terraform
Terraform & OpenTofu: Production Infrastructure-as-Code
Write, review, and architect Terraform/OpenTofu infrastructure - from individual resources to multi-account, PCI-compliant platform architectures. The goal is reproducible, drift-free, auditable infrastructure that passes both peer review and QSA assessment.
Target versions (April 2026): Terraform 1.14.9 (IBM/HashiCorp, BSL; 1.15.0-rc2 in progress), OpenTofu 1.11.6 (Linux Foundation, MPL). Helm provider v3.1+, K8s provider v3.0+, AWS provider v6.x, Azure v4.x, GCP v7.x.
This skill covers four domains depending on context:
- HCL - resource configs, variables, outputs, data sources, expressions, lifecycle rules
- Modules - structure, versioning, testing, registry patterns, reusable components
- Operations - state management, backends, workspaces, import, migration, CI/CD
- Compliance - PCI-DSS 4.0 controls, policy-as-code, audit trails, drift detection, CDE isolation
Terraform vs OpenTofu (2026)
IBM acquired HashiCorp for $6.4B (closed Feb 2025). Terraform stays BSL 1.1. The codebases have meaningfully diverged.
Choose Terraform if: already on HCP Terraform/TFE, need Stacks for multi-component orchestration, want vendor support.
Choose OpenTofu if: need client-side state encryption (Terraform never shipped this), BSL is a legal concern, want enabled meta-argument on resources, want OCI registry for providers/modules, need Linux Foundation governance.
Both share the provider plugin protocol - most providers work on both. For now.
CDKTF is dead. Deprecated Dec 2025, archived. Migrate to HCL or AWS CDK.
When to use
- Writing or reviewing Terraform/OpenTofu configurations
- Designing module architecture or registry patterns
- Planning state management, backend strategy, or migration
- Setting up CI/CD pipelines for IaC (plan/apply workflows)
- Implementing policy-as-code gates (Checkov, OPA, Sentinel)
- PCI-DSS 4.0 compliance for infrastructure provisioning
- Multi-account/multi-cloud architecture with blast radius controls
- Reviewing AI-generated Terraform for security and correctness
When NOT to use
- Kubernetes manifests or Helm charts (use kubernetes)
- Ansible playbooks or configuration management (use ansible)
- Docker/container optimization (use docker)
- CI/CD pipeline design (use ci-cd)
- Database engine configuration, schema design, or migrations (use databases)
- Security auditing application code (use security-audit)
AI Self-Check
AI tools consistently produce the same Terraform mistakes. Before returning any generated HCL, verify against this list:
- No hardcoded values - regions, AMI IDs, CIDR blocks, account IDs must be variables
- No overly permissive IAM - no
"Action": "*"or"Resource": "*"unless explicitly requested - No
0.0.0.0/0ingress on security groups (except port 443 for public ALBs, justified) - S3 buckets:
aws_s3_bucket_public_access_blockwith all four settingstrue(unless public access is explicitly required and justified), plus SSE-KMS encryption (aws_s3_bucket_server_side_encryption_configuration), versioning enabled, access logging (aws_s3_bucket_logging), and no overly permissive bucket policy (reviewaws_s3_bucket_policyfor broadPrincipal: "*"grants) - Provider versions pinned in
required_providerswith~>constraints - Backend config present (not local) with encryption and locking
-
lifecycleblocks where needed (create_before_destroy,prevent_destroyon stateful resources) -
sensitive = trueon variables/outputs containing secrets - Tags on every taggable resource (at minimum: Name, Environment, Owner, pci_scope if applicable)
- No deprecated resource arguments (check provider changelog - AI trains on old syntax)
- No
provisionerblocks - use Ansible or user_data instead - State file does NOT contain plaintext secrets (use ephemeral resources on TF 1.10+ or data sources for runtime secret lookup)
-
terraform fmtandterraform validatepass
AI should never own terraform apply. In March 2026, an AI-assisted Terraform workflow deleted production infrastructure through escalating cleanup logic. Plan output is reviewed by a human. Always.
Workflow
Step 1: Determine the domain
Based on the request:
- "Create a VPC/RDS/EC2/resource" -> HCL
- "Create a reusable module" -> Modules
- "Set up state backend" / "migrate state" -> Operations
- "Make this PCI compliant" / "policy gates" -> Compliance
- "Review this Terraform" -> Apply production checklist + critical rules + AI self-check
- "Review S3 buckets" -> S3 hardening review (see below) + AI self-check
Step 2: Gather requirements
Before writing HCL, determine:
- Cloud provider(s) and account/project structure
- Resource type and its dependencies
- Environment (dev/staging/prod) and promotion strategy
- State backend and locking mechanism
- Compliance scope: PCI CDE? Regulated? What tags/policies apply?
- Existing modules: reuse before creating new ones
- Secrets: how are they injected? (Vault, SSM, Secrets Manager - never tfvars)
Step 3: Build
Follow the domain-specific section below. Always terraform fmt + terraform validate + run Checkov before finishing.
Step 4: Validate
terraform fmt -check -recursive # Format check
terraform validate # Syntax + provider validation
tflint --recursive # Provider-specific linting
checkov -d . --framework terraform # Security/compliance scan
terraform plan -out=plan.tfplan # Review the plan
terraform show -json plan.tfplan | conftest test - # Policy-as-code gate (OPA)
HCL Patterns
Resource structure
resource "aws_instance" "web" {
ami = var.ami_id
instance_type = var.instance_type
subnet_id = var.private_subnet_id
root_block_device {
encrypted = true
kms_key_id = var.kms_key_arn
volume_size = 20
}
metadata_options {
http_endpoint = "enabled"
http_tokens = "required" # IMDSv2 - enforce this always
}
tags = merge(var.common_tags, {
Name = "${var.project}-web-${var.environment}"
})
lifecycle {
create_before_destroy = true
}
}
Key patterns
Variables: type them. Default non-sensitive ones. Mark secrets sensitive. Use validation blocks for constraints.
variable "environment" {
type = string
description = "Deployment environment"
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Must be dev, staging, or prod."
}
}
variable "db_password" {
type = string
sensitive = true # prevents logging in plan output
}
Locals: extract repeated expressions. Name descriptively.
locals {
name_prefix = "${var.project}-${var.environment}"
is_prod = var.environment == "prod"
common_tags = {
Project = var.project
Environment = var.environment
ManagedBy = "terraform"
pci_scope = var.pci_scope
}
}
Data sources: for runtime lookups. Never hardcode AMI IDs, AZ lists, or account IDs.
data "aws_caller_identity" "current" {}
data "aws_availability_zones" "available" { state = "available" }
data "aws_ami" "al2023" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["al2023-ami-*-x86_64"]
}
}
Ephemeral resources (TF 1.10+ / OT 1.11+): secrets that never persist in state. Ephemeral values can only flow into write_only arguments, provider configs, provisioners, or other ephemeral contexts - not into regular resource arguments.
ephemeral "aws_secretsmanager_secret_version" "db_password" {
secret_id = "prod/db/master-password"
}
resource "aws_db_instance" "main" {
password = ephemeral.aws_secretsmanager_secret_version.db_password.secret_string # must be write_only in provider
}
Lifecycle rules: use deliberately, not defensively.
create_before_destroy- for zero-downtime replacements (LBs, ASGs, DNS)prevent_destroy- for stateful resources (databases, S3 buckets with data)ignore_changes- for attributes managed outside Terraform (ASG desired_count managed by HPA)replace_triggered_by- force recreation when a dependency changes
Import blocks (TF 1.5+): declarative imports, no state surgery.
import {
to = aws_s3_bucket.existing
id = "my-existing-bucket"
}
Moved blocks (TF 1.8+): declarative intra-state refactoring. Rename resources or move into/out of modules within the same state file. Reviewed in PRs, applied automatically on terraform apply. Does NOT work across state files.
moved {
from = aws_instance.web
to = module.compute.aws_instance.web
}
Cross-state resource move (state surgery - when moved blocks can't help):
# 1. Back up source state, then remove the resource
terraform state pull > backup.tfstate # safety backup only
terraform state rm aws_instance.web # removes from source backend directly
# 2. In the destination workspace, import the resource
terraform import aws_instance.web i-0abc1234def56789
# Then add the matching resource block in HCL to avoid drift
Verify both states with terraform plan before and after. state rm writes directly to the backend - do not state push the backup afterward (that would undo the removal). Ensure no other runs hold the state lock before starting (check terraform force-unlock only as a last resort with a known-stale lock ID) and block concurrent apply in CI for both source and destination during the migration.
What NOT to write
provisioner "local-exec"orprovisioner "remote-exec"- use Ansibledepends_onwhen Terraform already infers the dependency from attribute referencescountfor conditional resources whenfor_eachwith a set is clearer (OpenTofu: useenabled)- String interpolation for simple references:
"${var.name}"->var.name terraform.tfvarscommitted to Git with real valuesterraform.workspacefor environment separation (use separate state files or workspaces with distinct backends)- Inline
provisionerblocks of any kind
S3 bucket review checklist
When reviewing or writing S3 bucket configurations, verify every bucket has all six companion resources. AI-generated HCL routinely omits several of these.
| # | Resource | Why | Checkov |
|---|---|---|---|
| 1 | aws_s3_bucket_public_access_block |
Block all public access (all four settings true) |
CKV_AWS_53 |
| 2 | aws_s3_bucket_server_side_encryption_configuration |
SSE-KMS with customer-managed key | CKV_AWS_145 |
| 3 | aws_s3_bucket_versioning |
Rollback + tamper evidence | CKV_AWS_21 |
| 4 | aws_s3_bucket_logging |
Access audit trail (target a dedicated logging bucket) | CKV_AWS_18 |
| 5 | aws_s3_bucket_lifecycle_configuration |
Expiration/transition rules for cost and compliance retention | - |
| 6 | aws_s3_bucket_policy |
Explicit deny on non-SSL requests (aws:SecureTransport = false); no Principal: "*" grants unless public access is justified |
CKV_AWS_70 |
Also verify the account-level safety net: aws_s3_account_public_access_block with all four settings true. This catches any bucket that accidentally ships without its own block.
For PCI CDE buckets, add aws_s3_bucket_object_lock_configuration with COMPLIANCE mode retention for immutable audit storage (Req 10.5).
See references/compliance.md for full S3 hardening HCL examples including account-level blocks and object lock.
Modules
Read references/module-patterns.md for detailed module structure, testing patterns, and registry strategies.
Structure
modules/<provider>/<resource-type>/
main.tf # Resources
variables.tf # Inputs
outputs.tf # Outputs
versions.tf # Required providers + terraform version
README.md # Usage examples
examples/ # Working example configs
tests/ # .tftest.hcl files
Versioning
- Production: pin exact versions (
= 2.1.3) or use dependency lock file - Dev/staging: allow minor updates (
~> 2.1) - Every module gets semantic versioning and a CHANGELOG
- Provider versions: pin with
~>inrequired_providers. The.terraform.lock.hclfile pins exact hashes - commit it.
Testing (2026 standard)
terraform test(native, GA): HCL-based unit tests for every module. Fast, runs in CI on every PR.- Terratest (Go): integration tests that spin up real infrastructure. Run nightly or pre-release.
- Both complement each other.
terraform testfor fast validation, Terratest for real-world proof.
Anti-patterns
- Modules wrapping a single resource with no added logic (just use the resource directly)
- Modules with more variables than the resource they wrap has arguments
module "vpc"that just passes through all variables toaws_vpc- Not pinning module versions in production
- Using Git refs for module sources in production (use a registry or exact tags)
Operations
Read references/state-and-security.md for state backends, locking, encryption, OIDC federation, CI/CD pipeline patterns, and state surgery (cross-state resource migration).
State management
See references/state-and-security.md for full backend config examples, OIDC federation patterns, CI/CD pipeline flows, and cross-state migration workflows.
S3 + native locking (TF 1.10+): DynamoDB-based locking is deprecated. Use use_lockfile = true. Encrypt with KMS. Enable versioning and CloudTrail data events on the bucket.
OpenTofu: add client-side state encryption on top (AES-GCM, AWS KMS, GCP KMS, or OpenBao) - encrypts before upload, even a compromised backend can't read state.
State file splitting (blast radius)
Split by risk and ownership:
states/
network/cde/ # CDE VPC - separate IAM role, separate approval
network/non-cde/ # Everything else
compute/cde/ # Payment processing
compute/non-cde/ # App tier
data/cde/ # RDS with cardholder data
iam/ # IAM is high-risk - own state, own approval
monitoring/ # CloudTrail, GuardDuty, Config
CDE state files get their own backend, IAM role, and approval workflow. A terraform apply on non-CDE infra must never touch CDE resources.
CI/CD credentials: OIDC federation
No static credentials in CI. Use OIDC federation (GitHub Actions, GitLab CI):
- CI generates a signed JWT per pipeline run
- Cloud provider validates JWT against CI platform's OIDC endpoint
- Short-lived credentials issued, scoped to that execution
- Separate roles: read-only for
plan(any branch), write forapply(main only) - Lock subject claims to specific repos AND branches
Supply chain integrity
The Terraform ecosystem has real supply chain risks (March 2026):
- Pin GitHub Actions to commit SHAs -
tj-actions/changed-fileswas compromised March 2025 via upstream reviewdog/action-setup (CVE-2025-30154) (~12 hours of credential theft). Same pattern as the Trivy compromise a year later. - Module supply chain is weak - modules have no hash verification (unlike the provider lock file). Typosquatting on the public registry is a demonstrated attack vector (NDC Oslo 2025).
- Terrascan: dead. Archived Nov 2025. Migrate to Checkov or Trivy.
- tfsec: merged into Trivy. Still works standalone but no new development.
- Trivy IaC scanning: pin to a verified version in CI. Check release notes before updating - supply chain attacks on CI tools are real. Pin to SHA digest, not mutable tag.
- CDKTF: dead. Deprecated Dec 2025, archived. Migrate to HCL.
Architecture
Multi-account strategy
Organization root
+-- Security OU
| +-- Log Archive account (CloudTrail, Config, audit logs)
| +-- Security Tooling account (GuardDuty, Security Hub)
+-- Infrastructure OU
| +-- Shared Services account (Transit Gateway, DNS, CI/CD)
+-- Workloads OU
| +-- Dev account
| +-- Staging account
| +-- Production account
+-- CDE OU (PCI)
+-- CDE Production account (payment processing - isolated)
+-- CDE Staging account
Terraform manages cross-account via provider aliases with assume_role:
provider "aws" {
alias = "cde"
region = var.region
assume_role {
role_arn = "arn:aws:iam::CDE_ACCOUNT:role/TerraformDeployRole"
}
}
Security scanning stack
| Tool | Role | Status |
|---|---|---|
| Checkov | Static HCL + plan analysis, 750+ checks, PCI/CIS/NIST frameworks | ๐ข Active, recommended |
| Trivy (absorbed tfsec) | IaC + container + repo scanning, single binary | ๐ข Active (v0.69.3 safe, v0.69.4-6 COMPROMISED) |
| TFLint | Provider-specific linting, catches misconfigs linters miss | ๐ข Active |
| OPA / Conftest | Custom policy-as-code on JSON plan output | ๐ข Active (CNCF) |
| Sentinel | Native TFC/TFE policy engine | ๐ข Active (proprietary) |
| tfsec | Security scanner | ๐ก Deprecated (merged into Trivy) |
| Terrascan | IaC scanner | ๐ด Archived Nov 2025 - migrate off |
| CDKTF | TypeScript/Python IaC | ๐ด Deprecated Dec 2025 - migrate off |
Recommended CI pipeline: terraform fmt -> terraform validate -> tflint -> checkov -> terraform plan -> conftest test (OPA) -> human review -> terraform apply
Compliance
Read references/compliance.md for the full PCI-DSS 4.0 requirements mapping, drift detection strategy, audit trail architecture, and OIDC patterns.
Quick reference: PCI-DSS 4.0 and IaC
PCI DSS 4.0 explicitly puts IaC repos in scope (Req 6). Your Terraform repo needs the same controls as any CDE system - access controls, audit logging, change management.
Critical requirements (most commonly cited in QSA findings):
- Req 6 + 8.6.2: IaC repo in scope - PR reviews, policy-as-code gates, no hardcoded secrets (use ephemeral resources TF 1.10+ or Vault/SSM)
- Req 10: Audit trail - archived plan/apply JSON + CloudTrail + immutable S3 with object lock (COMPLIANCE mode)
- Req 11.5: Drift detection satisfies FIM - schedule
terraform planruns and alert on unexpected changes
See references/compliance.md for full Req 1/3/7 mapping, drift detection strategy, and audit trail architecture.
State file security: state contains secrets (even with sensitive). Encrypt at rest (S3 SSE-KMS), restrict access (IAM policy), enable versioning, log all access (CloudTrail data events), retain 1+ year.
QSA expectations (2026): operational proof, not policy intent. Git history showing reviewed PRs, archived plan outputs, policy scan results per deployment, drift reports proving continuous compliance.
Production Checklist
Read references/production-checklist.md for the full pre-deploy checklist covering HCL quality, module standards, operations, PCI-DSS 4.0, and PCI MPoC compliance.
Reference Files
references/module-patterns.md- module design and testing patternsreferences/state-and-security.md- state backend, locking, encryption, OIDC patterns, and state surgery (cross-state migration)references/compliance.md- compliance and audit-oriented Terraform guidancereferences/production-checklist.md- pre-deploy verification checklist (HCL, modules, operations, PCI-DSS, MPoC)
Related Skills
- ansible - for day-2 configuration of provisioned resources. Terraform provisions the VM;
Ansible configures what runs on it. No
provisionerblocks - use Ansible instead. - kubernetes - K8s manifests and Helm charts. Terraform provisions the cluster; kubernetes configures what runs on it.
- databases - engine tuning and operations. Terraform provisions managed databases; databases skill tunes the engine.
- ci-cd - pipeline design that runs
terraform plan/apply. Terraform covers HCL; ci-cd covers the pipeline stages. - docker - container image patterns. Terraform provisions container infrastructure but Dockerfile design belongs in docker.
Rules
These are non-negotiable. Violating any of these is a bug.
terraform fmt+terraform validateon every change. Non-negotiable.- Pin provider versions.
required_providerswith~>constraints. Commit the lock file. - Never commit secrets. Not in
.tf, not in.tfvars, not in state. Use ephemeral resources, Vault, or SSM. - No
provisionerblocks. Use Ansible or user_data. - No
"Action": "*"in IAM policies. Least-privilege only. - Encrypt everything. Storage, transit, state backend. No exceptions.
- State backend with locking and encryption. Never local state in production.
- Separate CDE state files. Own backend, own IAM role, own approval workflow.
- OIDC federation for CI/CD. No static cloud credentials.
- Pin CI actions to commit SHAs. Mutable tags are compromised supply chain vectors (tj-actions March 2025, Trivy March 2026).
terraform planbefore everyapply. Archive the plan output.- AI never owns
terraform apply. Plan output is reviewed by a human. Always. - Run the AI self-check. Every generated HCL gets verified against the checklist above before returning.