refactor-module
Terraform Module Refactoring - Decision Expert
Assumption: You know Terraform syntax. This covers when to modularize vs keep inline.
Load this skill when
- the user is deciding whether Terraform code should become a reusable module
- module boundaries, state migration risk, or abstraction pressure are the hard part
- the goal is to prevent module sprawl while extracting the right seams
Do NOT load this skill when
- the user only needs Terraform syntax or provider help
- the modularization decision is already made and only implementation remains
- the code in question is not Terraform infrastructure
Before Refactoring to Module: Strategic Assessment
Ask yourself these questions BEFORE extracting a module:
1. Duplication Pain Analysis
- How many usages?: 1-2 usages → Keep inline (wait for third), 3+ → Module candidate
- How identical?: <50% same config → Use locals, >80% same → Module makes sense
- Change frequency: Changes weekly → DON'T modularize (unstable API), Quarterly+ → Module safe
- Scope of changes: Entire resource → Module, Just variables → Locals sufficient
2. Team Readiness Assessment
- Test coverage: <50% coverage → TOO RISKY (breaking changes won't be caught)
- State backup: Is state in remote backend? Can we roll back if migration fails?
- Consumer count: 1-3 consumers → Easy to update, 10+ → Module change = 10 PRs
- Documentation: Does team know how to use
terraform state mv? Need training first?
3. Cost/Benefit Analysis
- Module overhead: Versioning strategy + automated tests + migration docs + state migration risk
- Duplication pain: Inconsistent configs + bug fixes need 3+ PRs + compliance drift risk
- Break-even point: Module worth it when 4+ identical usages + stable API + compliance need
- Time to value: Simple module = 2 hours, Complex with state = 2 days planning + 4 hours execution
Critical Decision: Module vs Inline
Considering creating a module?
│
├─ Pattern used once → DON'T modularize (keep inline)
│ └─ Wait for second usage
│ WHY: Premature abstraction harder to change
│
├─ Pattern used 2-3 times → MAYBE modularize
│ ├─ Identical config → Modularize
│ ├─ Similar but different → DON'T (use locals instead)
│ └─ Different teams using → DON'T (coordination overhead)
│
├─ Pattern used 4+ times → Modularize
│ └─ IF config is stable (not changing every sprint)
│ WHY: Module changes require updating all consumers
│
└─ Compliance/security requirement → Modularize
└─ Enforce standards across teams
WHY: Module = single source of truth for compliance
The trap: Over-modularization. Modules add indirection, versioning, testing overhead.
Rule of thumb: Resist creating modules until pain of duplication > pain of abstraction.
Anti-Patterns
❌ #1: Leaky Abstractions
Problem: Module exposes 50 variables, just wrapping resources
// ❌ WRONG - 1:1 variable mapping
module "vpc" {
source = "./modules/vpc"
cidr_block = var.cidr_block
enable_dns_hostnames = var.enable_dns_hostnames
enable_dns_support = var.enable_dns_support
instance_tenancy = var.instance_tenancy
enable_classiclink = var.enable_classiclink
enable_classiclink_dns_support = var.enable_classiclink_dns_support
assign_generated_ipv6_cidr_block = var.assign_generated_ipv6_cidr_block
// ... 43 more variables
}
// WHY IT'S BAD: Not abstracting anything, just moving code
Fix: High-level interface
// ✅ CORRECT - meaningful abstraction
module "vpc" {
source = "./modules/vpc"
network_config = {
cidr = "10.0.0.0/16"
azs = ["us-east-1a", "us-east-1b"]
public_subnets = 2
private_subnets = 2
}
// 4 variables instead of 50
}
Test: If module variables match resource arguments 1:1, it's not a real abstraction.
Why this is deceptively hard to debug: Module works perfectly—tests pass, terraform apply succeeds. The problem emerges slowly over months: every resource argument becomes a module variable (+50 variables), documentation becomes unwieldy (which of 50 variables do I actually need?), consumers still need to know AWS VPC internals to use the module (defeating abstraction purpose). Takes 3-6 months of maintenance hell—adding new AWS features requires module updates, version bumps, consumer migrations—before team realizes the module isn't abstracting anything, just adding indirection. By then, you have 10+ consumers and unwinding the module is more painful than living with it.
❌ #2: Premature Modularization
Problem: Create module after first usage, becomes wrong abstraction
// First usage: Simple S3 bucket
resource "aws_s3_bucket" "data" {
bucket = "company-data"
}
// Developer: "Let's make this a module!"
// Creates module with 1 user
// Second usage: Needs versioning
// Third usage: Needs replication
// Fourth usage: Needs lifecycle rules
// Fifth usage: Needs encryption
// Result: Module has 30 variables, nobody uses all of them
Fix: Wait for 2-3 real usages, then extract common pattern
// After seeing 3 usages, common pattern emerges:
module "s3_bucket" {
source = "./modules/s3"
name = "company-data"
type = "data" | "logs" | "artifacts" // 3 known patterns
// Module handles appropriate defaults per type
}
Rule: First time: inline. Second time: copy-paste. Third time: abstract.
Why this is deceptively hard to debug: First usage looks clean (simple S3 module, 3 variables). Second usage needs versioning (add 2 variables). Third needs replication (add 5 variables). Fourth needs lifecycle rules (add 8 variables). Each addition seems reasonable in isolation. After 6 months, module has 30 variables, 20 boolean feature flags, complex conditional logic. Nobody uses all features (each consumer uses 30% of variables). Every change risks breaking someone. The wrong abstraction was baked in early (before patterns emerged) and now refactoring requires coordinating 10+ teams. Takes 3-6 months of feature requests to realize the module grew wrong, but reversing it requires migrating all consumers—a 2-month project nobody has time for.
❌ #3: State Migration Nightmare
Problem: Refactor to module without planning state migration
// Before: Inline resources
resource "aws_vpc" "main" { ... }
resource "aws_subnet" "private" { ... }
// After: Module
module "network" {
source = "./modules/network"
}
// Result: Terraform wants to destroy old resources, create new ones
// State addresses changed: aws_vpc.main → module.network.aws_vpc.main
Fix: Move state before deploying
# Plan the move
terraform state list # See current addresses
# Move resources into module
terraform state mv aws_vpc.main module.network.aws_vpc.main
terraform state mv aws_subnet.private module.network.aws_subnet.private
# Verify no changes
terraform plan # Should show: No changes
WARNING: State moves are dangerous. Always backup state first:
terraform state pull > backup-$(date +%s).tfstate
Why this is deceptively hard to debug: NO ERROR MESSAGE. terraform apply shows plan to destroy VPC, create "new" VPC with identical config. Looks like Terraform bug or state corruption. You spend 20-30 minutes checking: Terraform version? Backend config? State file corrupted? The error is invisible—state addresses changed (aws_vpc.main → module.network.aws_vpc.main) but Terraform doesn't say "you moved code without moving state." It just says "resource doesn't exist in state, will create new one." Developers approve the plan thinking it's safe (identical config), then production VPC gets destroyed and recreated, taking down all services. The fix (state mv) is 2 minutes, but discovering that's the problem takes 20-30 minutes of debugging—and by then you may have already applied and caused an outage.
❌ #4: Module Version Hell
Problem: 20 consumers on different module versions, breaking changes
// app1
module "vpc" { source = "git::...?ref=v1.0.0" }
// app2
module "vpc" { source = "git::...?ref=v1.2.0" }
// app3
module "vpc" { source = "git::...?ref=v2.0.0" } // Breaking changes
// Need bugfix in v1.x but already on v2.x
// Result: Maintain 3 versions or force painful upgrades
Fix: Semantic versioning + deprecation period
// v1.x: Stable, bug fixes only
// v2.x: New features, maintains v1 compatibility mode
// v3.x: Remove deprecated features
// v2 module supports both:
variable "legacy_mode" {
default = false // New consumers get new behavior
}
// Gives consumers 6 months to migrate
Rule: Breaking changes require major version + migration guide.
Why this is deceptively hard to debug: Version divergence happens slowly over weeks/months. App1 upgrades to v2.0 (works fine). App2 stays on v1.2 (no time to upgrade). App3 needs bugfix only in v1.x but you're maintaining v2.x now. Takes 2-3 weeks before pattern emerges: every module change requires checking "which versions need this fix?" You're backporting fixes to 3 versions, testing each, cutting multiple releases. CI builds slow down (testing v1.x, v2.x, v3.x). Documentation fragments (README has v1 vs v2 sections). Consumers report bugs fixed in v2 still present in v1—but they can't upgrade due to breaking changes. After 3-6 months you have 5 major versions to support, or you force painful migrations that break production apps.
Decision Frameworks
When to Extract Common Code
Have repeated Terraform code?
│
├─ Repeated data sources → Use locals (NOT module)
│ └─ data "aws_ami" { ... } appears 3 times
│ WHY: Data sources don't benefit from modules
│
├─ Repeated resource patterns → Check usage count
│ ├─ 1 usage → Keep inline
│ ├─ 2-3 usages → Use locals or workspaces
│ └─ 4+ usages → Consider module
│
└─ Compliance requirement → Module immediately
└─ Must enforce security standards
WHY: Module = single compliance enforcement point
Module Boundaries
Where to draw module boundary?
│
├─ By lifecycle → Good boundary
│ └─ VPC (rarely changes) vs EC2 (often changes)
│ WHY: Separate change frequencies
│
├─ By team ownership → Good boundary
│ └─ Team A owns networking, Team B owns compute
│ WHY: Clear ownership and responsibility
│
├─ By technology → Bad boundary
│ └─ "Database module", "Compute module", "Network module"
│ WHY: Real apps need cross-cutting concerns
│
└─ By Terraform resource type → Bad boundary
└─ aws_vpc module, aws_subnet module, aws_route_table module
WHY: Too granular, loses cohesion
Good module: VPC + subnets + route tables + NAT gateway (cohesive networking unit) Bad module: Just VPC (consumer has to wire up subnets manually)
Module Design Patterns
Pattern 1: Composition (Preferred)
// Small, focused modules
module "vpc" { ... }
module "eks" {
vpc_id = module.vpc.id // Compose modules
}
module "rds" {
subnet_ids = module.vpc.private_subnet_ids
}
Benefit: Mix and match, replace parts independently.
Pattern 2: Monolithic (Avoid)
// One module does everything
module "infrastructure" {
// Creates VPC + EKS + RDS + everything
}
Problem: Can't replace just VPC, all-or-nothing.
When monolithic is OK: Compliance module that must enforce standards together.
Refactoring Process
Step 1: Identify Duplication (Don't Rush)
# Find repeated patterns
grep -r "resource \"aws_s3_bucket\"" .
# If 1-2 results: keep inline
# If 3+ results: analyze differences
Step 2: Validate True Duplication
# Are configs actually identical?
diff app1/s3.tf app2/s3.tf
# If > 50% different: NOT true duplication
# Solution: locals or data sources, not module
Step 3: Design Module Interface
// ❌ WRONG - expose every resource argument
variable "acl" {}
variable "versioning" {}
variable "lifecycle_rules" {}
variable "replication" {}
// ... 20 more variables
// ✅ CORRECT - expose intent
variable "bucket_type" {
type = string
validation {
condition = contains(["data", "logs", "artifacts"], var.bucket_type)
}
}
// Module maps type to appropriate settings
Step 4: Extract + Test
# Create module
mkdir -p modules/s3
mv s3.tf modules/s3/main.tf
# Test in isolation
cd modules/s3
terraform init
terraform plan -var bucket_type=data
Step 5: Migrate State (Critical)
# Backup state
terraform state pull > backup.tfstate
# Move resources
terraform state mv aws_s3_bucket.data module.s3.aws_s3_bucket.main
# Verify
terraform plan # Should show: No changes
Error Recovery Procedures
When State Migration Causes Destroy/Create Plan
Recovery steps:
- STOP: Do NOT apply. Run
terraform state pull > emergency-backup.tfstateimmediately - Diagnose: Run
terraform state listto see current addresses vs expected module addresses - Fix state: Run
terraform state mvcommands for each resource that moved into module - Fallback: If you already applied and destroyed resources, restore from backup:
terraform state push emergency-backup.tfstate, then import destroyed resources:terraform import module.network.aws_vpc.main vpc-xxxxx
When Module Has Too Many Variables (Leaky Abstraction)
Recovery steps:
- Audit usage: Survey all consumers—which variables do they actually use? (Often 20% of variables)
- Identify patterns: Group consumers by usage pattern (data buckets, log buckets, artifact buckets)
- Redesign interface: Replace 50 variables with
bucket_typeenum + sensible defaults per type - Fallback: If redesign too risky, create
v2module with clean interface, deprecatev1over 6 months
When Premature Module Needs Different Features Per Consumer
Recovery steps:
- Count feature flags: If >10 boolean toggles, module is wrong abstraction
- Split by usage: Create separate modules per use case (simple-bucket, versioned-bucket, replicated-bucket)
- Migrate incrementally: New consumers use new modules, old consumers stay on v1 (deprecate over time)
- Fallback: If splitting too complex, add
advanced_configescape hatch allowing raw HCL passthrough for edge cases
When Multiple Module Versions Cause Maintenance Hell
Recovery steps:
- Audit versions: Run
grep -r 'source.*?ref=' . | sort | uniq -cto see version distribution - Create migration path: Write automated migration script (sed/awk) to update HCL from v1 → v2
- Coordinate upgrades: Schedule "module upgrade week" where all teams migrate together
- Fallback: If coordination impossible, use monorepo with workspace protocol (
source = "../../modules/vpc") to eliminate versions—all consumers use same code, breaking changes impossible
When to Load Full Reference
MANDATORY - READ ENTIRE FILE: references/state-migration.md when:
- Migrating 5+ resources into module with complex dependencies
- Encountering 3+ state-related errors during migration (resource not found, duplicate)
- Setting up automated state migration for 10+ similar refactorings
- Need rollback procedures for failed state migration in production
MANDATORY - READ ENTIRE FILE: references/module-testing.md when:
- Module has 3+ consumers and needs automated testing
- Setting up CI/CD pipeline for module with 5+ test scenarios
- Implementing contract tests between module versions
- Need to test 10+ module input combinations
Do NOT load references for:
- Basic refactoring decisions (use Strategic Assessment section)
- Single resource state moves (use Error Recovery section above)
- Deciding whether to create module (use Critical Decision section)
Resources
- Official Docs: https://developer.hashicorp.com/terraform/language/modules (for syntax)
- This Skill: When to modularize, module boundaries, anti-patterns
Arguments
$ARGUMENTS: Terraform path, resource group, or module decision prompt- Example:
/refactor-module infrastructure/networking - Example:
/refactor-module should these 8 resources become a module - If empty: ask which Terraform slice is under consideration
- Example: