refactor-module

SKILL.md

Terraform Module Refactoring - Decision Expert

Assumption: You know Terraform syntax. This covers when to modularize vs keep inline.

Load this skill when

  • the user is deciding whether Terraform code should become a reusable module
  • module boundaries, state migration risk, or abstraction pressure are the hard part
  • the goal is to prevent module sprawl while extracting the right seams

Do NOT load this skill when

  • the user only needs Terraform syntax or provider help
  • the modularization decision is already made and only implementation remains
  • the code in question is not Terraform infrastructure

Before Refactoring to Module: Strategic Assessment

Ask yourself these questions BEFORE extracting a module:

1. Duplication Pain Analysis

  • How many usages?: 1-2 usages → Keep inline (wait for third), 3+ → Module candidate
  • How identical?: <50% same config → Use locals, >80% same → Module makes sense
  • Change frequency: Changes weekly → DON'T modularize (unstable API), Quarterly+ → Module safe
  • Scope of changes: Entire resource → Module, Just variables → Locals sufficient

2. Team Readiness Assessment

  • Test coverage: <50% coverage → TOO RISKY (breaking changes won't be caught)
  • State backup: Is state in remote backend? Can we roll back if migration fails?
  • Consumer count: 1-3 consumers → Easy to update, 10+ → Module change = 10 PRs
  • Documentation: Does team know how to use terraform state mv? Need training first?

3. Cost/Benefit Analysis

  • Module overhead: Versioning strategy + automated tests + migration docs + state migration risk
  • Duplication pain: Inconsistent configs + bug fixes need 3+ PRs + compliance drift risk
  • Break-even point: Module worth it when 4+ identical usages + stable API + compliance need
  • Time to value: Simple module = 2 hours, Complex with state = 2 days planning + 4 hours execution

Critical Decision: Module vs Inline

Considering creating a module?
├─ Pattern used once → DON'T modularize (keep inline)
│   └─ Wait for second usage
│      WHY: Premature abstraction harder to change
├─ Pattern used 2-3 times → MAYBE modularize
│   ├─ Identical config → Modularize
│   ├─ Similar but different → DON'T (use locals instead)
│   └─ Different teams using → DON'T (coordination overhead)
├─ Pattern used 4+ times → Modularize
│   └─ IF config is stable (not changing every sprint)
│      WHY: Module changes require updating all consumers
└─ Compliance/security requirement → Modularize
    └─ Enforce standards across teams
       WHY: Module = single source of truth for compliance

The trap: Over-modularization. Modules add indirection, versioning, testing overhead.

Rule of thumb: Resist creating modules until pain of duplication > pain of abstraction.


Anti-Patterns

❌ #1: Leaky Abstractions

Problem: Module exposes 50 variables, just wrapping resources

// ❌ WRONG - 1:1 variable mapping
module "vpc" {
  source = "./modules/vpc"

  cidr_block                           = var.cidr_block
  enable_dns_hostnames                 = var.enable_dns_hostnames
  enable_dns_support                   = var.enable_dns_support
  instance_tenancy                     = var.instance_tenancy
  enable_classiclink                   = var.enable_classiclink
  enable_classiclink_dns_support       = var.enable_classiclink_dns_support
  assign_generated_ipv6_cidr_block     = var.assign_generated_ipv6_cidr_block
  // ... 43 more variables
}

// WHY IT'S BAD: Not abstracting anything, just moving code

Fix: High-level interface

// ✅ CORRECT - meaningful abstraction
module "vpc" {
  source = "./modules/vpc"

  network_config = {
    cidr           = "10.0.0.0/16"
    azs            = ["us-east-1a", "us-east-1b"]
    public_subnets = 2
    private_subnets = 2
  }

  // 4 variables instead of 50
}

Test: If module variables match resource arguments 1:1, it's not a real abstraction.

Why this is deceptively hard to debug: Module works perfectly—tests pass, terraform apply succeeds. The problem emerges slowly over months: every resource argument becomes a module variable (+50 variables), documentation becomes unwieldy (which of 50 variables do I actually need?), consumers still need to know AWS VPC internals to use the module (defeating abstraction purpose). Takes 3-6 months of maintenance hell—adding new AWS features requires module updates, version bumps, consumer migrations—before team realizes the module isn't abstracting anything, just adding indirection. By then, you have 10+ consumers and unwinding the module is more painful than living with it.

❌ #2: Premature Modularization

Problem: Create module after first usage, becomes wrong abstraction

// First usage: Simple S3 bucket
resource "aws_s3_bucket" "data" {
  bucket = "company-data"
}

// Developer: "Let's make this a module!"
// Creates module with 1 user

// Second usage: Needs versioning
// Third usage: Needs replication
// Fourth usage: Needs lifecycle rules
// Fifth usage: Needs encryption

// Result: Module has 30 variables, nobody uses all of them

Fix: Wait for 2-3 real usages, then extract common pattern

// After seeing 3 usages, common pattern emerges:
module "s3_bucket" {
  source = "./modules/s3"

  name = "company-data"
  type = "data" | "logs" | "artifacts"  // 3 known patterns

  // Module handles appropriate defaults per type
}

Rule: First time: inline. Second time: copy-paste. Third time: abstract.

Why this is deceptively hard to debug: First usage looks clean (simple S3 module, 3 variables). Second usage needs versioning (add 2 variables). Third needs replication (add 5 variables). Fourth needs lifecycle rules (add 8 variables). Each addition seems reasonable in isolation. After 6 months, module has 30 variables, 20 boolean feature flags, complex conditional logic. Nobody uses all features (each consumer uses 30% of variables). Every change risks breaking someone. The wrong abstraction was baked in early (before patterns emerged) and now refactoring requires coordinating 10+ teams. Takes 3-6 months of feature requests to realize the module grew wrong, but reversing it requires migrating all consumers—a 2-month project nobody has time for.

❌ #3: State Migration Nightmare

Problem: Refactor to module without planning state migration

// Before: Inline resources
resource "aws_vpc" "main" { ... }
resource "aws_subnet" "private" { ... }

// After: Module
module "network" {
  source = "./modules/network"
}

// Result: Terraform wants to destroy old resources, create new ones
// State addresses changed: aws_vpc.main → module.network.aws_vpc.main

Fix: Move state before deploying

# Plan the move
terraform state list  # See current addresses

# Move resources into module
terraform state mv aws_vpc.main module.network.aws_vpc.main
terraform state mv aws_subnet.private module.network.aws_subnet.private

# Verify no changes
terraform plan  # Should show: No changes

WARNING: State moves are dangerous. Always backup state first:

terraform state pull > backup-$(date +%s).tfstate

Why this is deceptively hard to debug: NO ERROR MESSAGE. terraform apply shows plan to destroy VPC, create "new" VPC with identical config. Looks like Terraform bug or state corruption. You spend 20-30 minutes checking: Terraform version? Backend config? State file corrupted? The error is invisible—state addresses changed (aws_vpc.main → module.network.aws_vpc.main) but Terraform doesn't say "you moved code without moving state." It just says "resource doesn't exist in state, will create new one." Developers approve the plan thinking it's safe (identical config), then production VPC gets destroyed and recreated, taking down all services. The fix (state mv) is 2 minutes, but discovering that's the problem takes 20-30 minutes of debugging—and by then you may have already applied and caused an outage.

❌ #4: Module Version Hell

Problem: 20 consumers on different module versions, breaking changes

// app1
module "vpc" { source = "git::...?ref=v1.0.0" }

// app2
module "vpc" { source = "git::...?ref=v1.2.0" }

// app3
module "vpc" { source = "git::...?ref=v2.0.0" }  // Breaking changes

// Need bugfix in v1.x but already on v2.x
// Result: Maintain 3 versions or force painful upgrades

Fix: Semantic versioning + deprecation period

// v1.x: Stable, bug fixes only
// v2.x: New features, maintains v1 compatibility mode
// v3.x: Remove deprecated features

// v2 module supports both:
variable "legacy_mode" {
  default = false  // New consumers get new behavior
}

// Gives consumers 6 months to migrate

Rule: Breaking changes require major version + migration guide.

Why this is deceptively hard to debug: Version divergence happens slowly over weeks/months. App1 upgrades to v2.0 (works fine). App2 stays on v1.2 (no time to upgrade). App3 needs bugfix only in v1.x but you're maintaining v2.x now. Takes 2-3 weeks before pattern emerges: every module change requires checking "which versions need this fix?" You're backporting fixes to 3 versions, testing each, cutting multiple releases. CI builds slow down (testing v1.x, v2.x, v3.x). Documentation fragments (README has v1 vs v2 sections). Consumers report bugs fixed in v2 still present in v1—but they can't upgrade due to breaking changes. After 3-6 months you have 5 major versions to support, or you force painful migrations that break production apps.


Decision Frameworks

When to Extract Common Code

Have repeated Terraform code?
├─ Repeated data sources → Use locals (NOT module)
│   └─ data "aws_ami" { ... } appears 3 times
│      WHY: Data sources don't benefit from modules
├─ Repeated resource patterns → Check usage count
│   ├─ 1 usage → Keep inline
│   ├─ 2-3 usages → Use locals or workspaces
│   └─ 4+ usages → Consider module
└─ Compliance requirement → Module immediately
    └─ Must enforce security standards
       WHY: Module = single compliance enforcement point

Module Boundaries

Where to draw module boundary?
├─ By lifecycle → Good boundary
│   └─ VPC (rarely changes) vs EC2 (often changes)
│      WHY: Separate change frequencies
├─ By team ownership → Good boundary
│   └─ Team A owns networking, Team B owns compute
│      WHY: Clear ownership and responsibility
├─ By technology → Bad boundary
│   └─ "Database module", "Compute module", "Network module"
│      WHY: Real apps need cross-cutting concerns
└─ By Terraform resource type → Bad boundary
    └─ aws_vpc module, aws_subnet module, aws_route_table module
       WHY: Too granular, loses cohesion

Good module: VPC + subnets + route tables + NAT gateway (cohesive networking unit) Bad module: Just VPC (consumer has to wire up subnets manually)


Module Design Patterns

Pattern 1: Composition (Preferred)

// Small, focused modules
module "vpc" { ... }
module "eks" {
  vpc_id = module.vpc.id  // Compose modules
}
module "rds" {
  subnet_ids = module.vpc.private_subnet_ids
}

Benefit: Mix and match, replace parts independently.

Pattern 2: Monolithic (Avoid)

// One module does everything
module "infrastructure" {
  // Creates VPC + EKS + RDS + everything
}

Problem: Can't replace just VPC, all-or-nothing.

When monolithic is OK: Compliance module that must enforce standards together.


Refactoring Process

Step 1: Identify Duplication (Don't Rush)

# Find repeated patterns
grep -r "resource \"aws_s3_bucket\"" .
# If 1-2 results: keep inline
# If 3+ results: analyze differences

Step 2: Validate True Duplication

# Are configs actually identical?
diff app1/s3.tf app2/s3.tf

# If > 50% different: NOT true duplication
# Solution: locals or data sources, not module

Step 3: Design Module Interface

// ❌ WRONG - expose every resource argument
variable "acl" {}
variable "versioning" {}
variable "lifecycle_rules" {}
variable "replication" {}
// ... 20 more variables

// ✅ CORRECT - expose intent
variable "bucket_type" {
  type = string
  validation {
    condition = contains(["data", "logs", "artifacts"], var.bucket_type)
  }
}

// Module maps type to appropriate settings

Step 4: Extract + Test

# Create module
mkdir -p modules/s3
mv s3.tf modules/s3/main.tf

# Test in isolation
cd modules/s3
terraform init
terraform plan -var bucket_type=data

Step 5: Migrate State (Critical)

# Backup state
terraform state pull > backup.tfstate

# Move resources
terraform state mv aws_s3_bucket.data module.s3.aws_s3_bucket.main

# Verify
terraform plan  # Should show: No changes

Error Recovery Procedures

When State Migration Causes Destroy/Create Plan

Recovery steps:

  1. STOP: Do NOT apply. Run terraform state pull > emergency-backup.tfstate immediately
  2. Diagnose: Run terraform state list to see current addresses vs expected module addresses
  3. Fix state: Run terraform state mv commands for each resource that moved into module
  4. Fallback: If you already applied and destroyed resources, restore from backup: terraform state push emergency-backup.tfstate, then import destroyed resources: terraform import module.network.aws_vpc.main vpc-xxxxx

When Module Has Too Many Variables (Leaky Abstraction)

Recovery steps:

  1. Audit usage: Survey all consumers—which variables do they actually use? (Often 20% of variables)
  2. Identify patterns: Group consumers by usage pattern (data buckets, log buckets, artifact buckets)
  3. Redesign interface: Replace 50 variables with bucket_type enum + sensible defaults per type
  4. Fallback: If redesign too risky, create v2 module with clean interface, deprecate v1 over 6 months

When Premature Module Needs Different Features Per Consumer

Recovery steps:

  1. Count feature flags: If >10 boolean toggles, module is wrong abstraction
  2. Split by usage: Create separate modules per use case (simple-bucket, versioned-bucket, replicated-bucket)
  3. Migrate incrementally: New consumers use new modules, old consumers stay on v1 (deprecate over time)
  4. Fallback: If splitting too complex, add advanced_config escape hatch allowing raw HCL passthrough for edge cases

When Multiple Module Versions Cause Maintenance Hell

Recovery steps:

  1. Audit versions: Run grep -r 'source.*?ref=' . | sort | uniq -c to see version distribution
  2. Create migration path: Write automated migration script (sed/awk) to update HCL from v1 → v2
  3. Coordinate upgrades: Schedule "module upgrade week" where all teams migrate together
  4. Fallback: If coordination impossible, use monorepo with workspace protocol (source = "../../modules/vpc") to eliminate versions—all consumers use same code, breaking changes impossible

When to Load Full Reference

MANDATORY - READ ENTIRE FILE: references/state-migration.md when:

  • Migrating 5+ resources into module with complex dependencies
  • Encountering 3+ state-related errors during migration (resource not found, duplicate)
  • Setting up automated state migration for 10+ similar refactorings
  • Need rollback procedures for failed state migration in production

MANDATORY - READ ENTIRE FILE: references/module-testing.md when:

  • Module has 3+ consumers and needs automated testing
  • Setting up CI/CD pipeline for module with 5+ test scenarios
  • Implementing contract tests between module versions
  • Need to test 10+ module input combinations

Do NOT load references for:

  • Basic refactoring decisions (use Strategic Assessment section)
  • Single resource state moves (use Error Recovery section above)
  • Deciding whether to create module (use Critical Decision section)

Resources

Arguments

  • $ARGUMENTS: Terraform path, resource group, or module decision prompt
    • Example: /refactor-module infrastructure/networking
    • Example: /refactor-module should these 8 resources become a module
    • If empty: ask which Terraform slice is under consideration
Weekly Installs
11
GitHub Stars
8
First Seen
Feb 14, 2026
Installed on
codex11
opencode10
gemini-cli10
github-copilot10
kimi-cli10
amp10