kafka-iac-deployment
Kafka Infrastructure as Code (IaC) Deployment
Expert guidance for deploying Apache Kafka using Terraform across multiple platforms.
When to Use This Skill
I activate when you need help with:
- Terraform deployments: "Deploy Kafka with Terraform", "provision Kafka cluster"
- Platform selection: "Should I use AWS MSK or self-hosted Kafka?", "compare Kafka platforms"
- Infrastructure planning: "How to size Kafka infrastructure", "Kafka on AWS vs Azure"
- IaC automation: "Automate Kafka deployment", "CI/CD for Kafka infrastructure"
What I Know
Available Terraform Modules
This plugin provides 3 production-ready Terraform modules:
1. Apache Kafka (Self-Hosted, KRaft Mode)
- Location:
plugins/specweave-kafka/terraform/apache-kafka/ - Platform: AWS EC2 (can adapt to other clouds)
- Architecture: KRaft mode (no ZooKeeper dependency)
- Features:
- Multi-broker cluster (3-5 brokers recommended)
- Security groups with SASL_SSL
- IAM roles for S3 backups
- CloudWatch metrics and alarms
- Auto-scaling group support
- Custom VPC and subnet configuration
- Use When:
- ✅ You need full control over Kafka configuration
- ✅ Running Kafka 3.6+ (KRaft mode)
- ✅ Want to avoid ZooKeeper operational overhead
- ✅ Multi-cloud or hybrid deployments
- Variables:
module "kafka" { source = "../../plugins/specweave-kafka/terraform/apache-kafka" environment = "production" broker_count = 3 kafka_version = "3.7.0" instance_type = "m5.xlarge" vpc_id = var.vpc_id subnet_ids = var.subnet_ids domain = "example.com" enable_s3_backups = true enable_monitoring = true }
2. AWS MSK (Managed Streaming for Kafka)
- Location:
plugins/specweave-kafka/terraform/aws-msk/ - Platform: AWS Managed Service
- Features:
- Fully managed Kafka service
- IAM authentication + SASL/SCRAM
- Auto-scaling (provisioned throughput)
- Built-in monitoring (CloudWatch)
- Multi-AZ deployment
- Encryption in transit and at rest
- Use When:
- ✅ You want AWS to manage Kafka operations
- ✅ Need tight AWS integration (IAM, KMS, CloudWatch)
- ✅ Prefer operational simplicity over cost
- ✅ Running in AWS VPC
- Variables:
module "msk" { source = "../../plugins/specweave-kafka/terraform/aws-msk" cluster_name = "my-kafka-cluster" kafka_version = "3.6.0" number_of_broker_nodes = 3 broker_node_instance_type = "kafka.m5.large" vpc_id = var.vpc_id subnet_ids = var.private_subnet_ids enable_iam_auth = true enable_scram_auth = false enable_auto_scaling = true }
3. Azure Event Hubs (Kafka API)
- Location:
plugins/specweave-kafka/terraform/azure-event-hubs/ - Platform: Azure Managed Service
- Features:
- Kafka 1.0+ protocol support
- Auto-inflate (elastic scaling)
- Premium SKU for high throughput
- Zone redundancy
- Private endpoints (VNet integration)
- Event capture to Azure Storage
- Use When:
- ✅ Running on Azure cloud
- ✅ Need Kafka-compatible API without Kafka operations
- ✅ Want serverless scaling (auto-inflate)
- ✅ Integrating with Azure ecosystem
- Variables:
module "event_hubs" { source = "../../plugins/specweave-kafka/terraform/azure-event-hubs" namespace_name = "my-event-hub-ns" resource_group_name = var.resource_group_name location = "eastus" sku = "Premium" capacity = 1 kafka_enabled = true auto_inflate_enabled = true maximum_throughput_units = 20 }
Platform Selection Decision Tree
Need Kafka deployment? START HERE:
├─ Running on AWS?
│ ├─ YES → Want managed service?
│ │ ├─ YES → Use AWS MSK module (terraform/aws-msk)
│ │ └─ NO → Use Apache Kafka module (terraform/apache-kafka)
│ └─ NO → Continue...
│
├─ Running on Azure?
│ ├─ YES → Use Azure Event Hubs module (terraform/azure-event-hubs)
│ └─ NO → Continue...
│
├─ Multi-cloud or hybrid?
│ └─ YES → Use Apache Kafka module (most portable)
│
├─ Need maximum control?
│ └─ YES → Use Apache Kafka module
│
└─ Default → Use Apache Kafka module (self-hosted, KRaft mode)
Deployment Workflows
Workflow 1: Deploy Self-Hosted Kafka (Apache Kafka Module)
Scenario: You want full control over Kafka on AWS EC2
# 1. Create Terraform configuration
cat > main.tf <<EOF
module "kafka_cluster" {
source = "../../plugins/specweave-kafka/terraform/apache-kafka"
environment = "production"
broker_count = 3
kafka_version = "3.7.0"
instance_type = "m5.xlarge"
vpc_id = "vpc-12345678"
subnet_ids = ["subnet-abc", "subnet-def", "subnet-ghi"]
domain = "kafka.example.com"
enable_s3_backups = true
enable_monitoring = true
tags = {
Project = "MyApp"
Environment = "Production"
}
}
output "broker_endpoints" {
value = module.kafka_cluster.broker_endpoints
}
EOF
# 2. Initialize Terraform
terraform init
# 3. Plan deployment (review what will be created)
terraform plan
# 4. Apply (create infrastructure)
terraform apply
# 5. Get broker endpoints
terraform output broker_endpoints
# Output: ["kafka-0.kafka.example.com:9093", "kafka-1.kafka.example.com:9093", ...]
Workflow 2: Deploy AWS MSK (Managed Service)
Scenario: You want AWS to manage Kafka operations
# 1. Create Terraform configuration
cat > main.tf <<EOF
module "msk_cluster" {
source = "../../plugins/specweave-kafka/terraform/aws-msk"
cluster_name = "my-msk-cluster"
kafka_version = "3.6.0"
number_of_broker_nodes = 3
broker_node_instance_type = "kafka.m5.large"
vpc_id = var.vpc_id
subnet_ids = var.private_subnet_ids
enable_iam_auth = true
enable_auto_scaling = true
tags = {
Project = "MyApp"
}
}
output "bootstrap_brokers" {
value = module.msk_cluster.bootstrap_brokers_sasl_iam
}
EOF
# 2. Deploy
terraform init && terraform apply
# 3. Configure IAM authentication
# (module outputs IAM policy, attach to your application role)
Workflow 3: Deploy Azure Event Hubs (Kafka API)
Scenario: You're on Azure and want Kafka-compatible API
# 1. Create Terraform configuration
cat > main.tf <<EOF
module "event_hubs" {
source = "../../plugins/specweave-kafka/terraform/azure-event-hubs"
namespace_name = "my-kafka-namespace"
resource_group_name = "my-resource-group"
location = "eastus"
sku = "Premium"
capacity = 1
kafka_enabled = true
auto_inflate_enabled = true
maximum_throughput_units = 20
# Create hubs (topics) for your use case
hubs = [
{ name = "user-events", partitions = 12 },
{ name = "order-events", partitions = 6 },
{ name = "payment-events", partitions = 3 }
]
}
output "connection_string" {
value = module.event_hubs.connection_string
sensitive = true
}
EOF
# 2. Deploy
terraform init && terraform apply
# 3. Get connection details
terraform output connection_string
Infrastructure Sizing Recommendations
Small Environment (Dev/Test)
# Self-hosted: 1 broker, m5.large
broker_count = 1
instance_type = "m5.large"
# AWS MSK: 1 broker per AZ, kafka.m5.large
number_of_broker_nodes = 3
broker_node_instance_type = "kafka.m5.large"
# Azure Event Hubs: Basic SKU
sku = "Basic"
capacity = 1
Medium Environment (Staging/Production)
# Self-hosted: 3 brokers, m5.xlarge
broker_count = 3
instance_type = "m5.xlarge"
# AWS MSK: 3 brokers, kafka.m5.xlarge
number_of_broker_nodes = 3
broker_node_instance_type = "kafka.m5.xlarge"
# Azure Event Hubs: Standard SKU with auto-inflate
sku = "Standard"
capacity = 2
auto_inflate_enabled = true
maximum_throughput_units = 10
Large Environment (High-Throughput Production)
# Self-hosted: 5+ brokers, m5.2xlarge or m5.4xlarge
broker_count = 5
instance_type = "m5.2xlarge"
# AWS MSK: 6+ brokers, kafka.m5.2xlarge, auto-scaling
number_of_broker_nodes = 6
broker_node_instance_type = "kafka.m5.2xlarge"
enable_auto_scaling = true
# Azure Event Hubs: Premium SKU with zone redundancy
sku = "Premium"
capacity = 4
zone_redundant = true
maximum_throughput_units = 20
Best Practices
Security Best Practices
-
Always use encryption in transit
- Self-hosted: Enable SASL_SSL listener
- AWS MSK: Set
encryption_in_transit_client_broker = "TLS" - Azure Event Hubs: HTTPS/TLS enabled by default
-
Use IAM authentication (when possible)
- AWS MSK:
enable_iam_auth = true - Azure Event Hubs: Managed identities
- AWS MSK:
-
Network isolation
- Deploy in private subnets
- Use security groups/NSGs restrictively
- Azure: Enable private endpoints for Premium SKU
High Availability Best Practices
-
Multi-AZ deployment
- Self-hosted: Distribute brokers across 3+ AZs
- AWS MSK: Automatically multi-AZ
- Azure Event Hubs: Enable
zone_redundant = true(Premium)
-
Replication factor = 3
- Self-hosted:
default.replication.factor=3 - AWS MSK: Configured automatically
- Azure Event Hubs: N/A (fully managed)
- Self-hosted:
-
min.insync.replicas = 2
- Ensures durability even if 1 broker fails
Cost Optimization
-
Right-size instances
- Use ClusterSizingCalculator utility (in kafka-architecture skill)
- Start small, scale up based on metrics
-
Auto-scaling (where available)
- AWS MSK:
enable_auto_scaling = true - Azure Event Hubs:
auto_inflate_enabled = true
- AWS MSK:
-
Retention policies
- Set
log.retention.hoursbased on actual needs (default: 168 hours = 7 days) - Shorter retention = lower storage costs
- Set
Monitoring Integration
All modules integrate with monitoring:
Self-Hosted Kafka
- CloudWatch metrics (via JMX Exporter)
- Prometheus + Grafana dashboards (see kafka-observability skill)
- Custom CloudWatch alarms
AWS MSK
- Built-in CloudWatch metrics
- Enhanced monitoring available
- Integration with CloudWatch Alarms
Azure Event Hubs
- Built-in Azure Monitor metrics
- Diagnostic logs to Log Analytics
- Integration with Azure Alerts
Troubleshooting
"Terraform destroy fails on security groups"
Cause: Resources using security groups still exist Fix:
# 1. Find dependent resources
aws ec2 describe-network-interfaces --filters "Name=group-id,Values=sg-12345678"
# 2. Delete dependent resources first
# 3. Retry terraform destroy
"AWS MSK cluster takes 20+ minutes to create"
Cause: MSK provisioning is inherently slow (AWS behavior)
Fix: This is normal. Use --auto-approve for automation:
terraform apply -auto-approve
"Azure Event Hubs: Connection refused"
Cause: Kafka protocol not enabled OR incorrect connection string Fix:
- Verify
kafka_enabled = truein Terraform - Use Kafka connection string (not Event Hubs connection string)
- Check firewall rules (Premium SKU supports private endpoints)
Integration with Other Skills
- kafka-architecture: For cluster sizing and partitioning strategy
- kafka-observability: For Prometheus + Grafana setup after deployment
- kafka-kubernetes: For deploying Kafka on Kubernetes (alternative to Terraform)
- kafka-cli-tools: For testing deployed clusters with kcat
Quick Reference Commands
# Terraform workflow
terraform init # Initialize modules
terraform plan # Preview changes
terraform apply # Create infrastructure
terraform output # Get outputs (endpoints, etc.)
terraform destroy # Delete infrastructure
# AWS MSK specific
aws kafka list-clusters # List MSK clusters
aws kafka describe-cluster --cluster-arn <arn> # Get cluster details
# Azure Event Hubs specific
az eventhubs namespace list # List namespaces
az eventhubs eventhub list --namespace-name <name> --resource-group <rg> # List hubs
Next Steps After Deployment:
- Use kafka-observability skill to set up Prometheus + Grafana monitoring
- Use kafka-cli-tools skill to test cluster with kcat
- Deploy your producer/consumer applications
- Monitor cluster health and performance