Deployment Troubleshooting
Deployment Troubleshooting Guide
When a deployment fails, the error message is your best starting point — but Terraform and cloud provider error messages are often cryptic or misleading. This guide maps common error patterns to their root causes and fixes, organized by the deployment phase where they typically occur.
When helping a user debug a deployment issue, start by identifying which phase failed, then match the error text against the patterns below.
Phase 1: Terraform Init Errors
Init failures happen before any infrastructure is created. They're usually about provider plugins or backend configuration.
| Error Pattern | Root Cause | Fix |
|---|---|---|
Failed to install provider |
No internet, proxy blocking registry.terraform.io, or provider name typo | Check connectivity: curl -I https://registry.terraform.io. If behind proxy, set HTTPS_PROXY. Verify provider source string |
Could not load plugin |
Plugin cache corrupted or provider version mismatch | Run terraform init -upgrade to re-download. Delete .terraform/ and retry if persistent |
Backend initialization required |
Remote state bucket doesn't exist or credentials wrong | Create the bucket first, verify credentials have access to it. Check region matches |
Failed to query available provider packages |
DNS resolution failure or firewall blocking | Try nslookup registry.terraform.io. Consider using terraform init -plugin-dir with pre-downloaded providers |
Phase 2: Authentication Errors
These surface during terraform plan when the provider tries to validate credentials against the cloud API.
| Error Pattern | Provider | Fix |
|---|---|---|
NoCredentialProviders |
AWS | AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY not set or expired. Re-export them or check RedC profile |
AuthorizationFailed |
Azure | Service principal lacks permissions on the subscription. Verify ARM_SUBSCRIPTION_ID matches, check role assignments |
googleapi: Error 403 |
GCP | Service account doesn't have required permissions. Check IAM roles in GCP Console, verify GOOGLE_APPLICATION_CREDENTIALS path |
InvalidAccessKeyId |
AWS/Alibaba | Access key deleted or rotated. Generate a new key pair in the console |
AuthFailure |
Tencent | TENCENTCLOUD_SECRET_ID wrong (note: it's "SecretId" not "AccessKey") |
Debugging tip: When a user reports an auth error, ask them to verify the environment variables are set in the current shell session. A common mistake is setting them in one terminal and running RedC in another.
Phase 3: Resource Creation Failures
These happen during terraform apply when the cloud provider rejects a resource creation request.
| Error Pattern | Root Cause | Fix |
|---|---|---|
InstanceLimitExceeded |
Account quota reached for this instance type | Request a quota increase via support ticket, or use a different instance type/region |
VPCLimitExceeded |
Default limit is 5 VPCs per region | Clean up unused VPCs in the console, or request a limit increase |
InvalidParameterValue for instance type |
Instance type not available in the selected AZ | Check availability with aws ec2 describe-instance-type-offerings, try a different AZ or type |
InsufficientInstanceCapacity |
AWS capacity constraints in that AZ | Retry in a different AZ (-a, -b, -c), or try a different instance family |
Insufficient balance |
Prepaid account ran out of credit | Top up the account. Use get_balances to check current balance |
Phase 4: Network and Connectivity Issues
These typically appear after instances are created but the user can't reach them.
| Symptom | Likely Cause | Investigation Steps |
|---|---|---|
| SSH connection refused | Security group doesn't allow inbound SSH from user's IP | Check the security group ingress rules. Verify user's current public IP matches the allowed CIDR |
| SSH connection timed out | Instance has no public IP, or is in a private subnet without NAT | Verify the instance has a public IP in the console. Check subnet route table has an internet gateway |
timeout awaiting response during apply |
Security group blocks outbound HTTPS (443) | The instance needs outbound access to download packages. Check egress rules |
| Instance created but tools don't work | user_data script failed silently | SSH in and check /var/log/cloud-init-output.log for errors |
Phase 5: State Issues
State problems are dangerous because they can cause Terraform to lose track of real infrastructure, leading to orphaned resources you're still paying for.
| Error Pattern | Root Cause | Fix |
|---|---|---|
Error acquiring the state lock |
Another terraform apply is running, or a previous run crashed without releasing the lock |
Wait for the other process to finish. If it crashed, force-unlock: terraform force-unlock <LOCK_ID> |
Resource already exists |
Resource was created outside Terraform (e.g., manually in console) | Import it: terraform import <resource_address> <resource_id> |
Unsupported attribute |
Provider version upgraded and the attribute name changed | Pin provider version in required_providers, or update your .tf to use the new attribute name |
| Drift between state and reality | Manual changes in cloud console | Run terraform plan to see the diff, then decide: apply to overwrite manual changes, or terraform refresh to update state |
Phase 6: User Data and Provisioning
Cloud-init runs on first boot and its failures are silent from Terraform's perspective — the instance is "created" but not properly configured.
| Symptom | Investigation | Fix |
|---|---|---|
| Packages not installed | Check /var/log/cloud-init-output.log |
Usually DNS or proxy issues. Add apt update retry logic to the script |
| Script didn't run at all | Check /var/log/cloud-init.log for YAML parse errors |
Validate the cloud-init YAML syntax. Common issue: wrong indentation in write_files |
| Script timed out | Long-running operations (compiling, large downloads) | Break into smaller scripts, or increase timeout. Consider using RedC's exec_command for post-deploy setup instead |
| Wrong permissions on files | write_files defaults to root ownership |
Set owner and permissions explicitly in the cloud-init config |
More from wgpsec/redc-template
multi-cloud deployment
Guide for deploying infrastructure across multiple cloud providers (AWS, Azure, GCP, Alibaba Cloud, Tencent Cloud, Huawei Cloud, Volcengine). Use this skill whenever the user mentions deploying to more than one cloud, comparing cloud providers, selecting regions, configuring provider credentials, or asking about cross-cloud compatibility. Also use when the user asks about a specific Chinese cloud provider (Alibaba, Tencent, Huawei, Volcengine) since these have unique authentication patterns that differ from Western clouds.
1terraform-provider-docs
Look up official Terraform provider documentation before writing or debugging any Terraform resource, data source, or provider configuration. Use this skill whenever you encounter a Terraform error, need to write a new resource block, are unsure about argument syntax or valid values, need to check resource attribute constraints, or want to understand provider-specific behaviors. Consult the docs first instead of guessing Terraform arguments from memory — it consistently saves multiple debug cycles.
1aws security hardening
AWS security hardening guide for red team infrastructure. Use this skill whenever the user is deploying to AWS, configuring IAM policies, setting up VPCs or security groups, asking about SSH access, encryption, key rotation, or any AWS security question. Also apply when the user mentions EC2 instances, EBS volumes, S3 buckets, or AWS networking — even if they don't explicitly ask about "security", because every AWS deployment should follow these hardening practices by default.
1cloud cost optimization
Strategies for minimizing cloud infrastructure costs in red team deployments. Use this skill whenever the user asks about pricing, budgets, cost estimates, instance sizing, spot instances, or resource cleanup. Also apply when the user is choosing instance types, discussing how long to keep infrastructure running, asking about billing alerts, or planning a deployment where cost is a concern — even if they don't explicitly mention "cost" or "budget". Proactively reference this skill when generating templates to suggest cost-saving alternatives.
1terraform best practices
Terraform IaC best practices for cloud infrastructure deployments. Use this skill whenever the user is writing Terraform code, creating templates, generating .tf files, asking about state management, modules, variables, security groups, or any infrastructure-as-code question. Also use when reviewing or debugging Terraform configurations, discussing provider setup, or planning multi-resource deployments — even if the user doesn't explicitly mention "Terraform" but is clearly working with .tf files or HCL syntax.
1