ecs-troubleshooting
ECS Troubleshooting Guide
Complete guide to diagnosing and resolving common ECS issues.
Quick Diagnostic Commands
# Check service status
aws ecs describe-services \
--cluster production \
--services my-service \
--query 'services[0].{status:status,running:runningCount,desired:desiredCount,events:events[:5]}'
# List stopped tasks (failures)
aws ecs list-tasks \
--cluster production \
--service-name my-service \
--desired-status STOPPED
# Describe stopped task
aws ecs describe-tasks \
--cluster production \
--tasks <task-arn> \
--query 'tasks[0].{status:lastStatus,reason:stoppedReason,containers:containers[*].{name:name,reason:reason,exitCode:exitCode}}'
# View recent logs
aws logs tail /ecs/my-app --since 1h --follow
# Execute into container (debug)
aws ecs execute-command \
--cluster production \
--task <task-id> \
--container my-app \
--interactive \
--command "/bin/sh"
Task Failures
Task Status: STOPPED
Symptom
Tasks immediately stop after starting or fail to start.
Diagnostic Steps
import boto3
ecs = boto3.client('ecs')
def diagnose_stopped_task(cluster: str, task_arn: str):
"""Diagnose why a task stopped"""
response = ecs.describe_tasks(cluster=cluster, tasks=[task_arn])
task = response['tasks'][0]
print(f"Task Status: {task['lastStatus']}")
print(f"Stop Code: {task.get('stopCode', 'N/A')}")
print(f"Stopped Reason: {task.get('stoppedReason', 'N/A')}")
for container in task['containers']:
print(f"\nContainer: {container['name']}")
print(f" Status: {container['lastStatus']}")
print(f" Exit Code: {container.get('exitCode', 'N/A')}")
print(f" Reason: {container.get('reason', 'N/A')}")
Common Causes & Solutions
1. Essential container failed
stoppedReason: "Essential container in task exited"
Solution: Check container logs for application errors
aws logs tail /ecs/my-app --since 30m
2. Task failed to start
stoppedReason: "Task failed to start"
Solution: Check execution role permissions
# Verify execution role can pull image
aws iam get-role-policy --role-name ecsTaskExecutionRole --policy-name ecr-access
3. CannotPullContainerError
reason: "CannotPullContainerError: Error response from daemon"
Solutions:
- Check ECR permissions in execution role
- Verify image exists:
aws ecr describe-images --repository-name my-app - Check VPC endpoints or NAT gateway for private subnets
4. OutOfMemoryError
reason: "OutOfMemoryError: Container killed due to memory usage"
exitCode: 137
Solution: Increase memory in task definition
memory = 2048 # Increase from current value
5. Exit Code 1 (Application Error)
exitCode: 1
Solution: Check application logs for errors
aws logs filter-events \
--log-group-name /ecs/my-app \
--filter-pattern "ERROR"
Task Status: PENDING
Symptom
Tasks stuck in PENDING state, not transitioning to RUNNING.
Diagnostic Steps
def diagnose_pending_tasks(cluster: str, service: str):
"""Check why tasks are stuck in PENDING"""
# List pending tasks
pending = ecs.list_tasks(
cluster=cluster,
serviceName=service,
desiredStatus='RUNNING'
)
for task_arn in pending['taskArns']:
task = ecs.describe_tasks(cluster=cluster, tasks=[task_arn])['tasks'][0]
if task['lastStatus'] == 'PENDING':
print(f"Task {task_arn.split('/')[-1]} is PENDING")
# Check attachments for ENI issues
for attachment in task.get('attachments', []):
print(f" Attachment: {attachment['type']} - {attachment['status']}")
for detail in attachment.get('details', []):
print(f" {detail['name']}: {detail['value']}")
Common Causes & Solutions
1. No available capacity
Service my-service was unable to place a task because no container instance met all of its requirements
Solutions for Fargate:
- Check capacity provider limits
- Verify subnet has available IPs
- Check if region/AZ has Fargate capacity
2. ENI provisioning issues
Attachment status: PRECREATED
Solutions:
- Check security group allows required traffic
- Verify subnet has available IPs
- Check ENI limits for EC2 instances
3. Image pull taking too long
Container image: pulling
Solutions:
- Check image size (use smaller base images)
- Verify network connectivity to ECR
- Use VPC endpoints for faster pulls
Service Issues
Service Not Starting Tasks
Diagnostic
# Check service events
aws ecs describe-services \
--cluster production \
--services my-service \
--query 'services[0].events[:10]'
Common Events & Solutions
1. "service my-service is unable to place a task"
Check task placement constraints and capacity.
2. "service my-service has reached a steady state"
Service is healthy - tasks are running as expected.
3. "service my-service was unable to place a task because no container instance met all requirements"
For Fargate: Check CPU/memory configurations are valid combinations.
Deployment Stuck
Symptom
Deployment never reaches COMPLETED state.
Diagnostic
def check_deployment_status(cluster: str, service: str):
"""Check deployment progress"""
response = ecs.describe_services(cluster=cluster, services=[service])
svc = response['services'][0]
for deployment in svc['deployments']:
print(f"\nDeployment: {deployment['id']}")
print(f" Status: {deployment['status']}")
print(f" Rollout State: {deployment['rolloutState']}")
print(f" Tasks: {deployment['runningCount']}/{deployment['desiredCount']}")
if deployment['rolloutState'] == 'IN_PROGRESS':
reason = deployment.get('rolloutStateReason', '')
print(f" Reason: {reason}")
Common Causes
1. Health check failures
rolloutStateReason: "ECS deployment circuit breaker: tasks failed to start"
Solutions:
- Check target group health check settings
- Increase
healthCheckGracePeriodSeconds - Verify application responds on health check path
2. Insufficient capacity
rolloutStateReason: "Service my-service was unable to place a task"
Solutions:
- Check subnet IP availability
- Reduce
maximumPercentto allow more headroom
Networking Issues
Tasks Cannot Connect to Internet
Symptoms
- Cannot pull images
- Cannot reach external APIs
- Timeouts on external calls
Solutions
For private subnets:
# Option 1: NAT Gateway
resource "aws_nat_gateway" "main" {
allocation_id = aws_eip.nat.id
subnet_id = aws_subnet.public.id
}
# Option 2: VPC Endpoints (recommended)
resource "aws_vpc_endpoint" "ecr_api" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.ecr.api"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
}
Tasks Cannot Connect to Each Other
Symptom
Service-to-service communication fails.
Diagnostic
# Check security group rules
aws ec2 describe-security-groups \
--group-ids sg-12345 \
--query 'SecurityGroups[0].IpPermissions'
Solutions
# Allow traffic between ECS tasks
resource "aws_security_group_rule" "ecs_to_ecs" {
type = "ingress"
from_port = 8080
to_port = 8080
protocol = "tcp"
security_group_id = aws_security_group.ecs_tasks.id
source_security_group_id = aws_security_group.ecs_tasks.id
}
Load Balancer Health Checks Failing
Symptom
Target group app-tg: 0 healthy, 3 unhealthy
Diagnostic
# Check target health
aws elbv2 describe-target-health \
--target-group-arn <target-group-arn>
Common Causes & Solutions
1. Wrong health check path
health_check {
path = "/health" # Must match application endpoint
}
2. Container not listening on expected port
# Verify inside container
aws ecs execute-command --cluster production --task <task-id> \
--container my-app --interactive --command "netstat -tlnp"
3. Security group blocking ALB
# Allow ALB to reach ECS tasks
resource "aws_security_group_rule" "alb_to_ecs" {
type = "ingress"
from_port = 8080
to_port = 8080
protocol = "tcp"
security_group_id = aws_security_group.ecs_tasks.id
source_security_group_id = aws_security_group.alb.id
}
IAM & Permissions Issues
CannotPullContainerError
Symptom
CannotPullContainerError: Error response from daemon: pull access denied
Solution: Task Execution Role
resource "aws_iam_role_policy_attachment" "ecs_task_execution" {
role = aws_iam_role.ecs_task_execution.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}
# For cross-account ECR
resource "aws_iam_role_policy" "cross_account_ecr" {
role = aws_iam_role.ecs_task_execution.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = [
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage"
]
Resource = "arn:aws:ecr:*:OTHER_ACCOUNT:repository/*"
}]
})
}
Secrets Access Denied
Symptom
ResourceInitializationError: unable to pull secrets
Solution
resource "aws_iam_role_policy" "secrets_access" {
role = aws_iam_role.ecs_task_execution.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = ["secretsmanager:GetSecretValue"]
Resource = "arn:aws:secretsmanager:*:*:secret:my-app/*"
},
{
Effect = "Allow"
Action = ["ssm:GetParameters"]
Resource = "arn:aws:ssm:*:*:parameter/my-app/*"
},
{
Effect = "Allow"
Action = ["kms:Decrypt"]
Resource = aws_kms_key.secrets.arn
}
]
})
}
Execute Command Not Working
Symptom
SessionManagerPlugin is not found
or
Execute command is disabled
Solutions
1. Enable execute command on service
resource "aws_ecs_service" "app" {
enable_execute_command = true
}
2. Add SSM permissions to task role
resource "aws_iam_role_policy" "ssm_exec" {
role = aws_iam_role.ecs_task.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = [
"ssmmessages:CreateControlChannel",
"ssmmessages:CreateDataChannel",
"ssmmessages:OpenControlChannel",
"ssmmessages:OpenDataChannel"
]
Resource = "*"
}]
})
}
Performance Issues
High CPU/Memory Usage
Diagnostic
import boto3
cloudwatch = boto3.client('cloudwatch')
def get_service_metrics(cluster: str, service: str):
"""Get CPU and memory metrics"""
response = cloudwatch.get_metric_statistics(
Namespace='AWS/ECS',
MetricName='CPUUtilization',
Dimensions=[
{'Name': 'ClusterName', 'Value': cluster},
{'Name': 'ServiceName', 'Value': service}
],
StartTime=datetime.utcnow() - timedelta(hours=1),
EndTime=datetime.utcnow(),
Period=300,
Statistics=['Average', 'Maximum']
)
for point in sorted(response['Datapoints'], key=lambda x: x['Timestamp']):
print(f"{point['Timestamp']}: Avg={point['Average']:.1f}%, Max={point['Maximum']:.1f}%")
Solutions
1. Right-size tasks
# Increase resources
cpu = "1024" # from 512
memory = "2048" # from 1024
2. Enable auto-scaling
resource "aws_appautoscaling_policy" "cpu" {
target_tracking_scaling_policy_configuration {
target_value = 70.0
}
}
Slow Task Startup
Causes & Solutions
1. Large container image
- Use smaller base images (alpine, distroless)
- Enable image caching with Fargate Platform 1.4.0
2. Slow application startup
- Increase
startPeriodin health check - Optimize application initialization
3. Slow secret/config loading
- Use VPC endpoints for faster access
- Cache configuration at startup
Log Analysis
CloudWatch Logs Queries
# Find errors in last hour
aws logs filter-events \
--log-group-name /ecs/my-app \
--start-time $(date -d '-1 hour' +%s000) \
--filter-pattern "ERROR"
# Find OOM kills
aws logs filter-events \
--log-group-name /ecs/my-app \
--filter-pattern "OutOfMemory"
# Find slow requests
aws logs filter-events \
--log-group-name /ecs/my-app \
--filter-pattern "[timestamp, level, duration>1000, ...]"
CloudWatch Insights
-- Top errors by count
fields @timestamp, @message
| filter @message like /ERROR/
| stats count(*) as errorCount by @message
| sort errorCount desc
| limit 10
-- Average response time
fields @timestamp, responseTime
| stats avg(responseTime) as avgTime, max(responseTime) as maxTime by bin(5m)
Related Skills
- boto3-ecs: SDK patterns
- terraform-ecs: Infrastructure as Code
- ecs-fargate: Fargate specifics
- ecs-deployment: Deployment strategies
Quick Reference
| Symptom | First Check | Common Cause |
|---|---|---|
| Task STOPPED | stoppedReason |
Container crash, OOM |
| Task PENDING | Attachments | ENI/network issues |
| Deployment stuck | Health checks | ALB health check failing |
| Cannot pull image | Execution role | Missing ECR permissions |
| Cannot connect | Security groups | Wrong SG rules |