skills/adaptationio/skrillz/ecs-troubleshooting

ecs-troubleshooting

SKILL.md

ECS Troubleshooting Guide

Complete guide to diagnosing and resolving common ECS issues.

Quick Diagnostic Commands

# Check service status
aws ecs describe-services \
  --cluster production \
  --services my-service \
  --query 'services[0].{status:status,running:runningCount,desired:desiredCount,events:events[:5]}'

# List stopped tasks (failures)
aws ecs list-tasks \
  --cluster production \
  --service-name my-service \
  --desired-status STOPPED

# Describe stopped task
aws ecs describe-tasks \
  --cluster production \
  --tasks <task-arn> \
  --query 'tasks[0].{status:lastStatus,reason:stoppedReason,containers:containers[*].{name:name,reason:reason,exitCode:exitCode}}'

# View recent logs
aws logs tail /ecs/my-app --since 1h --follow

# Execute into container (debug)
aws ecs execute-command \
  --cluster production \
  --task <task-id> \
  --container my-app \
  --interactive \
  --command "/bin/sh"

Task Failures

Task Status: STOPPED

Symptom

Tasks immediately stop after starting or fail to start.

Diagnostic Steps

import boto3

ecs = boto3.client('ecs')

def diagnose_stopped_task(cluster: str, task_arn: str):
    """Diagnose why a task stopped"""

    response = ecs.describe_tasks(cluster=cluster, tasks=[task_arn])
    task = response['tasks'][0]

    print(f"Task Status: {task['lastStatus']}")
    print(f"Stop Code: {task.get('stopCode', 'N/A')}")
    print(f"Stopped Reason: {task.get('stoppedReason', 'N/A')}")

    for container in task['containers']:
        print(f"\nContainer: {container['name']}")
        print(f"  Status: {container['lastStatus']}")
        print(f"  Exit Code: {container.get('exitCode', 'N/A')}")
        print(f"  Reason: {container.get('reason', 'N/A')}")

Common Causes & Solutions

1. Essential container failed

stoppedReason: "Essential container in task exited"

Solution: Check container logs for application errors

aws logs tail /ecs/my-app --since 30m

2. Task failed to start

stoppedReason: "Task failed to start"

Solution: Check execution role permissions

# Verify execution role can pull image
aws iam get-role-policy --role-name ecsTaskExecutionRole --policy-name ecr-access

3. CannotPullContainerError

reason: "CannotPullContainerError: Error response from daemon"

Solutions:

  • Check ECR permissions in execution role
  • Verify image exists: aws ecr describe-images --repository-name my-app
  • Check VPC endpoints or NAT gateway for private subnets

4. OutOfMemoryError

reason: "OutOfMemoryError: Container killed due to memory usage"
exitCode: 137

Solution: Increase memory in task definition

memory = 2048  # Increase from current value

5. Exit Code 1 (Application Error)

exitCode: 1

Solution: Check application logs for errors

aws logs filter-events \
  --log-group-name /ecs/my-app \
  --filter-pattern "ERROR"

Task Status: PENDING

Symptom

Tasks stuck in PENDING state, not transitioning to RUNNING.

Diagnostic Steps

def diagnose_pending_tasks(cluster: str, service: str):
    """Check why tasks are stuck in PENDING"""

    # List pending tasks
    pending = ecs.list_tasks(
        cluster=cluster,
        serviceName=service,
        desiredStatus='RUNNING'
    )

    for task_arn in pending['taskArns']:
        task = ecs.describe_tasks(cluster=cluster, tasks=[task_arn])['tasks'][0]

        if task['lastStatus'] == 'PENDING':
            print(f"Task {task_arn.split('/')[-1]} is PENDING")

            # Check attachments for ENI issues
            for attachment in task.get('attachments', []):
                print(f"  Attachment: {attachment['type']} - {attachment['status']}")
                for detail in attachment.get('details', []):
                    print(f"    {detail['name']}: {detail['value']}")

Common Causes & Solutions

1. No available capacity

Service my-service was unable to place a task because no container instance met all of its requirements

Solutions for Fargate:

  • Check capacity provider limits
  • Verify subnet has available IPs
  • Check if region/AZ has Fargate capacity

2. ENI provisioning issues

Attachment status: PRECREATED

Solutions:

  • Check security group allows required traffic
  • Verify subnet has available IPs
  • Check ENI limits for EC2 instances

3. Image pull taking too long

Container image: pulling

Solutions:

  • Check image size (use smaller base images)
  • Verify network connectivity to ECR
  • Use VPC endpoints for faster pulls

Service Issues

Service Not Starting Tasks

Diagnostic

# Check service events
aws ecs describe-services \
  --cluster production \
  --services my-service \
  --query 'services[0].events[:10]'

Common Events & Solutions

1. "service my-service is unable to place a task"

Check task placement constraints and capacity.

2. "service my-service has reached a steady state"

Service is healthy - tasks are running as expected.

3. "service my-service was unable to place a task because no container instance met all requirements"

For Fargate: Check CPU/memory configurations are valid combinations.

Deployment Stuck

Symptom

Deployment never reaches COMPLETED state.

Diagnostic

def check_deployment_status(cluster: str, service: str):
    """Check deployment progress"""

    response = ecs.describe_services(cluster=cluster, services=[service])
    svc = response['services'][0]

    for deployment in svc['deployments']:
        print(f"\nDeployment: {deployment['id']}")
        print(f"  Status: {deployment['status']}")
        print(f"  Rollout State: {deployment['rolloutState']}")
        print(f"  Tasks: {deployment['runningCount']}/{deployment['desiredCount']}")

        if deployment['rolloutState'] == 'IN_PROGRESS':
            reason = deployment.get('rolloutStateReason', '')
            print(f"  Reason: {reason}")

Common Causes

1. Health check failures

rolloutStateReason: "ECS deployment circuit breaker: tasks failed to start"

Solutions:

  • Check target group health check settings
  • Increase healthCheckGracePeriodSeconds
  • Verify application responds on health check path

2. Insufficient capacity

rolloutStateReason: "Service my-service was unable to place a task"

Solutions:

  • Check subnet IP availability
  • Reduce maximumPercent to allow more headroom

Networking Issues

Tasks Cannot Connect to Internet

Symptoms

  • Cannot pull images
  • Cannot reach external APIs
  • Timeouts on external calls

Solutions

For private subnets:

# Option 1: NAT Gateway
resource "aws_nat_gateway" "main" {
  allocation_id = aws_eip.nat.id
  subnet_id     = aws_subnet.public.id
}

# Option 2: VPC Endpoints (recommended)
resource "aws_vpc_endpoint" "ecr_api" {
  vpc_id            = aws_vpc.main.id
  service_name      = "com.amazonaws.us-east-1.ecr.api"
  vpc_endpoint_type = "Interface"
  subnet_ids        = aws_subnet.private[*].id
}

Tasks Cannot Connect to Each Other

Symptom

Service-to-service communication fails.

Diagnostic

# Check security group rules
aws ec2 describe-security-groups \
  --group-ids sg-12345 \
  --query 'SecurityGroups[0].IpPermissions'

Solutions

# Allow traffic between ECS tasks
resource "aws_security_group_rule" "ecs_to_ecs" {
  type                     = "ingress"
  from_port                = 8080
  to_port                  = 8080
  protocol                 = "tcp"
  security_group_id        = aws_security_group.ecs_tasks.id
  source_security_group_id = aws_security_group.ecs_tasks.id
}

Load Balancer Health Checks Failing

Symptom

Target group app-tg: 0 healthy, 3 unhealthy

Diagnostic

# Check target health
aws elbv2 describe-target-health \
  --target-group-arn <target-group-arn>

Common Causes & Solutions

1. Wrong health check path

health_check {
  path = "/health"  # Must match application endpoint
}

2. Container not listening on expected port

# Verify inside container
aws ecs execute-command --cluster production --task <task-id> \
  --container my-app --interactive --command "netstat -tlnp"

3. Security group blocking ALB

# Allow ALB to reach ECS tasks
resource "aws_security_group_rule" "alb_to_ecs" {
  type                     = "ingress"
  from_port                = 8080
  to_port                  = 8080
  protocol                 = "tcp"
  security_group_id        = aws_security_group.ecs_tasks.id
  source_security_group_id = aws_security_group.alb.id
}

IAM & Permissions Issues

CannotPullContainerError

Symptom

CannotPullContainerError: Error response from daemon: pull access denied

Solution: Task Execution Role

resource "aws_iam_role_policy_attachment" "ecs_task_execution" {
  role       = aws_iam_role.ecs_task_execution.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

# For cross-account ECR
resource "aws_iam_role_policy" "cross_account_ecr" {
  role = aws_iam_role.ecs_task_execution.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Action = [
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage"
      ]
      Resource = "arn:aws:ecr:*:OTHER_ACCOUNT:repository/*"
    }]
  })
}

Secrets Access Denied

Symptom

ResourceInitializationError: unable to pull secrets

Solution

resource "aws_iam_role_policy" "secrets_access" {
  role = aws_iam_role.ecs_task_execution.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = ["secretsmanager:GetSecretValue"]
        Resource = "arn:aws:secretsmanager:*:*:secret:my-app/*"
      },
      {
        Effect = "Allow"
        Action = ["ssm:GetParameters"]
        Resource = "arn:aws:ssm:*:*:parameter/my-app/*"
      },
      {
        Effect = "Allow"
        Action = ["kms:Decrypt"]
        Resource = aws_kms_key.secrets.arn
      }
    ]
  })
}

Execute Command Not Working

Symptom

SessionManagerPlugin is not found

or

Execute command is disabled

Solutions

1. Enable execute command on service

resource "aws_ecs_service" "app" {
  enable_execute_command = true
}

2. Add SSM permissions to task role

resource "aws_iam_role_policy" "ssm_exec" {
  role = aws_iam_role.ecs_task.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Action = [
        "ssmmessages:CreateControlChannel",
        "ssmmessages:CreateDataChannel",
        "ssmmessages:OpenControlChannel",
        "ssmmessages:OpenDataChannel"
      ]
      Resource = "*"
    }]
  })
}

Performance Issues

High CPU/Memory Usage

Diagnostic

import boto3

cloudwatch = boto3.client('cloudwatch')

def get_service_metrics(cluster: str, service: str):
    """Get CPU and memory metrics"""

    response = cloudwatch.get_metric_statistics(
        Namespace='AWS/ECS',
        MetricName='CPUUtilization',
        Dimensions=[
            {'Name': 'ClusterName', 'Value': cluster},
            {'Name': 'ServiceName', 'Value': service}
        ],
        StartTime=datetime.utcnow() - timedelta(hours=1),
        EndTime=datetime.utcnow(),
        Period=300,
        Statistics=['Average', 'Maximum']
    )

    for point in sorted(response['Datapoints'], key=lambda x: x['Timestamp']):
        print(f"{point['Timestamp']}: Avg={point['Average']:.1f}%, Max={point['Maximum']:.1f}%")

Solutions

1. Right-size tasks

# Increase resources
cpu    = "1024"  # from 512
memory = "2048"  # from 1024

2. Enable auto-scaling

resource "aws_appautoscaling_policy" "cpu" {
  target_tracking_scaling_policy_configuration {
    target_value = 70.0
  }
}

Slow Task Startup

Causes & Solutions

1. Large container image

  • Use smaller base images (alpine, distroless)
  • Enable image caching with Fargate Platform 1.4.0

2. Slow application startup

  • Increase startPeriod in health check
  • Optimize application initialization

3. Slow secret/config loading

  • Use VPC endpoints for faster access
  • Cache configuration at startup

Log Analysis

CloudWatch Logs Queries

# Find errors in last hour
aws logs filter-events \
  --log-group-name /ecs/my-app \
  --start-time $(date -d '-1 hour' +%s000) \
  --filter-pattern "ERROR"

# Find OOM kills
aws logs filter-events \
  --log-group-name /ecs/my-app \
  --filter-pattern "OutOfMemory"

# Find slow requests
aws logs filter-events \
  --log-group-name /ecs/my-app \
  --filter-pattern "[timestamp, level, duration>1000, ...]"

CloudWatch Insights

-- Top errors by count
fields @timestamp, @message
| filter @message like /ERROR/
| stats count(*) as errorCount by @message
| sort errorCount desc
| limit 10

-- Average response time
fields @timestamp, responseTime
| stats avg(responseTime) as avgTime, max(responseTime) as maxTime by bin(5m)

Related Skills

  • boto3-ecs: SDK patterns
  • terraform-ecs: Infrastructure as Code
  • ecs-fargate: Fargate specifics
  • ecs-deployment: Deployment strategies

Quick Reference

Symptom First Check Common Cause
Task STOPPED stoppedReason Container crash, OOM
Task PENDING Attachments ENI/network issues
Deployment stuck Health checks ALB health check failing
Cannot pull image Execution role Missing ECR permissions
Cannot connect Security groups Wrong SG rules
Weekly Installs
1
Installed on
claude-code1