eks-observability
EKS Observability
Overview
Complete observability solution for Amazon EKS using AWS-native managed services and open-source tools. This skill implements the three-pillar approach (metrics, logs, traces) with 2025 best practices including ADOT, Amazon Managed Prometheus, Fluent Bit, and OpenTelemetry.
Keywords: EKS monitoring, CloudWatch Container Insights, Prometheus, Grafana, ADOT, Fluent Bit, X-Ray, OpenTelemetry, distributed tracing, log aggregation, metrics collection, observability stack
Status: Production-ready with 2025 best practices
When to Use This Skill
- Setting up monitoring for EKS clusters
- Implementing centralized logging pipelines
- Configuring distributed tracing
- Building production dashboards in Grafana
- Troubleshooting application performance
- Establishing SLOs and error budgets
- Optimizing observability costs
- Migrating from X-Ray SDKs to OpenTelemetry
- Correlating metrics, logs, and traces
- Setting up alerting and on-call runbooks
The Three-Pillar Approach (2025 Recommendation)
1. Metrics
CloudWatch Container Insights + Amazon Managed Prometheus (AMP)
- Dual monitoring provides complete visibility
- CloudWatch for AWS-native integration and quick setup
- Prometheus for advanced queries and community dashboards
- Amazon Managed Grafana for visualization
2. Logs
Fluent Bit → CloudWatch Logs
- Lightweight log forwarder (AWS deprecated FluentD in Feb 2025)
- DaemonSet deployment for automatic collection
- Structured logging with JSON parsing
- Optional aggregation to OpenSearch for analytics
3. Traces
ADOT → AWS X-Ray
- OpenTelemetry standard (X-Ray SDKs entering maintenance mode 2026)
- ADOT Collector converts OTLP to X-Ray format
- Distributed tracing across microservices
- Integration with CloudWatch ServiceLens
Quick Start Workflow
Step 1: Enable CloudWatch Container Insights
Using EKS Add-on (Recommended):
# Create IAM policy for CloudWatch access
aws iam create-policy \
--policy-name CloudWatchAgentServerPolicy \
--policy-document file://cloudwatch-policy.json
# Create IRSA for CloudWatch
eksctl create iamserviceaccount \
--name cloudwatch-agent \
--namespace amazon-cloudwatch \
--cluster my-cluster \
--attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
--approve \
--override-existing-serviceaccounts
# Install Container Insights add-on
aws eks create-addon \
--cluster-name my-cluster \
--addon-name amazon-cloudwatch-observability \
--service-account-role-arn arn:aws:iam::ACCOUNT_ID:role/CloudWatchAgentRole
Verify Installation:
# Check add-on status
aws eks describe-addon \
--cluster-name my-cluster \
--addon-name amazon-cloudwatch-observability
# Verify pods running
kubectl get pods -n amazon-cloudwatch
What You Get:
- Node-level metrics (CPU, memory, disk, network)
- Pod-level metrics (resource usage, restart counts)
- Namespace-level aggregations
- Automatic CloudWatch Logs integration
- Pre-built CloudWatch dashboards
Step 2: Deploy Amazon Managed Prometheus
Create AMP Workspace:
# Create workspace
aws amp create-workspace \
--alias my-cluster-metrics \
--region us-west-2
# Get workspace ID
WORKSPACE_ID=$(aws amp list-workspaces \
--alias my-cluster-metrics \
--query 'workspaces[0].workspaceId' \
--output text)
# Create IRSA for AMP ingestion
eksctl create iamserviceaccount \
--name amp-ingest \
--namespace prometheus \
--cluster my-cluster \
--attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \
--approve
Deploy kube-prometheus-stack:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install with AMP remote write
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace prometheus \
--create-namespace \
--set prometheus.prometheusSpec.remoteWrite[0].url=https://aps-workspaces.us-west-2.amazonaws.com/workspaces/${WORKSPACE_ID}/api/v1/remote_write \
--set prometheus.prometheusSpec.remoteWrite[0].sigv4.region=us-west-2 \
--set prometheus.serviceAccount.annotations."eks\.amazonaws\.com/role-arn"="arn:aws:iam::ACCOUNT_ID:role/AMPIngestRole"
What You Get:
- Prometheus Operator for CRD-based monitoring
- Node Exporter for hardware metrics
- kube-state-metrics for cluster state
- Alertmanager for alert routing
- 100+ pre-built Grafana dashboards
Step 3: Deploy Fluent Bit for Logging
Create IRSA for Fluent Bit:
eksctl create iamserviceaccount \
--name fluent-bit \
--namespace logging \
--cluster my-cluster \
--attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
--approve
Deploy Fluent Bit:
helm repo add fluent https://fluent.github.io/helm-charts
helm install fluent-bit fluent/fluent-bit \
--namespace logging \
--create-namespace \
--set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"="arn:aws:iam::ACCOUNT_ID:role/FluentBitRole" \
--set cloudWatch.enabled=true \
--set cloudWatch.region=us-west-2 \
--set cloudWatch.logGroupName=/aws/eks/my-cluster/logs \
--set cloudWatch.autoCreateGroup=true
What You Get:
- Automatic log collection from all pods
- Structured JSON log parsing
- CloudWatch Logs integration
- Multi-line log support
- Kubernetes metadata enrichment
Step 4: Deploy ADOT for Distributed Tracing
Install ADOT Operator:
# Create IRSA for ADOT
eksctl create iamserviceaccount \
--name adot-collector \
--namespace adot \
--cluster my-cluster \
--attach-policy-arn arn:aws:iam::aws:policy/AWSXRayDaemonWriteAccess \
--attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
--approve
# Install ADOT add-on
aws eks create-addon \
--cluster-name my-cluster \
--addon-name adot \
--service-account-role-arn arn:aws:iam::ACCOUNT_ID:role/ADOTCollectorRole
Deploy ADOT Collector:
# adot-collector.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: adot-collector
namespace: adot
spec:
mode: deployment
serviceAccount: adot-collector
config: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 30s
send_batch_size: 50
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
awsxray:
region: us-west-2
awsemf:
region: us-west-2
namespace: EKS/Observability
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [awsxray]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [awsemf]
kubectl apply -f adot-collector.yaml
What You Get:
- OTLP receiver for OpenTelemetry traces
- Automatic X-Ray integration
- Service map visualization
- Trace sampling and filtering
- CloudWatch ServiceLens integration
Step 5: Setup Amazon Managed Grafana
Create AMG Workspace:
# Create workspace (via AWS Console recommended)
# Or use AWS CLI:
aws grafana create-workspace \
--workspace-name my-cluster-grafana \
--account-access-type CURRENT_ACCOUNT \
--authentication-providers AWS_SSO \
--permission-type SERVICE_MANAGED
Add Data Sources:
- Navigate to AMG workspace URL
- Configuration → Data Sources → Add data source
- Add Amazon Managed Service for Prometheus
- Region: us-west-2
- Workspace: Select your AMP workspace
- Add CloudWatch
- Default region: us-west-2
- Namespaces: ContainerInsights, EKS/Observability
- Add AWS X-Ray
- Default region: us-west-2
Import Dashboards:
# EKS Container Insights Dashboard
Dashboard ID: 16028
# Node Exporter Full Dashboard
Dashboard ID: 1860
# Kubernetes Cluster Monitoring
Dashboard ID: 15760
Production Deployment Checklist
Infrastructure
- CloudWatch Container Insights enabled (EKS add-on)
- Amazon Managed Prometheus workspace created
- kube-prometheus-stack deployed with remote write
- Fluent Bit DaemonSet running on all nodes
- ADOT Collector deployed (deployment or daemonset)
- Amazon Managed Grafana workspace created
- All IRSA roles configured with least-privilege policies
Configuration
- Prometheus scrape configs include all targets
- Fluent Bit log groups created and structured
- ADOT sampling configured (5-10% for high traffic)
- Grafana data sources connected (AMP, CloudWatch, X-Ray)
- Log retention policies set (7-90 days typical)
- Metric retention configured (AMP default 150 days)
Dashboards
- Cluster overview dashboard (nodes, pods, namespaces)
- Application performance dashboard (latency, errors, throughput)
- Resource utilization dashboard (CPU, memory, disk)
- Cost monitoring dashboard (resource waste, right-sizing)
- Network performance dashboard (CNO metrics)
Alerting
- Critical alerts: Pod crash loops, node not ready
- Performance alerts: High latency, error rate spikes
- Resource alerts: CPU/memory pressure, disk full
- Cost alerts: Budget thresholds, waste detection
- SNS topics configured for notifications
- PagerDuty/Opsgenie integration (optional)
Application Instrumentation
- OpenTelemetry SDK integrated in applications
- Trace context propagation configured
- Custom metrics exported via OTLP
- Structured logging with JSON format
- Log correlation with trace IDs
Modern Observability Stack (2025)
┌─────────────────────────────────────────────────────────────┐
│ EKS Cluster │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Application │ │ Application │ │ Application │ │
│ │ + OTel SDK │ │ + OTel SDK │ │ + OTel SDK │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └──────────────────┴──────────────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ ADOT Collector │ │
│ │ (OTel) │ │
│ └────────┬────────┘ │
│ │ │
│ ┌──────────────────┼──────────────────┐ │
│ │ │ │ │
│ ┌────▼─────┐ ┌────▼─────┐ ┌────▼─────┐ │
│ │Prometheus│ │Fluent Bit│ │Container │ │
│ │ (local) │ │DaemonSet │ │ Insights │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
└─────────┼──────────────────┼──────────────────┼────────────┘
│ │ │
│ │ │
┌─────▼─────┐ ┌────▼─────┐ ┌────▼─────┐
│ AMP │ │CloudWatch│ │ X-Ray │
│(Managed │ │ Logs │ │ │
│Prometheus)│ └────┬─────┘ └────┬─────┘
└─────┬─────┘ │ │
│ │ │
└─────────────────┴──────────────────┘
│
┌────────▼────────┐
│Amazon Managed │
│ Grafana │
└─────────────────┘
Detailed Documentation
For comprehensive guides on each observability component:
-
Metrics Collection: references/metrics.md
- CloudWatch Container Insights setup
- Amazon Managed Prometheus configuration
- kube-prometheus-stack deployment
- Custom metrics and ServiceMonitors
- Cost optimization strategies
-
Centralized Logging: references/logging.md
- Fluent Bit configuration and parsers
- CloudWatch Logs integration
- OpenSearch aggregation (optional)
- Log retention and lifecycle policies
- Troubleshooting log collection
-
Distributed Tracing: references/tracing.md
- ADOT Collector deployment patterns
- OpenTelemetry SDK instrumentation
- X-Ray integration and migration
- Trace sampling strategies
- ServiceLens and trace analysis
Cost Optimization
Metrics
- Sample high-cardinality metrics (5-10% of labels)
- Use metric relabeling to drop unnecessary labels
- Aggregate metrics before remote write to AMP
- Set appropriate retention periods (30-90 days typical)
Logs
- Implement log sampling for verbose applications
- Use CloudWatch Logs Insights instead of exporting to S3
- Set aggressive retention for debug logs (7 days)
- Keep audit logs longer (90+ days)
Traces
- Sample traces based on traffic (5-10% default)
- Increase sampling for errors (100%)
- Use tail-based sampling for important transactions
- Clean up old X-Ray traces (default 30 days)
Typical Monthly Costs:
- Small cluster (10 nodes): $50-150/month
- Medium cluster (50 nodes): $200-500/month
- Large cluster (200+ nodes): $1000-2000/month
Integration Patterns
Correlation Between Pillars
Metrics → Logs:
# Find pods with high error rates
rate(http_requests_total{status=~"5.."}[5m]) > 0.1
# Then search CloudWatch Logs for those pod names
Logs → Traces:
// Include trace_id in structured logs
{
"timestamp": "2025-01-27T10:30:00Z",
"level": "error",
"message": "Database connection failed",
"trace_id": "1-67a2f3b1-12456789abcdef012345678",
"span_id": "abcdef0123456789"
}
Traces → Metrics:
- Use trace data to identify slow endpoints
- Create SLIs from trace latency percentiles
- Alert on trace error rates
CloudWatch ServiceLens
Unified view combining:
- X-Ray traces (request flow)
- CloudWatch metrics (performance)
- CloudWatch Logs (detailed context)
# Enable ServiceLens (automatic with Container Insights + X-Ray)
aws servicelens get-service-lens-metrics \
--service-name my-app \
--start-time 2025-01-27T00:00:00Z \
--end-time 2025-01-27T23:59:59Z
Troubleshooting Quick Reference
| Issue | Cause | Fix |
|---|---|---|
| No metrics in AMP | Missing IRSA or remote write config | Check Prometheus pod logs, verify IAM role |
| Logs not appearing | Fluent Bit not running or wrong IAM | kubectl logs -n logging fluent-bit-xxx |
| Traces not in X-Ray | ADOT not deployed or app not instrumented | Verify ADOT pods, check OTel SDK setup |
| High costs | Too much data ingestion | Enable sampling, reduce log verbosity |
| Missing pod metrics | kube-state-metrics not running | Check kube-prometheus-stack installation |
| Grafana can't connect | Data source IAM permissions | Add CloudWatch/AMP read policies to AMG role |
Production Runbooks
Incident Response
- Check Grafana overview dashboard - Identify affected services
- Review X-Ray service map - Find bottleneck in request flow
- Query CloudWatch Logs Insights - Get detailed error messages
- Correlate with metrics spike - Understand timeline and scope
- Execute remediation - Scale, restart, or rollback
Performance Investigation
- Start with RED metrics (Rate, Errors, Duration)
- Check USE metrics (Utilization, Saturation, Errors) for infrastructure
- Analyze trace percentiles (p50, p95, p99)
- Review log patterns during slow periods
- Identify optimization opportunities
SLO Implementation
Define SLIs (Service Level Indicators):
# Availability SLI
- metric: probe_success
target: 99.9%
window: 30d
# Latency SLI
- metric: http_request_duration_seconds
percentile: p99
target: < 500ms
window: 30d
# Error Rate SLI
- metric: http_requests_total{status=~"5.."}
target: < 0.1%
window: 30d
Calculate Error Budget:
Error Budget = 100% - SLO Target
Example: 99.9% SLO = 0.1% error budget
= 43.2 minutes downtime/month
Burn Rate Alerts:
# Fast burn (5% budget in 1 hour)
(1 - slo:availability:ratio_rate_1h) > 0.05
# Slow burn (10% budget in 6 hours)
(1 - slo:availability:ratio_rate_6h) > 0.1
Best Practices Summary
- Use Dual Monitoring: CloudWatch Container Insights + Prometheus
- Standardize on OpenTelemetry: Future-proof instrumentation
- Enable IRSA for Everything: No node IAM roles
- Deploy ADOT Collector: Vendor-neutral observability
- Sample Intelligently: 5-10% traces, 100% errors
- Structure Your Logs: JSON format with trace correlation
- Set Retention Policies: Balance cost and compliance
- Build Actionable Dashboards: Focus on SLIs and anomalies
- Implement Progressive Alerting: Warn before critical
- Regularly Review Costs: Optimize based on actual usage
Stack: CloudWatch Container Insights, AMP, Fluent Bit, ADOT, AMG, X-Ray Standards: OpenTelemetry, IRSA, EKS Add-ons Last Updated: January 2025 (2025 Best Practices)