eks-app-log-analysis
EKS App Log Analysis
Analyze EKS application logs during FIS fault injection experiments to understand how applications respond to infrastructure failures. Supports real-time monitoring and post-hoc analysis modes.
Output Language Rule
Detect the language of the user's conversation and use the same language for all output.
- Chinese input -> Chinese output
- English input -> English output
Prerequisites
Required tools:
- kubectl — configured with access to target EKS cluster
- AWS CLI — for querying FIS experiment status
- A prepared/executed FIS experiment directory (from aws-fis-experiment-prepare or aws-fis-experiment-execute)
Workflow
digraph log_analysis_flow {
"Receive input path" [shape=box];
"Detect mode" [shape=diamond];
"Real-time mode" [shape=box];
"Post-hoc mode" [shape=box];
"Read service list" [shape=box];
"Auto-discover + confirm app dependencies" [shape=box];
"Start background log collection" [shape=box];
"Batch fetch historical logs" [shape=box];
"Frontend polling + insight display" [shape=box];
"Experiment complete?" [shape=diamond];
"Generate analysis report" [shape=box];
"Receive input path" -> "Detect mode";
"Detect mode" -> "Real-time mode" [label="directory with README"];
"Detect mode" -> "Post-hoc mode" [label="*-experiment-results.md"];
"Real-time mode" -> "Read service list";
"Post-hoc mode" -> "Read service list";
"Read service list" -> "Auto-discover + confirm app dependencies";
"Auto-discover + confirm app dependencies" -> "Start background log collection" [label="real-time"];
"Auto-discover + confirm app dependencies" -> "Batch fetch historical logs" [label="post-hoc"];
"Start background log collection" -> "Frontend polling + insight display";
"Frontend polling + insight display" -> "Experiment complete?";
"Experiment complete?" -> "Frontend polling + insight display" [label="No, continue"];
"Experiment complete?" -> "Generate analysis report" [label="Yes"];
"Batch fetch historical logs" -> "Generate analysis report";
}
Step 1: Detect Mode and Load Context
The user provides either:
- Directory path (e.g.,
./2026-03-31-14-30-22-az-power-interruption-my-cluster/) → Real-time mode - Report file path (e.g.,
./2026-03-31-...-experiment-results.md) → Post-hoc mode
Real-time mode: The directory contains a README.md from the prepare skill.
Extract the experiment template ID and region from it.
Post-hoc mode: The file is an experiment results report (contains "FIS Experiment Results"). Extract experiment ID, start time, end time, and region from it.
Step 2: Read Service List
Extract affected AWS services from:
expected-behavior.mdin the experiment directory (real-time mode), or- the experiment results report (post-hoc mode)
Look for service name headings (e.g., "### RDS (cluster-xxx)") to build the list. Present the detected service list to the user.
Step 3: Collect Application Dependencies
3a. Auto-Discover Potential Dependencies
For each affected AWS service, automatically discover EKS applications that may depend on it:
- Get the service's endpoint (e.g., RDS cluster endpoint, ElastiCache primary endpoint, EC2 private IP/DNS) via AWS CLI
- Search all pod environment variables across namespaces for references to that endpoint
- Search ConfigMaps across namespaces for references to that endpoint
- Present discovered
namespace/deploymentcandidates to the user, noting where the match was found (env var name, ConfigMap name)
3b. User Confirmation and Manual Supplement
Ask the user to confirm the auto-discovered dependencies and add any that were
missed. Store the final mapping as SERVICE_APP_MAP (service → list of
namespace/deployment pairs).
Step 4: Log Collection
Shell scripting rule: Use multi-line scripts. Do NOT chain commands with
&&on a single line — variables get lost after background&processes.
All logs should be saved to a temp directory: /tmp/{timestamp}-fis-app-logs/,
organized by service name subdirectories.
Real-time Mode: Background Collection
For each application in SERVICE_APP_MAP, start background kubectl logs -f processes
for regular containers only (excluding FIS-injected ephemeral containers):
- Resolve the deployment's pod label selector from
.spec.selector.matchLabels - Get the list of regular container names from the deployment spec:
Do NOT usekubectl get deployment {DEPLOYMENT} -n {NAMESPACE} \ -o jsonpath='{.spec.template.spec.containers[*].name}'--all-containers=true— FIS pod-level fault injection (e.g.,pod-network-latency,pod-cpu-stress) injects ephemeral containers into target pods. Using--all-containerswould pull in FIS agent logs (noise) alongside application logs. Always use--container={name}to collect only regular containers. - For each regular container, start a background log stream:
Usekubectl logs -f --selector={labels} -n {NAMESPACE} \ --container={CONTAINER_NAME} --timestamps --prefix=true \ --max-log-requests=20 \ >> {LOG_DIR}/{service-name}/{deployment}.log &--selector={labels}(NOTdeployment/xxx) — this captures logs from all matching pods, including those recreated during the experiment. - Record each background PID to
{LOG_DIR}/.pidsfor cleanup
Post-hoc Mode: Batch Fetch
In post-hoc mode, pods may have been terminated during the experiment. First detect whether Container Insights is available, then choose the log source accordingly.
Step 4a: Detect Container Insights
Check whether the EKS cluster has Container Insights enabled:
- Look for
amazon-cloudwatch-observabilityEKS addon (viaaws eks describe-addon) - Or check for CloudWatch agent / Fluent Bit daemonset in
amazon-cloudwatchnamespace
Step 4b: CloudWatch Logs (preferred, if Container Insights is enabled)
Query CloudWatch Logs Insights against the log group
/aws/containerinsights/{CLUSTER_NAME}/application for the experiment time window
(START_TIME to END_TIME). Filter by kubernetes.namespace_name and
kubernetes.labels.app (or pod name pattern) for each deployment. This captures
complete logs including from pods that no longer exist.
Step 4c: kubectl logs (fallback, no Container Insights)
Use kubectl logs --selector={labels} --since-time={START_TIME} with
--container={CONTAINER_NAME} --timestamps --prefix=true for each regular container
(same container discovery as real-time mode Step 2). Do NOT use --all-containers.
Note: this only retrieves logs from currently running pods — logs from pods terminated
during the experiment are lost.
Step 5: Real-time Monitoring Display
Poll every 30 seconds while the experiment is running. For each service group and each application:
- Read the last 30 seconds of collected logs from the log file
- Count error-level entries (match:
error,exception,fail,refused,timeout) and warning-level entries (match:warn,retry) - Display a per-app summary: error count, warning count, last 5 error lines
- Detect recovery signals (
connected,restored,success,recovered) in recent lines and report if found
Step 6: Check Experiment Status (Real-time Mode)
Use aws fis list-experiments to check if the experiment with the matching
template ID is still in running state. When the experiment completes (or is not
found), proceed to report generation.
Step 7: Generate Analysis Report
After experiment completes (or immediately in post-hoc mode), generate the report:
TIMESTAMP=$(date +%Y-%m-%d-%H-%M-%S)
# Save the report in the experiment directory (EXPERIMENT_DIR)
REPORT_FILE="${EXPERIMENT_DIR}/${TIMESTAMP}-app-log-analysis.md"
Report structure:
# Application Log Analysis Report
**Experiment ID:** {EXPERIMENT_ID}
**Analysis Time:** {TIMESTAMP}
**Time Range:** {START_TIME} - {END_TIME}
**Duration:** {DURATION}
## Summary
| Service | Application | Total Errors | Peak Error Rate | Recovery Time |
|---------|-------------|--------------|-----------------|---------------|
| {service} | {app} | {count} | {rate}/min | {time} |
## Per-Service Application Analysis
### {Service Name} ({resource_id})
#### {Application Name} ({namespace}/{deployment})
**Error Timeline:**
| Time (UTC) | Level | Message |
|------------|-------|---------|
| {HH:MM:SS} | ERROR | {truncated message} |
| ... | ... | ... |
**Key Error Patterns:**
| Pattern | Count | First Occurrence | Last Occurrence |
|---------|-------|------------------|-----------------|
| Connection refused | {n} | {time} | {time} |
| Timeout | {n} | {time} | {time} |
**Log Sample (Critical Errors):**
{5-10 lines of actual error logs}
**Insights:**
- {insight_1}: Error spike at {time}, correlates with {service} failover
- {insight_2}: Recovery detected at {time}, {duration} after fault injection ended
- {insight_3}: Application retry mechanism worked/failed because...
(Repeat for each application)
## Cross-Service Correlation
| Time | Event | RDS Impact | ElastiCache Impact | Application Response |
|------|-------|------------|--------------------|--------------------|
| {time} | Fault injection start | - | - | First errors appear |
| {time} | {service} failover | Connection errors | - | Retrying... |
| {time} | Recovery | Connections restored | - | Normal operation |
## Recommendations
1. **{Issue}:** {description}
- **Impact:** {what happened}
- **Recommendation:** {what to improve}
## Appendix: Log File Locations
**Raw log directory:** `{LOG_DIR}`
To view raw logs after the analysis, navigate to the temp directory shown above.
These files will persist until the system clears `/tmp`.
| Application | Log File |
|-------------|----------|
| {app} | `{LOG_DIR}/{service}/{app}.log` |
Step 8: Cleanup (Real-time Mode)
Kill all background kubectl logs processes recorded in {LOG_DIR}/.pids.
Remove the PID file after cleanup.
Error Handling
| Error | Cause | Resolution |
|---|---|---|
/.pids: Permission denied |
LOG_DIR variable empty due to && chain — path resolves to /.pids |
Use export LOG_DIR=... with multi-line script, NOT && chains. See Step 4 notes. |
kubectl: command not found |
kubectl not installed | Install kubectl and configure kubeconfig |
error: You must be logged in |
kubeconfig not configured | Run aws eks update-kubeconfig --name {cluster} |
No resources found |
Deployment/pod doesn't exist | Verify deployment name and namespace |
Unable to retrieve logs |
Pod not running or restarted | Check pod status, may need to fetch from CloudWatch Logs |
| Template ID not found | README format changed | Manually provide template ID |
Output Files
{EXPERIMENT_DIR}/ # Experiment directory
└── {timestamp}-app-log-analysis.md # Analysis report
/tmp/{timestamp}-fis-app-logs/ # Temp directory for raw logs
├── rds-cluster-xxx/
│ ├── app-backend.log
│ └── api-server.log
├── elasticache-redis-xxx/
│ └── cache-layer.log
└── .pids (temporary, cleaned up)
Usage Examples
# Real-time monitoring (during experiment)
"Analyze app logs for ./2026-03-31-14-30-22-az-power-interruption-my-cluster/"
"Monitor application behavior in the experiment directory"
"实时监控应用日志"
# Post-hoc analysis (after experiment)
"Analyze app logs using ./2026-03-31-14-35-00-az-power-interruption-my-cluster-experiment-results.md"
"分析实验报告中的应用表现"
"Check what happened to applications during the experiment"
Integration with Other Skills
- aws-fis-experiment-prepare — Reads
README.mdandexpected-behavior.mdfor context - aws-fis-experiment-execute — Reads
*-experiment-results.mdfor time range and service list - Does NOT modify any files from other skills