debug-buttercup
Debug Buttercup
When to Use
- Pods in the
crsnamespace are in CrashLoopBackOff, OOMKilled, or restarting - Multiple services restart simultaneously (cascade failure)
- Redis is unresponsive or showing AOF warnings
- Queues are growing but tasks are not progressing
- Nodes show DiskPressure, MemoryPressure, or PID pressure
- Build-bot cannot reach the Docker daemon (DinD failures)
- Scheduler is stuck and not advancing task state
- Health check probes are failing unexpectedly
- Deployed Helm values don't match actual pod configuration
When NOT to Use
- Deploying or upgrading Buttercup (use Helm and deployment guides)
- Debugging issues outside the
crsKubernetes namespace - Performance tuning that doesn't involve a failure symptom
Namespace and Services
All pods run in namespace crs. Key services:
| Layer | Services |
|---|---|
| Infra | redis, dind, litellm, registry-cache |
| Orchestration | scheduler, task-server, task-downloader, scratch-cleaner |
| Fuzzing | build-bot, fuzzer-bot, coverage-bot, tracer-bot, merger-bot |
| Analysis | patcher, seed-gen, program-model, pov-reproducer |
| Interface | competition-api, ui |
Triage Workflow
Always start with triage. Run these three commands first:
# 1. Pod status - look for restarts, CrashLoopBackOff, OOMKilled
kubectl get pods -n crs -o wide
# 2. Events - the timeline of what went wrong
kubectl get events -n crs --sort-by='.lastTimestamp'
# 3. Warnings only - filter the noise
kubectl get events -n crs --field-selector type=Warning --sort-by='.lastTimestamp'
Then narrow down:
# Why did a specific pod restart? Check Last State Reason (OOMKilled, Error, Completed)
kubectl describe pod -n crs <pod-name> | grep -A8 'Last State:'
# Check actual resource limits vs intended
kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.containers[0].resources}'
# Crashed container's logs (--previous = the container that died)
kubectl logs -n crs <pod-name> --previous --tail=200
# Current logs
kubectl logs -n crs <pod-name> --tail=200
Historical vs Ongoing Issues
High restart counts don't necessarily mean an issue is ongoing -- restarts accumulate over a pod's lifetime. Always distinguish:
--tailshows the end of the log buffer, which may contain old messages. Use--since=300sto confirm issues are actively happening now.--timestampson log output helps correlate events across services.- Check
Last Statetimestamps indescribe podto see when the most recent crash actually occurred.
Cascade Detection
When many pods restart around the same time, check for a shared-dependency failure before investigating individual pods. The most common cascade: Redis goes down -> every service gets ConnectionError/ConnectionRefusedError -> mass restarts. Look for the same error across multiple --previous logs -- if they all say redis.exceptions.ConnectionError, debug Redis, not the individual services.
Log Analysis
# All replicas of a service at once
kubectl logs -n crs -l app=fuzzer-bot --tail=100 --prefix
# Stream live
kubectl logs -n crs -l app.kubernetes.io/name=redis -f
# Collect all logs to disk (existing script)
bash deployment/collect-logs.sh
Resource Pressure
# Per-pod CPU/memory
kubectl top pods -n crs
# Node-level
kubectl top nodes
# Node conditions (disk pressure, memory pressure, PID pressure)
kubectl describe node <node> | grep -A5 Conditions
# Disk usage inside a pod
kubectl exec -n crs <pod> -- df -h
# What's eating disk
kubectl exec -n crs <pod> -- sh -c 'du -sh /corpus/* 2>/dev/null'
kubectl exec -n crs <pod> -- sh -c 'du -sh /scratch/* 2>/dev/null'
Redis Debugging
Redis is the backbone. When it goes down, everything cascades.
# Redis pod status
kubectl get pods -n crs -l app.kubernetes.io/name=redis
# Redis logs (AOF warnings, OOM, connection issues)
kubectl logs -n crs -l app.kubernetes.io/name=redis --tail=200
# Connect to Redis CLI
kubectl exec -n crs <redis-pod> -- redis-cli
# Inside redis-cli: key diagnostics
INFO memory # used_memory_human, maxmemory
INFO persistence # aof_enabled, aof_last_bgrewrite_status, aof_delayed_fsync
INFO clients # connected_clients, blocked_clients
INFO stats # total_connections_received, rejected_connections
CLIENT LIST # see who's connected
DBSIZE # total keys
# AOF configuration
CONFIG GET appendonly # is AOF enabled?
CONFIG GET appendfsync # fsync policy: everysec, always, or no
# What is /data mounted on? (disk vs tmpfs matters for AOF performance)
kubectl exec -n crs <redis-pod> -- mount | grep /data
kubectl exec -n crs <redis-pod> -- du -sh /data/
Queue Inspection
Buttercup uses Redis streams with consumer groups. Queue names:
| Queue | Stream Key |
|---|---|
| Build | fuzzer_build_queue |
| Build Output | fuzzer_build_output_queue |
| Crash | fuzzer_crash_queue |
| Confirmed Vulns | confirmed_vulnerabilities_queue |
| Download Tasks | orchestrator_download_tasks_queue |
| Ready Tasks | tasks_ready_queue |
| Patches | patches_queue |
| Index | index_queue |
| Index Output | index_output_queue |
| Traced Vulns | traced_vulnerabilities_queue |
| POV Requests | pov_reproducer_requests_queue |
| POV Responses | pov_reproducer_responses_queue |
| Delete Task | orchestrator_delete_task_queue |
# Check stream length (pending messages)
kubectl exec -n crs <redis-pod> -- redis-cli XLEN fuzzer_build_queue
# Check consumer group lag
kubectl exec -n crs <redis-pod> -- redis-cli XINFO GROUPS fuzzer_build_queue
# Check pending messages per consumer
kubectl exec -n crs <redis-pod> -- redis-cli XPENDING fuzzer_build_queue build_bot_consumers - + 10
# Task registry size
kubectl exec -n crs <redis-pod> -- redis-cli HLEN tasks_registry
# Task state counts
kubectl exec -n crs <redis-pod> -- redis-cli SCARD cancelled_tasks
kubectl exec -n crs <redis-pod> -- redis-cli SCARD succeeded_tasks
kubectl exec -n crs <redis-pod> -- redis-cli SCARD errored_tasks
Consumer groups: build_bot_consumers, orchestrator_group, patcher_group, index_group, tracer_bot_group.
Health Checks
Pods write timestamps to /tmp/health_check_alive. The liveness probe checks file freshness.
# Check health file freshness
kubectl exec -n crs <pod> -- stat /tmp/health_check_alive
kubectl exec -n crs <pod> -- cat /tmp/health_check_alive
If a pod is restart-looping, the health check file is likely going stale because the main process is blocked (e.g. waiting on Redis, stuck on I/O).
Telemetry (OpenTelemetry / Signoz)
All services export traces and metrics via OpenTelemetry. If Signoz is deployed (global.signoz.deployed: true), use its UI for distributed tracing across services.
# Check if OTEL is configured
kubectl exec -n crs <pod> -- env | grep OTEL
# Verify Signoz pods are running (if deployed)
kubectl get pods -n platform -l app.kubernetes.io/name=signoz
Traces are especially useful for diagnosing slow task processing, identifying which service in a pipeline is the bottleneck, and correlating events across the scheduler -> build-bot -> fuzzer-bot chain.
Volume and Storage
# PVC status
kubectl get pvc -n crs
# Check if corpus tmpfs is mounted, its size, and backing type
kubectl exec -n crs <pod> -- mount | grep corpus_tmpfs
kubectl exec -n crs <pod> -- df -h /corpus_tmpfs 2>/dev/null
# Check if CORPUS_TMPFS_PATH is set
kubectl exec -n crs <pod> -- env | grep CORPUS
# Full disk layout - what's on real disk vs tmpfs
kubectl exec -n crs <pod> -- df -h
CORPUS_TMPFS_PATH is set when global.volumes.corpusTmpfs.enabled: true. This affects fuzzer-bot, coverage-bot, seed-gen, and merger-bot.
Deployment Config Verification
When behavior doesn't match expectations, verify Helm values actually took effect:
# Check a pod's actual resource limits
kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.containers[0].resources}'
# Check a pod's actual volume definitions
kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.volumes}'
Helm values template typos (e.g. wrong key names) silently fall back to chart defaults. If deployed resources don't match the values template, check for key name mismatches.
Service-Specific Debugging
For detailed per-service symptoms, root causes, and fixes, see references/failure-patterns.md.
Quick reference:
- DinD:
kubectl logs -n crs -l app=dind --tail=100-- look for docker daemon crashes, storage driver errors - Build-bot: check build queue depth, DinD connectivity, OOM during compilation
- Fuzzer-bot: corpus disk usage, CPU throttling, crash queue backlog
- Patcher: LiteLLM connectivity, LLM timeout, patch queue depth
- Scheduler: the central brain --
kubectl logs -n crs -l app=scheduler --tail=-1 --prefix | grep "WAIT_PATCH_PASS\|ERROR\|SUBMIT"
Diagnostic Script
Run the automated triage snapshot:
bash {baseDir}/scripts/diagnose.sh
Pass --full to also dump recent logs from all pods:
bash {baseDir}/scripts/diagnose.sh --full
This collects pod status, events, resource usage, Redis health, and queue depths in one pass.