troubleshoot-ssi
Troubleshoot APM SSI on Kubernetes
Triggers
Invoke this skill when the user expresses intent to:
- Debug why a pod is not being instrumented
- Investigate why traces are not appearing in Datadog
- Diagnose admission webhook or init container injection failures
- Follow up on failed checks from
verify-ssi - Report that a specific service or pod has no traces
Do NOT invoke this skill if:
- SSI has not been enabled yet — run
enable-ssifirst
Prerequisites
- kubectl configured to target cluster —
kubectl config current-context
pup-cli: check, install, and authenticate
Claude runs
pup --version
If not found:
Claude runs
brew tap datadog-labs/pack
brew install pup
Check auth:
pup auth status
If not authenticated:
What you need to do in a terminal
pup auth login
Confirm with pup auth status. If no browser available: export DD_APP_KEY=<your-app-key>.
Context to resolve before acting
| Variable | How to resolve |
|---|---|
AGENT_NAMESPACE |
Namespace where Datadog Agent is installed |
APP_NAMESPACE |
Namespace of the application with missing traces |
CLUSTER_NAME |
kubectl config current-context or spec.global.clusterName in datadog-agent.yaml |
SERVICE_NAME |
tags.datadoghq.com/service label on the Deployment, or ask the user |
ENV |
tags.datadoghq.com/env label on the Deployment, or ask the user |
POD_NAME |
kubectl get pods -n <APP_NAMESPACE> — use the specific pod the user mentioned |
DEPLOYMENT_NAME |
Check metadata.name in the Deployment manifest, or ask the user |
APP_LABEL |
Check spec.selector.matchLabels.app in the Deployment manifest |
How SSI Works — Domain Knowledge
Read this before investigating. It gives you the mental model to reason about novel failures, not just known ones.
Injection chain:
- Admission webhook (registered by Cluster Agent) intercepts pod creation
- Webhook mutates the pod spec — adds a
datadog-lib-<language>-initinit container - Init container downloads the tracer library onto a shared volume
LD_PRELOADenv var is set pointing to the library.sofile- Application process loads the library automatically on startup via
LD_PRELOAD
What each diagnostic layer can see:
- pup — sees what Datadog's backend received. Blind to cluster-side injection failures. If pup shows no instrumented pods, the problem is in the cluster.
- kubectl — sees cluster state. Blind to whether data reached Datadog. If kubectl shows the init container but pup shows no traces, the problem is post-injection.
What healthy looks like:
pup fleet instrumented-pods listshows the pod with correct language/versionpup fleet tracers listshows the service as activekubectl get pod -o jsonpath='{.spec.initContainers[*].name}'includesdatadog-lib-<language>-init
Known silent failures — SSI produces no error when these occur:
- Alpine/musl libc —
LD_PRELOADfails silently. SSI's.sois compiled against glibc; musl (Alpine Linux) is ABI-incompatible - Existing ddtrace or OTel instrumentation — SSI detects it and silently disables itself
- Unsupported runtime version — silently skipped
admission.datadoghq.com/enabled: "false"annotation — webhook skips the pod entirely- Pod not restarted after SSI enabled — injection happens at startup; existing pods keep running uninstrumented
- Pod in Agent namespace — SSI never instruments its own namespace
Reasoning shortcuts:
- No init container → webhook didn't fire → check: namespace targeting, pod-selector, opt-out annotation, webhook registration, pod not restarted
- Init container present + no traces → injection attempted but failed or tracer not reporting → check: libc compatibility, existing ddtrace, runtime version, Agent connectivity, DD_SITE mismatch
Step 1: Triage
Run all four simultaneously. Everything after this is driven by what you find here.
Claude runs
pup traces search --query "service:<SERVICE_NAME>" --from 1h --limit 5
pup fleet instrumented-pods list <CLUSTER_NAME>
kubectl get pod <POD_NAME> -n <APP_NAMESPACE> \
-o jsonpath='{.spec.initContainers[*].name}'
kubectl describe pod <POD_NAME> -n <APP_NAMESPACE> | grep -A 10 "Events:"
Step 2: State Your Hypotheses
Before investigating, explicitly state your ranked hypotheses based on triage output. Do not skip this step.
| Triage signal | Strong hypothesis |
|---|---|
| Traces arriving + pod in instrumented list | Not a real problem — likely a UI filter or time window. Tell the user and stop |
| No traces + pod NOT in instrumented list + no init container | Injection never happened — investigate: namespace targeting, webhook, pod-selector, opt-out annotation, pod not restarted |
| No traces + pod NOT in instrumented list + init container present | Injection attempted but failed — check pup apm troubleshooting list for injection errors |
| No traces + pod in instrumented list + init container present | Tracer injected but not reporting — investigate: connectivity, DD_SITE, API key |
| Pod events show CrashLoopBackOff or init container errors | Init container failure — check libc (Alpine/musl), existing ddtrace, runtime version |
| Traces arriving but wrong service/env | UST labels missing or misconfigured on the Deployment |
State your top 1-3 hypotheses explicitly: "Based on triage, I think the most likely cause is X because Y."
Step 3: Investigate
Use only the tools relevant to your hypotheses. Each observation informs your next action.
Cluster-side investigation tools
Is the pod in the Agent namespace? SSI never instruments pods in the same namespace as the Datadog Agent.
kubectl get pods -n <AGENT_NAMESPACE>
Were pods restarted after SSI was enabled?
kubectl rollout restart deployment/<DEPLOYMENT_NAME> -n <APP_NAMESPACE>
kubectl wait --for=condition=Ready pod -l app=<APP_LABEL> -n <APP_NAMESPACE> --timeout=120s
pup fleet instrumented-pods list <CLUSTER_NAME>
Is namespace targeting filtering the pod out?
kubectl get datadogagent datadog -n <AGENT_NAMESPACE> -o yaml | grep -A 15 instrumentation
Fix: update enabledNamespaces in datadog-agent.yaml.
Claude runs
kubectl apply -f datadog-agent.yaml
Is a podSelector target filtering the pod out?
If targets with podSelector is configured, only pods whose labels match the selector are instrumented. Check whether the app pod's labels match any target:
kubectl get datadogagent datadog -n <AGENT_NAMESPACE> -o yaml | grep -A 20 targets
kubectl get pod <POD_NAME> -n <APP_NAMESPACE> --show-labels
Fix: add a matching label to the pod template, or broaden the podSelector, then apply and restart.
Is a pod annotation opting it out?
admission.datadoghq.com/enabled: "false" tells the webhook to skip this pod.
kubectl get pod <POD_NAME> -n <APP_NAMESPACE> -o yaml | grep -A 5 annotations
kubectl get pod <POD_NAME> -n <APP_NAMESPACE> --show-labels
Fix: remove the annotation from the Deployment pod template, then apply and restart.
Claude runs
kubectl apply -f <your-app-deployment.yaml>
kubectl rollout restart deployment/<DEPLOYMENT_NAME> -n <APP_NAMESPACE>
Does the app have existing custom instrumentation? SSI silently disables itself when it detects existing tracer code. Scan source files for:
- Python:
import ddtrace,ddtrace.patch_all() - Node.js:
require('dd-trace'),DD.init() - Java:
GlobalTracer.register(,dd-java-agent - .NET:
Tracer.Instance,DD.Trace - Ruby:
require 'ddtrace',Datadog.configure - PHP:
DDTrace\
Also check dependency manifests: requirements.txt, package.json, Gemfile, pom.xml.
Fix: remove the import/package, rebuild image, reload into cluster, restart pod.
Is the base image Alpine (musl libc)? SSI's injected library requires glibc. Alpine uses musl — ABI-incompatible, fails silently.
kubectl exec -n <APP_NAMESPACE> <POD_NAME> -- sh -c "ldd --version 2>&1 | head -1"
kubectl exec -n <APP_NAMESPACE> <POD_NAME> -- sh -c "cat /etc/os-release | grep -i 'ID\|NAME' | head -3"
Fix: rebuild with a glibc-based image (python:3.x-slim, node:x-bookworm, eclipse-temurin).
Is the runtime version supported?
kubectl exec -n <APP_NAMESPACE> <POD_NAME> -- python --version
kubectl exec -n <APP_NAMESPACE> <POD_NAME> -- node --version
kubectl exec -n <APP_NAMESPACE> <POD_NAME> -- java -version
Verify against SSI compatibility matrix.
Is the admission webhook registered?
kubectl get mutatingwebhookconfigurations | grep datadog
kubectl get pods -n <AGENT_NAMESPACE> -l app=datadog-cluster-agent
kubectl logs -n <AGENT_NAMESPACE> -l app=datadog-cluster-agent --tail=100
Did injection produce errors? Get the node hostname first, then query Datadog for injection errors:
kubectl get pod <POD_NAME> -n <APP_NAMESPACE> -o jsonpath='{.spec.nodeName}'
pup apm troubleshooting list --hostname <NODE_HOSTNAME> --timeframe 1h
Is the Agent sending data to Datadog?
kubectl exec -n <AGENT_NAMESPACE> \
$(kubectl get pod -n <AGENT_NAMESPACE> -l app=datadog-agent -o name | head -1) \
-- agent status | grep -A 5 "APM Agent"
Datadog-side investigation tools
Is the tracer reporting?
pup fleet tracers list --filter "service:<SERVICE_NAME>"
Does APM recognise the service?
pup apm services list --env <ENV>
Are traces arriving?
pup traces search --query "service:<SERVICE_NAME>" --from 1h --limit 10
Which agent is the tracer connected to? Use if connectivity between tracer and Agent is suspected.
pup fleet agents list --filter "hostname:<NODE_HOSTNAME>"
pup fleet agents tracers <AGENT_KEY> --filter "service:<SERVICE_NAME>"
Step 4: Reflect Before Concluding
Before applying any fix, answer:
- What evidence confirms my hypothesis?
- What evidence would contradict it — and have I checked?
- Is there a simpler explanation I haven't considered?
If the conclusion doesn't hold up, return to Step 2 with new hypotheses. Keep iterating until you can defend the conclusion against all three questions.
Step 5: Fix
Apply the fix for the confirmed root cause. If the fix requires a code or Dockerfile change, rebuild and reload:
Claude runs
docker build -f <DOCKERFILE_PATH> -t <IMAGE_NAME> <BUILD_CONTEXT>
[DECISION: cluster type]
- kind (local): load the image into the cluster
Claude runs
kind load docker-image <IMAGE_NAME> --name <CLUSTER_NAME>
- Registry-based: skip — image will be pulled on next deployment
Claude runs
kubectl rollout restart deployment/<DEPLOYMENT_NAME> -n <APP_NAMESPACE>
kubectl wait --for=condition=Ready pod -l app=<APP_LABEL> -n <APP_NAMESPACE> --timeout=120s
Step 6: Verify
Re-run triage to confirm the fix worked:
Claude runs
pup traces search --query "service:<SERVICE_NAME>" --from 1h --limit 5
pup fleet instrumented-pods list <CLUSTER_NAME>
If traces are arriving and the pod is in the instrumented list — resolved. Automatically proceed to onboarding-summary now — do not ask the user for permission.
ERROR: Still not resolved — return to Step 2 with the new triage data and form updated hypotheses.
Security constraints
- Never write a raw API key into any file or chat message
- Never run
kubectl deletewithout user confirmation - Never modify
admissionControllersettings directly docker pushto a registry always requires user confirmation