troubleshoot-ssi

Installation
SKILL.md

Troubleshoot APM SSI on Kubernetes

Triggers

Invoke this skill when the user expresses intent to:

  • Debug why a pod is not being instrumented
  • Investigate why traces are not appearing in Datadog
  • Diagnose admission webhook or init container injection failures
  • Follow up on failed checks from verify-ssi
  • Report that a specific service or pod has no traces

Do NOT invoke this skill if:

  • SSI has not been enabled yet — run enable-ssi first

Prerequisites

  • kubectl configured to target cluster — kubectl config current-context

pup-cli: check, install, and authenticate

Claude runs

pup --version

If not found:

Claude runs

brew tap datadog-labs/pack
brew install pup

Check auth:

pup auth status

If not authenticated:

What you need to do in a terminal

pup auth login

Confirm with pup auth status. If no browser available: export DD_APP_KEY=<your-app-key>.


Context to resolve before acting

Variable How to resolve
AGENT_NAMESPACE Namespace where Datadog Agent is installed
APP_NAMESPACE Namespace of the application with missing traces
CLUSTER_NAME kubectl config current-context or spec.global.clusterName in datadog-agent.yaml
SERVICE_NAME tags.datadoghq.com/service label on the Deployment, or ask the user
ENV tags.datadoghq.com/env label on the Deployment, or ask the user
POD_NAME kubectl get pods -n <APP_NAMESPACE> — use the specific pod the user mentioned
DEPLOYMENT_NAME Check metadata.name in the Deployment manifest, or ask the user
APP_LABEL Check spec.selector.matchLabels.app in the Deployment manifest

How SSI Works — Domain Knowledge

Read this before investigating. It gives you the mental model to reason about novel failures, not just known ones.

Injection chain:

  1. Admission webhook (registered by Cluster Agent) intercepts pod creation
  2. Webhook mutates the pod spec — adds a datadog-lib-<language>-init init container
  3. Init container downloads the tracer library onto a shared volume
  4. LD_PRELOAD env var is set pointing to the library .so file
  5. Application process loads the library automatically on startup via LD_PRELOAD

What each diagnostic layer can see:

  • pup — sees what Datadog's backend received. Blind to cluster-side injection failures. If pup shows no instrumented pods, the problem is in the cluster.
  • kubectl — sees cluster state. Blind to whether data reached Datadog. If kubectl shows the init container but pup shows no traces, the problem is post-injection.

What healthy looks like:

  • pup fleet instrumented-pods list shows the pod with correct language/version
  • pup fleet tracers list shows the service as active
  • kubectl get pod -o jsonpath='{.spec.initContainers[*].name}' includes datadog-lib-<language>-init

Known silent failures — SSI produces no error when these occur:

  • Alpine/musl libcLD_PRELOAD fails silently. SSI's .so is compiled against glibc; musl (Alpine Linux) is ABI-incompatible
  • Existing ddtrace or OTel instrumentation — SSI detects it and silently disables itself
  • Unsupported runtime version — silently skipped
  • admission.datadoghq.com/enabled: "false" annotation — webhook skips the pod entirely
  • Pod not restarted after SSI enabled — injection happens at startup; existing pods keep running uninstrumented
  • Pod in Agent namespace — SSI never instruments its own namespace

Reasoning shortcuts:

  • No init container → webhook didn't fire → check: namespace targeting, pod-selector, opt-out annotation, webhook registration, pod not restarted
  • Init container present + no traces → injection attempted but failed or tracer not reporting → check: libc compatibility, existing ddtrace, runtime version, Agent connectivity, DD_SITE mismatch

Step 1: Triage

Run all four simultaneously. Everything after this is driven by what you find here.

Claude runs

pup traces search --query "service:<SERVICE_NAME>" --from 1h --limit 5
pup fleet instrumented-pods list <CLUSTER_NAME>
kubectl get pod <POD_NAME> -n <APP_NAMESPACE> \
  -o jsonpath='{.spec.initContainers[*].name}'
kubectl describe pod <POD_NAME> -n <APP_NAMESPACE> | grep -A 10 "Events:"

Step 2: State Your Hypotheses

Before investigating, explicitly state your ranked hypotheses based on triage output. Do not skip this step.

Triage signal Strong hypothesis
Traces arriving + pod in instrumented list Not a real problem — likely a UI filter or time window. Tell the user and stop
No traces + pod NOT in instrumented list + no init container Injection never happened — investigate: namespace targeting, webhook, pod-selector, opt-out annotation, pod not restarted
No traces + pod NOT in instrumented list + init container present Injection attempted but failed — check pup apm troubleshooting list for injection errors
No traces + pod in instrumented list + init container present Tracer injected but not reporting — investigate: connectivity, DD_SITE, API key
Pod events show CrashLoopBackOff or init container errors Init container failure — check libc (Alpine/musl), existing ddtrace, runtime version
Traces arriving but wrong service/env UST labels missing or misconfigured on the Deployment

State your top 1-3 hypotheses explicitly: "Based on triage, I think the most likely cause is X because Y."


Step 3: Investigate

Use only the tools relevant to your hypotheses. Each observation informs your next action.


Cluster-side investigation tools

Is the pod in the Agent namespace? SSI never instruments pods in the same namespace as the Datadog Agent.

kubectl get pods -n <AGENT_NAMESPACE>

Were pods restarted after SSI was enabled?

kubectl rollout restart deployment/<DEPLOYMENT_NAME> -n <APP_NAMESPACE>
kubectl wait --for=condition=Ready pod -l app=<APP_LABEL> -n <APP_NAMESPACE> --timeout=120s
pup fleet instrumented-pods list <CLUSTER_NAME>

Is namespace targeting filtering the pod out?

kubectl get datadogagent datadog -n <AGENT_NAMESPACE> -o yaml | grep -A 15 instrumentation

Fix: update enabledNamespaces in datadog-agent.yaml.

Claude runs

kubectl apply -f datadog-agent.yaml

Is a podSelector target filtering the pod out? If targets with podSelector is configured, only pods whose labels match the selector are instrumented. Check whether the app pod's labels match any target:

kubectl get datadogagent datadog -n <AGENT_NAMESPACE> -o yaml | grep -A 20 targets
kubectl get pod <POD_NAME> -n <APP_NAMESPACE> --show-labels

Fix: add a matching label to the pod template, or broaden the podSelector, then apply and restart.

Is a pod annotation opting it out? admission.datadoghq.com/enabled: "false" tells the webhook to skip this pod.

kubectl get pod <POD_NAME> -n <APP_NAMESPACE> -o yaml | grep -A 5 annotations
kubectl get pod <POD_NAME> -n <APP_NAMESPACE> --show-labels

Fix: remove the annotation from the Deployment pod template, then apply and restart.

Claude runs

kubectl apply -f <your-app-deployment.yaml>
kubectl rollout restart deployment/<DEPLOYMENT_NAME> -n <APP_NAMESPACE>

Does the app have existing custom instrumentation? SSI silently disables itself when it detects existing tracer code. Scan source files for:

  • Python: import ddtrace, ddtrace.patch_all()
  • Node.js: require('dd-trace'), DD.init()
  • Java: GlobalTracer.register(, dd-java-agent
  • .NET: Tracer.Instance, DD.Trace
  • Ruby: require 'ddtrace', Datadog.configure
  • PHP: DDTrace\

Also check dependency manifests: requirements.txt, package.json, Gemfile, pom.xml.

Fix: remove the import/package, rebuild image, reload into cluster, restart pod.

Is the base image Alpine (musl libc)? SSI's injected library requires glibc. Alpine uses musl — ABI-incompatible, fails silently.

kubectl exec -n <APP_NAMESPACE> <POD_NAME> -- sh -c "ldd --version 2>&1 | head -1"
kubectl exec -n <APP_NAMESPACE> <POD_NAME> -- sh -c "cat /etc/os-release | grep -i 'ID\|NAME' | head -3"

Fix: rebuild with a glibc-based image (python:3.x-slim, node:x-bookworm, eclipse-temurin).

Is the runtime version supported?

kubectl exec -n <APP_NAMESPACE> <POD_NAME> -- python --version
kubectl exec -n <APP_NAMESPACE> <POD_NAME> -- node --version
kubectl exec -n <APP_NAMESPACE> <POD_NAME> -- java -version

Verify against SSI compatibility matrix.

Is the admission webhook registered?

kubectl get mutatingwebhookconfigurations | grep datadog
kubectl get pods -n <AGENT_NAMESPACE> -l app=datadog-cluster-agent
kubectl logs -n <AGENT_NAMESPACE> -l app=datadog-cluster-agent --tail=100

Did injection produce errors? Get the node hostname first, then query Datadog for injection errors:

kubectl get pod <POD_NAME> -n <APP_NAMESPACE> -o jsonpath='{.spec.nodeName}'
pup apm troubleshooting list --hostname <NODE_HOSTNAME> --timeframe 1h

Is the Agent sending data to Datadog?

kubectl exec -n <AGENT_NAMESPACE> \
  $(kubectl get pod -n <AGENT_NAMESPACE> -l app=datadog-agent -o name | head -1) \
  -- agent status | grep -A 5 "APM Agent"

Datadog-side investigation tools

Is the tracer reporting?

pup fleet tracers list --filter "service:<SERVICE_NAME>"

Does APM recognise the service?

pup apm services list --env <ENV>

Are traces arriving?

pup traces search --query "service:<SERVICE_NAME>" --from 1h --limit 10

Which agent is the tracer connected to? Use if connectivity between tracer and Agent is suspected.

pup fleet agents list --filter "hostname:<NODE_HOSTNAME>"
pup fleet agents tracers <AGENT_KEY> --filter "service:<SERVICE_NAME>"

Step 4: Reflect Before Concluding

Before applying any fix, answer:

  1. What evidence confirms my hypothesis?
  2. What evidence would contradict it — and have I checked?
  3. Is there a simpler explanation I haven't considered?

If the conclusion doesn't hold up, return to Step 2 with new hypotheses. Keep iterating until you can defend the conclusion against all three questions.


Step 5: Fix

Apply the fix for the confirmed root cause. If the fix requires a code or Dockerfile change, rebuild and reload:

Claude runs

docker build -f <DOCKERFILE_PATH> -t <IMAGE_NAME> <BUILD_CONTEXT>

[DECISION: cluster type]

  • kind (local): load the image into the cluster

Claude runs

kind load docker-image <IMAGE_NAME> --name <CLUSTER_NAME>
  • Registry-based: skip — image will be pulled on next deployment

Claude runs

kubectl rollout restart deployment/<DEPLOYMENT_NAME> -n <APP_NAMESPACE>
kubectl wait --for=condition=Ready pod -l app=<APP_LABEL> -n <APP_NAMESPACE> --timeout=120s

Step 6: Verify

Re-run triage to confirm the fix worked:

Claude runs

pup traces search --query "service:<SERVICE_NAME>" --from 1h --limit 5
pup fleet instrumented-pods list <CLUSTER_NAME>

If traces are arriving and the pod is in the instrumented list — resolved. Automatically proceed to onboarding-summary now — do not ask the user for permission.

ERROR: Still not resolved — return to Step 2 with the new triage data and form updated hypotheses.


Security constraints

  • Never write a raw API key into any file or chat message
  • Never run kubectl delete without user confirmation
  • Never modify admissionController settings directly
  • docker push to a registry always requires user confirmation
Weekly Installs
2
GitHub Stars
100
First Seen
8 days ago