dd-apm

SKILL.md

Datadog APM

Distributed tracing, service maps, and performance analysis.

Requirements

Datadog Labs Pup should be installed via:

brew tap datadog-labs/pack
brew install pup

Quick Start

pup auth login
pup apm services list --env production
pup traces search --query="service:api-gateway" --from="1h"

Services

List Services

--env is required for all apm services commands.

pup apm services list --env production
pup apm services list --env staging

Service Statistics

pup apm services stats --env production
pup apm services stats --env production --from 4h

Service Operations and Resources

# List operations for a service
pup apm services operations --env production --service api-gateway

# List resources (endpoints) for an operation
pup apm services resources --env production --service api-gateway --operation http.request

Service Dependencies

pup apm dependencies list --env production

Flow Map

# View service flow map (--query and --env required)
pup apm flow-map --query "service:api-gateway" --env production

Traces

Traces are searched via the top-level traces command (not under apm).

Important: APM durations are in nanoseconds: 1 second = 1,000,000,000 ns.

Search Traces

# By service
pup traces search --query="service:api-gateway" --from="1h"

# Errors only
pup traces search --query="service:api-gateway status:error" --from="1h"

# Slow traces (>1 second = 1000000000 ns)
pup traces search --query="service:api-gateway @duration:>1000000000" --from="1h"

# With specific tag
pup traces search --query="service:api @http.url:/api/users" --from="1h"

Aggregate Traces

# Average duration by resource
pup traces aggregate \
  --query="service:api-gateway" \
  --compute="avg(@duration)" \
  --group-by="resource_name" \
  --from="1h"

# Error count by service
pup traces aggregate \
  --query="status:error" \
  --compute="count" \
  --group-by="service" \
  --from="1h"

# p99 latency
pup traces aggregate \
  --query="service:api-gateway" \
  --compute="percentile(@duration, 99)" \
  --from="1h"

Key Metrics

Metric What It Measures
trace.http.request.hits Request count
trace.http.request.duration Latency
trace.http.request.errors Error count
trace.http.request.apdex User satisfaction

⚠️ Trace Sampling

Not all traces are kept. Understand sampling:

Mode What's Kept
Head-based Random % at start
Error/Slow All errors, slow traces
Retention What's indexed (billed)

Trace Retention Costs

Retention Cost
Indexed spans $$$ per million
Ingested spans $ per million

Best practice: Only index what you need for search.

Service Level Objectives

Link APM to SLOs:

pup slos create --file slo.json

Common Queries

Goal Query
Slowest endpoints pup traces aggregate --query="service:api" --compute="avg(@duration)" --group-by="resource_name" --from="1h"
Error rate by service pup traces aggregate --query="status:error" --compute="count" --group-by="service" --from="1h"
Throughput pup traces aggregate --query="service:api" --compute="count" --group-by="resource_name" --from="1h"

Troubleshooting

Problem Fix
No traces Check ddtrace installed, DD_TRACE_ENABLED=true
Missing service Verify DD_SERVICE env var
Traces not linked Check trace headers propagated
High cardinality Don't tag with user_id/request_id
--env required error Always pass --env to apm services commands

References/Docs

Weekly Installs
1
GitHub Stars
436
First Seen
7 days ago
Installed on
amp1
cline1
opencode1
cursor1
kimi-cli1
codex1