datadog-observability
SKILL.md
Datadog Observability
Overview
Datadog is a SaaS observability platform providing unified monitoring across infrastructure, applications, logs, and user experience. It offers AI-powered anomaly detection, 1000+ integrations, and OpenTelemetry compatibility.
Core Capabilities:
- APM: Distributed tracing with automatic instrumentation for 8+ languages
- Infrastructure: Host, container, and cloud service monitoring
- Logs: Centralized collection with processing pipelines and 15-month retention
- Metrics: Custom metrics via DogStatsD with cardinality management
- Synthetics: Proactive API and browser testing from 29+ global locations
- RUM: Frontend performance with Core Web Vitals and session replay
When to Use This Skill
Activate when:
- Setting up production monitoring and observability
- Implementing distributed tracing across microservices
- Configuring log aggregation and analysis pipelines
- Creating custom metrics and dashboards
- Setting up alerting and anomaly detection
- Optimizing Datadog costs
Do not use when:
- Building with open-source stack (use Prometheus/Grafana instead)
- Cost is primary concern and budget is limited
- Need maximum customization over managed solution
Quick Start
1. Install Datadog Agent
Docker (simplest):
docker run -d --name dd-agent \
-e DD_API_KEY=<YOUR_API_KEY> \
-e DD_SITE="datadoghq.com" \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
-v /proc/:/host/proc/:ro \
-v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
gcr.io/datadoghq/agent:7
Kubernetes (Helm):
helm repo add datadog https://helm.datadoghq.com
helm install datadog-agent datadog/datadog \
--set datadog.apiKey=<YOUR_API_KEY> \
--set datadog.apm.enabled=true \
--set datadog.logs.enabled=true
2. Instrument Your Application
Python:
from ddtrace import tracer, patch_all
# Automatic instrumentation for common libraries
patch_all()
# Manual span for custom operations
with tracer.trace("custom.operation", service="my-service") as span:
span.set_tag("user.id", user_id)
# your code here
Node.js:
// Must be first import
const tracer = require('dd-trace').init({
service: 'my-service',
env: 'production',
version: '1.0.0',
});
3. Verify in Datadog UI
- Go to Infrastructure > Host Map to verify agent
- Go to APM > Services to see traced services
- Go to Logs > Search to verify log collection
Core Concepts
Tagging Strategy
Tags enable filtering, aggregation, and cost attribution. Use consistent tags across all telemetry.
Required Tags:
| Tag | Purpose | Example |
|---|---|---|
env |
Environment | env:production |
service |
Service name | service:api-gateway |
version |
Deployment version | version:1.2.3 |
team |
Owning team | team:platform |
Avoid High-Cardinality Tags:
- User IDs, request IDs, timestamps
- Pod IDs in Kubernetes
- Build numbers, commit hashes
Unified Observability
Datadog correlates metrics, traces, and logs automatically:
- Traces include span tags that link to metrics
- Logs inject trace IDs for correlation
- Dashboards combine all data sources
Best Practices
Start Simple
- Install Agent with basic configuration
- Enable automatic instrumentation
- Verify data in Datadog UI
- Add custom spans/metrics as needed
Progressive Enhancement
Basic → APM tracing → Custom spans → Custom metrics → Profiling → RUM
Key Instrumentation Points
- HTTP entry/exit points
- Database queries
- External service calls
- Message queue operations
- Business-critical flows
Common Mistakes
- High-cardinality tags: Using user IDs or request IDs as tags creates millions of unique metrics
- Missing log index quotas: Leads to unexpected bills from log volume spikes
- Over-alerting: Creates alert fatigue; alert on symptoms, not causes
- Missing service tags: Prevents correlation between metrics, traces, and logs
- No sampling for high-volume traces: Ingests everything, causing cost explosion
Navigation
For detailed implementation:
- Agent Installation: Docker, Kubernetes, Linux, Windows, and cloud-specific setup
- APM Instrumentation: Python, Node.js, Go, Java instrumentation with code examples
- Log Management: Pipelines, Grok parsing, standard attributes, archives
- Custom Metrics: DogStatsD patterns, metric types, tagging best practices
- Alerting: Monitor types, anomaly detection, alert hygiene
- Cost Optimization: Metrics without Limits, sampling, index quotas
- Kubernetes: DaemonSet, Cluster Agent, autodiscovery
Complementary Skills
When using this skill, consider these related skills (if deployed):
- docker: Container instrumentation patterns
- kubernetes: K8s-native monitoring patterns
- python/nodejs/go: Language-specific APM setup
Resources
Official Documentation:
- APM: https://docs.datadoghq.com/tracing/
- Logs: https://docs.datadoghq.com/logs/
- Metrics: https://docs.datadoghq.com/metrics/
- DogStatsD: https://docs.datadoghq.com/developers/dogstatsd/
Cost Management:
Weekly Installs
26
Repository
bobmatnyc/claude-mpm-skillsFirst Seen
Jan 29, 2026
Security Audits
Installed on
claude-code21
opencode15
codex14
cursor14
gemini-cli13
github-copilot12