datadog-operations
Datadog Operations
Complete Datadog automation: query APIs, create infrastructure, manage incidents, and automate responses. 73% platform coverage with 17 working scripts.
What This Skill Does
Investigation & Analysis:
- Query APM traces to identify performance bottlenecks
- Search logs for error patterns and anomalies
- Detect security threats and attack attempts
- Analyze Watchdog anomaly detection alerts
- Query metrics with statistical analysis
- Analyze Datadog usage and costs (FinOps)
- Monitor LLM observability for GenAI applications
- Query SLO status and error budgets
- List services from service catalog
- Analyze database query performance
- Track frontend performance with RUM
Automation & Creation:
- Create and manage monitors with alert thresholds
- Generate dashboards for APM, security, costs, and LLM observability
- Trigger Datadog workflows for incident response
- Create and update incidents
- Mute/unmute monitors during maintenance
- Create synthetic uptime checks and browser tests
Prerequisites
Set environment variables:
export DD_API_KEY=your_api_key
export DD_APP_KEY=your_application_key
export DD_SITE=datadoghq.com # or datadoghq.eu, us3.datadoghq.com, etc.
Get keys from Datadog: Organization Settings > API Keys and Application Keys
Working Scripts
1. Query APM Performance
Find slow endpoints and performance issues:
bash scripts/query-apm.sh --service my-service --duration 1h --limit 20
Returns:
- Endpoints sorted by P95 latency
- Request counts per endpoint
- P50, P95, P99 latency
- Problem endpoints (P95 > 500ms)
2. Query Security Signals
Find security threats and attack attempts:
bash scripts/query-security-signals.sh --service my-service --duration 24h
Returns:
- Security signals by severity (critical, high, medium, low)
- Attack types (SQL injection, XSS, auth failures)
- Affected services and hosts
- Recent security events with details
3. Query Watchdog Anomalies
Automated anomaly detection from Datadog Watchdog:
bash scripts/query-watchdog.sh --service my-service --type latency --duration 7d
Returns:
- Anomalies by type (latency, error_rate, traffic)
- Affected services and resources
- Start timestamps and severity
- Baseline vs observed values
4. Search Logs
Search logs for error patterns:
bash scripts/search-logs.sh --query "status:error service:my-service" --duration 1h
Returns:
- Error messages grouped by frequency
- Associated trace IDs for investigation
- Service and host breakdowns
- Common error patterns
5. Query Metrics
Fetch metric data with statistical analysis:
bash scripts/query-metrics.sh --metric "trace.express.request.duration" --service my-service --duration 24h
Returns:
- Time series data
- Statistics (min, max, avg, p50, p95, p99)
- Trend analysis (increasing, decreasing, stable)
- Anomaly detection (values > 2 std dev)
6. Analyze Usage and Costs
FinOps cost analysis and optimization:
bash scripts/analyze-usage-cost.sh --duration 30d --product all
Returns:
- APM span ingestion (indexed vs ingested)
- Log volume breakdown
- Infrastructure hosts and container hours
- Custom metrics count
- Estimated monthly costs by product
- Cost optimization recommendations
7. Analyze LLM Performance
For GenAI applications, analyze LLM observability data:
bash scripts/analyze-llm.sh --service my-llm-app --duration 24h
Returns:
- Token usage statistics (prompt + completion)
- Cost estimates based on model pricing
- Model latency (P50, P95, P99)
- Error rates by model
- Most expensive operations
- Token usage trends
8. Manage Monitors
Create, list, mute, and manage Datadog monitors:
# List all monitors
bash scripts/manage-monitors.sh list
# Create error rate monitor
bash scripts/manage-monitors.sh create \
--name "High Error Rate" \
--query "avg(last_5m):sum:trace.express.request.errors{service:my-service}.as_count() > 10" \
--message "Error rate is high @slack-alerts"
# Mute monitor for 2 hours
bash scripts/manage-monitors.sh mute --id 12345 --duration 2
# Unmute monitor
bash scripts/manage-monitors.sh unmute --id 12345
Returns:
- Monitor list with states (alert, warn, OK)
- Created monitor ID and details
- Mute/unmute confirmations
9. Create Dashboards
Generate dashboards from templates:
# Create APM performance dashboard
bash scripts/create-dashboard.sh --service payment-api --title "Payment API Performance" --type apm
# Create security monitoring dashboard
bash scripts/create-dashboard.sh --service payment-api --title "Security Dashboard" --type security
# Create cost analysis dashboard
bash scripts/create-dashboard.sh --title "Infrastructure Costs" --type cost
# Create LLM observability dashboard
bash scripts/create-dashboard.sh --service my-genai-app --title "LLM Performance" --type llm
Dashboard types:
- apm: Latency, errors, throughput by endpoint
- logs: Log volume and error analysis
- security: Security threats and attack patterns
- cost: APM, logs, infrastructure costs
- llm: Token usage, costs, model performance
10. Query SLOs
Check Service Level Objectives and error budgets:
# List all SLOs
bash scripts/query-slos.sh
# List SLOs for service
bash scripts/query-slos.sh --service payment-api
# List SLOs with tag
bash scripts/query-slos.sh --tag team:backend
Returns:
- SLO status (breaching, warning, OK)
- Current value vs target threshold
- Error budget remaining
- Error budget status (exhausted, low, healthy)
11. Trigger Workflows
Execute Datadog workflow automation:
# List available workflows
bash scripts/trigger-workflow.sh list
# Trigger workflow
bash scripts/trigger-workflow.sh run --id abc123
# Trigger with input data
bash scripts/trigger-workflow.sh run --id abc123 --input '{"service": "payment-api", "severity": "high"}'
Returns:
- Workflow list with IDs and descriptions
- Workflow instance ID when triggered
- Execution status
12. Manage Incidents
Create and manage incident response:
# List active incidents
bash scripts/manage-incidents.sh list --status active
# Create critical incident
bash scripts/manage-incidents.sh create \
--title "Payment API Down" \
--service payment-api \
--severity SEV-1
# Update incident status
bash scripts/manage-incidents.sh update --id abc123 --status resolved
# Get incident details
bash scripts/manage-incidents.sh get --id abc123
Returns:
- Incident list with status and severity
- Created incident ID and details
- Incident timeline and updates
13. Query Service Catalog
List services and ownership metadata:
# List all services
bash scripts/query-service-catalog.sh list
# List services for team
bash scripts/query-service-catalog.sh list --team backend
# Get service details
bash scripts/query-service-catalog.sh get --service payment-api
Returns:
- Service metadata (kind, tier, lifecycle)
- Team ownership and contacts
- Repository links
- Integration details
14. Manage Synthetic Tests
Create uptime checks and API tests:
# List all synthetic tests
bash scripts/manage-synthetics.sh list
# Create API uptime check
bash scripts/manage-synthetics.sh create-api \
--name "Payment API Uptime" \
--url "https://api.example.com/health" \
--method GET
# Create browser test
bash scripts/manage-synthetics.sh create-browser \
--name "Login Flow" \
--url "https://app.example.com/login"
# Get test results
bash scripts/manage-synthetics.sh get --id abc-123-def
Returns:
- Test list with status (active, paused)
- Created test ID and configuration
- Test results and uptime status
15. Query Database Performance
Analyze database queries and performance:
# Query database performance
bash scripts/query-database.sh --host postgres-prod --duration 1h
# Get slow queries
bash scripts/query-database.sh --host mysql-01 --duration 24h
Returns:
- Slow query patterns
- P95/avg query duration
- Connection metrics
- Top queries by latency
16. Query RUM (Real User Monitoring)
Analyze frontend performance and user experience:
# Query RUM data for application
bash scripts/query-rum.sh --application abc-123-def --duration 1h
# Get page load performance
bash scripts/query-rum.sh --application abc-123-def --duration 24h
Returns:
- Page load times (avg, P95)
- Frontend errors
- Top pages by traffic
- Error rate and types
17. Verify Setup
Validate Datadog configuration:
bash scripts/verify-setup.sh
Returns:
- Environment variable validation
- Agent connectivity check
- Tracer installation detection
Incident Investigation Workflow
When investigating production issues:
1. Identify scope
# Check for security threats
bash scripts/query-security-signals.sh --severity critical --duration 1h
# Check for anomalies
bash scripts/query-watchdog.sh --service affected-service --duration 24h
2. Find performance issues
# Find slow endpoints
bash scripts/query-apm.sh --service affected-service --duration 1h
# Check error patterns
bash scripts/search-logs.sh --service affected-service --status error --duration 1h
3. Analyze metrics
# Check latency trends
bash scripts/query-metrics.sh --metric "trace.express.request.duration" --service affected-service --duration 24h
# Check error rate trends
bash scripts/query-metrics.sh --metric "trace.express.request.errors" --service affected-service --duration 24h
4. Get specific traces
# Get error traces
bash scripts/query-apm.sh --service affected-service --status error --limit 10
# Search logs for trace context
bash scripts/search-logs.sh --query "trace_id:abc123def456"
Security Analysis Workflow
Monitor and investigate security threats:
# Check critical security signals
bash scripts/query-security-signals.sh --severity critical --duration 7d
# Analyze specific service
bash scripts/query-security-signals.sh --service payment-api --duration 24h
# Search for attack patterns in logs
bash scripts/search-logs.sh --query "sql injection OR xss OR authentication failed" --duration 24h
Cost Optimization Workflow
Analyze and optimize Datadog costs:
# Get full cost breakdown
bash scripts/analyze-usage-cost.sh --duration 30d --product all
# Focus on APM costs
bash scripts/analyze-usage-cost.sh --duration 30d --product apm
# Extract high-priority recommendations
bash scripts/analyze-usage-cost.sh --duration 30d --product all | jq '.recommendations[] | select(.priority == "high")'
# Track weekly trends
bash scripts/analyze-usage-cost.sh --duration 7d --product all | jq '.cost_summary'
LLM Observability Workflow
For GenAI applications, monitor token usage and costs:
# Analyze LLM performance
bash scripts/analyze-llm.sh --service my-genai-app --duration 24h
# Filter by specific model
bash scripts/analyze-llm.sh --service my-genai-app --model gpt-4 --duration 7d
# Find most expensive operations
bash scripts/analyze-llm.sh --service my-genai-app --duration 30d | jq '.operations | sort_by(.total_cost_usd) | reverse | .[0:5]'
# Track token usage trends
bash scripts/analyze-llm.sh --service my-genai-app --duration 7d | jq '.summary.token_usage'
Deployment Impact Analysis
Compare metrics before/after deployment:
# Before deployment
bash scripts/query-apm.sh --service my-service --duration 1h > before.json
bash scripts/query-metrics.sh --metric "trace.express.request.duration" --service my-service --duration 1h >> before_metrics.json
# Deploy...
# After deployment
bash scripts/query-apm.sh --service my-service --duration 1h > after.json
bash scripts/query-metrics.sh --metric "trace.express.request.duration" --service my-service --duration 1h >> after_metrics.json
# Compare latency
jq -s '.[0].summary.avg_p95_ms - .[1].summary.avg_p95_ms' before.json after.json
# Check for new errors
bash scripts/search-logs.sh --service my-service --status error --duration 30m
Monitor Creation Workflow
Set up monitoring for new services:
# Create latency monitor
bash scripts/manage-monitors.sh create \
--name "Payment API - High Latency" \
--query "avg(last_5m):avg:trace.express.request.duration{service:payment-api} > 500" \
--message "P95 latency above 500ms @slack-ops"
# Create error rate monitor
bash scripts/manage-monitors.sh create \
--name "Payment API - Error Rate" \
--query "avg(last_5m):sum:trace.express.request.errors{service:payment-api}.as_count() / sum:trace.express.request.hits{service:payment-api}.as_count() > 0.05" \
--message "Error rate above 5% @pagerduty"
# Create APM dashboard
bash scripts/create-dashboard.sh --service payment-api --title "Payment API Performance" --type apm
# Create security dashboard
bash scripts/create-dashboard.sh --service payment-api --title "Payment API Security" --type security
Incident Response Workflow
Automated incident management:
# Check for SLO breaches
bash scripts/query-slos.sh --service payment-api | jq '.slos[] | select(.status == "breaching")'
# Create incident if SLO breached
bash scripts/manage-incidents.sh create \
--title "Payment API SLO Breach" \
--service payment-api \
--severity SEV-2
# Trigger remediation workflow
bash scripts/trigger-workflow.sh run --id remediation-workflow-123 --input '{"service": "payment-api"}'
# Mute non-critical monitors during incident
bash scripts/manage-monitors.sh list --service payment-api | \
jq '.monitors[] | select(.name | contains("non-critical")) | .id' | \
xargs -I {} bash scripts/manage-monitors.sh mute --id {} --duration 2
# Update incident when resolved
bash scripts/manage-incidents.sh update --id abc123 --status resolved
SLO Monitoring Workflow
Track service level objectives:
# Check all SLOs
bash scripts/query-slos.sh
# Alert if error budget exhausted
EXHAUSTED=$(bash scripts/query-slos.sh | jq '.summary.budget_exhausted')
if [ "$EXHAUSTED" -gt 0 ]; then
bash scripts/manage-incidents.sh create \
--title "Error Budget Exhausted" \
--service affected-service \
--severity SEV-3
fi
# Weekly SLO report
bash scripts/query-slos.sh | jq '{
total: .total_slos,
breaching: .summary.breaching,
low_budget: .summary.budget_low,
at_risk: [.slos[] | select(.error_budget_remaining < 20) | {name, budget: .error_budget_remaining}]
}'
Output Format
All scripts return structured JSON for programmatic parsing:
{
"status": "ok|warning|critical|error",
"summary": {
"...": "aggregated metrics"
},
"data": [...],
"recommendations": [...]
}
Status messages go to stderr, JSON to stdout. This allows:
# Silent execution, capture JSON
bash scripts/query-apm.sh --service my-service --duration 1h 2>/dev/null | jq '.summary'
# Log messages only
bash scripts/query-apm.sh --service my-service --duration 1h >/dev/null
# Both
bash scripts/query-apm.sh --service my-service --duration 1h
Best Practices
Query Optimization
- Use specific time ranges to reduce API calls
- Filter by service/environment early
- Paginate large result sets
- Cache results when appropriate
Alert Investigation
- Start with Watchdog anomalies (automated detection)
- Correlate security signals with application errors
- Check metrics for trend confirmation
- Search logs for detailed context
Cost Control
- Run analyze-usage-cost.sh monthly
- Implement high-priority recommendations first
- Monitor sampling rates for high-volume services
- Track custom metric growth
Security Monitoring
- Query security signals daily (automated check)
- Filter by critical severity for alerting
- Correlate with log patterns
- Track attack trends over time
Limitations
- API rate limits apply (varies by endpoint)
- Historical data retention depends on Datadog plan
- Real-time queries have eventual consistency
- Requires live Datadog data (APM, logs, security monitoring)
Resources
- Datadog API Documentation
- APM Query Syntax
- Log Query Syntax
- Watchdog Alerts
- Security Monitoring
- Usage Metering API
Notes
This skill provides comprehensive Datadog automation: query live data to investigate issues AND create infrastructure (monitors, dashboards, incidents) for ongoing operations. It does not handle installation or initial setup - use Datadog documentation for that.
Investigation: Query APM, logs, metrics, security signals, SLOs, costs, and LLM usage to debug production issues.
Automation: Create monitors, generate dashboards, trigger workflows, manage incidents, and mute alerts during maintenance.
All scripts return structured JSON for integration with CI/CD pipelines, ChatOps workflows, and automation platforms.