Design Datadog Dashboard

Design a dashboard layout that tells a clear story — from high-level health signals down to granular diagnostics — using proper widget types, group organization, and template variables for reusability.

Important: Always check for existing dashboards first with pup dashboards list --agent. Do not create a new dashboard if one already exists for the same service or purpose — update the existing one instead. Only create a new dashboard when no relevant one exists or the user explicitly asks for a new one.

Philosophy: The frameworks, layouts, and widget guides in this skill are starting points — not rigid rules. Every product and business is different. Understand the domain first, then adapt the frameworks to fit. The best dashboards reflect how the business actually works, not how a generic template says they should.

Interview

First, determine the mode:

Design mode — user wants to create or redesign a dashboard → ask design questions, then run the full workflow
Audit mode — user wants to review an existing dashboard with no intent to redesign → skip to Audit

Design Questions

Skip if ALL of these are already specified: dashboard purpose, target audience, data sources, template variable needs, dashboard strategy.

Purpose — "What is this dashboard for? Service overview, infrastructure, executive KPIs, debugging, or SLO tracking?"
Audience — "Who will use this? On-call engineers, platform team, leadership, or mixed?"
Data Sources — "Which Datadog products are involved? Metrics only, APM + Metrics, Logs + Metrics, or full stack?"
Scope — "Is this for a single service, a group of services, or infrastructure-wide?"
Dashboard Strategy — "One dashboard per service, or a consolidated view?" — share the trade-offs from references/layouts.md to help them decide. If unsure, ask: "During an outage, does your team investigate one service at a time, or do they need to see all services simultaneously?"
Existing Dashboard — "Is there an existing dashboard to audit or redesign?" If yes, fetch with pup dashboards get <id> --agent before designing.

Audit Questions

Skip if ALL of these are already specified: dashboard ID or URL, service name or team context.

Always interview if: No dashboard ID is provided or multiple dashboards may be relevant.

Dashboard — "Which dashboard should I review? Provide a dashboard ID, URL, or service name to search for."
Business Context — "Can you tell me what this service does for customers? Are there codebases or docs I can read to understand the product?"
- Impact: Understanding the domain lets the review focus on whether the right metrics are being tracked, not just whether generic rules are followed
Focus — "Is there anything specific you want me to focus on? (A) Full review, (B) Alert thresholds only, (C) Business section, (D) Layout and readability"
- Default to full review if unspecified

Domain Discovery

Applies to design mode. Skip if auditing only.

Before designing, understand what you are building observability for. The metrics that matter depend entirely on the product and business context.

Ask the user:

"Can you tell me about the product and what this service does for the business? What does a customer experience when they interact with it?"
"What does a bad day look like for this service? What breaks, and how do customers feel it?"
"Are there codebases, architecture docs, or README files I can read to understand the service and its dependencies?"

If the user points you to a codebase: Read it. Look at the entry points, the API routes, the database models, the queue consumers, the external service calls. Understanding the code gives you the context to choose metrics that actually matter — not just generic RED/USE signals.

If the user describes the business: Use that context to tailor the Business (B) section. An e-commerce service cares about checkout success rates. A messaging service cares about delivery latency. A payment service cares about transaction completion. Generic "request rate" and "error rate" are a starting point, but the real value comes from metrics that map to customer-visible outcomes.

Skip domain discovery if: You already have deep context about the service from prior conversations or the user has provided detailed specifications.

Gate: Before designing the Business group, you must be able to name at least 3 domain outcomes specific to this service — in plain language a product manager would recognize. Examples: "order placed", "payment completed", "message delivered". If you cannot name them, ask the user before proceeding. Do not substitute transport-layer metrics (gRPC error rate, HTTP request rate) as placeholders — those are P, not B. See the B trap in references/widgets.md.

Design

Skip to Audit if the user only wants to review an existing dashboard.

1. Gather existing context

pup dashboards list --agent
pup dashboards get <dashboard-id> --agent

If auditing an existing dashboard, fetch its definition first and analyze its current structure before redesigning.

2. Explore available telemetry

Before designing widgets, check what metrics and tag values actually exist for the service. This prevents designing around metrics that don't exist or using the wrong tag values in queries.

# See what metrics are available for the service
pup metrics list --filter="<service-name>.*" --agent

# Verify the service tag is active and see what metrics are flowing
pup metrics list --filter="trace.*" --tag-filter="service:<service-name>" --agent

Use the actual metric names and tag values you find here when writing widget queries — do not guess or invent them. If a metric you expect does not appear, flag it to the user before building widgets around it.

This applies to all query types: metric queries, APM span filters (operation_name, resource_name, span tags), and log filters. The **Configuration** sections in references/widgets.md describe JSON structure and field constraints only — they are not prescriptive queries. Always verify the actual filter values with pup before using them.

3. Choose a framework

Match the dashboard purpose to a framework. Read references/frameworks.md for detailed metric mappings and group structures.

Purpose	Framework
Service overview	RED (Rate, Errors, Duration)
Infrastructure	USE (Utilization, Saturation, Errors)
Executive/business	Golden Signals
SLO tracking	SLI/SLO
Debugging	Drill-down

4. Design the layout

Using your domain understanding and the chosen framework, design the group structure and select widgets. Read these references before designing:

layouts.md — Template variable conventions, group structure patterns, dashboard strategy trade-offs, grid sizing, anti-patterns
widgets.md — Widget selection guide, display options, sizing, naming conventions
thresholds.md — Alert threshold markers, threshold proximity, Y-axis configuration

5. Write the design output

Present the design using this template:

# Dashboard Design: [Dashboard Title]

## Purpose
[1-2 sentences: what this monitors, who uses it]

## Template Variables
| Variable | Tag | Default |
|----------|-----|---------|
| ... | ... | `*` |

## Layout

### Group: [Group Title]
| Widget | Type | Query/Metric | Width | Alert Threshold |
|--------|------|-------------|-------|----------------|
| ... | ... | ... | ... | ... |

[Repeat for each group]

Audit

Applies to both modes. Run after design, or directly if auditing an existing dashboard.

The core principles are: graphs should earn their place with alert thresholds, thresholds should sit close to normal traffic, a business section should exist at the top, and the dashboard should be readable by someone with zero service knowledge.

These are guiding principles — not a rigid checklist. Apply judgment based on the product and business context. A context-providing metric (like deployment events) may earn its place without a threshold. A service with unusual traffic patterns may need different proximity rules.

1. Fetch the dashboard

# If given a service name, list all dashboards and identify the relevant one by title
pup dashboards list --agent

# If given a URL, extract the dashboard ID from the path (e.g., /dashboard/abc-def-ghi/...)

# Get the full dashboard definition (includes the dashboard URL in the response)
pup dashboards get <dashboard-id> --agent

# Verify real metric names exist
pup metrics list --filter="trace.http.request.*" --agent

Parse the response to build an inventory of all widgets, groups, and their configurations.

2. Build widget inventory

Read references/widgets.md for the full widget prefix system before cataloging.

Catalog every widget in the dashboard:

Widget Title	Prefix	Type	Group	Has Alert Threshold	Threshold Value	Notes
...	I0/P1/D0/B0/—	...	...	...	...	...

Focus on timeseries and query value widgets — these are the primary candidates for alert threshold markers.

3. Audit alert thresholds

Read references/thresholds.md for threshold marker principles, configuration details, and findings format.

For each timeseries widget, check:

Does it have a marker/threshold line configured?
Is the marker colored red for visibility?
Does the threshold correspond to an actual monitor/alert?

4. Audit threshold proximity

Read references/thresholds.md for proximity guidance, Y-axis configuration rules, and findings format.

For each widget with a threshold, check:

What is the typical (normal) value range?
Where is the threshold set?
Is there excessive whitespace between the normal line and the alert line?
Is the Y-axis auto-scaled or explicitly set?

5. Audit business section

Principle: A dedicated Business (B) group should exist at the top of the dashboard with 5-8 key metrics for immediate outage identification. Business metrics are customer-visible outcomes — not infrastructure or domain internals. The specific metrics should reflect the product's business transactions, not generic traffic and error rates.

Check:

Does a Business group exist (named "Business", "B", or equivalent)?
Is it the first group on the dashboard?
Do its widgets use the B0-N: prefix?
Does it contain 5-8 metrics covering: customer-visible success rates, key transaction flows, and SLA-impacting latency?
Can someone determine "are customers affected?" within 5 seconds of opening the dashboard?
B trap check: For each B-prefixed widget, ask "Can a product manager interpret this without knowing the transport protocol?" If no — gRPC error rate, HTTP request rate, queue depth — it is P, not B, regardless of where it is placed. Flag and recommend moving to the appropriate platform group.

Findings format:

#### Business Section Audit

**Status**: MISSING / INCOMPLETE / OK

**Current state**: [Description of what exists]

**Recommended metrics** (if missing or incomplete):
1. B0: Key transaction success rate (are critical flows completing?)
2. B0: Customer-facing error rate (are requests failing for customers?)
3. B1: API p99 latency (are responses slow for customers?)
4. B1: Total request rate (are we receiving traffic?)
5. B2: Queue depth or processing lag (is async work backing up?)
6. B2: Key business event throughput (e.g. orders created, payments processed)

6. Apply zero-knowledge viewer test

Principle: Someone with zero knowledge of the service should be able to spot problems by looking for red indicators.

Evaluate:

Can you identify a problem in under 10 seconds without reading widget titles?
Are thresholds visible as red lines on every graph?
Is conditional formatting applied to query value widgets (green/yellow/red)?
Are group names self-explanatory?
Is there a note widget with runbook links or team ownership?

Findings format:

#### Zero-Knowledge Readability Audit

| Check | Status | Finding |
|-------|--------|---------|
| Problems visible in <10s | FAIL | No red lines on 8 of 12 graphs |
| Conditional formatting on QV widgets | PARTIAL | 2 of 4 QV widgets have thresholds |
| Group names self-explanatory | OK | All groups use clear names |
| Runbook/ownership note | MISSING | No note widget with team info |

7. Generate audit report

Compile all findings into a structured report:

# Dashboard Audit: [Dashboard Title]

**Dashboard ID**: [id]
**URL**: [url]
**Review date**: [date]

## Summary

[2-3 sentence summary: overall health of the dashboard, critical issues count]

## Critical Issues

[List issues that must be fixed before the dashboard is production-ready]

## Alert Threshold Audit
[From step 3]

## Threshold Proximity Audit
[From step 4]

## Business Section Audit
[From step 5]

## Zero-Knowledge Readability Audit
[From step 6]

## Recommended Actions

### Must Fix
1. [Action item with specific widget and group reference]

### Should Fix
1. [Action item]

### Nice to Have
1. [Action item]

Quality Principles

References

Observability Frameworks — RED, USE, Golden Signals, SLI/SLO with metric mappings
Layout & Structure — Template variables, group patterns, dashboard strategy, grid sizing, anti-patterns
Widgets — Widget prefix system, types, display options, sizing, naming conventions
Alert Thresholds — Threshold markers, proximity guide, Y-axis configuration

datadog-design-dashboard