dt-obs-tracing
Application Tracing Skill
Overview
Distributed traces in Dynatrace consist of spans - building blocks representing units of work. With Traces in Grail, every span is accessible via DQL with full-text searchability on all attributes. This skill covers trace fundamentals, common analysis patterns, and span-type specific queries.
Core Concepts
Understanding Traces and Spans
Spans represent logical units of work in distributed traces:
- HTTP requests, RPC calls, database operations
- Messaging system interactions
- Internal function invocations
- Custom instrumentation points
Span kinds:
span.kind: server- Incoming call to a servicespan.kind: client- Outgoing call from a servicespan.kind: consumer- Incoming message consumption call to a servicespan.kind: producer- Outgoing message production call from a servicespan.kind: internal- Internal operation within a service
Root spans: A request root span (request.is_root_span == true) represents an incoming call to a service. Use this to analyze end-to-end request performance.
Key Trace Attributes
Essential attributes for trace analysis:
| Attribute | Description |
|---|---|
trace.id |
Unique trace identifier |
span.id |
Unique span identifier |
span.parent_id |
Parent span ID (null for root spans) |
request.is_root_span |
Boolean, true for request entry points |
request.is_failed |
Boolean, true if request failed |
duration |
Span duration in nanoseconds |
span.timing.cpu |
Overall CPU time of the span (stable) |
span.timing.cpu_self |
CPU time excluding child spans (stable) |
dt.smartscape.service |
Service Smartscape node ID |
dt.service.name |
Dynatrace service name derived from service detection rules. It is equal to the Smartscape service node name. |
endpoint.name |
Endpoint/route name |
Service Context
Spans reference services via Smartscape node IDs and the detected service name dt.service.name which is also present on every span.
fetch spans
| summarize spans=count(), by: { dt.smartscape.service, dt.service.name }
Sampling and Extrapolation
One span can represent multiple real operations due to:
- Aggregation: Multiple operations in one span (
aggregation.count) - ATM (Adaptive Traffic Management): Head-based sampling by agent
- ALR (Adaptive Load Reduction): Server-side sampling
- Read Sampling: Query-time sampling via
samplingRatioparameter
When to extrapolate: Always extrapolate when counting actual operations (not just spans). Use the multiplicity factor:
fetch spans
| fieldsAdd sampling.probability = (power(2, 56) - coalesce(sampling.threshold, 0)) * power(2, -56)
| fieldsAdd sampling.multiplicity = 1 / sampling.probability
| fieldsAdd multiplicity = coalesce(sampling.multiplicity, 1)
* coalesce(aggregation.count, 1)
* dt.system.sampling_ratio
| summarize operation_count = sum(multiplicity)
π Learn more: See Sampling and Extrapolation for detailed formulas and examples.
Common Query Patterns
Basic Span Access
Fetch spans and explore by type:
fetch spans | limit 1
Explore spans by function and type:
fetch spans
| summarize count(), by: { span.kind, code.namespace, code.function }
Request Root Filtering
List request root spans (incoming service calls):
fetch spans
| filter request.is_root_span == true
| fields trace.id, span.id, start_time, response_time = duration, endpoint.name
| limit 100
Service Performance Summary
Analyze service performance with error rates:
fetch spans
| filter request.is_root_span == true
| summarize
total_requests = count(),
failed_requests = countIf(request.is_failed == true),
avg_duration = avg(duration),
p95_duration = percentile(duration, 95),
by: {dt.service.name}
| fieldsAdd error_rate = (failed_requests * 100.0) / total_requests
| sort error_rate desc
Trace ID Lookup
Find all spans in a specific trace:
fetch spans
| filter trace.id == toUid("abc123def456")
| fields span.name, duration, dt.service.name
Performance Analysis
Response Time Percentiles
Calculate percentiles by endpoint:
fetch spans
| filter request.is_root_span == true
| summarize {
requests=count(),
avg_duration=avg(duration),
p95=percentile(duration, 95),
p99=percentile(duration, 99)
}, by: { endpoint.name }
| sort p99 desc
π‘ Best practice: Use percentiles (p95, p99) over averages for performance insights.
Slow Trace Detection
Find requests exceeding a threshold:
fetch spans, from:now() - 2h
| filter request.is_root_span == true
| filter duration > 5s
| fields trace.id, span.name, dt.service.name, duration
| sort duration desc
| limit 50
Duration Buckets with Exemplars
Group requests into duration buckets with example traces:
fetch spans, from:now() - 24h
| filter http.route == "/api/v1/storage/findByISBN"
| summarize {
spans=count(),
trace=takeAny(record(start_time, trace.id))
}, by: { bin(duration, 10ms) }
| fields `bin(duration, 10ms)`, spans, trace.id=trace[trace.id], start_time=trace[start_time]
Performance Timeseries
Extract response time as timeseries:
fetch spans, from:now() - 24h
| filter request.is_root_span == true
| makeTimeseries {
requests=count(),
avg_duration=avg(duration),
p95=percentile(duration, 95),
p99=percentile(duration, 99)
}, by: { endpoint.name }
π Learn more: See Performance Analysis for advanced patterns and timeseries techniques.
Failure Investigation
Failed Request Summary
Summarize failures by service:
fetch spans
| filter request.is_root_span == true
| summarize
total = count(),
failed = countIf(request.is_failed == true),
by: { dt.service.name }
| fieldsAdd failure_rate = (failed * 100.0) / total
| sort failure_rate desc
Failure Reason Analysis
Breakdown by failure detection reason:
fetch spans
| filter request.is_failed == true and isNotNull(dt.failure_detection.results)
| expand dt.failure_detection.results
| summarize count(), by: { dt.failure_detection.results[reason] }
Failure reasons:
http_code- HTTP response code triggered failuregrpc_code- gRPC status code triggered failureexception- Exception caused failurespan_status- Span status indicated failurecustom_rule- Custom failure detection rule matched
HTTP Code Failures
Find failures by HTTP status code:
fetch spans
| filter request.is_failed == true
| filter iAny(dt.failure_detection.results[][reason] == "http_code")
| summarize count(), by: { http.response.status_code, endpoint.name }
| sort `count()` desc
Recent Failed Requests
List recent failures with details:
fetch spans
| filter request.is_root_span == true and request.is_failed == true
| fields
start_time,
trace.id,
endpoint.name,
http.response.status_code,
duration
| sort start_time desc
| limit 100
π Learn more: See Failure Detection for exception analysis and custom rule investigation.
Service Dependencies
Service Communication
Analyze incoming and outgoing service communication:
fetch spans, from:now() - 1h
| filter isNotNull(server.address)
| fieldsAdd
remote_side = server.address
| summarize
call_count = count(),
avg_duration = avg(duration),
by: {dt.service.name, remote_side}
| sort call_count desc
Outgoing HTTP Calls
Identify external API dependencies:
fetch spans
| filter span.kind == "client" and isNotNull(http.request.method)
| summarize
calls = count(),
avg_latency = avg(duration),
p99_latency = percentile(duration, 99),
by: { dt.service.name, server.address, server.port }
| sort calls desc
Trace Aggregation
Complete Trace Analysis
Aggregate all spans in a trace to understand full request flow:
fetch spans, from:now() - 30m
| summarize {
spans = count(),
client_spans = countIf(span.kind == "client"),
// Endpoints involved in the trace
endpoints = toString(arrayRemoveNulls(collectDistinct(endpoint.name))),
// Extract the first request root in the trace
trace_root = takeMin(record(
root_detection_helper = coalesce(
if(request.is_root_span, 1),
if(isNull(span.parent_id), 2),
3),
start_time, endpoint.name, duration
))
}, by: { trace.id }
| fieldsFlatten trace_root
| fieldsRemove trace_root.root_detection_helper, trace_root
| fields
start_time = trace_root.start_time,
endpoint = trace_root.endpoint.name,
response_time = trace_root.duration,
spans,
client_spans,
endpoints,
trace.id
| sort start_time
| limit 100
Root detection strategy: Use takeMin(record(...)) with a detection helper to reliably find the root request:
- Priority 1: Spans with
request.is_root_span == true - Priority 2: Spans without parent (root spans)
- Priority 3: All other spans
Multi-Service Traces
Find traces spanning multiple services:
fetch spans, from:now() - 1h
| summarize {
services = collectDistinct(dt.service.name),
trace_root = takeMin(record(
root_detection_helper = coalesce(if(request.is_root_span, 1), 2),
endpoint.name
))
}, by: { trace.id }
| fieldsAdd service_count = arraySize(services)
| filter service_count > 1
| fields
endpoint = trace_root[endpoint.name],
service_count,
services = toString(services),
trace.id
| sort service_count desc
| limit 50
Request-Level Analysis
Request Attributes
Access custom request attributes captured by OneAgent on request root spans:
fetch spans
| filter request.is_root_span == true
| filter isNotNull(request_attribute.PaidAmount)
| makeTimeseries sum(request_attribute.PaidAmount)
Field pattern: request_attribute.<name>
For attributes with special characters, use backticks:
fetch spans
| filter isNotNull(`request_attribute.My Customer ID`)
Captured Attributes
Access attributes captured from method parameters (always as arrays):
fetch spans
| filter isNotNull(captured_attribute.BookID_purchased)
| fields trace.id, span.id, code.namespace, code.function, captured_attribute.BookID_purchased
| limit 1
Field pattern: captured_attribute.<name>
Request ID Aggregation
Aggregate all spans belonging to a single request using request.id (OneAgent traces only):
fetch spans
| filter isNotNull(request.id)
| summarize {
spans = count(),
client_spans = countIf(span.kind == "client"),
request_root = takeMin(record(
root_detection_helper = coalesce(if(request.is_root_span, 1), 2),
start_time, endpoint.name, duration
))
}, by: { trace.id, request.id }
| fieldsFlatten request_root
| fields
start_time = request_root.start_time,
endpoint = request_root.endpoint.name,
response_time = request_root.duration,
spans,
client_spans
| limit 100
π Learn more: See Request Attributes for complete patterns on request attributes, captured attributes, and request-level aggregation.
Span Types
HTTP Spans
HTTP spans capture web requests and API calls:
Server-side (incoming requests):
fetch spans
| filter span.kind == "server" and isNotNull(http.request.method)
| summarize
requests = count(),
avg_duration = avg(duration),
by: { http.request.method, http.route }
| sort requests desc
Client-side (outgoing calls):
fetch spans
| filter span.kind == "client" and isNotNull(http.request.method)
| summarize
calls = count(),
avg_duration = avg(duration),
by: { server.address, http.request.method }
| sort calls desc
π Learn more: See HTTP Span Analysis for status codes, payload analysis, and client IP tracking.
Database Spans
Database operations appear as client spans with db.* attributes:
fetch spans
| filter span.kind == "client" and isNotNull(db.system) and isNotNull(db.namespace)
| summarize {
spans=count(),
avg_duration=avg(duration)
}, by: { dt.service.name, db.system, db.namespace }
| sort spans desc
β οΈ Important: Database spans can be aggregated (one span = multiple calls). Always use extrapolation for accurate counts.
π Learn more: See Database Span Analysis for extrapolated counts and slow query detection.
Messaging Spans
Messaging spans capture Kafka, RabbitMQ, SQS operations:
fetch spans
| filter isNotNull(messaging.system)
| summarize
spans = count(),
messages = sum(coalesce(messaging.batch.message_count, 1)),
by: { messaging.system, messaging.destination.name, messaging.operation.type }
| sort messages desc
π Learn more: See Messaging Span Analysis for throughput, latency, and failure patterns.
RPC Spans
RPC spans cover gRPC, SOAP, and other RPC frameworks:
fetch spans
| filter isNotNull(rpc.system)
| summarize
calls = count(),
avg_duration = avg(duration),
by: { rpc.system, rpc.service, rpc.method }
| sort calls desc
π Learn more: See RPC Span Analysis for gRPC status codes and service dependencies.
Serverless Spans
FaaS spans capture Lambda, Azure Functions, and GCP Cloud Functions:
fetch spans
| filter isNotNull(faas.name) and span.kind == "server"
| summarize
invocations = count(),
avg_duration = avg(duration),
p99_duration = percentile(duration, 99),
by: { faas.name, cloud.provider }
| sort invocations desc
π Learn more: See Serverless Span Analysis for cold start analysis and trigger types.
Advanced Topics
Exception Analysis
Exceptions are stored as span.events within spans:
fetch spans
| filter iAny(span.events[][span_event.name] == "exception")
| expand span.events
| fieldsFlatten span.events, fields: { exception.type }
| summarize {
count(),
trace=takeAny(record(start_time, trace.id))
}, by: { exception.type }
| fields exception.type, `count()`, trace.id=trace[trace.id], start_time=trace[start_time]
π‘ Tip: Use iAny() to check conditions within span event arrays.
Logs and Traces Correlation
Join logs with traces using trace IDs:
fetch spans, from:now() - 30m
| join [ fetch logs | fieldsAdd trace.id = toUid(trace_id) ]
, on: { trace.id }
, fields: { content, loglevel }
| fields start_time, trace.id, span.id, loglevel, content
| limit 100
π Learn more: See Logs Correlation for filtering traces by log content and finding logs for failed requests.
Network Analysis
Analyze IP addresses, DNS resolution, and client geography:
fetch spans, from:now() - 24h
| filter isNotNull(client.ip)
| fieldsAdd client.ip = toIp(client.ip)
| fieldsAdd client.subnet = ipMask(client.ip, 24)
| summarize {
requests=count(),
unique_clients=countDistinct(client.ip)
}, by: { client.subnet, endpoint.name }
| sort requests desc
π Learn more: See Network Analysis for server address resolution and communication mapping.
Best Practices
Query Optimization
- Filter early: Apply
request.is_root_span == trueand endpoint filters first - Use
samplingRatio: Reduce data volume for better performance (e.g.,samplingRatio:100reads 1%) - Limit results: Always use
limitfor exploratory queries - Percentiles over averages: Use p95/p99 for performance insights
Node Lookups
- Use
getNodeName(): Simplest way to add service names - Prefer subqueries: Use Smartscape node filters and
traversefor filtering - Cache node info: Store node lookups in fields for reuse
Aggregation Patterns
- Request roots: Use
request.is_root_span == truefor end-to-end analysis - Trace-level: Group by
trace.idfor complete trace metrics - Request-level: Group by
request.idfor request metrics (OneAgent traces only) - Always extrapolate: Use multiplicity for accurate operation counts
Trace Exemplars
Include example traces for drilldown:
| summarize {
count(),
trace=takeAny(record(start_time, trace.id))
}, by: { grouping_field }
| fields ..., trace.id=trace[trace.id], start_time=trace[start_time]
This enables "Open With" functionality in Dynatrace UI.
References
Detailed documentation for specific topics:
- Performance Analysis - Advanced timeseries, duration buckets, endpoint ranking
- Failure Detection - Failure reasons, exception investigation, custom rules
- Sampling and Extrapolation - Multiplicity calculation, database extrapolation
- Request Attributes - Request attributes, captured attributes, request ID aggregation
- Entity Lookups - Advanced node lookups, infrastructure correlation, hardware analysis
- HTTP Span Analysis - Status codes, payload analysis, client IPs
- Database Span Analysis - Extrapolated counts, slow queries, statement analysis
- Messaging Span Analysis - Kafka, RabbitMQ, SQS throughput and latency
- RPC Span Analysis - gRPC, SOAP, service dependencies
- Serverless Span Analysis - Lambda, Azure Functions, cold start analysis
- Logs Correlation - Joining logs and traces, correlation patterns
- Network Analysis - IP addresses, DNS resolution, communication mapping
Related Skills
- dt-dql-essentials - Core DQL syntax for querying trace data
- dt-app-dashboards - Embed trace queries in dashboards
- dt-migration - Smartscape entity model and relationship navigation