Grafana Tempo - Distributed Tracing Backend

Grafana Tempo is an open-source, high-scale distributed tracing backend. It is:

Cost-efficient: only requires object storage (S3, GCS, Azure) to operate
Deeply integrated: with Grafana, Mimir, Prometheus, Loki, and Pyroscope
Protocol-agnostic: accepts OTLP, Jaeger, Zipkin, OpenCensus, Kafka

Quick Reference Links

TraceQL Language Reference - query syntax, operators, examples, metrics functions
Configuration Reference - all YAML config blocks with defaults
Architecture and Operations - components, deployment, tuning
Metrics from Traces - span metrics, service graphs, TraceQL metrics
API Reference - HTTP endpoints, ingestion, search, metrics queries

What is Distributed Tracing?

A trace represents the lifecycle of a request as it passes through multiple services. It consists of:

Spans: Individual units of work with start time, duration, attributes, and status
Trace ID: Shared identifier across all spans in a request
Parent-child relationships: Spans form a tree showing causality

Traces enable:

Root cause analysis for service outages
Understanding service dependencies
Identifying latency bottlenecks
Correlating events across microservices

Architecture Overview

Applications
    |
    | (OTLP 4317/4318, Jaeger 14250/14268, Zipkin 9411)
    v
[Distributor]  ----  hashes traceID, routes to N ingesters
    |
    |---> [Ingester]  (WAL + Parquet block assembly, flush to object store)
    |
    |---> [Metrics Generator]  (optional: derives RED metrics -> Prometheus)
    
Query path:
Grafana  -->  [Query Frontend]  (shards queries)
                    |
              [Querier pool]
              /           \
    [Ingesters]     [Object Storage]
    (recent)        (historical blocks)

Core Components

Component	Role	Default Ports
Distributor	Receives spans, routes by traceID hash	4317 (gRPC), 4318 (HTTP)
Ingester	Buffers in memory, flushes to storage	-
Query Frontend	Query orchestrator, shards across queriers	3200 (HTTP)
Querier	Executes search jobs against storage	-
Compactor	Merges blocks, enforces retention	-
Metrics Generator	Derives RED metrics from spans	-

TraceQL - The Query Language

TraceQL queries filter traces by span properties. Structure: { filters } | pipeline

Attribute Scopes

span.http.status_code        # span-level attribute
resource.service.name        # resource attribute (from SDK)
name                         # intrinsic: span operation name
status                       # intrinsic: ok | error | unset
duration                     # intrinsic: span duration
kind                         # intrinsic: server | client | producer | consumer | internal
traceDuration                # intrinsic: entire trace duration
rootServiceName              # intrinsic: service of the root span
rootName                     # intrinsic: operation name of the root span

Operators

=   !=   >   <   >=   <=      # comparison
=~  !~                         # regex match (Go RE2)
&&  ||  !                      # logical

Essential Examples

# All errors
{ status = error }

# Slow requests from a service
{ resource.service.name = "frontend" && duration > 1s }

# HTTP 5xx errors
{ span.http.status_code >= 500 }

# Count errors per trace (more than 2)
{ status = error } | count() >= 2

# Group by service
{ status = error } | by(resource.service.name)

# P99 latency grouping
{ kind = server } | avg(duration) by(resource.service.name)

# Select specific fields
{ status = error } | select(span.http.url, duration, resource.service.name)

# Structural: server span with downstream error
{ kind = server } >> { status = error }

# Both conditions present (any relationship)
{ span.db.system = "redis" } && { span.db.system = "postgresql" }

# Find most recent (deterministic)
{ resource.service.name = "api" } with (most_recent=true)

TraceQL Metrics

# Error rate per service
{ status = error } | rate() by (resource.service.name)

# P99 latency
{ kind = server } | quantile_over_time(duration, .99) by (resource.service.name)

# With exemplars
{ kind = server } | quantile_over_time(duration, .99) by (resource.service.name) with (exemplars=true)

Deployment

Quick Start (Docker Compose)

git clone https://github.com/grafana/tempo.git
cd tempo/example/docker-compose/local
mkdir tempo-data
docker compose up -d
# Grafana at http://localhost:3000, Tempo API at http://localhost:3200

Minimal Single-Node Config

server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

ingester:
  lifecycler:
    ring:
      replication_factor: 1

compactor:
  compaction:
    block_retention: 336h    # 14 days

storage:
  trace:
    backend: local
    local:
      path: /var/tempo/traces
    wal:
      path: /var/tempo/wal

memberlist:
  abort_if_cluster_join_fails: false
  join_members: []

Production (S3 + Microservices)

storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces
      endpoint: s3.amazonaws.com
      region: us-east-1
      # Use IRSA/IAM roles (preferred over access keys)

compactor:
  compaction:
    block_retention: 336h    # Override per-tenant in overrides section

memberlist:
  join_members:
    - tempo-1:7946
    - tempo-2:7946
    - tempo-3:7946

ingester:
  lifecycler:
    ring:
      replication_factor: 3

Kubernetes (Helm)

helm repo add grafana https://grafana.github.io/helm-charts
helm install tempo grafana/tempo-distributed \
  --set storage.trace.backend=s3 \
  --set storage.trace.s3.bucket=my-tempo-bucket \
  --set storage.trace.s3.region=us-east-1

Sending Traces to Tempo

Via Grafana Alloy (Recommended)

// alloy.river
otelcol.receiver.otlp "default" {
  grpc { endpoint = "0.0.0.0:4317" }
  http { endpoint = "0.0.0.0:4318" }
  output {
    traces = [otelcol.exporter.otlp.tempo.input]
  }
}

otelcol.exporter.otlp "tempo" {
  client {
    endpoint = "tempo:4317"
    tls { insecure = true }
  }
}

Via OpenTelemetry Collector

exporters:
  otlp:
    endpoint: tempo:4317
    tls:
      insecure: true
    # For multi-tenancy:
    headers:
      x-scope-orgid: my-tenant

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlp]

Direct HTTP (OTLP)

curl -X POST -H 'Content-Type: application/json' \
  http://localhost:4318/v1/traces \
  -d '{"resourceSpans": [{"resource": {"attributes": [{"key": "service.name", "value": {"stringValue": "my-service"}}]}, "scopeSpans": [{"spans": [{"traceId": "5B8EFFF798038103D269B633813FC700", "spanId": "EEE19B7EC3C1B100", "name": "my-op", "startTimeUnixNano": 1689969302000000000, "endTimeUnixNano": 1689969302500000000, "kind": 2}]}]}]}'

Metrics from Traces

Enable Metrics Generator

metrics_generator:
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://prometheus:9090/api/v1/write
        send_exemplars: true

overrides:
  defaults:
    metrics_generator:
      processors: [service-graphs, span-metrics, local-blocks]

Processor Types

Service Graphs: Visualizes service topology and latency

Output: traces_service_graph_request_total, traces_service_graph_request_failed_total, duration histograms

Span Metrics: RED metrics per span

Output: traces_spanmetrics_calls_total, traces_spanmetrics_duration_seconds_*
Labels: service, span_name, span_kind, status_code + custom dimensions

Local Blocks: Enables TraceQL metrics queries on recent data

Multi-Tenancy

# Enable in Tempo config
multitenancy_enabled: true

All requests require X-Scope-OrgID header.

# OpenTelemetry Collector
exporters:
  otlp:
    headers:
      x-scope-orgid: tenant-id

# Grafana datasource
jsonData:
  httpHeaderName1: "X-Scope-OrgID"
secureJsonData:
  httpHeaderValue1: "tenant-id"

Grafana Integration

Data Source Configuration

datasources:
  - name: Tempo
    type: tempo
    url: http://tempo:3200
    jsonData:
      # Link traces to logs
      tracesToLogsV2:
        datasourceUid: loki-uid
        filterByTraceID: true
        tags: [{key: "service.name", value: "app"}]

      # Link traces to metrics
      tracesToMetrics:
        datasourceUid: prometheus-uid
        tags: [{key: "service.name", value: "service"}]
        queries:
          - name: Error Rate
            query: 'sum(rate(traces_spanmetrics_calls_total{$$__tags, status_code="STATUS_CODE_ERROR"}[5m]))'

      # Link traces to profiles (Pyroscope)
      tracesToProfiles:
        datasourceUid: pyroscope-uid
        tags: [{key: "service.name", value: "service_name"}]

      # Service map from span metrics
      serviceMap:
        datasourceUid: prometheus-uid

Key Grafana Features

Explore > Tempo: Search by TraceQL, trace ID, or tag filters
Service Graph tab: Visual service topology with RED metrics
Traces Drilldown: /a/grafana-exploretraces-app - no TraceQL required
Exemplars: Click metric spike -> jump directly to responsible trace
Derived fields in Loki: Click trace ID in log -> jump to trace in Tempo

API Quick Reference

# Search traces
GET /api/search?q={status=error}&limit=20&start=<unix>&end=<unix>

# Get trace by ID
GET /api/traces/<traceID>
GET /api/v2/traces/<traceID>

# List all tag names
GET /api/search/tags

# Get values for a tag
GET /api/search/tag/service.name/values

# TraceQL metrics (time series)
GET /api/metrics/query_range?q={status=error}|rate()&start=...&end=...&step=60

# Health check
GET /ready

Performance Tuning Summary

Problem	Solution
Slow searches	Scale queriers horizontally; scale compactors to reduce block count
High memory on queriers	Reduce `max_concurrent_queries`; lower `target_bytes_per_job`
High memory on ingesters	Reduce `max_block_bytes`; lower per-tenant trace limits
Slow attribute queries	Add dedicated Parquet columns for frequent attributes
Cache miss rate high	Increase cache size; tune `cache_min_compaction_level`
Rate limited (429)	Raise `max_outstanding_per_tenant` or increase per-tenant ingestion limits
Memcached connection errors	Increase memcached connection limit (`-c 4096`)

Best Practices

Instrumentation

Follow OpenTelemetry semantic conventions for attribute names
Use span. prefix for span attributes, resource. for process context
Keep attributes meaningful - avoid metrics/logs as span attributes
Limit attributes to max ~128 per span (OTel default)
Use span linking for batch processing (instead of huge fan-out traces)
Create spans for: external calls, significant loops, operations with variable latency
Avoid creating spans for every function call

Deployment

Use replication factor 3 for production HA
Object storage required for distributed deployments (not local)
Enable dedicated attribute columns for your most-queried attributes
Set appropriate block retention per tenant via overrides
Monitor tempo_ingester_live_traces to detect memory pressure early

Querying

Use time bounds (start/end) to limit search scope
Use structural operators for root cause analysis patterns
Prefer attribute != nil for existence checks
Use with (most_recent=true) when you need deterministic recent results
Scope tag discovery with a TraceQL query to reduce noise

Ports Reference

Port	Protocol	Purpose
3200	HTTP	Tempo API (queries, search, health)
9095	gRPC	Internal component communication
4317	gRPC	OTLP trace ingestion
4318	HTTP	OTLP trace ingestion
14268	HTTP	Jaeger Thrift HTTP ingestion
14250	gRPC	Jaeger gRPC ingestion
6831	UDP	Jaeger Thrift Compact
6832	UDP	Jaeger Thrift Binary
9411	HTTP	Zipkin ingestion
7946	TCP/UDP	Memberlist gossip

tempo

Grafana Tempo - Distributed Tracing Backend

Quick Reference Links

What is Distributed Tracing?

Architecture Overview

Core Components

TraceQL - The Query Language

Attribute Scopes

Operators

Essential Examples

TraceQL Metrics

Deployment

Quick Start (Docker Compose)

Minimal Single-Node Config

Production (S3 + Microservices)

Kubernetes (Helm)

Sending Traces to Tempo

Via Grafana Alloy (Recommended)

Via OpenTelemetry Collector

Direct HTTP (OTLP)

Metrics from Traces

Enable Metrics Generator

Processor Types

Multi-Tenancy

Grafana Integration

Data Source Configuration

Key Grafana Features

API Quick Reference

Performance Tuning Summary

Best Practices

Instrumentation

Deployment

Querying

Ports Reference

More from grafana/skills

dashboarding

promql

grafana-oss

prometheus

opentelemetry

loki