distributed-tracing-logs
Distributed Tracing with Logs
Implement distributed tracing using logs by propagating trace context, creating span logs, using correlation IDs, and integrating with OpenTelemetry standards to enable end-to-end request tracing across distributed systems.
When to use me
Use this skill when:
- Building or maintaining distributed systems (microservices, serverless functions)
- Need to trace requests across multiple service boundaries
- Debugging issues that span multiple components or services
- Implementing observability for complex workflows
- Correlating logs from different services for a single user request
- Setting up OpenTelemetry or other tracing standards
- Analyzing latency and performance across service boundaries
- Implementing request context propagation
- Building audit trails for business transactions
What I do
1. Trace Context Propagation
- Generate trace and span IDs for request initiation
- Propagate context through HTTP headers across services
- Maintain context through async operations (queues, background jobs, callbacks)
- Handle context in batch processing and streaming systems
- Implement context extraction and injection middleware
- Manage sampling decisions for trace collection
2. Span Logging
- Create span start/end logs with timing information
- Log span attributes and events during execution
- Capture parent-child relationships between spans
- Record span status and errors for failed operations
- Include business context in span logs
- Implement span baggage for custom key-value propagation
3. Correlation & Context Management
- Generate correlation IDs for business transactions
- Link logs to traces through trace_id fields
- Maintain user/session context across service boundaries
- Propagate business identifiers (order_id, transaction_id, etc.)
- Handle context in distributed transactions
- Implement context storage and retrieval for long-running operations
4. OpenTelemetry Integration
- Implement OpenTelemetry SDKs for various languages
- Configure trace exporters (Jaeger, Zipkin, OTEL Collector, etc.)
- Set up automatic instrumentation for common frameworks
- Define custom spans and attributes for business logic
- Configure sampling strategies for production environments
- Integrate with existing logging infrastructure
5. Trace Analysis & Visualization
- Extract trace information from logs for analysis
- Calculate trace duration and latency across services
- Identify critical paths and bottlenecks
- Correlate traces with business metrics
- Create trace visualizations and dependency graphs
- Set up trace-based alerting for performance degradation
Trace Context Propagation
W3C Trace Context Standard
The W3C Trace Context specification defines standard HTTP headers for trace propagation:
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE
Header format:
traceparent:00-{trace-id}-{span-id}-{trace-flags}tracestate: Vendor-specific trace state information
Propagation Methods
HTTP Headers (Synchronous calls)
GET /api/users HTTP/1.1
Host: api.example.com
Traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
X-Correlation-Id: tx-123456
X-Request-Id: req-789012
Message Queues (Asynchronous)
{
"headers": {
"traceparent": "00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01",
"correlation_id": "tx-123456"
},
"body": {
"order_id": "ord-789",
"amount": 99.99
}
}
Database Operations
-- Include trace context in audit fields
INSERT INTO orders (id, amount, trace_id, span_id, created_at)
VALUES ('ord-789', 99.99, '0af7651916cd43dd8448eb211c80319c', 'b7ad6b7169203331', NOW());
Span Logging Patterns
Basic Span Logging
{
"timestamp": "2026-02-26T18:00:00Z",
"level": "INFO",
"trace_id": "0af7651916cd43dd8448eb211c80319c",
"span_id": "b7ad6b7169203331",
"span_name": "process_payment",
"span_kind": "SERVER",
"event": "span_start",
"duration_ms": 0,
"attributes": {
"order_id": "ord-789",
"payment_method": "credit_card",
"amount": 99.99
}
}
{
"timestamp": "2026-02-26T18:00:00.123Z",
"level": "INFO",
"trace_id": "0af7651916cd43dd8448eb211c80319c",
"span_id": "b7ad6b7169203331",
"span_name": "process_payment",
"span_kind": "SERVER",
"event": "span_end",
"duration_ms": 123,
"status": "OK",
"attributes": {
"order_id": "ord-789",
"payment_id": "pay-456",
"gateway_response": "success"
}
}
Error Span Logging
{
"timestamp": "2026-02-26T18:00:00Z",
"level": "ERROR",
"trace_id": "0af7651916cd43dd8448eb211c80319c",
"span_id": "b7ad6b7169203331",
"span_name": "process_payment",
"span_kind": "SERVER",
"event": "span_end",
"duration_ms": 5123,
"status": "ERROR",
"error_code": "PAYMENT_GATEWAY_TIMEOUT",
"error_message": "Payment gateway timeout after 5000ms",
"stack_trace": "...",
"attributes": {
"order_id": "ord-789",
"retry_count": 3,
"gateway": "stripe"
}
}
Nested Span Logging
{
"timestamp": "2026-02-26T18:00:00Z",
"level": "INFO",
"trace_id": "0af7651916cd43dd8448eb211c80319c",
"span_id": "b7ad6b7169203331",
"parent_span_id": "c8be7c825a934b7d",
"span_name": "charge_card",
"span_kind": "INTERNAL",
"event": "span_start",
"duration_ms": 0,
"attributes": {
"order_id": "ord-789",
"card_last4": "4242"
}
}
OpenTelemetry Integration
Manual Instrumentation
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
tracer = trace.get_tracer(__name__)
def process_payment(order_id, amount):
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("order_id", order_id)
span.set_attribute("amount", amount)
try:
# Business logic
result = charge_credit_card(order_id, amount)
span.set_status(Status(StatusCode.OK))
span.set_attribute("payment_id", result.payment_id)
return result
except Exception as e:
span.record_exception(e)
span.set_status(Status(StatusCode.ERROR, str(e)))
raise
Automatic Instrumentation
Configuration for automatic instrumentation of common frameworks:
opentelemetry:
instrumentations:
- name: "opentelemetry-instrumentation-flask"
enabled: true
- name: "opentelemetry-instrumentation-sqlalchemy"
enabled: true
- name: "opentelemetry-instrumentation-requests"
enabled: true
sampling:
type: "parentbased_traceidratio"
ratio: 0.1 # Sample 10% of traces in production
exporters:
- type: "otlp"
endpoint: "http://otel-collector:4317"
- type: "logging" # Also log spans for local debugging
resource:
attributes:
service.name: "payment-service"
service.version: "1.2.3"
deployment.environment: "production"
Examples
# Generate trace context for new request
npm run tracing:generate-context -- --service payment-service --output context.json
# Propagate trace context through HTTP call
npm run tracing:propagate -- --trace-id abc123 --span-id def456 --target http://api.example.com
# Analyze trace from logs
npm run tracing:analyze -- --trace-id abc123 --sources "app.log,api.log,db.log" --output trace.json
# Set up OpenTelemetry instrumentation
npm run tracing:setup-otel -- --language nodejs --exporter jaeger --sampling-ratio 0.1
# Extract trace timeline from logs
npm run tracing:timeline -- --trace-id abc123 --output timeline.html
Output format
Trace Context Configuration:
tracing:
standard: "W3C TraceContext"
headers:
traceparent: "traceparent"
tracestate: "tracestate"
correlation_id: "X-Correlation-Id"
request_id: "X-Request-Id"
propagation:
http: true
messaging: true
database: true
rpc: true
sampling:
strategy: "probability"
rate: 0.1 # 10% sampling in production
decision_deferred: false
span_logging:
enabled: true
format: "json"
include_fields:
- trace_id
- span_id
- parent_span_id
- span_name
- span_kind
- event
- duration_ms
- status
events:
- span_start
- span_end
- span_event
- span_error
correlation:
business_ids:
- order_id
- user_id
- transaction_id
- session_id
Trace Analysis Report:
Distributed Trace Analysis
─────────────────────────
Trace ID: 0af7651916cd43dd8448eb211c80319c
Start Time: 2026-02-26T18:00:00Z
Duration: 1.234s
Status: ERROR (partial failure)
Services Involved:
1. api-gateway (entry point)
2. auth-service (authentication)
3. payment-service (payment processing)
4. notification-service (notifications)
5. database (persistence)
Span Timeline:
00.000ms - api-gateway: request_received (span_start)
00.123ms - api-gateway: auth_check (span_start)
00.234ms - auth-service: validate_token (span_start)
00.345ms - auth-service: validate_token (span_end) [OK]
00.456ms - api-gateway: auth_check (span_end) [OK]
00.567ms - payment-service: process_payment (span_start)
01.234ms - payment-service: charge_card (span_start)
05.678ms - payment-service: charge_card (span_end) [ERROR: timeout]
05.789ms - payment-service: process_payment (span_end) [ERROR]
05.890ms - api-gateway: request_completed (span_end) [ERROR]
Critical Path Analysis:
- Total duration: 1.234s
- Payment processing: 1.111s (90% of total time)
- Card charging: 4.444s (within payment processing)
- Card charging timeout at 5.000ms
Error Analysis:
- Root cause: Payment gateway timeout
- Impact: Payment failed, user notified
- Recovery: Automatic retry scheduled
- Alternative flows: None configured
Performance Insights:
- Slowest service: payment-service (1.111s)
- Fastest service: auth-service (0.111ms)
- Bottleneck: External payment gateway call
- Recommendation: Implement circuit breaker for payment gateway
Business Context:
- User ID: user-123
- Order ID: ord-789
- Amount: $99.99
- Payment method: credit_card
- Outcome: Failed (gateway timeout)
Notes
- Trace context should be propagated consistently across all service boundaries
- Sampling is essential in production to manage volume and cost
- Span logs should include business context for meaningful analysis
- Trace visualization requires complete context from all services
- Consider trace storage and retention policies for compliance
- Monitor trace collection and processing for reliability
- Implement trace-based alerting for performance degradation detection
- Test trace propagation in all communication patterns (sync, async, batch)
- Document trace standards for development teams
- Regularly review trace sampling rates based on volume and importance
More from wojons/skills
adversarial-thinking
Apply systematic adversarial thinking patterns including devil's advocate, assumption busting, red teaming, and white hat security approaches
45devils-advocate
Challenge ideas, assumptions, and decisions by playing devil's advocate to identify weaknesses and prevent groupthink
41redteam
Think and act like an attacker to identify security vulnerabilities, weaknesses, and penetration vectors through adversarial security testing
37code-migration
Guide framework and library migrations with incremental strategies, breaking change analysis, compatibility testing, and automated migration tools
34observability-logging
Use logs as part of comprehensive observability strategy including metrics, traces, alerts, and dashboards for system understanding and operational excellence
34gap-analysis
Identify discrepancies between documented requirements and actual implementation through systematic comparison and analysis
34