logging-sucks
This skill is adapted from "Logging sucks. And here's how to make it better." by Boris Tane.
When helping with logging, observability, or debugging strategies, follow these principles:
Core Philosophy
- Logs are optimized for querying, not writing — always design with debugging in mind
- Context is everything — a log without correlation IDs is useless in distributed systems
- Logs are for humans during incidents, not just for compliance or "just in case"
- If you can't filter and search your logs effectively, they provide zero value
- Mental model shift: Log what happened to this request, not what your code is doing
Wide Events / Canonical Log Lines
Instead of scattering 10-20 log lines throughout a request, emit one comprehensive event per request per service. This is the most important concept for effective logging.
- Build the event object throughout the request lifecycle
- Enrich it with context as you process (user info, business data, feature flags)
- Emit once at the end with all context attached
- Include 30-50+ fields containing everything useful for debugging
Example wide event structure:
{
"timestamp": "2025-01-15T10:23:45.612Z",
"request_id": "req_8bf7ec2d",
"trace_id": "abc123",
"service": "checkout-service",
"method": "POST",
"path": "/api/checkout",
"status_code": 500,
"duration_ms": 1247,
"user": {
"id": "user_456",
"subscription": "premium",
"account_age_days": 847
},
"cart": {
"id": "cart_xyz",
"item_count": 3,
"total_cents": 15999
},
"error": {
"type": "PaymentError",
"code": "card_declined",
"message": "Card declined by issuer"
},
"feature_flags": {
"new_checkout_flow": true
}
}
This enables queries like: "Show all checkout failures for premium users where new_checkout_flow was enabled, grouped by error code."
Structured Logging Requirements
- Always use key-value pairs (JSON) instead of string interpolation
- Bad:
"Payment failed for user 123" - Good:
{"event": "payment_failed", "user_id": "123", "reason": "insufficient_funds", "amount": 99.99} - Structured logs are machine-parseable, enabling aggregation, alerting, and dashboards
Required Fields for Every Log Event
timestamp— RFC3339 with timezone (e.g.,2025-01-24T20:00:00Z)level— debug, info, warn, error (be consistent, don't invent new levels)event— machine-readable event name, snake_case (e.g.,user_login_success)request_idortrace_id— for correlating logs across a single requestservice— which service/application emitted this logenvironment— prod, staging, dev
Examples of High-Cardinality Fields (always include when available)
user_id,org_id,account_id— who is affectedrequest_id,trace_id,span_id— for distributed tracingorder_id,transaction_id,job_id— domain-specific identifiers
These fields are what make logs actually queryable during incidents. Without them, you're grepping through millions of lines blindly.
Look for opportunities for high-cardinality fields that can help you identify the root cause of an issue quickly.
Context Propagation
- Pass trace/request IDs through all service boundaries (HTTP headers, message queues, etc.)
- Downstream services must inherit correlation IDs from upstream
- Use middleware or interceptors to automatically inject context into every log
- For async jobs, store and restore the original request context
Log Levels — Use Them Correctly
debug— Verbose details for local development, usually disabled in productioninfo— Normal operations worth recording (user actions, job completions, deploys)warn— Something unexpected happened but the system handled it (retries, fallbacks)error— Something failed and likely needs human attention (exceptions, failed requests)
Don't log errors for expected conditions (e.g., user enters wrong password)
What to Log
- Request entry and exit points (with duration)
- State transitions (order created → paid → shipped)
- External service calls (with latency and response codes)
- Authentication and authorization events
- Background job starts, completions, and failures
- Retry attempts and circuit breaker state changes
What NOT to Log
- Sensitive data (passwords, tokens, PII, credit card numbers)
- Logs inside tight loops (will generate millions of useless entries)
- Success cases that provide no debugging value
- Redundant information already captured by infrastructure (load balancer logs, etc.)
Naming Conventions
- Be consistent across all services — agree on field names as a team
- Use snake_case for field names:
user_id, notuserIdoruser-id - Use past-tense verbs for events:
payment_completed, notcomplete_payment - Prefix events by domain when helpful:
auth.login_failed,billing.invoice_created
Performance Considerations
- Avoid logging inside hot paths unless absolutely necessary
- Buffer and batch log writes to reduce I/O overhead
- Consider log levels that can be changed at runtime without redeploying
Sampling Strategy (Tail Sampling)
Use tail sampling — make the sampling decision after the request completes based on its outcome:
- Always keep errors — 100% of 5xx status codes, exceptions, and failures
- Always keep slow requests — anything above your p99 latency threshold
- Always keep specific users — VIP customers, internal testing accounts, flagged sessions
- Randomly sample the rest — happy, fast requests get sampled at 1-5%
This ensures you never lose the events that matter during incidents while keeping costs manageable.
During Incidents
- Your logs should answer: Who was affected? What failed? When? Why?
- If you can't answer these within 5 minutes of querying, your logging strategy needs work
- Post-incident: add the logs you wished you had
Common Misconceptions
- Structured logging != wide events — JSON logs with 5 fields scattered across 20 lines are still useless. Wide events are a philosophy: one comprehensive event per request.
- OpenTelemetry won't save you — OTel is a delivery mechanism, not a strategy. It doesn't decide what to log or add business context. You still need to deliberately instrument with wide events.
- High cardinality is only expensive on legacy systems — Modern columnar databases (ClickHouse, BigQuery) are designed for high-cardinality, high-dimensionality data.