Structured Logging

Core Philosophy

Logs are optimized for querying, not writing — design with debugging in mind
A log without correlation IDs is useless in distributed systems
If you can't answer "Who was affected? What failed? When? Why?" within 5 minutes, logging needs work

Structured Format

Always use key-value pairs (JSON), never string interpolation.

{
  "event": "payment_failed",
  "user_id": "123",
  "reason": "insufficient_funds",
  "amount": 99.99,
  "timestamp": "2025-01-24T20:00:00Z",
  "level": "error",
  "service": "billing",
  "request_id": "req_abc123"
}

Required Fields

Every log event MUST include:

Field	Format	Example
`timestamp`	ISO 8601 with timezone	`2025-01-24T20:00:00Z`
`level`	debug, info, warn, error	`info`
`event`	snake_case, past tense	`user_login_succeeded`
`request_id` or `trace_id`	UUID or prefixed ID	`req_abc123`
`service`	Service/app name	`api-gateway`
`environment`	prod, staging, dev	`prod`

High-Cardinality Fields

Include these when available — they make logs queryable during incidents:

Category	Fields
Identity	`user_id`, `org_id`, `account_id`
Tracing	`request_id`, `trace_id`, `span_id`
Domain	`order_id`, `transaction_id`, `job_id`

Rule: Look for domain-specific identifiers that help isolate issues to specific entities.

Log Levels

Level	When to Use	Example
`debug`	Verbose local dev details, disabled in prod	Variable values, loop iterations
`info`	Normal operations worth recording	User actions, job completions, deploys
`warn`	Unexpected but handled	Retries triggered, fallbacks activated
`error`	Failed, needs attention	Exceptions, failed requests, timeouts

Anti-pattern: Don't log errors for expected conditions (wrong password = info, not error).

Context Propagation

For distributed systems:

Inherit IDs — Downstream services must receive correlation IDs from upstream
Pass through boundaries — HTTP headers, message queues, async jobs
Middleware injection — Auto-inject context into every log via middleware/interceptor

[Client] --request_id--> [API Gateway] --request_id--> [Service A] --request_id--> [Service B]
                              |                              |                          |
                           (logs)                         (logs)                     (logs)
                              ↓                              ↓                          ↓
                     All queryable by single request_id

Async jobs: Store and restore original request context when processing background work.

What to Log

Log These	Skip These
Request entry/exit with duration	Sensitive data (passwords, tokens, PII, cards)
State transitions (created → paid → shipped)	Inside tight loops
External service calls with latency + status	Success cases with no debug value
Auth/authz events	Redundant infra logs (LB already captures)
Job starts, completions, failures
Retry attempts, circuit breaker changes

Naming Conventions

Pattern	Example
Field names: `snake_case`	`user_id`, not `userId` or `user-id`
Events: past tense verbs	`payment_completed`, not `complete_payment`
Domain prefixes when helpful	`auth.login_failed`, `billing.invoice_created`

Team agreement: Define field names once, use consistently across all services.

Performance

Concern	Solution
High-volume debug logs	Sampling in production
Hot path logging	Avoid or use async appenders
I/O overhead	Buffer and batch writes
Dynamic verbosity	Runtime-configurable log levels

Language-Specific Implementations

Language	Library	Notes
Python	`structlog`	See `majestic-data/etl-core-patterns`
Ruby/Rails	`Rails.event` (8.1+), `semantic_logger`	See `majestic-rails/dhh-coder/structured-events`
Node.js	`pino`, `winston` with JSON formatter
Go	`slog` (stdlib), `zerolog`
Java	`logback` with JSON encoder

Decision Table: Log or Not?

Scenario	Decision	Reason
User enters wrong password	`info`	Expected behavior, not an error
Payment gateway timeout	`error` + retry	Needs attention, affects user
Cache miss	`debug`	Only useful for performance analysis
User created account	`info`	Business event worth recording
Loop iteration 5000 of 10000	Don't log	Creates noise, no debug value
External API returns 500	`warn` or `error`	Depends on retry/fallback behavior
Background job started	`info`	Useful for job debugging
Background job failed after retries	`error`	Needs investigation

Incident Debugging Checklist

When designing logs, verify you can answer:

Who — Can filter to specific user/org/account?
What — Can identify the exact operation that failed?
When — Can narrow to specific time window?
Why — Is error context captured (reason, upstream cause)?
Where — Can trace across services via correlation ID?

Post-incident: Add the logs you wished you had.

structured-logging