nestjs-lgtm-metrics
NestJS Application Metrics via OpenTelemetry
What This Skill Covers
This skill adds application-level metrics to a NestJS app using the OpenTelemetry SDK, pushing them through an OTel Collector into Prometheus. It is organized in progressive stages, from foundational to advanced.
Scope boundary: This skill covers metrics only. Not logging, not tracing. Infrastructure metrics (container resources, host stats, database server stats, Redis server stats) come from their respective exporters and community Grafana dashboards. This skill covers what those exporters cannot see: request-level behavior, business events, runtime internals, and domain-specific KPIs as observed from inside your application.
Prerequisites
- NestJS application
- OpenTelemetry Collector running and reachable from the app
- Prometheus configured as an export target from the Collector
- Grafana connected to Prometheus as a data source
Architecture
NestJS App (OTel SDK) ──OTLP/gRPC──▶ OTel Collector ──remote_write──▶ Prometheus ──▶ Grafana
The app never exposes a /metrics endpoint. It pushes metrics to the Collector via OTLP. The Collector handles export to Prometheus (via prometheusremotewrite exporter or a Prometheus scrape on the Collector itself).
Read references/otel-setup.md before writing any application metric code. It covers the SDK bootstrap, Collector pipeline config, and how to verify the pipeline is working end to end.
Implementation Stages
Each stage builds on the previous. Read the corresponding reference file for full implementation code.
Stage 1: Foundation (Day 1, any app)
Read:
references/stage-1-foundation.md
Non-negotiable. Every NestJS app in production needs these from the start.
What you get:
- HTTP RED metrics (Rate, Errors, Duration) via a global interceptor
http_request_duration_secondshistogram, labeled by method, route, status_codehttp_requests_totalcounter, labeled by method, route, status_codehttp_requests_in_flightup/down counter
- Node.js runtime metrics (what V8/libuv expose that no infra exporter can see)
nodejs_eventloop_lag_secondshistogram (not just mean. percentiles matter)nodejs_active_handles_totalobservable gaugenodejs_active_requests_totalobservable gauge- GC duration, heap size, external memory via
@opentelemetry/host-metrics
- Health and readiness probe metrics
app_health_check_duration_secondshistogram by check_nameapp_health_check_statusgauge (1=healthy, 0=unhealthy) by check_name
Why these first: HTTP RED gives you the ability to detect problems. Runtime metrics tell you whether the Node.js process itself is degraded. Health checks close the loop with your orchestrator. Together they answer: "Is the app working? How fast? For whom is it failing?"
Histogram bucket strategy: Use explicit bucket boundaries initially: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]. Tune after you have real traffic data. Do not over-bucket. More buckets = more time series = more Prometheus storage cost.
Stage 2: Application Intelligence (Week 1-2, pre-launch)
Read:
references/stage-2-app-intelligence.md
These metrics expose application behavior that sits between raw HTTP and business logic.
What you get:
-
Database query metrics from the app's perspective
app_db_query_duration_secondshistogram by operation (select, insert, update, delete), entityapp_db_query_errors_totalcounter by operation, error_typeapp_db_connection_pool_wait_secondshistogram- Why this matters: postgres-exporter sees server-side query time. Your app sees ORM overhead + pool wait + serialization + network. The gap between these two is where bugs hide.
-
Redis operation metrics from the app's perspective
app_redis_operation_duration_secondshistogram by operation (get, set, del, etc.), key_prefixapp_redis_operation_errors_totalcounterapp_cache_hit_total/app_cache_miss_totalcounters by cache_name- Why this matters: redis-exporter sees
keyspace_hits/missesglobally. Your app knows WHICH cache strategy worked and which didn't.
-
External API call metrics
app_external_api_duration_secondshistogram by service, endpoint, status_codeapp_external_api_errors_totalcounter by service, error_typeapp_circuit_breaker_stategauge (0=closed, 0.5=half-open, 1=open) by service
-
Authentication and authorization metrics
app_auth_attempts_totalcounter by method (password, oauth, api_key), result (success, failure, locked)app_auth_token_operations_totalcounter by operation (issue, refresh, revoke, expire)app_authorization_denied_totalcounter by resource, action
-
Background job / queue metrics (Bull, BullMQ, or similar)
app_job_duration_secondshistogram by queue, job_typeapp_job_completed_totalcounter by queue, job_type, status (completed, failed, stalled)app_job_queue_depthgauge by queueapp_job_attempts_totalcounter by queue, job_type (tracks retry behavior)
Stage 3: Business Metrics (Launch, real users)
Read:
references/stage-3-business-metrics.md
Once users arrive, technical metrics aren't enough. You need to observe the business through the same system.
What you get:
-
User lifecycle events
app_user_signups_totalcounter by source, planapp_user_logins_totalcounter by methodapp_user_actions_totalcounter by action_type (generic, high-cardinality-safe pattern)
-
Transaction / conversion metrics
app_transactions_totalcounter by type, statusapp_transaction_value_totalcounter (sum of monetary values) by currencyapp_conversion_funnel_step_totalcounter by funnel, step
-
Feature usage metrics
app_feature_usage_totalcounter by feature_name, variant (useful for A/B tests)app_feature_errors_totalcounter by feature_name
-
SLI metrics for SLO tracking
app_sli_request_availabilitycounter (successful requests / total requests, partitioned)app_sli_request_latencyhistogram (tighter buckets around your SLO threshold)
Cardinality warning: Business metrics are where cardinality explosions happen. Never use user_id, session_id, or any unbounded value as an attribute. Use bucketed categories. Read the cardinality section in references/stage-3-business-metrics.md.
Stage 4: Advanced Patterns (Scaled app, mature team)
Read:
references/stage-4-advanced.md
For apps serving real traffic at scale, handling complex flows, or needing tight operational control.
What you get:
-
WebSocket / real-time connection metrics
app_ws_connections_activegauge by namespaceapp_ws_messages_totalcounter by namespace, direction (in/out), event_typeapp_ws_connection_duration_secondshistogram
-
Rate limiter observability
app_rate_limit_hits_totalcounter by limiter, routeapp_rate_limit_remaininggauge by limiter (sampled, not per-request)
-
Multi-tenant metrics
app_tenant_requests_totalcounter by tenant_tier (not tenant_id. cardinality.)app_tenant_resource_usagegauge by tenant_tier, resource_type
-
Graceful shutdown and lifecycle
app_shutdown_duration_secondshistogramapp_startup_duration_secondsgauge
-
Deployment version tracking
app_infogauge with version, commit_sha, environment attributes for canary analysis
Grafana Dashboard Templates
Import-ready Grafana dashboard JSON files, one per stage. Each uses ${DS_PROMETHEUS} as the data source variable, so Grafana will prompt you to select your Prometheus data source on import.
How to import: Grafana → Dashboards → New → Import → Upload JSON file → Select your Prometheus data source.
| Template | Panels | Description |
|---|---|---|
templates/stage-1-service-overview.json |
14 | Golden signals (stat), request rate by route, error rate, latency percentiles, P99 by route, event loop lag, heap memory, active handles, health check status and duration. |
templates/stage-2-dependency-health.json |
15 | DB query P95 by operation, pool wait time, DB errors, cache hit ratio by strategy, Redis op latency, Redis errors, external API P95, circuit breaker state, API errors, auth failure rate, token ops, authz denials, queue depth, job duration, job failure rate. |
templates/stage-3-business-metrics.json |
14 | Signups/revenue/success rate big numbers, signups by source, logins by method, revenue rate, transaction failure rate, funnel step totals, feature usage by variant, feature errors, 30-day SLO availability, SLI latency percentiles, error budget burn rate. |
templates/stage-4-operational.json |
10 | Running versions table, startup/shutdown duration, active WS connections, WS message throughput, WS connection duration, rate limit rejections by route, remaining tokens, request volume by tenant tier, resource usage by tier. |
Each dashboard is standalone. Combine panels across dashboards as your needs evolve. All dashboards include $job and $instance template variables for filtering. Stage 3 additionally includes $funnel and $sli selectors.
Grafana Dashboard Strategy
Read references/grafana-dashboards.md for dashboard panel definitions and provisioning.
Dashboard hierarchy:
- Service Overview (Stage 1): RED metrics, runtime health, up/down status. One dashboard per service.
- Dependency Health (Stage 2): DB latency, Redis hit rates, external API status. Shows how your app EXPERIENCES its dependencies.
- Business Dashboard (Stage 3): Conversion funnels, feature adoption, revenue metrics. Non-engineers should be able to read this.
- Operational Dashboard (Stage 4): Rate limits, circuit breakers, tenant resource usage, deployment comparison.
Alert hierarchy (pair with dashboards):
- Stage 1: Error rate spike, latency P99 breach, event loop lag > 100ms
- Stage 2: DB query latency drift, cache hit rate drop, external API circuit open
- Stage 3: Conversion rate drop, signup anomaly
- Stage 4: Rate limiter saturation, tenant quota breach
Infrastructure Exporters and Community Dashboards
Read:
references/infra-exporters.md
Your app metrics (Stages 1-4) show how the application experiences its dependencies. Infrastructure exporters show how those services perform internally. You need both. The gap between them is where most production issues hide.
This reference covers exporter setup (docker-compose), Prometheus scrape config, and community Grafana dashboard IDs for: PostgreSQL, Redis, RabbitMQ, MongoDB, Qdrant, Elasticsearch, Nginx, MinIO, plus node-exporter and cAdvisor for host/container metrics.
Each entry includes the key metrics the exporter provides and, critically, what it does NOT tell you (which your app-level metrics fill).
Common Mistakes to Avoid
- High-cardinality attributes: Never use userId, requestId, IP, or full URL path as a metric attribute. Use route patterns. If you need per-user data, that's a log or trace, not a metric.
- Metric name collisions: Prefix everything with
app_. Infra exporters usenode_,pg_,redis_,container_. - Too many buckets: Every unique attribute combination x every bucket = one time series. 10 routes x 5 methods x 11 buckets = 550 series just for HTTP duration. That's fine. 500 routes x 11 buckets = not fine.
- Forgetting the MeterProvider: If your OTel SDK setup doesn't register a MeterProvider, all
meter.create*calls silently return no-op instruments. Nothing fails, nothing records. Verify by checking the Collector's own metrics or the Prometheus targets page. - Measuring inside try/catch only: Always record duration and count in a
finallyblock so errors are measured too. - Ignoring pool wait time: The time your request spends waiting for a DB connection from the pool is invisible to both your query timer and the postgres-exporter. Instrument it separately.
File Reference
| File | When to read |
|---|---|
references/otel-setup.md |
Before starting. OTel SDK bootstrap and Collector config. |
references/stage-1-foundation.md |
Implementing foundation metrics. Full code. |
references/stage-2-app-intelligence.md |
Adding dependency and internal behavior metrics. |
references/stage-3-business-metrics.md |
Adding business event tracking and SLIs. |
references/stage-4-advanced.md |
WebSockets, rate limiters, multi-tenancy, lifecycle. |
references/infra-exporters.md |
Exporter setup for Postgres, Redis, RabbitMQ, MongoDB, Qdrant, Elasticsearch, Nginx, MinIO, node-exporter, cAdvisor. Docker-compose, scrape configs, community dashboard IDs. |
references/grafana-dashboards.md |
App dashboard structure, alert rules, provisioning. |
templates/stage-1-service-overview.json |
Import into Grafana for HTTP RED, runtime, health check panels. |
templates/stage-2-dependency-health.json |
Import into Grafana for DB, Redis, external API, auth, job panels. |
templates/stage-3-business-metrics.json |
Import into Grafana for signups, revenue, funnels, SLO panels. |
templates/stage-4-operational.json |
Import into Grafana for WebSocket, rate limiter, multi-tenant panels. |