observability
Observability
Three Pillars
1. Logs
Discrete events with context.
{
"timestamp": "2024-01-01T12:00:00Z",
"level": "error",
"message": "Failed to process order",
"orderId": "123",
"error": "Payment declined",
"traceId": "abc123"
}
2. Metrics
Numeric measurements over time.
http_requests_total{method="GET", status="200"} 1234
http_request_duration_seconds{quantile="0.95"} 0.23
3. Traces
Request flow through services.
Trace: abc123
├── API Gateway (50ms)
│ ├── Auth Service (10ms)
│ └── Order Service (35ms)
│ └── Database (20ms)
OpenTelemetry Setup
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://collector:4318/v1/traces',
}),
serviceName: 'my-service',
});
sdk.start();
Key Metrics
RED Method (Request-focused)
- Rate: Requests per second
- Errors: Failed requests per second
- Duration: Request latency
USE Method (Resource-focused)
- Utilization: % time busy
- Saturation: Queue depth
- Errors: Error count
Alerting
Good Alerts
- Actionable: Something can be done
- Urgent: Needs immediate attention
- Specific: Clear what's wrong
Alert Template
alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }}"
Dashboards
Essential panels:
- Request rate
- Error rate
- Latency (P50, P95, P99)
- Saturation (CPU, memory)
- Active alerts
More from nguyenhuuca/assessment
compliance
Ensure regulatory compliance. Use when implementing GDPR, HIPAA, PCI-DSS, or SOC2 requirements. Covers compliance frameworks and controls.
17requirements-analysis
Analyze and refine product requirements. Use when clarifying scope, identifying gaps, or validating requirements. Covers requirement types and analysis techniques.
16security-review
Conduct security code reviews. Use when reviewing code for vulnerabilities, assessing security posture, or auditing applications. Covers security review checklist.
13execution-roadmaps
Create execution roadmaps for projects. Use when planning multi-phase projects or feature rollouts. Covers phased delivery and milestone planning.
12cloud-native-patterns
Apply cloud-native architecture patterns. Use when designing for scalability, resilience, or cloud deployment. Covers microservices, containers, and distributed systems.
12agile-methodology
Apply agile development practices. Use when planning sprints, running ceremonies, or improving team processes. Covers Scrum, Kanban, and agile principles.
12