observability
Observability
Three Pillars
1. Logs
Discrete events with context.
{
"timestamp": "2024-01-01T12:00:00Z",
"level": "error",
"message": "Failed to process order",
"orderId": "123",
"error": "Payment declined",
"traceId": "abc123"
}
2. Metrics
Numeric measurements over time.
http_requests_total{method="GET", status="200"} 1234
http_request_duration_seconds{quantile="0.95"} 0.23
3. Traces
Request flow through services.
Trace: abc123
├── API Gateway (50ms)
│ ├── Auth Service (10ms)
│ └── Order Service (35ms)
│ └── Database (20ms)
OpenTelemetry Setup
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://collector:4318/v1/traces',
}),
serviceName: 'my-service',
});
sdk.start();
Key Metrics
RED Method (Request-focused)
- Rate: Requests per second
- Errors: Failed requests per second
- Duration: Request latency
USE Method (Resource-focused)
- Utilization: % time busy
- Saturation: Queue depth
- Errors: Error count
Alerting
Good Alerts
- Actionable: Something can be done
- Urgent: Needs immediate attention
- Specific: Clear what's wrong
Alert Template
alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }}"
Dashboards
Essential panels:
- Request rate
- Error rate
- Latency (P50, P95, P99)
- Saturation (CPU, memory)
- Active alerts
More from dralgorhythm/claude-agentic-framework
react-native-reanimated
React Native Reanimated 4.x animation patterns. Use when adding animations, transitions, entering/exiting effects, or gesture-driven animations to React Native screens. Replaces Framer Motion for mobile.
102brainstorming
Generate and explore ideas effectively. Use when starting new projects, solving problems, or exploring solutions. Covers ideation techniques and divergent thinking.
47security-review
Conduct security code reviews. Use when reviewing code for vulnerabilities, assessing security posture, or auditing applications. Covers security review checklist.
45compliance
Ensure regulatory compliance. Use when implementing GDPR, HIPAA, PCI-DSS, or SOC2 requirements. Covers compliance frameworks and controls.
45requirements-analysis
Analyze and refine product requirements. Use when clarifying scope, identifying gaps, or validating requirements. Covers requirement types and analysis techniques.
44optimizing-code
Improve code performance without changing behavior. Use when code fails latency/throughput requirements. Covers profiling, caching, and algorithmic optimization.
43