monitoring-observability
Monitoring and Observability
Prometheus and Grafana Setup
Prometheus
-
Core Concepts
- Time-series database for metrics
- Pull-based metrics collection
- PromQL query language
- Alerting rules and notifications
-
Best Practices
- Use appropriate metric types (Counter, Gauge, Histogram, Summary)
- Label metrics with relevant dimensions
- Use metric naming conventions
- Implement relabeling for metric filtering
- Use federation for multi-cluster setups
-
Configuration
- Configure scrape targets for services
- Use service discovery for dynamic targets
- Configure retention policies
- Implement remote write for long-term storage
- Use alert rules for proactive monitoring
-
PromQL Examples
# CPU usage rate rate(process_cpu_seconds_total[5m]) # Request error rate rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) # P95 latency histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # Memory usage process_resident_memory_bytes / node_memory_MemTotal_bytes * 100
Grafana
-
Core Concepts
- Visualization and dashboard platform
- Multiple data source support
- Alerting and notifications
- Plugin ecosystem
-
Best Practices
- Use folder organization for dashboards
- Use dashboard variables for interactivity
- Implement dashboard versioning
- Use annotations for event marking
- Share dashboards via JSON export
-
Dashboard Design
- Create role-specific dashboards (SRE, developer, business)
- Use appropriate visualization types (graph, gauge, table, stat)
- Implement drill-down capabilities
- Use consistent color schemes
- Include context and descriptions
-
Alerting
- Configure alert rules in Grafana
- Use notification channels (email, Slack, PagerDuty)
- Implement alert grouping and routing
- Use alert templates for clear messages
- Configure alert silencing and downtime
CloudWatch (AWS) Monitoring
Core Concepts
- Metrics: Time-series data points
- Dashboards: Visualizations of metrics
- Alarms: Threshold-based alerts
- Logs: Log data collection and analysis
- Events: Event-driven monitoring
Best Practices
-
Metric Collection
- Use custom metrics for application-specific data
- Use metric filters for log-based metrics
- Use metric dimensions for filtering
- Implement metric aggregation
- Use metric streams for real-time processing
-
Dashboard Design
- Create service-specific dashboards
- Use widgets for different visualizations
- Implement dashboard variables
- Use cross-account dashboards
- Share dashboards with teams
-
Alarm Configuration
- Use appropriate alarm thresholds
- Implement alarm actions (SNS, Auto Scaling, EC2 actions)
- Use composite alarms for complex conditions
- Configure alarm states and transitions
- Use alarm tags for organization
CloudWatch Logs
- Log Groups: Logical containers for logs
- Log Streams: Sequences of log events
- Metric Filters: Extract metrics from logs
- Subscription Filters: Stream logs to other services
- Insights: Query and analyze logs
CloudWatch Examples
// Alarm configuration
{
"AlarmName": "HighCPUUsage",
"MetricName": "CPUUtilization",
"Namespace": "AWS/EC2",
"Statistic": "Average",
"Period": 300,
"EvaluationPeriods": 2,
"Threshold": 80,
"ComparisonOperator": "GreaterThanThreshold"
}
// Metric filter
{
"filterPattern": "[timestamp, request_id, status_code, latency]",
"metricTransformations": [
{
"metricName": "RequestLatency",
"metricNamespace": "Application",
"metricValue": "$latency"
}
]
}
Azure Monitor
Core Concepts
- Metrics: Time-series data
- Logs: Log data collection and analysis
- Alerts: Threshold-based alerts
- Dashboards: Visualizations
- Application Insights: Application monitoring
Best Practices
-
Metric Collection
- Use custom metrics for application data
- Use metric dimensions for filtering
- Implement metric aggregation
- Use metric alerts for proactive monitoring
- Configure metric collection rules
-
Log Analytics
- Use Kusto Query Language (KQL) for log queries
- Create custom log tables
- Implement log collection rules
- Use log alerts for monitoring
- Configure log retention policies
-
Application Insights
- Enable distributed tracing
- Use custom telemetry
- Implement dependency tracking
- Configure performance counters
- Use smart detection for anomalies
Azure Monitor Examples
// KQL query for error rate
requests
| where timestamp > ago(1h)
| summarize count() by success
| project error_rate = 100.0 * (count_ - count_success) / count_
// Query for slow requests
requests
| where timestamp > ago(1h)
| where duration > 1000
| summarize count() by name
| top 10 by count_
// Query for exceptions
exceptions
| where timestamp > ago(1h)
| summarize count() by type, problemId
| top 10 by count_
Stackdriver (GCP) Monitoring
Core Concepts
- Metrics: Time-series data
- Dashboards: Visualizations
- Alerting: Threshold-based alerts
- Logging: Log data collection
- Tracing: Distributed tracing
Best Practices
-
Metric Collection
- Use custom metrics for application data
- Use metric labels for filtering
- Implement metric aggregation
- Use metric-based alerting policies
- Configure metric descriptors
-
Dashboard Design
- Create service-specific dashboards
- Use dashboard variables
- Implement dashboard sharing
- Use dashboard templates
- Configure dashboard refresh intervals
-
Logging
- Use log sinks for log routing
- Implement log-based metrics
- Configure log exclusions
- Use log alerts for monitoring
- Configure log retention
Stackdriver Examples
# Alerting policy
displayName: "High Error Rate"
conditions:
- displayName: "Error rate > 5%"
conditionThreshold:
filter: 'metric.type="custom.googleapis.com/error_rate"'
comparison: COMPARISON_GT
thresholdValue: 0.05
duration: 300s
aggregations:
- alignmentPeriod: 60s
perSeriesAligner: ALIGN_RATE
Logging and Log Aggregation
ELK Stack (Elasticsearch, Logstash, Kibana)
- Elasticsearch: Search and analytics engine
- Logstash: Data processing pipeline
- Kibana: Visualization platform
- Beats: Data shippers
Best Practices
-
Log Collection
- Use centralized logging
- Implement log shippers (Filebeat, Fluentd, Logstash)
- Use log parsing and normalization
- Configure log retention policies
- Implement log archiving
-
Log Analysis
- Use index patterns for organization
- Implement log queries and filters
- Use saved searches for common queries
- Create visualizations for log data
- Use dashboards for log monitoring
-
Log Security
- Implement log encryption at rest
- Use secure log transmission (TLS)
- Implement log access controls
- Configure log audit trails
- Use log redaction for sensitive data
Loki
-
Core Concepts
- Lightweight log aggregation system
- Label-based indexing
- Grafana integration
- PromQL-like query language (LogQL)
-
Best Practices
- Use appropriate log labels
- Implement log retention policies
- Use log streams for organization
- Configure log scraping
- Implement log alerting
Alerting Strategies and Incident Response
Alerting Best Practices
-
Alert Design
- Use meaningful alert names and descriptions
- Include relevant context in alerts
- Use appropriate severity levels
- Configure alert thresholds carefully
- Implement alert deduplication
-
Alert Routing
- Route alerts to appropriate teams
- Use escalation policies
- Configure on-call rotations
- Implement alert grouping
- Use notification channels (email, Slack, PagerDuty)
-
Alert Quality
- Reduce alert noise with proper filtering
- Implement alert suppression
- Use alert correlation
- Configure alert cooldown periods
- Implement alert auto-resolution
Incident Response
-
Incident Lifecycle
- Detection: Identify incident
- Triage: Assess severity and impact
- Response: Mitigate incident
- Resolution: Restore service
- Post-Mortem: Learn and improve
-
Best Practices
- Use incident severity levels
- Implement incident communication
- Use runbooks for common incidents
- Conduct post-mortems
- Implement follow-up actions
-
Runbooks
- Document common incident scenarios
- Include step-by-step procedures
- Include relevant commands and tools
- Update runbooks based on incidents
- Share runbooks with teams
SLO/SLI Definitions and Tracking
SLI (Service Level Indicator)
- Definition: Quantitative measure of service performance
- Common SLIs:
- Availability: Percentage of time service is operational
- Latency: Response time for requests
- Error Rate: Percentage of failed requests
- Throughput: Requests per second
- Saturation: Resource utilization
SLO (Service Level Objective)
- Definition: Target value for SLI
- Best Practices:
- Set realistic SLOs based on business requirements
- Use SLOs to drive reliability improvements
- Monitor SLOs continuously
- Alert on SLO breaches
- Use error budgets for balancing reliability and innovation
Error Budget
- Definition: Allowable amount of unreliability
- Calculation: Error Budget = 100% - SLO
- Best Practices:
- Use error budget to guide release decisions
- Freeze deployments when error budget is exhausted
- Implement error budget alerts
- Track error budget consumption
- Use error budget for reliability planning
SLO/SLI Examples
# SLO configuration
slo_name: "API Availability"
sli_name: "api_availability"
slo_target: 0.999
slo_window: 30d
alert_threshold: 0.998
# SLI calculation
api_availability = 1 - (error_count / total_count)
# Error budget
error_budget = 1 - slo_target
error_budget_remaining = slo_target - current_availability
Monitoring SLOs
-
Tools:
- Prometheus and Grafana
- CloudWatch SLOs
- Azure Monitor SLOs
- Stackdriver SLOs
- SRE-specific tools (Sloth, sli-exporter)
-
Best Practices:
- Visualize SLOs in dashboards
- Alert on SLO breaches
- Track SLO trends over time
- Compare SLOs across services
- Use SLOs for capacity planning
Observability Best Practices
The Three Pillars of Observability
- Metrics: Quantitative data points
- Logs: Discrete events
- Traces: Request paths through distributed systems
Distributed Tracing
-
Core Concepts
- Trace: End-to-end request path
- Span: Individual operation
- Trace ID: Unique identifier for trace
- Span ID: Unique identifier for span
-
Best Practices
- Use distributed tracing for microservices
- Implement trace sampling
- Use trace context propagation
- Configure trace retention
- Analyze traces for performance issues
-
Tools
- Jaeger
- Zipkin
- AWS X-Ray
- Azure Application Insights
- Google Cloud Trace
Observability Patterns
- RED Method: Rate, Errors, Duration
- USE Method: Utilization, Saturation, Errors
- Golden Signals: Latency, Traffic, Errors, Saturation
- Four Golden Signals: Latency, Traffic, Errors, Saturation, SLOs
Observability Maturity Model
- Level 1: Basic metrics and logging
- Level 2: Structured logging and metrics
- Level 3: Distributed tracing
- Level 4: Automated alerting and incident response
- Level 5: SLO-driven development and error budgets
More from davincidreams/agent-team-plugins
blender
Blender interface, workflows, and 3D production pipeline
220rigging
Rigging fundamentals, skeleton setup, and animation controls
16animation
Animation principles, techniques, and best practices for 3D animation
13vroid
Vroid Studio, VRM format, and VTuber avatar creation
10technical-writing
Technical writing principles and best practices for creating clear, accurate documentation
9unreal
Unreal Engine patterns, Actor/Component model, Blueprints vs C++, and best practices
8