skills/davincidreams/agent-team-plugins/monitoring-observability

monitoring-observability

SKILL.md

Monitoring and Observability

Prometheus and Grafana Setup

Prometheus

  • Core Concepts

    • Time-series database for metrics
    • Pull-based metrics collection
    • PromQL query language
    • Alerting rules and notifications
  • Best Practices

    • Use appropriate metric types (Counter, Gauge, Histogram, Summary)
    • Label metrics with relevant dimensions
    • Use metric naming conventions
    • Implement relabeling for metric filtering
    • Use federation for multi-cluster setups
  • Configuration

    • Configure scrape targets for services
    • Use service discovery for dynamic targets
    • Configure retention policies
    • Implement remote write for long-term storage
    • Use alert rules for proactive monitoring
  • PromQL Examples

    # CPU usage rate
    rate(process_cpu_seconds_total[5m])
    
    # Request error rate
    rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
    
    # P95 latency
    histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
    
    # Memory usage
    process_resident_memory_bytes / node_memory_MemTotal_bytes * 100
    

Grafana

  • Core Concepts

    • Visualization and dashboard platform
    • Multiple data source support
    • Alerting and notifications
    • Plugin ecosystem
  • Best Practices

    • Use folder organization for dashboards
    • Use dashboard variables for interactivity
    • Implement dashboard versioning
    • Use annotations for event marking
    • Share dashboards via JSON export
  • Dashboard Design

    • Create role-specific dashboards (SRE, developer, business)
    • Use appropriate visualization types (graph, gauge, table, stat)
    • Implement drill-down capabilities
    • Use consistent color schemes
    • Include context and descriptions
  • Alerting

    • Configure alert rules in Grafana
    • Use notification channels (email, Slack, PagerDuty)
    • Implement alert grouping and routing
    • Use alert templates for clear messages
    • Configure alert silencing and downtime

CloudWatch (AWS) Monitoring

Core Concepts

  • Metrics: Time-series data points
  • Dashboards: Visualizations of metrics
  • Alarms: Threshold-based alerts
  • Logs: Log data collection and analysis
  • Events: Event-driven monitoring

Best Practices

  • Metric Collection

    • Use custom metrics for application-specific data
    • Use metric filters for log-based metrics
    • Use metric dimensions for filtering
    • Implement metric aggregation
    • Use metric streams for real-time processing
  • Dashboard Design

    • Create service-specific dashboards
    • Use widgets for different visualizations
    • Implement dashboard variables
    • Use cross-account dashboards
    • Share dashboards with teams
  • Alarm Configuration

    • Use appropriate alarm thresholds
    • Implement alarm actions (SNS, Auto Scaling, EC2 actions)
    • Use composite alarms for complex conditions
    • Configure alarm states and transitions
    • Use alarm tags for organization

CloudWatch Logs

  • Log Groups: Logical containers for logs
  • Log Streams: Sequences of log events
  • Metric Filters: Extract metrics from logs
  • Subscription Filters: Stream logs to other services
  • Insights: Query and analyze logs

CloudWatch Examples

// Alarm configuration
{
  "AlarmName": "HighCPUUsage",
  "MetricName": "CPUUtilization",
  "Namespace": "AWS/EC2",
  "Statistic": "Average",
  "Period": 300,
  "EvaluationPeriods": 2,
  "Threshold": 80,
  "ComparisonOperator": "GreaterThanThreshold"
}

// Metric filter
{
  "filterPattern": "[timestamp, request_id, status_code, latency]",
  "metricTransformations": [
    {
      "metricName": "RequestLatency",
      "metricNamespace": "Application",
      "metricValue": "$latency"
    }
  ]
}

Azure Monitor

Core Concepts

  • Metrics: Time-series data
  • Logs: Log data collection and analysis
  • Alerts: Threshold-based alerts
  • Dashboards: Visualizations
  • Application Insights: Application monitoring

Best Practices

  • Metric Collection

    • Use custom metrics for application data
    • Use metric dimensions for filtering
    • Implement metric aggregation
    • Use metric alerts for proactive monitoring
    • Configure metric collection rules
  • Log Analytics

    • Use Kusto Query Language (KQL) for log queries
    • Create custom log tables
    • Implement log collection rules
    • Use log alerts for monitoring
    • Configure log retention policies
  • Application Insights

    • Enable distributed tracing
    • Use custom telemetry
    • Implement dependency tracking
    • Configure performance counters
    • Use smart detection for anomalies

Azure Monitor Examples

// KQL query for error rate
requests
| where timestamp > ago(1h)
| summarize count() by success
| project error_rate = 100.0 * (count_ - count_success) / count_

// Query for slow requests
requests
| where timestamp > ago(1h)
| where duration > 1000
| summarize count() by name
| top 10 by count_

// Query for exceptions
exceptions
| where timestamp > ago(1h)
| summarize count() by type, problemId
| top 10 by count_

Stackdriver (GCP) Monitoring

Core Concepts

  • Metrics: Time-series data
  • Dashboards: Visualizations
  • Alerting: Threshold-based alerts
  • Logging: Log data collection
  • Tracing: Distributed tracing

Best Practices

  • Metric Collection

    • Use custom metrics for application data
    • Use metric labels for filtering
    • Implement metric aggregation
    • Use metric-based alerting policies
    • Configure metric descriptors
  • Dashboard Design

    • Create service-specific dashboards
    • Use dashboard variables
    • Implement dashboard sharing
    • Use dashboard templates
    • Configure dashboard refresh intervals
  • Logging

    • Use log sinks for log routing
    • Implement log-based metrics
    • Configure log exclusions
    • Use log alerts for monitoring
    • Configure log retention

Stackdriver Examples

# Alerting policy
displayName: "High Error Rate"
conditions:
  - displayName: "Error rate > 5%"
    conditionThreshold:
      filter: 'metric.type="custom.googleapis.com/error_rate"'
      comparison: COMPARISON_GT
      thresholdValue: 0.05
      duration: 300s
      aggregations:
        - alignmentPeriod: 60s
          perSeriesAligner: ALIGN_RATE

Logging and Log Aggregation

ELK Stack (Elasticsearch, Logstash, Kibana)

  • Elasticsearch: Search and analytics engine
  • Logstash: Data processing pipeline
  • Kibana: Visualization platform
  • Beats: Data shippers

Best Practices

  • Log Collection

    • Use centralized logging
    • Implement log shippers (Filebeat, Fluentd, Logstash)
    • Use log parsing and normalization
    • Configure log retention policies
    • Implement log archiving
  • Log Analysis

    • Use index patterns for organization
    • Implement log queries and filters
    • Use saved searches for common queries
    • Create visualizations for log data
    • Use dashboards for log monitoring
  • Log Security

    • Implement log encryption at rest
    • Use secure log transmission (TLS)
    • Implement log access controls
    • Configure log audit trails
    • Use log redaction for sensitive data

Loki

  • Core Concepts

    • Lightweight log aggregation system
    • Label-based indexing
    • Grafana integration
    • PromQL-like query language (LogQL)
  • Best Practices

    • Use appropriate log labels
    • Implement log retention policies
    • Use log streams for organization
    • Configure log scraping
    • Implement log alerting

Alerting Strategies and Incident Response

Alerting Best Practices

  • Alert Design

    • Use meaningful alert names and descriptions
    • Include relevant context in alerts
    • Use appropriate severity levels
    • Configure alert thresholds carefully
    • Implement alert deduplication
  • Alert Routing

    • Route alerts to appropriate teams
    • Use escalation policies
    • Configure on-call rotations
    • Implement alert grouping
    • Use notification channels (email, Slack, PagerDuty)
  • Alert Quality

    • Reduce alert noise with proper filtering
    • Implement alert suppression
    • Use alert correlation
    • Configure alert cooldown periods
    • Implement alert auto-resolution

Incident Response

  • Incident Lifecycle

    • Detection: Identify incident
    • Triage: Assess severity and impact
    • Response: Mitigate incident
    • Resolution: Restore service
    • Post-Mortem: Learn and improve
  • Best Practices

    • Use incident severity levels
    • Implement incident communication
    • Use runbooks for common incidents
    • Conduct post-mortems
    • Implement follow-up actions
  • Runbooks

    • Document common incident scenarios
    • Include step-by-step procedures
    • Include relevant commands and tools
    • Update runbooks based on incidents
    • Share runbooks with teams

SLO/SLI Definitions and Tracking

SLI (Service Level Indicator)

  • Definition: Quantitative measure of service performance
  • Common SLIs:
    • Availability: Percentage of time service is operational
    • Latency: Response time for requests
    • Error Rate: Percentage of failed requests
    • Throughput: Requests per second
    • Saturation: Resource utilization

SLO (Service Level Objective)

  • Definition: Target value for SLI
  • Best Practices:
    • Set realistic SLOs based on business requirements
    • Use SLOs to drive reliability improvements
    • Monitor SLOs continuously
    • Alert on SLO breaches
    • Use error budgets for balancing reliability and innovation

Error Budget

  • Definition: Allowable amount of unreliability
  • Calculation: Error Budget = 100% - SLO
  • Best Practices:
    • Use error budget to guide release decisions
    • Freeze deployments when error budget is exhausted
    • Implement error budget alerts
    • Track error budget consumption
    • Use error budget for reliability planning

SLO/SLI Examples

# SLO configuration
slo_name: "API Availability"
sli_name: "api_availability"
slo_target: 0.999
slo_window: 30d
alert_threshold: 0.998

# SLI calculation
api_availability = 1 - (error_count / total_count)

# Error budget
error_budget = 1 - slo_target
error_budget_remaining = slo_target - current_availability

Monitoring SLOs

  • Tools:

    • Prometheus and Grafana
    • CloudWatch SLOs
    • Azure Monitor SLOs
    • Stackdriver SLOs
    • SRE-specific tools (Sloth, sli-exporter)
  • Best Practices:

    • Visualize SLOs in dashboards
    • Alert on SLO breaches
    • Track SLO trends over time
    • Compare SLOs across services
    • Use SLOs for capacity planning

Observability Best Practices

The Three Pillars of Observability

  • Metrics: Quantitative data points
  • Logs: Discrete events
  • Traces: Request paths through distributed systems

Distributed Tracing

  • Core Concepts

    • Trace: End-to-end request path
    • Span: Individual operation
    • Trace ID: Unique identifier for trace
    • Span ID: Unique identifier for span
  • Best Practices

    • Use distributed tracing for microservices
    • Implement trace sampling
    • Use trace context propagation
    • Configure trace retention
    • Analyze traces for performance issues
  • Tools

    • Jaeger
    • Zipkin
    • AWS X-Ray
    • Azure Application Insights
    • Google Cloud Trace

Observability Patterns

  • RED Method: Rate, Errors, Duration
  • USE Method: Utilization, Saturation, Errors
  • Golden Signals: Latency, Traffic, Errors, Saturation
  • Four Golden Signals: Latency, Traffic, Errors, Saturation, SLOs

Observability Maturity Model

  • Level 1: Basic metrics and logging
  • Level 2: Structured logging and metrics
  • Level 3: Distributed tracing
  • Level 4: Automated alerting and incident response
  • Level 5: SLO-driven development and error budgets
Weekly Installs
2
GitHub Stars
2
First Seen
Feb 14, 2026
Installed on
opencode2
antigravity2
claude-code2
github-copilot2
codex2
kimi-cli2