skills/yonatangross/skillforge-claude-plugin/Observability & Monitoring

Observability & Monitoring

SKILL.md

Observability & Monitoring Skill

Comprehensive frameworks for implementing observability including structured logging, metrics, distributed tracing, and alerting.

Overview

  • Setting up application monitoring
  • Implementing structured logging
  • Adding metrics and dashboards
  • Configuring distributed tracing
  • Creating alerting rules
  • Debugging production issues

Three Pillars of Observability

+-----------------+-----------------+-----------------+
|     LOGS        |     METRICS     |     TRACES      |
+-----------------+-----------------+-----------------+
| What happened   | How is system   | How do requests |
| at specific     | performing      | flow through    |
| point in time   | over time       | services        |
+-----------------+-----------------+-----------------+

References

Logging Patterns

See: references/logging-patterns.md

Key topics covered:

  • Correlation IDs for cross-service request tracking
  • Log sampling strategies for high-traffic systems
  • LogQL queries for Loki log aggregation
  • OrchestKit structlog configuration example

Metrics Collection

See: references/metrics-collection.md

Key topics covered:

  • Counter, Gauge, Histogram, Summary metric types
  • Cardinality management and limits
  • Custom business metrics (LLM tokens, cache hit rates)
  • LLM cost tracking with Prometheus

Distributed Tracing

See: references/distributed-tracing.md

Key topics covered:

  • OpenTelemetry setup and auto-instrumentation
  • Span relationships (parent/child, parallel)
  • Head-based and tail-based sampling strategies
  • Trace context propagation across services

Alerting and Dashboards

See: references/alerting-dashboards.md

Key topics covered:

  • Alert severity levels and response times
  • Alert grouping and inhibition rules
  • Escalation policies and runbook links
  • Golden Signals dashboard design
  • SLO/SLI definitions and error budgets

Quick Reference

Log Levels

Level Use Case
ERROR Unhandled exceptions, failed operations
WARN Deprecated API, retry attempts
INFO Business events, successful operations
DEBUG Development troubleshooting

RED Method (Rate, Errors, Duration)

Essential metrics for any service:

  • Rate - Requests per second
  • Errors - Failed requests per second
  • Duration - Request latency distribution

Prometheus Buckets

// HTTP request latency
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]

// Database query latency
buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1]

Key Alerts

Alert Condition Severity
ServiceDown up == 0 for 1m Critical
HighErrorRate 5xx > 5% for 5m Critical
HighLatency p95 > 2s for 5m High
LowCacheHitRate < 70% for 10m Medium

Health Checks (Kubernetes)

Probe Purpose Endpoint
Liveness Is app running? /health
Readiness Ready for traffic? /ready
Startup Finished starting? /startup

Observability Checklist

Implementation

  • JSON structured logging
  • Request correlation IDs
  • RED metrics (Rate, Errors, Duration)
  • Business metrics
  • Distributed tracing
  • Health check endpoints

Alerting

  • Service outage alerts
  • Error rate thresholds
  • Latency thresholds
  • Resource utilization alerts

Dashboards

  • Service overview
  • Error analysis
  • Performance metrics

Templates Reference

Template Purpose
structured-logging.ts Winston logger with request middleware
prometheus-metrics.ts HTTP, DB, cache metrics with middleware
opentelemetry-tracing.ts Distributed tracing setup
alerting-rules.yml Prometheus alerting rules
health-checks.ts Liveness, readiness, startup probes

Langfuse Integration

For LLM observability, use Langfuse decorators:

from langfuse.decorators import observe, langfuse_context

@observe(name="analyze_content")
async def analyze_content(url: str) -> AnalysisResult:
    langfuse_context.update_current_trace(
        name="content_analysis",
        user_id="system",
        metadata={"url": url}
    )
    # ... workflow implementation

See examples/orchestkit-monitoring-dashboard.md for real-world examples.

Extended Thinking Triggers

Use Opus 4.5 extended thinking for:

  • Incident investigation - Correlating logs, metrics, traces
  • Alert tuning - Reducing noise, catching real issues
  • Architecture decisions - Choosing monitoring solutions
  • Performance debugging - Cross-service latency analysis

Related Skills

  • defense-in-depth - Layer 8 observability as part of security architecture
  • devops-deployment - Observability integration with CI/CD and Kubernetes
  • resilience-patterns - Monitoring circuit breakers and failure scenarios
  • fastapi-advanced - FastAPI-specific middleware for logging and metrics

Key Decisions

Decision Choice Rationale
Log format Structured JSON Machine-parseable, supports log aggregation, enables queries
Metric types RED method (Rate, Errors, Duration) Industry standard, covers essential service health indicators
Tracing OpenTelemetry Vendor-neutral, auto-instrumentation, broad ecosystem support
Alerting severity 4 levels (Critical, High, Medium, Low) Clear escalation paths, appropriate response times

Capability Details

structured-logging

Keywords: logging, structured log, json log, correlation id, log level, winston, pino, structlog Solves:

  • How do I set up structured logging?
  • Implement correlation IDs across services
  • JSON logging best practices

correlation-tracking

Keywords: correlation id, request tracking, trace context, distributed logs Solves:

  • How do I track requests across services?
  • Implement correlation IDs in middleware
  • Find all logs for a single request

log-sampling

Keywords: log sampling, high traffic logging, sampling rate, log volume Solves:

  • How do I reduce log volume in production?
  • Sample INFO logs while keeping all errors

prometheus-metrics

Keywords: metrics, prometheus, counter, histogram, gauge, summary, red method Solves:

  • How do I collect application metrics?
  • Implement RED method (Rate, Errors, Duration)
  • Choose between Counter, Gauge, Histogram

metric-types

Keywords: counter, gauge, histogram, summary, bucket, quantile Solves:

  • When to use Counter vs Gauge?
  • Histogram vs Summary for latency
  • Configure histogram buckets

cardinality-management

Keywords: cardinality, label explosion, time series, prometheus performance Solves:

  • How do I prevent label cardinality explosions?
  • Fix unbounded labels (user IDs, request IDs)

distributed-tracing

Keywords: tracing, distributed tracing, opentelemetry, span, trace id, waterfall Solves:

  • How do I implement distributed tracing?
  • OpenTelemetry setup with auto-instrumentation
  • Create manual spans for custom operations

trace-sampling

Keywords: trace sampling, head-based sampling, tail-based sampling Solves:

  • How do I reduce trace volume?
  • Sample 10% of traces but keep all errors

alerting-strategy

Keywords: alert, alerting, notification, threshold, pagerduty, slack, severity Solves:

  • How do I set up effective alerts?
  • Define alert severity levels (P1-P4)

alert-fatigue-prevention

Keywords: alert fatigue, alert grouping, inhibition, escalation Solves:

  • How do I reduce alert noise?
  • Group related alerts together

dashboards

Keywords: dashboard, visualization, grafana, golden signals, red method Solves:

  • How do I create monitoring dashboards?
  • Design Golden Signals dashboard layout
  • Build SLO/SLI dashboards

health-checks

Keywords: health check, liveness, readiness, startup probe, kubernetes Solves:

  • How do I implement health check endpoints?
  • Difference between liveness and readiness

langfuse-observability

Keywords: langfuse, llm observability, llm tracing, token usage, llm cost tracking Solves:

  • How do I monitor LLM calls with Langfuse?
  • Track LLM token usage and cost

llm-cost-tracking

Keywords: llm cost, token tracking, cost optimization, prometheus llm metrics Solves:

  • How do I track LLM costs with Prometheus?
  • Measure token usage by model and operation
Weekly Installs
0
GitHub Stars
90
First Seen
Jan 1, 1970