deploying-monitoring-stacks
SKILL.md
Deploying Monitoring Stacks
Overview
Deploy production monitoring stacks (Prometheus + Grafana, Datadog, or Victoria Metrics) with metric collection, custom dashboards, and alerting rules. Configure exporters, scrape targets, recording rules, and notification channels for comprehensive infrastructure and application observability.
Prerequisites
- Target infrastructure identified: Kubernetes cluster, Docker hosts, or bare-metal servers
- Metric endpoints accessible from the monitoring platform (application
/metrics, node exporters) - Storage backend capacity planned for time-series data (Prometheus TSDB, Thanos, or Cortex for long-term)
- Alert notification channels defined: Slack webhook, PagerDuty integration key, or email SMTP
- Helm 3+ for Kubernetes deployments using kube-prometheus-stack or similar charts
Instructions
- Select the monitoring platform: Prometheus + Grafana for open-source self-hosted, Datadog for managed SaaS, Victoria Metrics for high-cardinality workloads
- Deploy the monitoring stack:
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stackor Docker Compose for non-Kubernetes - Install exporters on monitored systems: node-exporter for host metrics, kube-state-metrics for Kubernetes object states, application-specific exporters
- Configure scrape targets in
prometheus.yml: define job names, scrape intervals, and relabeling rules for service discovery - Create recording rules for frequently queried aggregations to reduce dashboard query load
- Define alerting rules with meaningful thresholds: high CPU (>80% for 5m), high memory (>90%), error rate (>1%), latency P99 (>500ms)
- Configure Alertmanager with routing, grouping, and notification channels (Slack, PagerDuty, email)
- Build Grafana dashboards: RED metrics (Rate, Errors, Duration) for services, USE metrics (Utilization, Saturation, Errors) for resources
- Set up data retention: configure TSDB retention period (15-30 days local), set up Thanos/Cortex for long-term storage if needed
- Test the full pipeline: trigger a test alert and verify notification delivery
Output
- Helm values file or Docker Compose for the monitoring stack
- Prometheus configuration with scrape targets, recording rules, and alerting rules
- Alertmanager configuration with routing tree and notification receivers
- Grafana dashboard JSON files for infrastructure and application metrics
- Exporter deployment manifests (node-exporter DaemonSet, application ServiceMonitor)
Error Handling
| Error | Cause | Solution |
|---|---|---|
No data points in dashboard |
Scrape target not reachable or metric name wrong | Check Targets page in Prometheus UI; verify service discovery and metric name |
Too many time series (high cardinality) |
Labels with unbounded values (user IDs, request IDs) | Remove high-cardinality labels with metric_relabel_configs; use recording rules for aggregation |
Alert condition met but no notification |
Alertmanager routing or receiver misconfigured | Verify Alertmanager config with amtool check-config; test receiver with amtool silence |
Prometheus OOMKilled |
Insufficient memory for series count | Increase memory limits; reduce scrape targets or retention; add WAL compression |
Grafana datasource connection failed |
Wrong Prometheus URL or network policy blocking access | Verify datasource URL in Grafana; check Kubernetes service name and port; review network policies |
Examples
- "Deploy kube-prometheus-stack on Kubernetes with alerts for node CPU > 80%, pod restart count > 5, and API error rate > 1%, sending to Slack."
- "Set up Prometheus + Grafana on Docker Compose for monitoring 10 application servers with node-exporter and custom application metrics."
- "Create Grafana dashboards for the four golden signals (latency, traffic, errors, saturation) for a microservices application."
Resources
- Prometheus documentation: https://prometheus.io/docs/
- Grafana documentation: https://grafana.com/docs/grafana/latest/
- kube-prometheus-stack: https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
- Alerting best practices: https://prometheus.io/docs/practices/alerting/
- Datadog documentation: https://docs.datadoghq.com/
Weekly Installs
14
Repository
jeremylongshore…s-skillsGitHub Stars
1.6K
First Seen
Feb 18, 2026
Security Audits
Installed on
codex14
mcpjam13
claude-code13
junie13
windsurf13
zencoder13