prometheus-grafana

SKILL.md

Prometheus & Grafana

Collect metrics and visualize system performance with the Prometheus-Grafana stack.

When to Use This Skill

Use this skill when:

  • Setting up metrics collection infrastructure
  • Creating monitoring dashboards
  • Writing PromQL queries for analysis
  • Configuring alerting rules
  • Monitoring Kubernetes clusters

Prerequisites

  • Docker or Kubernetes for deployment
  • Network access to monitored targets
  • Basic understanding of metrics concepts

Prometheus Setup

Docker Deployment

# docker-compose.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:v2.48.0
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./rules:/etc/prometheus/rules
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=15d'

  grafana:
    image: grafana/grafana:10.2.0
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

volumes:
  prometheus-data:
  grafana-data:

Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets:
          - 'node-exporter:9100'

  - job_name: 'applications'
    static_configs:
      - targets:
          - 'app1:8080'
          - 'app2:8080'
    metrics_path: /metrics

Kubernetes Deployment

Using Helm

# Add Prometheus community Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

# Install kube-prometheus-stack
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=admin

ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics
  namespaceSelector:
    matchNames:
      - default

PromQL Queries

Basic Queries

# Current CPU usage
node_cpu_seconds_total{mode="idle"}

# Rate of HTTP requests per second
rate(http_requests_total[5m])

# Average response time
avg(http_request_duration_seconds_sum / http_request_duration_seconds_count)

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

Aggregations

# Sum requests by status code
sum by (status_code) (rate(http_requests_total[5m]))

# Average CPU by instance
avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))

# Top 5 endpoints by request count
topk(5, sum by (endpoint) (rate(http_requests_total[5m])))

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Time-Based Queries

# Compare to 1 hour ago
http_requests_total - http_requests_total offset 1h

# Predict disk space in 4 hours
predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600)

# Changes in last 5 minutes
changes(up[5m])

# Average over 24 hours
avg_over_time(http_requests_total[24h])

Alerting Rules

# rules/alerts.yml
groups:
  - name: application
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.instance }} is down"

      - alert: HighMemoryUsage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | humanizePercentage }}"

      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"

Alertmanager

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/xxx'

route:
  receiver: 'slack-notifications'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'

receivers:
  - name: 'slack-notifications'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true
        title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'xxx'
        severity: critical

Grafana Dashboards

Dashboard JSON Structure

{
  "dashboard": {
    "title": "Application Metrics",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (status_code)",
            "legendFormat": "{{ status_code }}"
          }
        ],
        "gridPos": {"x": 0, "y": 0, "w": 12, "h": 8}
      },
      {
        "title": "Latency P95",
        "type": "gauge",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
          }
        ],
        "gridPos": {"x": 12, "y": 0, "w": 6, "h": 8}
      }
    ]
  }
}

Provisioning Dashboards

# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    options:
      path: /var/lib/grafana/dashboards

Data Source Provisioning

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

Recording Rules

# rules/recording.yml
groups:
  - name: aggregations
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      - record: instance:node_cpu:avg_rate5m
        expr: |
          avg by (instance) (
            rate(node_cpu_seconds_total{mode!="idle"}[5m])
          )

      - record: job:http_latency:p95
        expr: |
          histogram_quantile(0.95,
            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

Application Instrumentation

Go Application

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var httpRequests = prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total HTTP requests",
    },
    []string{"method", "endpoint", "status"},
)

func init() {
    prometheus.MustRegister(httpRequests)
}

// Expose metrics endpoint
http.Handle("/metrics", promhttp.Handler())

Node.js Application

const client = require('prom-client');

const httpRequests = new client.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'endpoint', 'status']
});

// Middleware
app.use((req, res, next) => {
  res.on('finish', () => {
    httpRequests.inc({
      method: req.method,
      endpoint: req.path,
      status: res.statusCode
    });
  });
  next();
});

// Expose metrics
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(await client.register.metrics());
});

Common Issues

Issue: Targets Not Discovered

Problem: Prometheus not scraping targets Solution: Check network connectivity, verify target labels

Issue: High Memory Usage

Problem: Prometheus using excessive memory Solution: Reduce retention, use recording rules, limit cardinality

Issue: Slow Queries

Problem: PromQL queries timing out Solution: Use recording rules, limit time ranges, optimize queries

Issue: Missing Data Points

Problem: Gaps in metrics data Solution: Check scrape interval, verify target availability

Best Practices

  • Use recording rules for frequently-used queries
  • Limit label cardinality to prevent memory issues
  • Set appropriate retention based on storage capacity
  • Use histogram metrics for latency measurement
  • Implement proper alerting thresholds
  • Version control dashboards as code
  • Use federation for large-scale deployments
  • Regularly review and prune unused metrics

Related Skills

Weekly Installs
16
GitHub Stars
13
First Seen
Feb 4, 2026
Installed on
opencode16
codex16
github-copilot15
kimi-cli15
amp15
gemini-cli15