skills/pjt222/development-guides/setup-prometheus-monitoring

setup-prometheus-monitoring

SKILL.md

Setup Prometheus Monitoring

Configure a production-ready Prometheus deployment with scrape targets, recording rules, and federation.

When to Use

  • Setting up centralized metrics collection for microservices or distributed systems
  • Implementing time-series monitoring for application and infrastructure metrics
  • Establishing a foundation for SLO/SLI tracking and alerting
  • Consolidating metrics from multiple Prometheus instances via federation
  • Migrating from legacy monitoring solutions to modern observability stack

Inputs

  • Required: List of scrape targets (services, exporters, endpoints)
  • Required: Retention period and storage requirements
  • Optional: Existing service discovery mechanism (Kubernetes, Consul, EC2)
  • Optional: Recording rules for pre-aggregated metrics
  • Optional: Federation hierarchy for multi-cluster setups

Procedure

Step 1: Install and Configure Prometheus

Create the base Prometheus configuration with global settings and scrape intervals.

# Create Prometheus directory structure
mkdir -p /etc/prometheus/{rules,file_sd}
mkdir -p /var/lib/prometheus

# Download Prometheus (adjust version as needed)
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
tar xvf prometheus-2.48.0.linux-amd64.tar.gz
sudo cp prometheus-2.48.0.linux-amd64/{prometheus,promtool} /usr/local/bin/

Create /etc/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production'
    region: 'us-east-1'

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - localhost:9093

# Load recording and alerting rules
rule_files:
  - "rules/*.yml"

# Scrape configurations
scrape_configs:
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
        labels:
          env: 'production'

  # Node exporter for host metrics
  - job_name: 'node'
    static_configs:
      - targets:
          - 'node1:9100'
          - 'node2:9100'
        labels:
          env: 'production'

  # Application metrics with file-based service discovery
  - job_name: 'app-services'
    file_sd_configs:
      - files:
          - '/etc/prometheus/file_sd/services.json'
        refresh_interval: 30s
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
      - source_labels: [env]
        target_label: environment

Expected: Prometheus starts successfully, web UI accessible at http://localhost:9090, targets listed under Status > Targets.

On failure:

  • Check syntax with promtool check config /etc/prometheus/prometheus.yml
  • Verify file permissions: sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus
  • Check logs: journalctl -u prometheus -f

Step 2: Configure Service Discovery

Set up dynamic target discovery to avoid manual target management.

For Kubernetes environments, add to scrape_configs:

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with prometheus.io/scrape annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # Use custom port if specified
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
      # Add namespace as label
      - source_labels: [__meta_kubernetes_namespace]
        target_label: kubernetes_namespace
      # Add pod name as label
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: kubernetes_pod_name

For file-based service discovery, create /etc/prometheus/file_sd/services.json:

[
  {
    "targets": ["web-app-1:8080", "web-app-2:8080"],
    "labels": {
      "job": "web-app",
      "env": "production",
      "team": "platform"
    }
  },
  {
    "targets": ["api-service-1:9090", "api-service-2:9090"],
    "labels": {
      "job": "api-service",
      "env": "production",
      "team": "backend"
    }
  }
]

For Consul service discovery:

  - job_name: 'consul-services'
    consul_sd_configs:
      - server: 'consul.example.com:8500'
        services: []  # Empty list means discover all services
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: job
      - source_labels: [__meta_consul_tags]
        regex: '.*,monitoring,.*'
        action: keep

Expected: Dynamic targets appear in Prometheus UI, automatically updated when services scale or change.

On failure:

  • Kubernetes: Verify RBAC permissions with kubectl auth can-i list pods --as=system:serviceaccount:monitoring:prometheus
  • File SD: Validate JSON syntax with python -m json.tool /etc/prometheus/file_sd/services.json
  • Consul: Test connectivity with curl http://consul.example.com:8500/v1/catalog/services

Step 3: Create Recording Rules

Pre-aggregate expensive queries for dashboard performance and alerting efficiency.

Create /etc/prometheus/rules/recording_rules.yml:

groups:
  - name: api_aggregations
    interval: 30s
    rules:
      # Calculate request rate per endpoint (5m window)
      - record: job:http_requests:rate5m
        expr: |
          sum by (job, endpoint, method) (
            rate(http_requests_total[5m])
          )

      # Calculate error rate percentage
      - record: job:http_errors:rate5m
        expr: |
          sum by (job) (
            rate(http_requests_total{status=~"5.."}[5m])
          ) / sum by (job) (
            rate(http_requests_total[5m])
          ) * 100

      # P95 latency by endpoint
      - record: job:http_request_duration_seconds:p95
        expr: |
          histogram_quantile(0.95,
            sum by (job, endpoint, le) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )

  - name: resource_aggregations
    interval: 1m
    rules:
      # CPU usage by instance
      - record: instance:cpu_usage:ratio
        expr: |
          1 - avg by (instance) (
            rate(node_cpu_seconds_total{mode="idle"}[5m])
          )

      # Memory usage percentage
      - record: instance:memory_usage:ratio
        expr: |
          1 - (
            node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
          )

      # Disk usage by mount point
      - record: instance:disk_usage:ratio
        expr: |
          1 - (
            node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.*"}
            / node_filesystem_size_bytes{fstype!~"tmpfs|fuse.*"}
          )

Validate and reload:

# Validate rules syntax
promtool check rules /etc/prometheus/rules/recording_rules.yml

# Reload Prometheus configuration (without restart)
curl -X POST http://localhost:9090/-/reload

# Or send SIGHUP signal
sudo killall -HUP prometheus

Expected: Recording rules evaluate successfully, new metrics visible in Prometheus with job: prefix, query performance improved for dashboards.

On failure:

  • Check rule syntax with promtool check rules
  • Verify evaluation interval matches data availability
  • Check for missing source metrics: curl http://localhost:9090/api/v1/targets
  • Review logs for evaluation errors: journalctl -u prometheus | grep -i error

Step 4: Configure Storage and Retention

Optimize storage for retention requirements and query performance.

Edit /etc/systemd/system/prometheus.service:

[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention.time=30d \
  --storage.tsdb.retention.size=50GB \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries \
  --web.listen-address=:9090 \
  --web.enable-lifecycle \
  --web.enable-admin-api

Restart=always
RestartSec=10s

[Install]
WantedBy=multi-user.target

Key storage flags:

  • --storage.tsdb.retention.time=30d: Keep 30 days of data
  • --storage.tsdb.retention.size=50GB: Limit storage to 50GB (whichever limit hits first)
  • --storage.tsdb.wal-compression: Enable WAL compression (reduces disk I/O)
  • --web.enable-lifecycle: Allow config reload via HTTP POST
  • --web.enable-admin-api: Enable snapshot and delete APIs

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus
sudo systemctl status prometheus

Expected: Prometheus retains metrics according to policy, disk usage stays within limits, old data automatically pruned.

On failure:

  • Monitor disk usage: du -sh /var/lib/prometheus
  • Check TSDB stats: curl http://localhost:9090/api/v1/status/tsdb
  • Verify retention settings: curl http://localhost:9090/api/v1/status/runtimeinfo | jq .data.storageRetention
  • Force cleanup: curl -X POST http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={__name__=~".+"}

Step 5: Set Up Federation (Multi-Cluster)

Configure hierarchical Prometheus for aggregating metrics across clusters.

On edge Prometheus instances (in each cluster), ensure external labels are set:

global:
  external_labels:
    cluster: 'production-east'
    datacenter: 'us-east-1'

On central Prometheus instance, add federation scrape config:

scrape_configs:
  - job_name: 'federate-production'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        # Aggregate only pre-computed recording rules
        - '{__name__=~"job:.*"}'
        # Include alert states
        - '{__name__=~"ALERTS.*"}'
        # Include critical infrastructure metrics
        - 'up{job=~".*"}'
    static_configs:
      - targets:
          - 'prometheus-east.example.com:9090'
          - 'prometheus-west.example.com:9090'
        labels:
          env: 'production'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
      - source_labels: [__address__]
        regex: 'prometheus-(.*).example.com.*'
        target_label: cluster
        replacement: '$1'

Federation best practices:

  • Use honor_labels: true to preserve original labels
  • Federate only recording rules and aggregates (not raw metrics)
  • Set appropriate scrape intervals (longer than edge Prometheus evaluation)
  • Use match[] to filter metrics (avoid federating everything)

Expected: Central Prometheus shows federated metrics from all clusters, queries can span multiple regions, minimal data duplication.

On failure:

  • Verify federation endpoint accessibility: curl http://prometheus-east.example.com:9090/federate?match[]={__name__=~"job:.*"} | head -20
  • Check for label conflicts (central vs edge external labels)
  • Monitor federation lag: compare timestamp differences
  • Review match patterns: curl http://localhost:9090/api/v1/label/__name__/values | jq .data | grep "job:"

Step 6: Implement High Availability (Optional)

Deploy redundant Prometheus instances with identical configurations for failover.

Use Thanos or Cortex for true HA, or simple load-balanced setup:

# prometheus-1.yml and prometheus-2.yml (identical configs)
global:
  scrape_interval: 15s
  external_labels:
    prometheus: 'prometheus-1'  # Different per instance
    replica: 'A'

# Use --web.external-url flag for each instance
# prometheus-1: --web.external-url=http://prometheus-1.example.com:9090
# prometheus-2: --web.external-url=http://prometheus-2.example.com:9090

Configure Grafana to query both instances:

{
  "name": "Prometheus-HA",
  "type": "prometheus",
  "url": "http://prometheus-lb.example.com",
  "jsonData": {
    "httpMethod": "POST",
    "timeInterval": "15s"
  }
}

Use HAProxy or nginx for load balancing:

upstream prometheus_backend {
    server prometheus-1.example.com:9090 max_fails=3 fail_timeout=30s;
    server prometheus-2.example.com:9090 max_fails=3 fail_timeout=30s;
}

server {
    listen 9090;
    location / {
        proxy_pass http://prometheus_backend;
        proxy_set_header Host $host;
    }
}

Expected: Query requests balanced across instances, automatic failover if one instance down, no data loss during single instance failure.

On failure:

  • Verify both instances scraping same targets (slight time skew acceptable)
  • Check for configuration drift between instances
  • Monitor deduplication in queries (Grafana shows duplicate series)
  • Review load balancer health checks

Validation

  • Prometheus web UI accessible at expected endpoint
  • All configured scrape targets showing as UP in Status > Targets
  • Service discovery dynamically adding/removing targets as expected
  • Recording rules evaluating successfully (no errors in logs)
  • Metrics retention matching configured time/size limits
  • Federation (if configured) pulling metrics from edge instances
  • Queries returning expected metric cardinality (not excessive)
  • Disk usage stable and within allocated storage budget
  • Configuration reload working via HTTP endpoint or SIGHUP
  • Prometheus self-monitoring metrics available (up, scrape duration, etc.)

Common Pitfalls

  • High cardinality metrics: Avoid labels with unbounded values (user IDs, timestamps, UUIDs). Use recording rules to aggregate before storage.
  • Scrape interval mismatch: Recording rules should evaluate at intervals equal to or greater than scrape intervals to avoid gaps.
  • Federation overload: Federating all metrics creates massive data duplication. Only federate aggregated recording rules.
  • Missing relabel configs: Without proper relabeling, service discovery can create confusing or duplicate labels.
  • Retention too short: Set retention longer than your longest dashboard time window to avoid "no data" gaps.
  • No resource limits: Prometheus can consume excessive memory with high cardinality. Set --storage.tsdb.max-block-duration and monitor heap usage.
  • Disabled lifecycle endpoint: Without --web.enable-lifecycle, config reloads require full restarts causing scrape gaps.

Related Skills

  • configure-alerting-rules - Define alerting rules based on Prometheus metrics and route to Alertmanager
  • build-grafana-dashboards - Visualize Prometheus metrics with Grafana dashboards and panels
  • define-slo-sli-sla - Establish SLO/SLI targets using Prometheus recording rules and error budget tracking
  • instrument-distributed-tracing - Complement metrics with distributed tracing for deeper observability
Weekly Installs
13
GitHub Stars
3
First Seen
Feb 27, 2026
Installed on
opencode13
claude-code13
github-copilot13
codex13
kimi-cli13
gemini-cli13