prometheus
Installation
SKILL.md
Identity
- Binary:
prometheus - Unit:
prometheus.service - Config:
/etc/prometheus/prometheus.yml - Rules dir:
/etc/prometheus/rules/(glob-referenced from prometheus.yml) - Data dir:
/var/lib/prometheus/ - Logs:
journalctl -u prometheus - Web UI + API: port 9090
- Alertmanager: port 9093
- Distro install:
apt install prometheus/dnf install prometheus(or binary from prometheus.io)
Key Operations
| Operation | Command |
|---|---|
| Status | systemctl status prometheus |
| Check config syntax | promtool check config /etc/prometheus/prometheus.yml |
| Check alerting/recording rules | promtool check rules /etc/prometheus/rules/*.yml |
| Reload config (no restart) | curl -X POST http://localhost:9090/-/reload or sudo systemctl reload prometheus |
| SIGHUP reload | sudo kill -HUP $(pidof prometheus) |
| Instant query via API | curl 'http://localhost:9090/api/v1/query?query=up' |
| Range query via API | curl 'http://localhost:9090/api/v1/query_range?query=up&start=...&end=...&step=60' |
| List active targets | curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[].health' |
| List active alerts | curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts' |
| Count total series | curl -s 'http://localhost:9090/api/v1/query?query=prometheus_tsdb_head_series' | jq '.data.result[0].value[1]' |
| TSDB stats (cardinality) | curl -s http://localhost:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName[:10]' |
| Create TSDB snapshot | curl -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot (requires --web.enable-admin-api) |
| Delete series by label | curl -X POST 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]=job="old_job"' (requires admin API) |
| Clean tombstones after delete | curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones |
| Verify scrape target reachable | curl -v http://<target-host>:<port>/metrics |
| Build info / version | curl -s http://localhost:9090/api/v1/status/buildinfo | jq . |
Expected Ports
- 9090/tcp — Prometheus web UI and HTTP API
- 9093/tcp — Alertmanager (if running)
- 9100/tcp — node_exporter (convention, not enforced)
- Verify:
ss -tlnp | grep -E '9090|9093|9100' - Firewall (internal only, do not expose 9090 publicly without auth):
sudo ufw allow from 10.0.0.0/8 to any port 9090
Health Checks
systemctl is-active prometheus→activepromtool check config /etc/prometheus/prometheus.yml→SUCCESScurl -sf http://localhost:9090/-/healthy→Prometheus Server is Healthy.curl -s 'http://localhost:9090/api/v1/query?query=up' | jq '.data.result | length'→ non-zero (targets are being scraped)
Common Failures
| Symptom | Likely cause | Check/Fix |
|---|---|---|
Target shows DOWN in web UI |
Exporter not running or wrong port | curl http://<target>:<port>/metrics; systemctl status <exporter> |
Target shows DOWN: connection refused |
Firewall blocking scrape or wrong IP | ss -tlnp on target host; check static_configs target address |
| Scrape timeout | Exporter too slow or metrics endpoint overloaded | Increase scrape_timeout for that job; profile exporter |
| Config reload fails | YAML syntax error in prometheus.yml or rule files | promtool check config /etc/prometheus/prometheus.yml — shows exact line |
| Rule evaluation error in logs | Bad PromQL expression in alerting/recording rule | promtool check rules /etc/prometheus/rules/*.yml |
| "too many samples" error | High cardinality query or series explosion | Check TSDB stats endpoint; identify high-cardinality labels |
| Disk filling up | Default 15-day retention accumulating | Set --storage.tsdb.retention.time=30d or --storage.tsdb.retention.size=50GB in systemd unit |
| Alertmanager not receiving alerts | Wrong alertmanager address in alerting: block |
Check alertmanagers config; curl http://localhost:9093/-/healthy |
| Alerts firing but not routing | Alertmanager route/receiver misconfigured | amtool config routes test or check amtool alert output |
| OOM kill / high memory | Too many active series in TSDB head | Reduce retention, add recording rules, drop high-cardinality labels via metric_relabel_configs |
Pain Points
- High cardinality kills performance. Each unique label value combination creates a new time series. Labels like
user_id,request_id, orurlin metric names can create millions of series. Drop them withmetric_relabel_configsusingaction: labeldropbefore they enter the TSDB. - Default retention is 15 days, not forever. Data older than
--storage.tsdb.retention.timeis deleted automatically. For long-term storage, configureremote_writeto Thanos, Mimir, or VictoriaMetrics — Prometheus alone is not an archival system. - No built-in authentication. Prometheus's HTTP API and web UI have no access control. Put it behind nginx or Caddy with basic auth or mTLS before exposing it on any network interface other than localhost.
- PromQL range vectors vs instant vectors.
http_requests_totalis an instant vector (current value).http_requests_total[5m]is a range vector (a set of samples over 5 minutes). Functions likerate()andincrease()require a range vector; arithmetic and comparisons require an instant vector. Mixing them is the most common PromQL beginner error. - Recording rules for expensive queries. Dashboard queries that aggregate across thousands of series run on every panel refresh. Pre-compute them with recording rules so dashboards query a single pre-aggregated series instead of triggering a full scan at render time.
- Alertmanager config is separate from Prometheus config. Alerting rules live in Prometheus (which decides when to fire); routing, receivers, and silences live in Alertmanager (
/etc/alertmanager/alertmanager.yml).promtool check rulesvalidates the rule expressions but does not validate Alertmanager routing. Useamtool config checkfor Alertmanager's config.
References
See references/ for:
prometheus.yml.annotated— full config with every directive explained, plus an alerting rule file examplecommon-patterns.md— node monitoring, file SD, alerting rules, PromQL, recording rules, Alertmanager, nginx auth proxy, remote_write, and retention sizingdocs.md— official documentation links
Related skills