prometheus
Prometheus
PromQL Gotchas
Counter Functions (Critical)
Counters only increase. Never use raw counter values—always use rate functions:
rate(http_requests_total[5m]) # Per-second average rate
irate(http_requests_total[5m]) # Instant rate (last 2 points, spiky)
increase(http_requests_total[1h]) # Total increase over range
rate()handles counter resets automatically- Use
rate()for dashboards,irate()only for high-resolution spikes
Range Vector Required
Rate functions need [duration]:
rate(metric[5m]) # Correct
rate(metric) # Error: expected range vector
Vector Matching
Binary operations require matching labels:
# This fails if label sets differ:
metric_a / metric_b
# Ignore extra labels:
metric_a / ignoring(extra_label) metric_b
# Match on specific labels only:
metric_a / on(common_label) metric_b
Histogram Quantiles
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
- Must use
_bucketmetric withlelabel - Always wrap in
rate()for counters by (le)is required; add other labels as needed:by (le, endpoint)
Common Query Patterns
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
# CPU usage (node_exporter)
100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)
# Memory usage
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# Container memory (Kubernetes)
sum by (pod) (container_memory_working_set_bytes{container!=""})
Alerting Rules
groups:
- name: example
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/ sum(rate(http_requests_total[5m])) by (job)
> 0.05
for: 5m # Must be firing for this duration
labels:
severity: warning
annotations:
summary: "Error rate {{ $value | humanizePercentage }} on {{ $labels.job }}"
for Clause
- Prevents flapping on brief spikes
- Alert stays "pending" until duration met
- Missing
for= immediate alerting
Recording Rules
Pre-compute expensive queries:
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
Naming convention: level:metric:operations
Staleness
- Samples older than 5 minutes are "stale"
up == 0only fires if target was recently scraped- Use
absent(metric)to detect missing metrics entirely
More from kontrolplane/skills
kyverno
Kyverno Kubernetes policy engine for validation, mutation, and generation. Use when writing ClusterPolicies to enforce security standards, auto-mutate resources with defaults, generate companion resources, or verify container image signatures.
12loki
Grafana Loki log aggregation and LogQL queries. Use when writing LogQL queries for log analysis, configuring Promtail scrape pipelines, debugging log ingestion issues, or creating Loki alerting rules.
3argocd
ArgoCD GitOps continuous delivery for Kubernetes. Use when creating or debugging ArgoCD Application/ApplicationSet manifests, configuring sync policies, troubleshooting OutOfSync or degraded states, or integrating Helm/Kustomize sources.
3grafana
Grafana dashboard JSON configuration and alerting. Use when creating or editing dashboard JSON, configuring panels programmatically, setting up Grafana alerting rules, or troubleshooting visualization issues.
3kubernetes
Kubernetes resource configuration and troubleshooting. Use when debugging pod failures, configuring probes and resource limits, setting up RBAC or NetworkPolicies, or resolving common Kubernetes errors like CrashLoopBackOff or ImagePullBackOff.
3terraform
Terraform infrastructure as code with HCL. Use when writing Terraform configurations, debugging state issues, understanding count vs for_each behavior, managing modules, or troubleshooting plan/apply errors.
3