monitoring-authoring
Monitoring Resource Authoring
This skill covers creating and modifying monitoring resources. For querying Prometheus or investigating alerts, see the prometheus skill and sre skill.
Resource Types
| Resource | API Group | Purpose |
|---|---|---|
PrometheusRule |
monitoring.coreos.com/v1 |
Alert rules and recording rules |
ServiceMonitor |
monitoring.coreos.com/v1 |
Scrape metrics from Services |
PodMonitor |
monitoring.coreos.com/v1 |
Scrape metrics from Pods directly |
ScrapeConfig |
monitoring.coreos.com/v1alpha1 |
Advanced scrape configuration |
AlertmanagerConfig |
monitoring.coreos.com/v1alpha1 |
Routing, receivers, silencing |
Silence |
observability.giantswarm.io/v1alpha2 |
Declarative Alertmanager silences |
Canary |
canaries.flanksource.com/v1 |
Synthetic health checks (HTTP, TCP, K8s) |
See [references/file-placement.md] for where to put each resource type and naming conventions.
PrometheusRule Authoring
Every PrometheusRule must include release: kube-prometheus-stack label for Prometheus to discover it.
PrometheusRule template: see references/alert-patterns.md
Severity and for Duration
| Severity | for Duration |
Use Case | Routing |
|---|---|---|---|
critical |
2m-5m | Service down, data loss risk | Discord |
warning |
5m-15m | Degraded performance, limits | Discord |
info |
10m-30m | Informational, non-urgent | Silenced by InfoInhibitor |
Guidelines: for: 0m only for instant failures (e.g., SMART fail). Most alerts: 5m default. Flap-prone metrics (error rates, latency): 10m-15m. Use 5m for absence detection.
Alert Grouping
Group related alerts in named rule groups — affects Prometheus UI ordering:
spec:
groups:
- name: cilium-agent # Agent availability and health
rules: [...]
- name: cilium-bpf # BPF subsystem alerts
rules: [...]
See [references/alert-patterns.md] for common alert patterns (down, error rate, latency, capacity, PVC), annotation template functions, and recording rule examples.
ServiceMonitor and PodMonitor
Via Helm Values (Preferred)
serviceMonitor:
enabled: true
interval: 30s
scrapeTimeout: 10s
Manual ServiceMonitor
Place in monitoring namespace; use namespaceSelector to reach target namespace. Required label: release: kube-prometheus-stack.
---
# yaml-language-server: $schema=https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/monitoring.coreos.com/servicemonitor_v1.json
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: <component>
namespace: monitoring
labels:
release: kube-prometheus-stack # REQUIRED
spec:
namespaceSelector:
matchNames: [<target-namespace>]
selector:
matchLabels:
app.kubernetes.io/name: <component>
endpoints:
- port: http-monitoring
path: /metrics
interval: 30s
Manual PodMonitor
Use when pods expose metrics but don't have a Service (DaemonSets, sidecars). Same pattern as ServiceMonitor with podMetricsEndpoints instead of endpoints, and numeric ports quoted: port: "15020". For matchExpressions selecting multiple values, see any existing Flux PodMonitor in config/monitoring/.
See [references/alertmanagerconfig-reference.md] for AlertmanagerConfig routing, Silence CR templates, and matcher reference.
Canary Health Checks
Canary resources live in config/canary-checker/ (platform) or alongside app config.
HTTP health check:
---
# yaml-language-server: $schema=https://kubernetes-schemas.pages.dev/canaries.flanksource.com/canary_v1.json
apiVersion: canaries.flanksource.com/v1
kind: Canary
metadata:
name: http-check-<component>
spec:
schedule: "@every 1m"
http:
- name: <component>-health
url: https://<component>.${internal_domain}/health
responseCodes: [200]
maxSSLExpiry: 7
thresholdMillis: 5000
Kubernetes resource check with CEL (preferred over ready: true — avoids penalizing pods with restart history):
spec:
interval: 60
kubernetes:
- name: <component>-pods-healthy
kind: Pod
namespaceSelector:
name: <namespace>
resource:
labelSelector: app.kubernetes.io/name=<component>
test:
expr: >
dyn(results).all(pod,
pod.Object.status.phase == "Running" &&
pod.Object.status.conditions.exists(c, c.type == "Ready" && c.status == "True")
)
canary_check == 1 triggers CanaryCheckFailure (critical, 2m). No per-canary alert needed.
Workflow: Adding Monitoring for a New Component
Check if the chart provides monitoring via Helm values first (kubesearch <chart-name> serviceMonitor) → enable via values if available → else create ServiceMonitor/PodMonitor + PrometheusRule + Canary manually → place in correct directory → register in kustomization → task k8s:validate → verify after deployment:
# Check ServiceMonitor is discovered
kubectl --context <cluster> exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus -- \
wget -qO- 'http://localhost:9090/api/v1/targets' | \
jq '.data.activeTargets[] | select(.labels.job | contains("<component>"))'
# Check alert rules are loaded
kubectl --context <cluster> exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus -- \
wget -qO- 'http://localhost:9090/api/v1/rules' | \
jq '.data.groups[] | select(.name | contains("<component>"))'
For PrometheusRule validation before committing, see [scripts/validate-rules.sh].
Common Mistakes
| Mistake | Impact | Fix |
|---|---|---|
Missing release: kube-prometheus-stack label |
Prometheus ignores the resource | Add to metadata.labels |
| ServiceMonitor selector does not match any service | No metrics scraped, no error | Verify labels with kubectl get svc -n <ns> --show-labels |
Using ready: true in canary Kubernetes checks |
False negatives after pod restarts | Use CEL test.expr |
| Hardcoding domains in canary URLs | Breaks across clusters | Use ${internal_domain} |
Very short for on flappy metrics |
Alert noise | Use 10m+ for error rates and latencies |
| Creating alerts for non-existent metrics | Alert stuck in "pending" | Verify metrics exist in Prometheus first |
Keywords
PrometheusRule, ServiceMonitor, PodMonitor, ScrapeConfig, AlertmanagerConfig, Silence, silence-operator, canary-checker, Canary, recording rules, alert rules, monitoring, observability, scrape targets, prometheus, alertmanager, discord, heartbeat
More from ionfury/homelab
prometheus
Query Prometheus API for cluster metrics, alerts, and observability data. Use when investigating cluster health, performance issues, resource utilization, or alert status. Triggers on questions like "what's the CPU usage", "show me firing alerts", "check memory pressure", "query prometheus for", or any PromQL-related requests.
66taskfiles
|
63opentofu-modules
|
59terragrunt
|
59k8s
|
46cnpg-database
|
37