prometheus-monitoring
SKILL.md
Prometheus Monitoring
Direct access to Prometheus time-series databases for network infrastructure monitoring: execute PromQL queries (instant and range), discover available metrics, retrieve metric metadata, inspect scrape target health, and perform system health checks. 6 tools via the prometheus-mcp-server.
MCP Server
| Property | Value |
|---|---|
| Source | pab1it0/prometheus-mcp-server |
| Transport | stdio (default), SSE, or HTTP |
| Language | Python 3.10+ |
| Tools | 6 (query, range query, list metrics, metadata, targets, health check) |
| Auth | Basic auth (username/password), bearer token, or unauthenticated |
| Install | pip3 install prometheus-mcp-server (PyPI) |
| Run | prometheus-mcp-server (stdio) |
How to Run
# stdio mode (default — used by NetClaw)
PROMETHEUS_URL=http://prometheus:9090 prometheus-mcp-server
# HTTP transport mode
PROMETHEUS_MCP_SERVER_TRANSPORT=http PROMETHEUS_URL=http://prometheus:9090 prometheus-mcp-server
# With basic auth
PROMETHEUS_URL=http://prometheus:9090 PROMETHEUS_USERNAME=admin PROMETHEUS_PASSWORD=secret prometheus-mcp-server
# With bearer token (Grafana Cloud, Thanos, etc.)
PROMETHEUS_URL=https://prom.example.com PROMETHEUS_TOKEN=your_bearer_token prometheus-mcp-server
Environment Variables
| Variable | Required | Example | Description |
|---|---|---|---|
PROMETHEUS_URL |
Yes | http://prometheus:9090 |
Prometheus server endpoint |
PROMETHEUS_USERNAME |
No | admin |
Basic auth username |
PROMETHEUS_PASSWORD |
No | changeme |
Basic auth password |
PROMETHEUS_TOKEN |
No | eyJhbG... |
Bearer token (Grafana Cloud, Thanos, Cortex) |
PROMETHEUS_URL_SSL_VERIFY |
No | false |
Disable SSL certificate verification |
PROMETHEUS_REQUEST_TIMEOUT |
No | 30 |
Request timeout in seconds (default: 30) |
PROMETHEUS_DISABLE_LINKS |
No | true |
Disable Prometheus UI links in responses (saves context) |
ORG_ID |
No | 1 |
Multi-tenant organization ID (Cortex/Mimir) |
PROMETHEUS_CUSTOM_HEADERS |
No | {"X-Custom":"val"} |
Additional HTTP headers as JSON |
PROMETHEUS_MCP_SERVER_TRANSPORT |
No | stdio |
Transport: stdio (default), http, or sse |
Tools
| Tool | Parameters | What It Does |
|---|---|---|
execute_query |
query, timeout? |
Execute instant PromQL query at current time |
execute_range_query |
query, start, end, step, timeout? |
Execute PromQL range query over time interval |
list_metrics |
page?, page_size? |
Browse available metric names with pagination |
get_metric_metadata |
metric?, limit? |
Retrieve metric type, help text, and unit info |
get_targets |
none | View scrape target details (up/down, labels, last scrape) |
health_check |
none | Check Prometheus server availability and readiness |
Workflow: Network Device Metric Monitoring
When checking Prometheus for network device metrics:
- Health check:
health_check— verify Prometheus is reachable - Discover metrics:
list_metrics— find available SNMP/device metrics - Metric metadata:
get_metric_metadata(metric="ifHCInOctets")— check type and description - Instant query:
execute_query(query="up{job='snmp'}")— check which targets are up - Range query:
execute_range_query— trend analysis over time:- Interface traffic:
rate(ifHCInOctets{instance="router1"}[5m]) * 8 - CPU utilization:
device_cpu_utilization{device="core-rtr-01"} - Interface errors:
increase(ifInErrors{device=~".*"}[1h]) - BGP peer state:
bgp_peer_state{peer="10.1.1.2"}
- Interface traffic:
- Scrape targets:
get_targets— verify SNMP exporters and device scrape health - GAIT: Record all queries in audit trail
Example: Interface Utilization Check
health_check()
list_metrics(page=1, page_size=50)
execute_query(query="rate(ifHCInOctets{device='core-rtr-01'}[5m]) * 8")
execute_range_query(query="rate(ifHCOutOctets{device='core-rtr-01'}[5m]) * 8", start="2024-01-01T00:00:00Z", end="2024-01-01T01:00:00Z", step="60s")
get_targets()
Workflow: Alert Threshold Investigation
When investigating whether metrics are crossing alert thresholds:
- Discover metrics:
list_metrics— find the metric name - Check metadata:
get_metric_metadata— understand metric type (counter, gauge, histogram) - Current value:
execute_query— get current metric value - Historical trend:
execute_range_query— check trend over past 1h/6h/24h - Compare targets:
get_targets— check if specific exporters are down - Report: Metric analysis with current value, trend direction, and recommendation
Workflow: Capacity Planning
When analyzing capacity trends for network infrastructure:
- Discover metrics:
list_metrics— find bandwidth/utilization metrics - Peak analysis:
execute_range_querywithmax_over_time():max_over_time(rate(ifHCInOctets{device="core-rtr-01",ifName="Gi0/0"}[5m])[7d:1h]) * 8
- 95th percentile:
execute_range_querywithquantile_over_time():quantile_over_time(0.95, rate(ifHCInOctets{device="core-rtr-01"}[5m])[30d:1h]) * 8
- Growth rate: Compare weekly/monthly averages
- Report: Utilization summary with capacity headroom and growth projection
Integration with Other Skills
| Skill | Integration |
|---|---|
| grafana-observability | Grafana dashboards visualize Prometheus data; use Prometheus skill for direct PromQL when Grafana isn't available or for ad-hoc queries |
| pyats-health-check | Cross-reference pyATS device health with Prometheus time-series metrics |
| pyats-routing | Correlate OSPF/BGP state changes with Prometheus metric timelines |
| gait-session-tracking | Record all Prometheus queries and findings in GAIT audit trail |
| te-network-monitoring | Pair ThousandEyes path data with Prometheus infrastructure metrics |
| sdwan-ops | Correlate SD-WAN vManage alarms with Prometheus device metrics |
| servicenow-change-workflow | Reference Prometheus metrics as evidence in change requests |
Important Rules
- Prefer read-only operations — all 6 tools are read-only; no Prometheus configuration changes
- Use pagination for metric lists —
list_metricssupportspageandpage_sizeto avoid large responses - Specify time ranges carefully — overly broad
execute_range_querytime ranges return large result sets - Disable links for context efficiency — set
PROMETHEUS_DISABLE_LINKS=trueto reduce response size - GAIT audit mandatory — record all Prometheus queries and metric analysis in audit trail
- No secrets in queries — never embed credentials or sensitive data in PromQL expressions
- Verify connectivity first — use
health_checkbefore running queries to confirm Prometheus is reachable
Error Handling
- Auth fails (401/403): Check
PROMETHEUS_URL,PROMETHEUS_USERNAME/PROMETHEUS_PASSWORD, orPROMETHEUS_TOKENin~/.openclaw/.env. Verify Prometheus allows the configured auth method. - Connection refused: Verify
PROMETHEUS_URLis reachable. Usehealth_checkto diagnose connectivity. - PromQL syntax errors: Use
list_metricsandget_metric_metadatato discover valid metric names before querying. - Empty results: Check
get_targetsto verify scrape targets are up and the expected labels exist. - Timeout errors: Increase
PROMETHEUS_REQUEST_TIMEOUTfor slow queries or large result sets. - SSL errors: Set
PROMETHEUS_URL_SSL_VERIFY=falsefor self-signed certificates (development only).
Weekly Installs
1
Repository
automateyournet…/netclawGitHub Stars
282
First Seen
10 days ago
Security Audits
Installed on
mcpjam1
claude-code1
replit1
junie1
windsurf1
zencoder1