grafana-observability
SKILL.md
Grafana Observability
Full access to Grafana instances (self-hosted or Grafana Cloud) for network infrastructure observability: dashboards, Prometheus metrics (PromQL), Loki logs (LogQL), alerting rules, incident management, OnCall schedules, annotations, and panel image rendering. 75+ tools via the official Grafana MCP server.
MCP Server
| Property | Value |
|---|---|
| Source | grafana/mcp-grafana |
| Transport | stdio (default), SSE, or streamable-http |
| Language | Go (runs via uvx mcp-grafana) |
| Tools | 75+ (dashboards, Prometheus, Loki, alerting, incidents, OnCall, annotations, admin) |
| Auth | Service account token (preferred) or username/password |
| Requires | Grafana 9.0+, service account with Editor role or granular RBAC |
How to Run
# stdio mode (default — used by NetClaw)
uvx mcp-grafana
# Read-only mode (prevents dashboard/alert modifications)
uvx mcp-grafana --disable-write
Environment Variables
| Variable | Required | Example | Description |
|---|---|---|---|
GRAFANA_URL |
Yes | http://grafana.example.com:3000 |
Grafana instance URL |
GRAFANA_SERVICE_ACCOUNT_TOKEN |
Yes* | glsa_abc123... |
Service account token (preferred auth) |
GRAFANA_USERNAME |
Alt | admin |
Basic auth username (alternative to token) |
GRAFANA_PASSWORD |
Alt | changeme |
Basic auth password |
GRAFANA_ORG_ID |
No | 1 |
Organization ID for multi-org setups |
*Either service account token or username/password required.
Key Tool Categories
Dashboard Operations
| Tool | What It Does |
|---|---|
search_dashboards |
Find dashboards by title or metadata |
get_dashboard_summary |
Lightweight overview (context-efficient — use this first) |
get_dashboard_by_uid |
Full dashboard JSON (large — use sparingly) |
get_dashboard_property |
Extract specific fields via JSONPath |
get_dashboard_panel_queries |
Extract panel query details |
update_dashboard |
Create or modify dashboards |
patch_dashboard |
Targeted modifications without full JSON replacement |
Prometheus (PromQL)
| Tool | What It Does |
|---|---|
query_prometheus |
Execute instant or range PromQL queries |
list_prometheus_metric_names |
Discover available metrics |
list_prometheus_label_names |
List labels matching selectors |
list_prometheus_label_values |
Retrieve values for a specific label |
query_prometheus_histogram |
Calculate percentiles (p50, p90, p95, p99) |
list_prometheus_metric_metadata |
Metric type, help text, unit |
Loki (LogQL)
| Tool | What It Does |
|---|---|
query_loki_logs |
Execute LogQL queries against log streams |
list_loki_label_names |
Discover available log labels |
list_loki_label_values |
List values for a specific log label |
query_loki_stats |
Stream statistics (volume, rate) |
query_loki_patterns |
Detect log structure patterns |
Alerting
| Tool | What It Does |
|---|---|
list_alert_rules |
View all Grafana and datasource-managed alert rules |
get_alert_rule_by_uid |
Retrieve specific alert rule details |
create_alert_rule |
Create new alert rule |
update_alert_rule |
Modify existing alert rule |
delete_alert_rule |
Remove alert rule |
list_contact_points |
View notification endpoints (email, Slack, PagerDuty, etc.) |
Incident Management
| Tool | What It Does |
|---|---|
list_incidents |
View Grafana Incidents with filtering |
get_incident |
Single incident details |
create_incident |
Create a new incident |
add_activity_to_incident |
Add timeline entry to incident |
OnCall
| Tool | What It Does |
|---|---|
list_oncall_schedules |
View on-call rotation schedules |
get_oncall_shift |
Shift details |
get_current_oncall_users |
Who is on call right now |
list_alert_groups |
OnCall alert groups with filtering |
Annotations & Rendering
| Tool | What It Does |
|---|---|
get_annotations |
Query annotations with time/tag filters |
create_annotation |
Add annotation to dashboard/panel |
get_panel_image |
Render a panel or dashboard as PNG image |
generate_deeplink |
Create accurate Grafana URLs for sharing |
Investigation (Sift)
| Tool | What It Does |
|---|---|
list_sift_investigations |
List automated investigations |
get_sift_investigation |
Investigation details |
find_error_pattern_logs |
Detect elevated error patterns in logs |
find_slow_requests |
Identify slow requests via Tempo traces |
Workflow: Network Infrastructure Monitoring
When checking network device metrics in Grafana:
- Find dashboards:
search_dashboardswith keyword (e.g., "network", "interface", "BGP") - Dashboard overview:
get_dashboard_summaryfor panel list without full JSON - Query metrics:
query_prometheuswith PromQL for specific metrics:- Interface traffic:
rate(ifHCInOctets{instance="router1"}[5m]) * 8 - BGP peer state:
bgp_peer_state{peer="10.1.1.2"} - CPU utilization:
device_cpu_utilization{device="core-rtr-01"} - Interface errors:
increase(ifInErrors{device=~".*"}[1h])
- Interface traffic:
- Check alerts:
list_alert_rulesto see active alerting thresholds - Search logs:
query_loki_logsfor syslog or SNMP trap data - Report: Metrics summary with alert status and log correlation
- GAIT: Record all queries in audit trail
Example: Interface Utilization Check
search_dashboards(title="Network Interfaces")
get_dashboard_summary(uid="abc123")
query_prometheus(expr="rate(ifHCInOctets{device='core-rtr-01'}[5m]) * 8", time_range="1h")
query_prometheus(expr="rate(ifHCOutOctets{device='core-rtr-01'}[5m]) * 8", time_range="1h")
list_alert_rules(folder="Network")
Workflow: Alert Investigation
When investigating Grafana alerts:
- List alerts:
list_alert_rules— find firing or pending rules - Alert details:
get_alert_rule_by_uid— thresholds, conditions, datasource - Query metrics:
query_prometheus— check the metric that triggered the alert - Search logs:
query_loki_logs— correlate with log events around alert time - Check incidents:
list_incidents— is this already tracked? - Contact points:
list_contact_points— verify notification routes - Report: Alert analysis with root cause and metric evidence
Workflow: Incident Response
When responding to a Grafana incident:
- List incidents:
list_incidents— find open incidents - Incident details:
get_incident— timeline, severity, labels - OnCall:
get_current_oncall_users— who should be notified - Correlate metrics:
query_prometheus— check affected service metrics - Correlate logs:
query_loki_logs— find error patterns around incident time - Investigate:
find_error_pattern_logs— automated error pattern detection - Update incident:
add_activity_to_incident— add findings to timeline - Annotate:
create_annotation— mark event on relevant dashboards
Workflow: Log Analysis
When investigating network logs stored in Loki:
- Discover labels:
list_loki_label_names— find available labels (host, severity, facility) - Label values:
list_loki_label_values— enumerate hosts, severity levels - Query logs:
query_loki_logswith LogQL:- By device:
{host="core-rtr-01"} - By severity:
{host="core-rtr-01"} |= "error" - Pattern match:
{job="syslog"} |~ "BGP|OSPF"
- By device:
- Patterns:
query_loki_patterns— detect recurring log structures - Stats:
query_loki_stats— log volume and rate analysis
Integration with Other Skills
| Skill | Integration |
|---|---|
| pyats-health-check | Cross-reference pyATS health data with Grafana metrics and dashboards |
| pyats-routing | Correlate OSPF/BGP state changes with Grafana metric timelines |
| gait-session-tracking | Record all Grafana queries and findings in GAIT audit trail |
| slack-network-alerts | Grafana alerts fed through Slack + NetClaw for automated investigation |
| servicenow-change-workflow | Annotate Grafana dashboards during change windows; correlate incidents with CRs |
| te-network-monitoring | Pair ThousandEyes path data with Grafana infrastructure metrics |
| aws-cloud-monitoring | Compare Grafana dashboards with CloudWatch data for hybrid visibility |
| markmap-viz | Visualize Grafana alert rule hierarchies as mind maps |
Context Window Management
Grafana dashboards can be large JSON documents. Use these strategies:
- Always start with
get_dashboard_summary— lightweight overview, not full JSON - Use
get_dashboard_propertywith JSONPath for specific fields - Avoid
get_dashboard_by_uidunless you need the complete dashboard definition - Use
get_dashboard_panel_queriesto extract just the query definitions
Important Rules
- Prefer read-only operations — use
search_dashboards,get_dashboard_summary,query_prometheus,query_loki_logs,list_alert_rulesbefore any write operations - Dashboard modifications require ServiceNow CR — unless in lab/dev Grafana instance
- Alert rule changes require approval — creating/updating/deleting alert rules affects production monitoring
- Token-efficient queries — use
get_dashboard_summaryoverget_dashboard_by_uid, use time ranges to limit Prometheus/Loki result size - GAIT audit mandatory — record all Grafana queries, dashboard modifications, alert changes, and incident updates
- No secrets in queries — never embed credentials or sensitive data in PromQL/LogQL expressions
Error Handling
- Auth fails (401/403): Check
GRAFANA_URLandGRAFANA_SERVICE_ACCOUNT_TOKENin~/.openclaw/.env. Verify service account has Editor role or required RBAC permissions. - Datasource not found: Use
list_datasourcesto discover available datasource UIDs and names. - PromQL/LogQL errors: Use
list_prometheus_metric_namesorlist_loki_label_namesto discover valid metric/label names before querying. - Dashboard not found: Use
search_dashboardsto find dashboards by title before using UID-based tools. - Rate limiting: Grafana may rate-limit API requests; space out large query batches.
Weekly Installs
9
Repository
automateyournet…/netclawGitHub Stars
282
First Seen
13 days ago
Security Audits
Installed on
opencode9
gemini-cli9
antigravity9
github-copilot9
codex9
kimi-cli9