grafana-observability

Installation

SKILL.md

Grafana Observability

MCP Server

Property	Value
Source	grafana/mcp-grafana
Transport	stdio (default), SSE, or streamable-http
Language	Go (runs via `uvx mcp-grafana`)
Tools	75+ (dashboards, Prometheus, Loki, alerting, incidents, OnCall, annotations, admin)
Auth	Service account token (preferred) or username/password
Requires	Grafana 9.0+, service account with Editor role or granular RBAC

How to Run

# stdio mode (default — used by NetClaw)
uvx mcp-grafana

# Read-only mode (prevents dashboard/alert modifications)
uvx mcp-grafana --disable-write

Environment Variables

Variable	Required	Example	Description
`GRAFANA_URL`	Yes	`http://grafana.example.com:3000`	Grafana instance URL
`GRAFANA_SERVICE_ACCOUNT_TOKEN`	Yes*	`glsa_abc123...`	Service account token (preferred auth)
`GRAFANA_USERNAME`	Alt	`admin`	Basic auth username (alternative to token)
`GRAFANA_PASSWORD`	Alt	`changeme`	Basic auth password
`GRAFANA_ORG_ID`	No	`1`	Organization ID for multi-org setups

*Either service account token or username/password required.

Key Tool Categories

Dashboard Operations

Tool	What It Does
`search_dashboards`	Find dashboards by title or metadata
`get_dashboard_summary`	Lightweight overview (context-efficient — use this first)
`get_dashboard_by_uid`	Full dashboard JSON (large — use sparingly)
`get_dashboard_property`	Extract specific fields via JSONPath
`get_dashboard_panel_queries`	Extract panel query details
`update_dashboard`	Create or modify dashboards
`patch_dashboard`	Targeted modifications without full JSON replacement

Prometheus (PromQL)

Tool	What It Does
`query_prometheus`	Execute instant or range PromQL queries
`list_prometheus_metric_names`	Discover available metrics
`list_prometheus_label_names`	List labels matching selectors
`list_prometheus_label_values`	Retrieve values for a specific label
`query_prometheus_histogram`	Calculate percentiles (p50, p90, p95, p99)
`list_prometheus_metric_metadata`	Metric type, help text, unit

Loki (LogQL)

Tool	What It Does
`query_loki_logs`	Execute LogQL queries against log streams
`list_loki_label_names`	Discover available log labels
`list_loki_label_values`	List values for a specific log label
`query_loki_stats`	Stream statistics (volume, rate)
`query_loki_patterns`	Detect log structure patterns

Alerting

Tool	What It Does
`list_alert_rules`	View all Grafana and datasource-managed alert rules
`get_alert_rule_by_uid`	Retrieve specific alert rule details
`create_alert_rule`	Create new alert rule
`update_alert_rule`	Modify existing alert rule
`delete_alert_rule`	Remove alert rule
`list_contact_points`	View notification endpoints (email, Slack, PagerDuty, etc.)

Incident Management

Tool	What It Does
`list_incidents`	View Grafana Incidents with filtering
`get_incident`	Single incident details
`create_incident`	Create a new incident
`add_activity_to_incident`	Add timeline entry to incident

OnCall

Tool	What It Does
`list_oncall_schedules`	View on-call rotation schedules
`get_oncall_shift`	Shift details
`get_current_oncall_users`	Who is on call right now
`list_alert_groups`	OnCall alert groups with filtering

Annotations & Rendering

Tool	What It Does
`get_annotations`	Query annotations with time/tag filters
`create_annotation`	Add annotation to dashboard/panel
`get_panel_image`	Render a panel or dashboard as PNG image
`generate_deeplink`	Create accurate Grafana URLs for sharing

Investigation (Sift)

Tool	What It Does
`list_sift_investigations`	List automated investigations
`get_sift_investigation`	Investigation details
`find_error_pattern_logs`	Detect elevated error patterns in logs
`find_slow_requests`	Identify slow requests via Tempo traces

Workflow: Network Infrastructure Monitoring

When checking network device metrics in Grafana:

Find dashboards: search_dashboards with keyword (e.g., "network", "interface", "BGP")
Dashboard overview: get_dashboard_summary for panel list without full JSON
Query metrics: query_prometheus with PromQL for specific metrics:
- Interface traffic: rate(ifHCInOctets{instance="router1"}[5m]) * 8
- BGP peer state: bgp_peer_state{peer="10.1.1.2"}
- CPU utilization: device_cpu_utilization{device="core-rtr-01"}
- Interface errors: increase(ifInErrors{device=~".*"}[1h])
Check alerts: list_alert_rules to see active alerting thresholds
Search logs: query_loki_logs for syslog or SNMP trap data
Report: Metrics summary with alert status and log correlation
GAIT: Record all queries in audit trail

Example: Interface Utilization Check

search_dashboards(title="Network Interfaces")
get_dashboard_summary(uid="abc123")
query_prometheus(expr="rate(ifHCInOctets{device='core-rtr-01'}[5m]) * 8", time_range="1h")
query_prometheus(expr="rate(ifHCOutOctets{device='core-rtr-01'}[5m]) * 8", time_range="1h")
list_alert_rules(folder="Network")

Workflow: Alert Investigation

When investigating Grafana alerts:

List alerts: list_alert_rules — find firing or pending rules
Alert details: get_alert_rule_by_uid — thresholds, conditions, datasource
Query metrics: query_prometheus — check the metric that triggered the alert
Search logs: query_loki_logs — correlate with log events around alert time
Check incidents: list_incidents — is this already tracked?
Contact points: list_contact_points — verify notification routes
Report: Alert analysis with root cause and metric evidence

Workflow: Incident Response

When responding to a Grafana incident:

List incidents: list_incidents — find open incidents
Incident details: get_incident — timeline, severity, labels
OnCall: get_current_oncall_users — who should be notified
Correlate metrics: query_prometheus — check affected service metrics
Correlate logs: query_loki_logs — find error patterns around incident time
Investigate: find_error_pattern_logs — automated error pattern detection
Update incident: add_activity_to_incident — add findings to timeline
Annotate: create_annotation — mark event on relevant dashboards

Workflow: Log Analysis

When investigating network logs stored in Loki:

Discover labels: list_loki_label_names — find available labels (host, severity, facility)
Label values: list_loki_label_values — enumerate hosts, severity levels
Query logs: query_loki_logs with LogQL:
- By device: {host="core-rtr-01"}
- By severity: {host="core-rtr-01"} |= "error"
- Pattern match: {job="syslog"} |~ "BGP|OSPF"
Patterns: query_loki_patterns — detect recurring log structures
Stats: query_loki_stats — log volume and rate analysis

Integration with Other Skills

Skill	Integration
pyats-health-check	Cross-reference pyATS health data with Grafana metrics and dashboards
pyats-routing	Correlate OSPF/BGP state changes with Grafana metric timelines
gait-session-tracking	Record all Grafana queries and findings in GAIT audit trail
slack-network-alerts	Grafana alerts fed through Slack + NetClaw for automated investigation
servicenow-change-workflow	Annotate Grafana dashboards during change windows; correlate incidents with CRs
te-network-monitoring	Pair ThousandEyes path data with Grafana infrastructure metrics
aws-cloud-monitoring	Compare Grafana dashboards with CloudWatch data for hybrid visibility
markmap-viz	Visualize Grafana alert rule hierarchies as mind maps

Context Window Management

Grafana dashboards can be large JSON documents. Use these strategies:

Always start with get_dashboard_summary — lightweight overview, not full JSON
Use get_dashboard_property with JSONPath for specific fields
Avoid get_dashboard_by_uid unless you need the complete dashboard definition
Use get_dashboard_panel_queries to extract just the query definitions

Important Rules

Prefer read-only operations — use search_dashboards, get_dashboard_summary, query_prometheus, query_loki_logs, list_alert_rules before any write operations
Dashboard modifications require ServiceNow CR — unless in lab/dev Grafana instance
Alert rule changes require approval — creating/updating/deleting alert rules affects production monitoring
Token-efficient queries — use get_dashboard_summary over get_dashboard_by_uid, use time ranges to limit Prometheus/Loki result size
GAIT audit mandatory — record all Grafana queries, dashboard modifications, alert changes, and incident updates
No secrets in queries — never embed credentials or sensitive data in PromQL/LogQL expressions

Error Handling

Auth fails (401/403): Check GRAFANA_URL and GRAFANA_SERVICE_ACCOUNT_TOKEN in ~/.openclaw/.env. Verify service account has Editor role or required RBAC permissions.
Datasource not found: Use list_datasources to discover available datasource UIDs and names.
PromQL/LogQL errors: Use list_prometheus_metric_names or list_loki_label_names to discover valid metric/label names before querying.
Dashboard not found: Use search_dashboards to find dashboards by title before using UID-based tools.
Rate limiting: Grafana may rate-limit API requests; space out large query batches.

Related skills

More from automateyournetwork/netclaw

Installs

Repository

automateyournet…/netclaw

GitHub Stars

481

First Seen

Mar 3, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass