ml-ai
Grafana Cloud AI & ML
Docs: https://grafana.com/docs/grafana-cloud/alerting-and-irm/machine-learning/
Grafana Assistant
Context-aware LLM sidebar agent (GA). Integrates with your Grafana Cloud stack.
Capabilities:
- Convert natural language to PromQL/LogQL/TraceQL
- Explain existing queries in plain English
- Build and edit dashboards from descriptions
- Investigate incidents (correlate metrics, logs, traces)
- MCP server integration — connect external tools to Assistant
- RBAC controls per organization
- Slack integration for on-call workflows
Assistant Investigations (public preview): Multi-agent autonomous incident analysis mode — launches multiple specialized agents in parallel to investigate different signals.
Enable: Grafana Cloud → Administration → AI & LLM → Enable Grafana Assistant
In panel editor: Click the magic wand / "Assistant" icon to get query suggestions and explanations.
Dynamic Alerting
ML-based alerting without static thresholds.
Forecasting (Prophet model)
Trained on 90 days of history; learns daily and weekly seasonality patterns.
# Create forecast job
curl -X POST https://yourstack.grafana.net/api/plugins/grafana-ml-app/resources/ml/v1/forecast \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{
"name": "cpu-forecast",
"metric": "avg(rate(node_cpu_seconds_total{mode=\"user\"}[5m]))",
"datasourceId": 1,
"interval": 300,
"trainingWindow": "90d",
"forecastWindow": "7d",
"algorithm": { "name": "prophet", "config": {} }
}'
Generated metric pairs for alert rules:
# Predicted value
ml_forecast{job="cpu-forecast"}
# Confidence bounds
ml_forecast_lower{job="cpu-forecast"}
ml_forecast_upper{job="cpu-forecast"}
# Alert: actual > upper bound (anomaly above forecast)
avg(rate(node_cpu_seconds_total{mode="user"}[5m]))
> ml_forecast_upper{job="cpu-forecast"} * 1.1
Outlier Detection
Detects when one series in a group deviates from its peers.
curl -X POST https://yourstack.grafana.net/api/plugins/grafana-ml-app/resources/ml/v1/outlier \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{
"name": "service-error-outliers",
"metric": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service)",
"datasourceId": 1,
"interval": 300,
"algorithm": {
"name": "dbscan",
"sensitivity": 0.5,
"config": { "epsilon": 0.5 }
}
}'
# Score > 0: series is an outlier (use in alert rule)
ml_outlier_score{job="service-error-outliers", service="checkout"}
Alert Rules using ML
groups:
- name: ml-alerts
rules:
- alert: CPUAboveForecast
expr: |
avg(rate(node_cpu_seconds_total{mode="user"}[5m]))
> ml_forecast_upper{job="cpu-forecast"} * 1.1
for: 5m
labels:
severity: warning
annotations:
summary: "CPU usage significantly above forecast"
- alert: ServiceErrorRateAnomaly
expr: ml_outlier_score{job="service-error-outliers"} > 0.8
for: 5m
labels:
severity: critical
annotations:
summary: "Anomalous error rate on {{ $labels.service }}"
Sift (Automated Root Cause Analysis)
Free for all Grafana Cloud accounts. Automatically investigates incidents by correlating signals.
8 Analysis Types:
| Analysis | What it checks |
|---|---|
| Error Pattern Logs | Clusters log errors by pattern, ranks by frequency/recency |
| HTTP Error Series | Finds HTTP 4xx/5xx spikes correlated with incident window |
| Kube Crashes | OOMKills, pod restarts, evictions in K8s |
| Log Query | Custom LogQL query results correlated to incident time |
| Metric Query | Custom PromQL anomalies around incident window |
| Noisy Neighbors | Detects resource contention from co-located services |
| Recent Deployments | Correlates recent Helm/K8s deployments with incident start |
| Resource Contention | CPU throttling, memory pressure, disk I/O saturation |
Trigger Sift from:
- Explore → "Run Sift Investigation"
- Dashboard panel → "Investigate with Sift"
- Grafana Incident → "Run Sift" button
- Command palette (
Cmd+K) → "Start Sift investigation" - OnCall escalation chains → automatic trigger
# Trigger via API
curl -X POST https://yourstack.grafana.net/api/plugins/grafana-sift-app/resources/sift/v1/investigations \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{
"name": "checkout-latency-spike",
"start": "2024-02-01T10:00:00Z",
"end": "2024-02-01T10:30:00Z",
"filters": { "service": "checkout", "namespace": "production" }
}'
Knowledge Graph
Auto-discovers services, pods, nodes, and namespaces from metric labels and trace data. Updates every minute.
Access: Observability → Entity graph
Search syntax:
Show Service api-server
Show all services in namespace production
Show Pod frontend-abc123
RCA Workbench: Structured troubleshooting interface built on the knowledge graph — traces relationships between entities to identify blast radius and upstream causes.
LLM Plugin
Acts as an authenticated proxy for LLM provider API calls from Grafana panels and plugins.
Supported providers: OpenAI, Anthropic (Claude), Azure OpenAI, vLLM, Ollama, LiteLLM
Powered features: Flame graph interpretation, incident auto-summary, panel title generation, Sift log explanations, natural language panel descriptions.
Enable: Administration → Plugins → LLM Plugin → "Enable OpenAI/LLM access via Grafana"
# provisioning/plugins/llm.yaml
apiVersion: 1
apps:
- type: grafana-llm-app
jsonData:
# OpenAI
openAIUrl: https://api.openai.com
openAIModel: gpt-4o
# Or Anthropic:
# provider: anthropic
# anthropicModel: claude-sonnet-4-6
# Or Azure OpenAI:
# openAIUrl: https://your-resource.openai.azure.com
# azureModelMapping: '[["gpt-4o","your-deployment-name"]]'
secureJsonData:
openAIKey: sk-your-openai-key
Adaptive Metrics
Identifies unused metrics to reduce cardinality and storage costs.
# Get aggregation recommendations
curl https://yourstack.grafana.net/api/plugins/grafana-adaptive-metrics-app/resources/v1/recommendations \
-H "Authorization: Bearer <token>"
Aggregation rule (drops high-cardinality labels):
- match: "^http_request_duration_seconds.*"
action: keep
match_labels: [method, status, service]
# Drops: pod, container, instance