slo-optimize
SLO Optimizer
Analyze SLO timeline trends, compute statistics over the past 28 days, and generate advisory recommendations backed by real metric values. Never modify SLO definitions directly — route to slo-manage when the user wants to apply a recommendation.
Core Principles
- Use gcx commands exclusively — do not call Grafana APIs directly.
- Trust the user's expertise — skip explanations of what SLOs or burn rates are.
- Use
-o jsonfor agent processing of structured output; default format for user display. - Show graph output for timeline data so the user can see the trend visually.
- Every recommendation MUST include supporting data (current values, projected values, or historical comparisons). No generic advice without numbers.
- This skill is advisory only. Route to slo-manage for any changes the user wants to apply.
Prerequisites
gcx configured with a context pointing to the target Grafana instance.
If the user does not supply a UUID, list available SLOs first:
gcx slo definitions list
Ask the user which SLO to analyze if the target is ambiguous.
Optimization Workflow
Step 1: Retrieve SLO Definition
gcx slo definitions get <UUID> -o json
Extract and note:
spec.name— display namespec.objectives[0].value— current objective (e.g., 0.999)spec.objectives[0].window— compliance window (e.g., 28d)spec.query.type— ratio | freeform | thresholdspec.query.ratio.groupByLabels— dimensional labels (may be empty)spec.alerting— fastBurn / slowBurn configuration (may be absent)spec.destinationDatasource.uid— datasource UID for metric queries
Step 2: Fetch 28-Day Timeline
# Default graph output for user display
gcx slo definitions timeline <UUID> --from now-28d --to now
# JSON output for statistical analysis
gcx slo definitions timeline <UUID> --from now-28d --to now -o json
Parse the JSON output to extract SLI values across the time series. Compute:
mean_sli— average SLI over the 28-day windowmin_sli— lowest observed SLI pointmax_sli— highest observed SLI pointstd_dev— variability indicator
If timeline returns no data (NODATA), note it and skip to Step 3 for current status.
Step 3: Get Current Status (Wide Format)
gcx slo definitions status <UUID> -o wide
Extract from the wide output:
- Current SLI value
- Error budget remaining (%)
- Burn rate (current)
- SLI_1H and SLI_1D snapshots
- Status: OK | BREACHING | NODATA
Step 4: Query Raw SLI Metrics (When Timeline Is Insufficient)
When timeline data is sparse (< 7 days of points) or all NODATA, query raw metrics directly using the datasource UID from Step 1:
# SLI window metric (primary trend signal)
gcx metrics query <datasource-uid> \
'grafana_slo_sli_window{slo_uuid="<UUID>"}' \
--from now-28d --to now --step 6h
# Success and total rate for ratio SLOs
gcx metrics query <datasource-uid> \
'grafana_slo_success_rate_5m{slo_uuid="<UUID>"}' \
--from now-28d --to now --step 6h
gcx metrics query <datasource-uid> \
'grafana_slo_total_rate_5m{slo_uuid="<UUID>"}' \
--from now-28d --to now --step 6h
If the datasource UID is not in the definition, resolve it:
gcx datasources list --type prometheus
Step 5: Analyze Trends
Classify the pattern using the timeline data from Steps 2 and 4:
Sustained decline — SLI trending downward for 7 or more consecutive days. Compute the slope over the last 7 days vs. the preceding 7 days to confirm direction.
- Recommendation trigger: investigate underlying service degradation; a window adjustment will not fix a declining service.
Periodic dips — SLI drops recur at regular intervals (e.g., every weekend, every night). Look for temporal correlation in the min points.
- Recommendation trigger: window adjustment (e.g., 7d → 28d smooths weekend traffic spikes) or objective reduction if the dips are expected.
Sudden drops — Step-change in SLI at a specific timestamp (deployment, config change). Identify the onset timestamp and estimate error budget consumed by the event.
- Recommendation trigger: check alerting is configured; if budget consumed > 20% by a single event, consider tighter fastBurn thresholds.
Budget exhaustion rate — Project when the error budget will reach 0 based on the current
burn rate from Step 3. Formula:
days_until_exhausted = budget_remaining_pct / (burn_rate * 100 / window_days)
- Recommendation trigger: if < 7 days remain, flag as urgent; route to slo-investigate.
Step 6: Generate Advisory Recommendations
Produce numbered recommendations. Each recommendation requires:
- A specific change (what to do)
- Supporting data (why — current value vs. proposed value)
- Expected outcome
Objective tuning
If mean_sli < objective - 0.005 (more than 0.5 pp below the objective):
- Suggest lowering the objective to
floor(mean_sli * 1000) / 1000(rounded down to 3 dp). - Include: current objective, observed mean SLI, proposed objective.
- Rationale: the SLO is chronically breaching due to an unrealistic target.
If mean_sli > objective + 0.010 (more than 1 pp above the objective):
- Suggest tightening the objective toward
mean_sli - 0.005. - Include: current objective, observed mean SLI, proposed objective.
- Rationale: the SLO is trivially satisfied; tighten to reflect achievable performance.
groupByLabels addition (ratio query type only)
If spec.query.ratio.groupByLabels is empty or absent:
- Recommend adding dimensional labels such as
cluster,service,endpoint, orstatus_codedepending on what labels exist in the underlying metric series. - Rationale: without groupByLabels, all dimensions are collapsed — the SLO cannot identify which dimension is causing a breach.
Alerting configuration
If spec.alerting is absent or empty:
- Recommend configuring fastBurn (page) and slowBurn (ticket) alerts.
- Example thresholds: fastBurn
burnRateThreshold: 14.4over 1h (consumes 2% budget/hour), slowBurnburnRateThreshold: 1over 6h.
If alerting is configured and current burn rate (from Step 3) has been above 2x for the past 7 days (compare burn rate from status with recent timeline values):
- Recommend reviewing alerting thresholds — existing alerts may not be firing despite sustained budget drain.
- Include: current burn rate, alert threshold from definition, observed duration above 2x.
Window adjustment
If the SLO window is 7d and periodic dips are detected (weekend pattern):
- Recommend switching to 28d to smooth the variability.
- Include: current window, dip frequency, estimated improvement in budget consumption.
If the SLO window is 28d or 30d and mean_sli is very stable (std_dev < 0.001):
- Note the window is appropriate; no change needed.
Step 7: Present Recommendations and Route to slo-manage
Present all recommendations as advisory text. Do not apply any changes.
After presenting recommendations, ask:
"Would you like me to apply any of these recommendations? If so, I'll switch to slo-manage to pull the current definition and implement the changes with a dry-run first."
If the user confirms, invoke the slo-manage skill to handle the update workflow.
Output Format
SLO: <name>
UUID: <uuid>
Objective: <value> over <window>
Analysis period: now-28d to now
SLI Statistics (28d):
Mean: <value> Min: <value> Max: <value>
Std dev: <value>
Current Status:
SLI: <value> Budget remaining: <pct>% Burn rate: <value>x
SLI (1h): <value> SLI (1d): <value>
[28-day timeline graph]
Trend classification: <Sustained decline | Periodic dips | Sudden drops | Stable>
<One sentence describing the dominant pattern with supporting data>
Advisory Recommendations:
1. <Recommendation title>
Current: <value>
Proposed: <value>
Why: <rationale with numbers>
2. <Recommendation title>
...
[If no recommendations apply:]
No objective or alerting changes recommended. The SLO configuration appears well-calibrated
for the observed performance over the past 28 days.
---
To apply a recommendation: slo-manage will pull the definition and apply the change with
a dry-run. Confirm which recommendation(s) you want to apply.
Error Handling
Collect errors; report them at the end of the analysis, not interleaved with findings.
-
gcx slo definitions getfails (not found): Confirm the UUID and context. Rungcx slo definitions listto show available SLOs. -
Timeline returns NODATA: Recording rule metrics may not be populating. Check the destination datasource configuration. Proceed with raw metric queries in Step 4. If raw metrics also return NODATA, report the data gap and recommend verifying that the SLO recording rules are evaluating correctly.
-
Datasource UID not in definition: Run
gcx datasources list --type prometheusand present the list to the user. Do not block the analysis — use the remaining timeline data from Step 2. -
Timeline data < 7 days of points: The SLO may be newly created. Note the limited analysis window, proceed with available data, and suppress trend classifications that require 7+ days of data.
-
Status returns BREACHING: Note the breach in the output. Include budget exhaustion rate in the recommendations. Route to slo-investigate for deeper root cause analysis if the user wants to understand why the SLO is breaching (not just optimize it).
-
gcx command not found or auth error: Check
gcx config viewto verify the active context and credentials.
More from grafana/gcx
gcx
>
5explore-datasources
Discover what datasources, metrics, labels, and log streams are available in a Grafana instance. Use when the user asks what data exists, what metrics are available, what services are being monitored, or needs to find a datasource UID.
4setup-gcx
>
3gcx-observability
>
3slo-check-status
Use when the user asks about SLO health, wants an overview of all SLOs, or needs status of a specific SLO. Trigger on phrases like "how are my SLOs doing", "SLO status", "check my SLOs", "is my SLO healthy", "SLO budget", "SLO burn rate". For investigating breaching SLOs use slo-investigate. For optimization suggestions use slo-optimize. For creating or modifying SLO definitions use slo-manage.
2slo-investigate
Use when a specific SLO is breaching or alerting and the user needs to understand why — root cause analysis, dimensional breakdown, alert rule correlation, runbook access. Trigger on phrases like "investigate SLO", "why is my SLO breaching", "SLO error budget burning", "SLO alerting". For SLO status overview use slo-check-status. For creating or modifying SLOs use slo-manage. For optimization suggestions use slo-optimize.
2