agency-sre
Installation
SKILL.md
Agency SRE
Treat reliability as an engineering system with measurable tradeoffs.
Use with companion skills
- Use
grafana-expertorgrafana-dashboardswhen the task needs concrete dashboards or alert rules. - Use
kubernetes-specialistfor workload-level health, capacity, and rollout behavior. - Use
k3s-backupwhen disaster recovery or restore posture matters. - Use
agency-incident-response-commanderwhen the work has moved from prevention into active incident handling.
Core workflow
- Start from user impact, not host trivia. Define what the service must do for users and how failure shows up externally.
- Propose or inspect SLOs and SLIs before discussing alerts or capacity.
- Map the golden signals: latency, traffic, errors, and saturation.
- Separate symptoms from causes. Dashboards should accelerate diagnosis, not just look busy.
- Reduce toil by codifying repetitive operational work, especially recurring incident steps.
Default deliverables
- Reliability review with the main failure modes and current blind spots.
- Suggested SLOs or SLIs, even if they are provisional.
- Alerting changes that reduce noise and improve signal quality.
- Runbook or automation recommendations for recurring failure modes.
- Capacity or scaling notes when resource pressure is part of the problem.
Guardrails
- Do not recommend alert spam. Every alert should imply a human decision.
- Do not optimize blindly. Tie changes to measured latency, error rate, saturation, or burn rate.
- Prefer multi-window, multi-burn-rate thinking for serious services.
- Track operational debt explicitly: missing probes, missing dashboards, no restore drill, unowned alerts.
- Frame tradeoffs clearly: reliability work may pause feature velocity when error budget is exhausted.
Fast checklist
- What is the user-visible symptom?
- What metric proves the symptom exists?
- What alert should have fired, and did it?
- What rollout or dependency change happened recently?
- What can be automated so this exact investigation is shorter next time?
Output pattern
Use this structure unless the user asked for something else:
- Reliability objective
- Current signals and gaps
- Recommended instrumentation or alerts
- Toil reduction or automation
- Risks and next reliability bets
Related skills
More from nordz0r/skills
open-webui-guide
Подробная русскоязычная справка по Open WebUI: архитектура, авторизация, функции, пайплайны, API, RAG, масштабирование, отладка и скрытые возможности. Используй этот скилл при любых вопросах об Open WebUI — как он устроен, как развернуть, настроить авторизацию (OAuth, LDAP, JWT), написать функцию или пайплайн, подключить модель (Ollama, OpenAI), настроить RAG/knowledge base, масштабировать на production, отладить проблему. Также используй при написании кода для Open WebUI: функции (filter, pipe, action), пайплайны, конфигурации, docker-compose.
38zapret-openwrt-guide
>-
32nextcloud-admin
>-
24ollama-search
>-
23amneziawg-openwrt-guide
>-
16podkop-openwrt-guide
>-
15