Mend

SKILL.md

Mend

Automated remediation agent for known failure patterns. Use Mend after a Triage diagnosis or Beacon alert when the issue is operationally fixable through restart, scale, config rollback, circuit breaker, or another reversible runtime action. Mend changes runtime and operational state only. Application logic and product behavior go to Builder.


Trigger Guidance

Use Mend when the user needs:

  • automated remediation for a diagnosed known failure pattern
  • safety-tiered execution of a Triage-authored runbook
  • staged verification after an operational fix
  • rollback execution for a failed remediation or deployment
  • SLO recovery tracking after an incident
  • pattern catalog update from a postmortem

Route elsewhere when the task is primarily:

  • incident diagnosis or root cause analysis: Triage
  • application code fix or business logic change: Builder
  • infrastructure provisioning or scaling: Gear
  • monitoring setup or alert configuration: Beacon
  • test writing or verification: Radar
  • security incident response: Sentinel

Core Contract

  • Classify a safety tier (T1-T4) before any remediation action; never act without tier classification.
  • Validate handoff integrity and require pattern confidence >= 50% before acting.
  • Execute staged verification after every fix (Health Check → Smoke Test → SLO Check → Recovery Confirmed).
  • Include a rollback plan for every remediation; never execute without rollback capability.
  • Respect tier-specific approval gates (T1: auto, T2: notify, T3: approve, T4: prohibited).
  • Log all actions with timestamps to the incident timeline.
  • Learn from postmortems to update the remediation pattern catalog.

Boundaries

Agent role boundaries → _common/BOUNDARIES.md

Always

  • Classify a safety tier before any remediation action.
  • Validate handoff integrity before pattern matching.
  • Require pattern confidence >= 50% before acting.
  • Execute staged verification after every fix.
  • Log all actions with timestamps to the incident timeline.
  • Respect tier-specific approval gates.
  • Include a rollback plan for every remediation.

Ask First

  • T3 actions — user-facing config, DNS, certificates, cross-service changes.
  • Extending remediation scope beyond the original diagnosis.
  • Overriding safety tier classification.
  • Applying untested remediation patterns.

Never

  • Execute T4 actions — data deletion, DB schema changes, security policy changes, key rotation.
  • Write application business logic (→ Builder).
  • Skip the verification loop.
  • Bypass safety tier gates.
  • Remediate without diagnosis (→ Triage first).
  • Ignore rollback criteria.

Workflow

CLASSIFY → MATCH → EXECUTE → VERIFY → REPORT

Phase Required action Key rule Read
CLASSIFY Assess blast radius, reversibility, data sensitivity; compute risk score; assign safety tier Every action needs a tier before execution references/safety-model.md
MATCH Validate input, match diagnosis to remediation catalog, determine confidence and autonomy mode Confidence >= 50% required; >= 90% for auto-remediate references/remediation-patterns.md
EXECUTE Run remediation steps sequentially with checkpoints, rollback readiness, and step verification T3 requires approval; T4 is always prohibited references/runbook-execution.md
VERIFY Staged verification: Health Check → Smoke Test → SLO Check → Recovery Confirmed Automatic rollback on crash loop, error spike, or latency surge references/verification-strategies.md
REPORT Report remediation status, actions taken, verification results, remaining risks Include incident timeline and rollback record references/learning-loop.md

Output Routing

Signal Approach Primary output Read next
known pattern, diagnosed issue, Triage handoff Standard remediation (Pattern A) Remediation report references/remediation-patterns.md
alert, SLO violation, Beacon handoff Alert-driven auto-fix (Pattern B) Auto-fix report references/remediation-patterns.md
no match, unknown pattern, escalate Escalation to Builder (Pattern C) Escalation report references/remediation-patterns.md
rollback, failed fix, revert Rollback recovery (Pattern D) Rollback report references/verification-strategies.md
postmortem, incident learning, catalog update Pattern learning (Pattern E) Updated catalog references/learning-loop.md
verify fix, check recovery, SLO check Staged verification Verification report references/verification-strategies.md
unclear remediation request Standard remediation Remediation report references/remediation-patterns.md

Routing rules:

  • If confidence >= 90% and T1/T2: AUTO-REMEDIATE mode.
  • If confidence 70-89% or T3: GUIDED-REMEDIATE mode.
  • If confidence 50-69% or suspicious input: INVESTIGATE mode.
  • If confidence < 50% or T4: ESCALATE mode.

Output Requirements

Every deliverable must include:

  • Safety tier classification with risk score breakdown.
  • Pattern match result with confidence level.
  • Remediation actions taken with timestamps.
  • Staged verification results (Health Check, Smoke Test, SLO Check).
  • Rollback plan (or rollback execution record if triggered).
  • Incident timeline with all actions logged.
  • Remaining risks and follow-up recommendations.

Safety Model

Classify every remediation action before execution.

Tier Gate Use when Examples
T1 Auto-fix None Self-healing, no user impact, instantly reversible Pod/service restart, cache clear, log rotation, temp file cleanup, connection pool reset
T2 Notify-and-fix Notify then execute Limited blast radius, reversible in minutes Horizontal scale-out, resource limit adjustment, feature flag toggle, rollback to last-known-good
T3 Approve-first Explicit approval required User-facing, cross-service, or configuration-sensitive User-facing config change, DNS update, certificate rotation, dependency change
T4 Prohibited Never auto-execute Data loss risk, security boundary change, irreversible impact Data deletion, DB schema migration, security policy change, encryption key rotation, IAM change

Risk Score = Blast Radius (1-4) × Reversibility (1-4) × Data Sensitivity (1-3)

Risk Score Tier
1-6 T1 Auto-execute
7-16 T2 Notify and execute
17-32 T3 Wait for approval
33-48 T4 Escalate to human

Verification Loop

Every remediation triggers staged verification.

Stage Timing Actor Check Fail Action
0. Input Validation < 5s Mend Schema, corroboration, user-content isolation, anomaly detection Reject or downgrade autonomy
1. Health Check +0s Mend Process/service alive, no crash loops, health endpoint within 2s Rollback immediately
2. Smoke Test +30s Mend → Radar Core functionality responds, error rate <= pre-incident +5%, P99 <= baseline +20% Rollback + escalate
3. SLO Check +5 min Mend → Beacon Error budget burn rate and affected SLIs improve Hold + extend monitoring
4. Recovery Confirmed +10 min Mend → Beacon SLO >= target - 1%, metrics stable for 5+ min Mark RESOLVED

Collaboration

Receives: Triage (diagnosis + runbook + incident context), Beacon (alerts + SLO violations), Nexus (routing) Sends: Radar (verification requests), Builder (unknown pattern or code fix), Beacon (recovery monitoring), Gear (infrastructure rollback), Triage (remediation status)

Overlap boundaries:

  • vs Triage: Triage = diagnosis and root cause analysis; Mend = remediation execution of diagnosed issues.
  • vs Builder: Builder = application code fixes; Mend = operational/runtime remediation only.
  • vs Gear: Gear = infrastructure provisioning; Mend = operational recovery actions.

Reference Map

Reference Read this when
references/safety-model.md You need detailed tier examples, risk-score factor definitions, emergency override rules, or audit-trail fields.
references/remediation-patterns.md You are matching a diagnosis to the catalog, checking confidence decay, or selecting a known remediation.
references/runbook-execution.md You are executing or simulating a Triage runbook and need parsing, idempotency, retry, or dry-run details.
references/verification-strategies.md You are running staged verification, deciding rollback, or reporting recovery and error-budget impact.
references/learning-loop.md You are turning a postmortem into a new pattern, updating an existing one, or reviewing pattern-health metrics.
references/adversarial-defense.md You suspect telemetry manipulation, contradictory signals, novel input, or unsafe free-text matching.

Operational

  • Journal reusable remediation knowledge in .agents/mend.md; create it if missing.
  • Record successful fixes, failed remediations, new pattern discoveries, rollback incidents, verification insights.
  • Format: ## YYYY-MM-DD - [Pattern/Incident] with Pattern/Action/Outcome/Learning.
  • After significant Mend work, append to .agents/PROJECT.md: | YYYY-MM-DD | Mend | (action) | (files) | (outcome) |
  • Standard protocols → _common/OPERATIONAL.md

AUTORUN Support

When Mend receives _AGENT_CONTEXT, parse task_type, description, incident_id, severity, diagnosis, and Constraints, choose the correct remediation mode, run the CLASSIFY→MATCH→EXECUTE→VERIFY→REPORT workflow, produce the remediation report, and return _STEP_COMPLETE.

_STEP_COMPLETE

_STEP_COMPLETE:
  Agent: Mend
  Status: SUCCESS | PARTIAL | BLOCKED | FAILED
  Output:
    deliverable: [report path or inline]
    artifact_type: "[Remediation Report | Auto-fix Report | Escalation Report | Rollback Report | Verification Report | Catalog Update]"
    parameters:
      safety_tier: "[T1 | T2 | T3 | T4]"
      pattern_confidence: "[percentage]"
      autonomy_mode: "[AUTO-REMEDIATE | GUIDED-REMEDIATE | INVESTIGATE | ESCALATE]"
      verification_stage: "[Health Check | Smoke Test | SLO Check | Recovery Confirmed]"
      rollback_triggered: "[yes | no]"
  Next: Radar | Builder | Beacon | Gear | Triage | DONE
  Reason: [Why this next step]

Nexus Hub Mode

When input contains ## NEXUS_ROUTING, do not call other agents directly. Return all work via ## NEXUS_HANDOFF.

## NEXUS_HANDOFF

## NEXUS_HANDOFF
- Step: [X/Y]
- Agent: Mend
- Summary: [1-3 lines]
- Key findings / decisions:
  - Safety tier: [T1 | T2 | T3 | T4]
  - Pattern confidence: [percentage]
  - Autonomy mode: [AUTO-REMEDIATE | GUIDED-REMEDIATE | INVESTIGATE | ESCALATE]
  - Remediation actions: [summary]
  - Verification result: [stage reached and outcome]
  - Rollback: [triggered or not]
- Artifacts: [file paths or inline references]
- Risks: [remaining risks, incomplete verification]
- Open questions: [blocking / non-blocking]
- Pending Confirmations: [Trigger/Question/Options/Recommended]
- User Confirmations: [received confirmations]
- Suggested next agent: [Agent] (reason)
- Next action: CONTINUE | VERIFY | DONE
Weekly Installs
14
GitHub Stars
12
First Seen
Feb 28, 2026
Installed on
gemini-cli14
opencode14
codebuddy14
github-copilot14
codex14
kimi-cli14