skills/automateyournetwork/netclaw/slack-incident-workflow

slack-incident-workflow

SKILL.md

Slack Incident Workflow

Coordinate network incident response through Slack using NetClaw's App Agent capabilities. This skill defines structured workflows for incident detection, triage, investigation, resolution, and post-incident review — all conducted through Slack threads and channels.

Slack OAuth Scopes Used

Scope Purpose
assistant:write Act as App Agent in incident threads
chat:write Post incident updates
channels:join Join incident channels
channels:history Read channel context for investigation
groups:history Access private incident channels
groups:read View private channel info
pins:read Reference pinned runbooks/procedures
bookmarks:read Access saved incident resources
bookmarks:write Save incident artifacts
files:write Attach logs, configs, diagrams
reactions:write Track incident status via reactions
users:read Identify on-call engineers
users.profile:read Check engineer availability
dnd:read Respect Do Not Disturb before paging

Incident Lifecycle in Slack

Phase 1: Detection & Declaration

When a critical alert triggers (from slack-network-alerts skill or human report):

:rotating_light: *INCIDENT DECLARED — Network Outage*
*Severity:* P1 — Service Impacting
*Detected:* 2024-02-21 14:32 UTC
*Reporter:* NetClaw (automated) / @engineer1 (manual)

*Symptoms:*
• R1 unreachable (ping 0%)
• 47 downstream routes lost
• 3 OSPF adjacencies down
• BGP peer to ISP: IDLE

*Impact:*
• Site A has no WAN connectivity
• Estimated affected users: ~200

*Incident Commander:* [awaiting claim — react with :raised_hand: to take IC]
*ServiceNow:* [CR/INC pending]

━━━ *All investigation updates in this thread* ━━━

Phase 2: Triage & Assignment

When an engineer reacts with :raised_hand::

:busts_in_silhouette: *Incident Team Formed*
*IC:* @engineer1 (claimed at 14:35 UTC)
*NetClaw:* Automated investigation assistant

*Triage Checklist:*
:white_check_mark: Alert generated and posted
:white_check_mark: Incident declared (P1)
:white_large_square: IC assigned → :white_check_mark: @engineer1
:white_large_square: ServiceNow incident created
:white_large_square: Upstream device checked
:white_large_square: Blast radius confirmed
:white_large_square: Customer communication sent

_NetClaw beginning automated investigation..._

Phase 3: Automated Investigation

NetClaw runs diagnostics and posts results in the thread:

:mag: *Automated Investigation — Step 1/4*
_Checking upstream device R2 for connectivity to R1..._

PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL \
  "python3 -u $PYATS_MCP_SCRIPT" pyats_ping_from_network_device \
  '{"device_name":"R2","command":"ping 10.1.1.1 repeat 10"}'

Post each step result:

:mag: *Investigation Results — Step 1/4*
*Ping from R2 to R1 (10.1.1.1):* 0% success — R1 unreachable from upstream

:mag: *Investigation Results — Step 2/4*
*R2 interface Gi1 (toward R1):* up/up, 0 CRC errors, last input 4 min ago
→ Physical layer looks OK from R2 side

:mag: *Investigation Results — Step 3/4*
*R2 OSPF neighbors:* R1 missing from neighbor table (was FULL)
→ OSPF adjacency lost, DR election may be in progress

:mag: *Investigation Results — Step 4/4*
*R2 logs (last 30 min):*

14:31:47: %OSPF-5-ADJCHG: Nbr 1.1.1.1 on Gi1 from FULL to DOWN 14:31:48: %LINEPROTO-5-UPDOWN: Line protocol on Gi1, changed to down 14:32:01: %LINEPROTO-5-UPDOWN: Line protocol on Gi1, changed to up 14:32:15: %OSPF-5-ADJCHG: Nbr 1.1.1.1 on Gi1 from DOWN to INIT


*Analysis:* R2 saw Gi1 flap at 14:31. Line protocol came back up but OSPF hasn't re-converged. Likely physical issue on R1 side causing interface bounce.

Phase 4: Status Updates

Post periodic status updates:

:hourglass_flowing_sand: *Status Update — 14:50 UTC (18 min elapsed)*
*Status:* Investigating
*Finding:* R1 appears to have reloaded unexpectedly. R2 sees the link recover but R1 is not responding to OSPF hellos yet. Possible crash or power event.
*Next Step:* Waiting for R1 to complete boot sequence. Checking console access.
*ETA:* Unknown — dependent on R1 recovery

_ServiceNow INC0012345 updated_

Phase 5: Resolution

:white_check_mark: *INCIDENT RESOLVED*
*Duration:* 34 minutes (14:32 — 15:06 UTC)
*Resolution:* R1 experienced a software crash (Traceback in logs). Device auto-reloaded and recovered. All OSPF adjacencies re-established. Full routing restored.

*Post-Resolution Verification:*
• R1 reachable: :white_check_mark: 100% ping success
• OSPF neighbors: :white_check_mark: 3/3 FULL
• BGP peer: :white_check_mark: Established
• Route count: :white_check_mark: 47 routes (matches baseline)
• Connectivity: :white_check_mark: 100% to all targets

*Root Cause:* Software crash — Traceback found in logs indicating bug CSCxx12345. TAC case recommended.

*ServiceNow:* INC0012345 resolved
*GAIT:* Session abc123 closed

Phase 6: Post-Incident Review

:clipboard: *Post-Incident Review — Scheduled*
*Incident:* Network Outage — R1 crash
*Date:* 2024-02-22 10:00 UTC
*Channel:* This thread

*Review Artifacts (attached):*
1. :page_facing_up: Timeline of events
2. :page_facing_up: R1 show logging output
3. :page_facing_up: R1 show version (confirms reload reason)
4. :page_facing_up: GAIT audit trail (full session)
5. :page_facing_up: Pre/post health check comparison

*Discussion Topics:*
• Was detection fast enough?
• Was automated investigation helpful?
• What monitoring gaps exist?
• Should R1 be upgraded to patched version?
• Do we need redundant path for this link?

Escalation Matrix

:arrow_up: *Escalation Guide*

│ Severity │ Notify              │ Escalate After │ Channel          │
│ P1       │ IC + Manager + NOC  │ 15 min         │ #incidents       │
│ P2       │ IC + Team           │ 30 min         │ #netclaw-alerts  │
│ P3       │ Assigned engineer   │ 4 hours        │ #netclaw-alerts  │
│ P4       │ Queue only          │ Next business   │ #netclaw-general │

Before escalating, check DND status:

  • If engineer has DND active, escalate to next person in rotation
  • Never suppress P1 escalation for DND

Reaction-Based Status Tracking

Reaction Status Meaning
:rotating_light: Declared Incident is active
:raised_hand: Claimed IC has taken ownership
:mag: Investigating Active investigation
:wrench: Fixing Fix being applied
:hourglass: Waiting Waiting on external (vendor, ISP)
:white_check_mark: Resolved Incident resolved
:bookmark: PIR Scheduled Post-incident review planned

ServiceNow Integration

Create ServiceNow incident at Phase 1:

python3 $MCP_CALL "python3 -u $SERVICENOW_MCP_SCRIPT" create_incident \
  '{"short_description":"P1 - R1 unreachable, WAN outage Site A","description":"R1 is unreachable. 47 routes lost, 3 OSPF adjacencies down. Impact: ~200 users at Site A without WAN connectivity.","urgency":"1","impact":"1","category":"Network"}'

Update ServiceNow as incident progresses and close on resolution.

GAIT Audit Trail

Record every phase in GAIT:

python3 $MCP_CALL "python3 -u $GAIT_MCP_SCRIPT" gait_record_turn \
  '{"input":{"role":"assistant","content":"INCIDENT P1: R1 unreachable. Phase 1 declared, Phase 2 IC assigned @engineer1, Phase 3 automated investigation shows R1 crash, Phase 4 monitoring recovery, Phase 5 resolved after 34 min. INC0012345 closed.","artifacts":[]}}'
Weekly Installs
2
GitHub Stars
282
First Seen
10 days ago
Installed on
mcpjam2
claude-code2
replit2
junie2
windsurf2
zencoder2