pyats-health-check

Installation
SKILL.md

Device Health Check

When to Use

  • Proactive daily/weekly health monitoring
  • Pre-change and post-change validation
  • Incident response — first thing you run when alerted
  • Capacity planning and trending
  • Compliance checks for operational readiness

Health Check Procedure

Always run health checks in this exact order. Each section builds on the previous one.

Step 1: Device Identity & Uptime

Run show version to establish baseline identity.

PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show version"}'

Extract and report:

  • Hostname, model, serial number
  • IOS-XE version and image filename
  • Uptime (flag if < 24 hours — indicates recent reload)
  • Last reload reason (flag if unexpected: crash, power failure)
  • Total/available memory
  • License status

Thresholds:

  • Uptime < 24h → WARNING: Recent reload
  • Uptime < 1h → CRITICAL: Very recent reload, check for crash
  • Last reload reason contains "crash" or "error" → CRITICAL

Step 2: CPU Utilization

PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show processes cpu sorted"}'

Thresholds (5-second / 1-minute / 5-minute averages):

  • < 50% → HEALTHY
  • 50-75% → WARNING: Elevated CPU
  • 75-90% → HIGH: Investigate top processes
  • 90% → CRITICAL: Immediate investigation required

Top processes to watch:

  • IP Input — high traffic volume or routing loops
  • BGP Router / BGP I/O — large BGP table or instability
  • OSPF-1 Hello — OSPF adjacency issues
  • Crypto IKMP / Crypto Engine — IPsec overhead
  • SNMP ENGINE — polling storm
  • ARP Input — ARP storm or L2 loop

Step 3: Memory Utilization

PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show processes memory sorted"}'

Also run:

PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show platform resources"}'

Thresholds:

  • Used < 70% → HEALTHY
  • 70-85% → WARNING: Memory pressure
  • 85-95% → HIGH: May impact routing table updates
  • 95% → CRITICAL: Risk of process crashes or OOM

Memory consumers to watch:

  • BGP Router — large BGP table (full internet table = ~1M routes)
  • CEF process — large FIB
  • OSPF Router — large OSPF LSDB
  • HTTP CORE — web server / RESTCONF overhead
  • IOSD iomem — I/O memory for packet buffers

Step 4: Interface Status

PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show ip interface brief"}'

Then for each active interface, get detailed counters:

PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show interfaces"}'

Report for each interface:

  • Admin status (up/down) and protocol status (up/down)
  • IP address and subnet
  • Speed, duplex, MTU
  • Input/output rate (bps and pps)
  • Error counters: CRC, input errors, output errors, drops, overruns
  • Resets counter (flag if incrementing — indicates flapping)
  • Last input/output timestamps

Flags:

  • Interface up/down → WARNING: Check physical or protocol
  • CRC errors > 0 → WARNING: Physical layer issue (cabling, optics, duplex mismatch)
  • Input errors incrementing → WARNING: Packet corruption
  • Output drops > 0 → WARNING: Congestion or QoS issue
  • Resets incrementing → CRITICAL: Interface flapping
  • Line protocol down on configured interface → CRITICAL

Step 5: Hardware & Environment

PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show inventory"}'
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show platform"}'

Report: Module status (ok/fail), serial numbers, PID, transceiver types and DOM readings.

Step 6: NTP Synchronization

PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show ntp associations"}'
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show clock"}'

Flags:

  • No NTP peer synchronized (no * in associations) → CRITICAL for logging/forensics
  • Clock offset > 100ms → WARNING
  • Clock offset > 1s → CRITICAL
  • No NTP configured at all → CRITICAL

Step 7: System Logs

PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_show_logging '{"device_name":"R1"}'

Scan for these patterns:

  • %SYS-*-RELOAD — reload events
  • %LINEPROTO-5-UPDOWN — interface flaps
  • %OSPF-*-ADJCHG — OSPF adjacency changes
  • %BGP-*-ADJCHANGE — BGP peer state changes
  • %DUAL-*-NBRCHANGE — EIGRP neighbor changes
  • %SYS-2-MALLOCFAIL — memory allocation failure (CRITICAL)
  • %SYS-3-CPUHOG — process monopolizing CPU (HIGH)
  • %TRACKING-* — IP SLA or object tracking changes
  • %SEC-* / %AUTHMGR-* — security events
  • %PLATFORM-*-CRASH — crash events (CRITICAL)
  • Traceback — software bug (CRITICAL — open TAC case)

Step 8: Connectivity Validation

Test reachability to critical infrastructure:

PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_ping_from_network_device '{"device_name":"R1","command":"ping 8.8.8.8 repeat 5"}'

Thresholds:

  • 100% success, RTT < 50ms → HEALTHY
  • 100% success, RTT > 100ms → WARNING: High latency
  • 80-99% success → WARNING: Packet loss
  • < 80% success → CRITICAL: Significant packet loss
  • 0% success → CRITICAL: No reachability

Health Report Format

Always produce a summary table:

Device: R1 (devnetsandboxiosxec8k.cisco.com)
Model: C8000V | IOS-XE: 17.x.x | Uptime: XXd XXh

┌──────────────────┬──────────┬─────────────────────────┐
│ Check            │ Status   │ Details                 │
├──────────────────┼──────────┼─────────────────────────┤
│ CPU (5min avg)   │ HEALTHY  │ 12%                     │
│ Memory           │ HEALTHY  │ 45% used (1.2G/2.6G)   │
│ Interfaces       │ WARNING  │ Gi2 down/down           │
│ Hardware         │ HEALTHY  │ All modules OK          │
│ NTP              │ HEALTHY  │ Synced, offset 2ms      │
│ Logs             │ WARNING  │ 3 OSPF adjacency flaps  │
│ Connectivity     │ HEALTHY  │ 100% to 8.8.8.8, 23ms  │
└──────────────────┴──────────┴─────────────────────────┘

Overall: WARNING — 2 items need attention

Severity order: CRITICAL > HIGH > WARNING > HEALTHY. Overall status = worst individual status.

NetBox Cross-Reference (MISSION02 Enhancement)

When NetBox is available ($NETBOX_MCP_SCRIPT is set), cross-reference device state against the source of truth after Steps 1 and 4:

Interface State Validation

Query NetBox for expected interface states:

python3 $MCP_CALL "python3 -u $NETBOX_MCP_SCRIPT" netbox_get_objects '{"object_type":"dcim.interfaces","filters":{"device":"R1"},"brief":true}'

Compare NetBox intent vs device reality:

  • NetBox shows interface enabled but device shows down → CRITICAL: Unexpected outage
  • NetBox shows interface disabled but device shows up → WARNING: Undocumented activation
  • Interface exists on device but not in NetBox → WARNING: Undocumented interface
  • Interface in NetBox but not on device → WARNING: NetBox stale data

IP Address Validation

Query NetBox for expected IP assignments:

python3 $MCP_CALL "python3 -u $NETBOX_MCP_SCRIPT" netbox_get_objects '{"object_type":"ipam.ip-addresses","filters":{"device":"R1"}}'

Compare: Flag any IP_DRIFT where the device IP differs from NetBox.

Fleet-Wide Health (pCall)

To run health checks across ALL devices simultaneously, first list all devices:

PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_list_devices

Then run Steps 1-8 on each device concurrently using multiple exec commands. Collect all results and produce a fleet summary:

┌──────────┬──────────┬──────┬────────┬──────────┬─────────────┐
│ Device   │ CPU      │ Mem  │ Intf   │ NTP      │ Overall     │
├──────────┼──────────┼──────┼────────┼──────────┼─────────────┤
│ R1       │ HEALTHY  │ WARN │ HEALTHY│ HEALTHY  │ WARNING     │
│ R2       │ HEALTHY  │ OK   │ CRIT   │ HEALTHY  │ CRITICAL    │
│ SW1      │ HIGH     │ OK   │ HEALTHY│ CRIT     │ CRITICAL    │
└──────────┴──────────┴──────┴────────┴──────────┴─────────────┘

Sort devices by severity (CRITICAL first) for triage prioritization.

GAIT Audit Trail

After completing a health check, record the session in GAIT:

python3 $MCP_CALL "python3 -u $GAIT_MCP_SCRIPT" gait_record_turn '{"input":{"role":"assistant","content":"Health check completed on R1: CPU HEALTHY (12%), Memory WARNING (78%), Interfaces HEALTHY, NTP HEALTHY. Overall: WARNING.","artifacts":[]}}'
Related skills

More from automateyournetwork/netclaw

Installs
17
GitHub Stars
481
First Seen
Mar 6, 2026