pyats-health-check
Device Health Check
Perform comprehensive health assessments on network devices using pyATS. This skill defines the systematic approach for evaluating device health across all critical dimensions.
When to Use
- Proactive daily/weekly health monitoring
- Pre-change and post-change validation
- Incident response — first thing you run when alerted
- Capacity planning and trending
- Compliance checks for operational readiness
Health Check Procedure
Always run health checks in this exact order. Each section builds on the previous one.
Step 1: Device Identity & Uptime
Run show version to establish baseline identity.
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show version"}'
Extract and report:
- Hostname, model, serial number
- IOS-XE version and image filename
- Uptime (flag if < 24 hours — indicates recent reload)
- Last reload reason (flag if unexpected: crash, power failure)
- Total/available memory
- License status
Thresholds:
- Uptime < 24h → WARNING: Recent reload
- Uptime < 1h → CRITICAL: Very recent reload, check for crash
- Last reload reason contains "crash" or "error" → CRITICAL
Step 2: CPU Utilization
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show processes cpu sorted"}'
Thresholds (5-second / 1-minute / 5-minute averages):
- < 50% → HEALTHY
- 50-75% → WARNING: Elevated CPU
- 75-90% → HIGH: Investigate top processes
-
90% → CRITICAL: Immediate investigation required
Top processes to watch:
IP Input— high traffic volume or routing loopsBGP Router/BGP I/O— large BGP table or instabilityOSPF-1 Hello— OSPF adjacency issuesCrypto IKMP/Crypto Engine— IPsec overheadSNMP ENGINE— polling stormARP Input— ARP storm or L2 loop
Step 3: Memory Utilization
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show processes memory sorted"}'
Also run:
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show platform resources"}'
Thresholds:
- Used < 70% → HEALTHY
- 70-85% → WARNING: Memory pressure
- 85-95% → HIGH: May impact routing table updates
-
95% → CRITICAL: Risk of process crashes or OOM
Memory consumers to watch:
BGP Router— large BGP table (full internet table = ~1M routes)CEF process— large FIBOSPF Router— large OSPF LSDBHTTP CORE— web server / RESTCONF overheadIOSD iomem— I/O memory for packet buffers
Step 4: Interface Status
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show ip interface brief"}'
Then for each active interface, get detailed counters:
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show interfaces"}'
Report for each interface:
- Admin status (up/down) and protocol status (up/down)
- IP address and subnet
- Speed, duplex, MTU
- Input/output rate (bps and pps)
- Error counters: CRC, input errors, output errors, drops, overruns
- Resets counter (flag if incrementing — indicates flapping)
- Last input/output timestamps
Flags:
- Interface up/down → WARNING: Check physical or protocol
- CRC errors > 0 → WARNING: Physical layer issue (cabling, optics, duplex mismatch)
- Input errors incrementing → WARNING: Packet corruption
- Output drops > 0 → WARNING: Congestion or QoS issue
- Resets incrementing → CRITICAL: Interface flapping
- Line protocol down on configured interface → CRITICAL
Step 5: Hardware & Environment
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show inventory"}'
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show platform"}'
Report: Module status (ok/fail), serial numbers, PID, transceiver types and DOM readings.
Step 6: NTP Synchronization
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show ntp associations"}'
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show clock"}'
Flags:
- No NTP peer synchronized (no
*in associations) → CRITICAL for logging/forensics - Clock offset > 100ms → WARNING
- Clock offset > 1s → CRITICAL
- No NTP configured at all → CRITICAL
Step 7: System Logs
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_show_logging '{"device_name":"R1"}'
Scan for these patterns:
%SYS-*-RELOAD— reload events%LINEPROTO-5-UPDOWN— interface flaps%OSPF-*-ADJCHG— OSPF adjacency changes%BGP-*-ADJCHANGE— BGP peer state changes%DUAL-*-NBRCHANGE— EIGRP neighbor changes%SYS-2-MALLOCFAIL— memory allocation failure (CRITICAL)%SYS-3-CPUHOG— process monopolizing CPU (HIGH)%TRACKING-*— IP SLA or object tracking changes%SEC-*/%AUTHMGR-*— security events%PLATFORM-*-CRASH— crash events (CRITICAL)Traceback— software bug (CRITICAL — open TAC case)
Step 8: Connectivity Validation
Test reachability to critical infrastructure:
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_ping_from_network_device '{"device_name":"R1","command":"ping 8.8.8.8 repeat 5"}'
Thresholds:
- 100% success, RTT < 50ms → HEALTHY
- 100% success, RTT > 100ms → WARNING: High latency
- 80-99% success → WARNING: Packet loss
- < 80% success → CRITICAL: Significant packet loss
- 0% success → CRITICAL: No reachability
Health Report Format
Always produce a summary table:
Device: R1 (devnetsandboxiosxec8k.cisco.com)
Model: C8000V | IOS-XE: 17.x.x | Uptime: XXd XXh
┌──────────────────┬──────────┬─────────────────────────┐
│ Check │ Status │ Details │
├──────────────────┼──────────┼─────────────────────────┤
│ CPU (5min avg) │ HEALTHY │ 12% │
│ Memory │ HEALTHY │ 45% used (1.2G/2.6G) │
│ Interfaces │ WARNING │ Gi2 down/down │
│ Hardware │ HEALTHY │ All modules OK │
│ NTP │ HEALTHY │ Synced, offset 2ms │
│ Logs │ WARNING │ 3 OSPF adjacency flaps │
│ Connectivity │ HEALTHY │ 100% to 8.8.8.8, 23ms │
└──────────────────┴──────────┴─────────────────────────┘
Overall: WARNING — 2 items need attention
Severity order: CRITICAL > HIGH > WARNING > HEALTHY. Overall status = worst individual status.
NetBox Cross-Reference (MISSION02 Enhancement)
When NetBox is available ($NETBOX_MCP_SCRIPT is set), cross-reference device state against the source of truth after Steps 1 and 4:
Interface State Validation
Query NetBox for expected interface states:
python3 $MCP_CALL "python3 -u $NETBOX_MCP_SCRIPT" netbox_get_objects '{"object_type":"dcim.interfaces","filters":{"device":"R1"},"brief":true}'
Compare NetBox intent vs device reality:
- NetBox shows interface enabled but device shows down → CRITICAL: Unexpected outage
- NetBox shows interface disabled but device shows up → WARNING: Undocumented activation
- Interface exists on device but not in NetBox → WARNING: Undocumented interface
- Interface in NetBox but not on device → WARNING: NetBox stale data
IP Address Validation
Query NetBox for expected IP assignments:
python3 $MCP_CALL "python3 -u $NETBOX_MCP_SCRIPT" netbox_get_objects '{"object_type":"ipam.ip-addresses","filters":{"device":"R1"}}'
Compare: Flag any IP_DRIFT where the device IP differs from NetBox.
Fleet-Wide Health (pCall)
To run health checks across ALL devices simultaneously, first list all devices:
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_list_devices
Then run Steps 1-8 on each device concurrently using multiple exec commands. Collect all results and produce a fleet summary:
┌──────────┬──────────┬──────┬────────┬──────────┬─────────────┐
│ Device │ CPU │ Mem │ Intf │ NTP │ Overall │
├──────────┼──────────┼──────┼────────┼──────────┼─────────────┤
│ R1 │ HEALTHY │ WARN │ HEALTHY│ HEALTHY │ WARNING │
│ R2 │ HEALTHY │ OK │ CRIT │ HEALTHY │ CRITICAL │
│ SW1 │ HIGH │ OK │ HEALTHY│ CRIT │ CRITICAL │
└──────────┴──────────┴──────┴────────┴──────────┴─────────────┘
Sort devices by severity (CRITICAL first) for triage prioritization.
GAIT Audit Trail
After completing a health check, record the session in GAIT:
python3 $MCP_CALL "python3 -u $GAIT_MCP_SCRIPT" gait_record_turn '{"input":{"role":"assistant","content":"Health check completed on R1: CPU HEALTHY (12%), Memory WARNING (78%), Interfaces HEALTHY, NTP HEALTHY. Overall: WARNING.","artifacts":[]}}'