/debug-pod Skill

Diagnose pod failures on OpenShift by automatically gathering status, events, logs, and resource information.

Prerequisites

Before running this skill:

User is logged into OpenShift cluster
User has access to the target namespace
Pod or deployment name is known (or can be identified from recent deployments)

When to Use This Skill

Use this skill when pods are not running, restarting frequently, or stuck in non-ready states such as CrashLoopBackOff, ImagePullBackOff, OOMKilled, or Pending. It automates gathering pod status, events, logs, and resource constraints to identify the root cause.

Critical: Human-in-the-Loop Requirements

See Human-in-the-Loop Requirements for mandatory checkpoint behavior.

Workflow

Step 1: Identify Target Pod

## Pod Debugging

**Current OpenShift Context:**
- Cluster: [cluster]
- Namespace: [namespace]

Which pod would you like me to debug?

1. **Specify pod name** - Enter the pod name directly
2. **List failing pods** - Show pods with issues in current namespace
3. **From deployment** - Debug pods from a specific deployment

Select an option or enter a pod name:

WAIT for user confirmation before proceeding.

If user selects "List failing pods": Use kubernetes MCP pod_list with namespace, then filter to show pods NOT in Running/Succeeded state:

## Pods with Issues in [namespace]

| Pod | Status | Restarts | Age | Reason |
|-----|--------|----------|-----|--------|
| [pod-name] | CrashLoopBackOff | 5 | 10m | [waiting reason] |
| [pod-name-2] | ImagePullBackOff | 0 | 3m | [waiting reason] |
| [pod-name-3] | Pending | 0 | 15m | [conditions] |

Which pod would you like me to debug?

WAIT for user confirmation before proceeding.

Step 2: Get Pod Status Overview

Use kubernetes MCP resources_get to get pod details:

## Pod Status: [pod-name]

**Basic Info:**
| Field | Value |
|-------|-------|
| Namespace | [namespace] |
| Node | [node-name or "Not scheduled"] |
| Status | [phase: Pending/Running/Failed/Succeeded] |
| IP | [pod-ip or "Not assigned"] |
| Created | [timestamp] |

**Container Status:**
| Container | State | Ready | Restarts | Exit Code | Reason |
|-----------|-------|-------|----------|-----------|--------|
| [container-name] | [Waiting/Running/Terminated] | [true/false] | [count] | [code or N/A] | [reason] |

**Quick Assessment:**
[Based on status, provide initial assessment - e.g., "Pod is in CrashLoopBackOff - container keeps crashing after startup"]

Continue with detailed analysis? (yes/no)

WAIT for user confirmation before proceeding.

Step 3: Analyze Events

Use kubernetes MCP events_list filtered by pod:

## Recent Events for [pod-name]

| Time | Type | Reason | Message |
|------|------|--------|---------|
| [timestamp] | [Normal/Warning] | [reason] | [message] |
| [timestamp] | [Normal/Warning] | [reason] | [message] |
| ... |

**Event Analysis:**

[Analyze events and identify key issues:]

**Issues Found:**
- [Issue 1 - e.g., "FailedScheduling: 0/3 nodes available - insufficient memory"]
- [Issue 2 - e.g., "ImagePullBackOff: unauthorized - check image pull secrets"]

Continue to view container logs? (yes/no)

WAIT for user confirmation before proceeding.

Step 4: Get Container Logs

Use kubernetes MCP pod_logs for current and previous container:

## Container Logs: [container-name]

**Current Container Logs** (last 50 lines):

[log output]


[If container has restarted, also show previous logs:]

**Previous Container Logs** (before last restart):

[log output from --previous]


**Log Analysis:**

[Analyze logs and identify errors:]

**Errors Found:**
- Line [X]: [error description - e.g., "Connection refused to database on port 5432"]
- Line [Y]: [error description - e.g., "Out of memory - heap allocation failed"]

Continue to analyze resource constraints? (yes/no)

WAIT for user confirmation before proceeding.

Step 5: Analyze Resource Constraints

Check resource requests, limits, and actual usage:

## Resource Analysis: [pod-name]

**Container: [container-name]**

| Resource | Request | Limit | Status |
|----------|---------|-------|--------|
| Memory | [128Mi] | [512Mi] | [OK / WARNING: OOMKilled] |
| CPU | [100m] | [500m] | [OK / WARNING: throttled] |

**Node Resources (if scheduled):**
| Resource | Allocatable | Allocated | Available |
|----------|-------------|-----------|-----------|
| Memory | [8Gi] | [7.5Gi] | [512Mi] |
| CPU | [4000m] | [3800m] | [200m] |

**Resource Issues:**
- [Issue 1 - e.g., "Container was OOMKilled - memory limit too low for application"]
- [Issue 2 - e.g., "Pod cannot be scheduled - no nodes have 2Gi available memory"]

Continue to full diagnosis summary? (yes/no)

WAIT for user confirmation before proceeding.

Step 6: Present Diagnosis Summary

## Diagnosis Summary: [pod-name]

### Root Cause

**Primary Issue:** [Categorized root cause]

| Category | Status | Details |
|----------|--------|---------|
| Container Start | [OK/FAIL] | [details] |
| Image Pull | [OK/FAIL] | [details] |
| Resource Scheduling | [OK/FAIL] | [details] |
| Application Health | [OK/FAIL] | [details] |
| Volume Mounts | [OK/FAIL] | [details] |

### Detailed Findings

**[Category 1: e.g., Image Pull Issues]**
- Problem: [specific problem]
- Evidence: [from events/logs]
- Impact: [how this affects the pod]

**[Category 2: e.g., Application Crash]**
- Problem: [specific problem]
- Evidence: [from logs]
- Impact: [how this affects the pod]

### Recommended Actions

1. **[Action 1]** - [description]
   ```bash
   [command to fix - e.g., oc create secret docker-registry...]

[Action 2] - [description]

[command to fix - e.g., oc set resources deployment/app --limits=memory=1Gi]

[Action 3] - [description]

debug-pod

/debug-pod Skill

Prerequisites

When to Use This Skill

Critical: Human-in-the-Loop Requirements

Workflow

Step 1: Identify Target Pod

Step 2: Get Pod Status Overview

Step 3: Analyze Events

Step 4: Get Container Logs

Step 5: Analyze Resource Constraints

Step 6: Present Diagnosis Summary

Related Documentation

More from rhecosystemappeng/agentic-collections

fleet-inventory

cve-impact

playbook-generator

playbook-executor

cve-validation

system-context