Kubernetes Management (Kubernetes 管理)

为运维人员提供 Kubernetes 集群管理和监控功能，包括查看 Pod 状态、日志、资源使用等。

Provides Kubernetes cluster management and monitoring for DevOps engineers, including pod status, logs, resource usage, and more.

Quick Start

View Pods (查看 Pod)

显示 production namespace 中的所有 pods
List all pods in the production namespace

Check Pod Logs (查看 Pod 日志)

查看 pod api-server-abc123 的日志
Get logs from pod api-server-abc123

Check Resource Usage (查看资源使用)

显示集群资源使用情况
Show cluster resource usage

Key Features

1. Pod Management (Pod 管理)

View and monitor pods across namespaces:

Common Operations:

# List pods
kubectl get pods -n <namespace>

# Get pod details
kubectl describe pod <pod-name> -n <namespace>

# View pod logs
kubectl logs <pod-name> -n <namespace>

# Follow logs
kubectl logs -f <pod-name> -n <namespace>

2. Service and Deployment (服务与部署)

Monitor services and deployments:

Common Operations:

# List services
kubectl get services -n <namespace>

# List deployments
kubectl get deployments -n <namespace>

# Check deployment status
kubectl rollout status deployment/<name> -n <namespace>

3. Resource Monitoring (资源监控)

Check cluster and pod resource usage:

Common Operations:

# Node resource usage
kubectl top nodes

# Pod resource usage
kubectl top pods -n <namespace>

# Check cluster info
kubectl cluster-info

Common Workflows

Workflow 1: Troubleshoot Pod Issues

1. User: "production namespace 中有哪些 pod 出问题了？"
2. List pods: kubectl get pods -n production
3. Identify pods with issues (Not Running, CrashLoopBackOff)
4. Get pod details: kubectl describe pod <problem-pod>
5. Check logs: kubectl logs <problem-pod>
6. Analyze error messages
7. Suggest solutions

Workflow 2: Check Deployment Health

1. User: "检查 api-server deployment 的健康状态"
2. Get deployment: kubectl get deployment api-server -n production
3. Check replicas: desired vs available
4. Check rollout status
5. View recent events
6. Report status and any issues

Workflow 3: Monitor Resource Usage

1. User: "集群资源使用情况如何？"
2. Get node metrics: kubectl top nodes
3. Get pod metrics: kubectl top pods --all-namespaces
4. Calculate utilization percentages
5. Identify resource-intensive pods
6. Provide recommendations

Examples (示例)

Example 1: List Pods (列出 Pods)

User (Chinese):

显示 production namespace 中的所有 pods

Assistant Response:

我会列出 production namespace 中的所有 pods。

<executes: kubectl get pods -n production>

**Production Namespace - Pods:**

| Name | Status | Restarts | Age |
|------|--------|----------|-----|
| api-server-7d4f8b9c-abc123 | Running | 0 | 5d |
| api-server-7d4f8b9c-def456 | Running | 1 | 5d |
| web-app-5c8a7b-ghi789 | Running | 0 | 3d |
| worker-queue-9f3e2d-jkl012 | CrashLoopBackOff | 10 | 2h |
| database-6a4b5c-mno345 | Running | 0 | 15d |

**Summary:**
- Total: 5 pods
- Running: 4 pods
- Issues: 1 pod (worker-queue)

⚠️ **Alert**: worker-queue-9f3e2d-jkl012 is in CrashLoopBackOff状态。需要检查日志吗？

Example 2: View Pod Logs (查看 Pod 日志)

User (English):

Get logs from pod worker-queue-9f3e2d-jkl012 in production namespace

Assistant Response:

I'll retrieve the logs from the pod.

<executes: kubectl logs worker-queue-9f3e2d-jkl012 -n production --tail=50>

**Pod Logs: worker-queue-9f3e2d-jkl012**

[2026-01-16 10:15:23] INFO: Worker starting... [2026-01-16 10:15:24] INFO: Connecting to Redis... [2026-01-16 10:15:25] ERROR: Connection failed: redis://redis-service:6379 [2026-01-16 10:15:25] ERROR: Error: getaddrinfo ENOTFOUND redis-service [2026-01-16 10:15:26] INFO: Retrying connection (1/10)... [2026-01-16 10:15:27] ERROR: Connection failed: redis://redis-service:6379 [2026-01-16 10:15:28] FATAL: Max retries exceeded, exiting


**Analysis:**
- **Issue**: Cannot connect to Redis service
- **Root Cause**: DNS resolution failure for `redis-service`
- **Possible Reasons**:
  1. Redis service not running in the namespace
  2. Service name mismatch
  3. Network policy blocking connection

**Recommendations:**
1. Check if Redis service exists: `kubectl get svc redis-service -n production`
2. Verify service endpoints: `kubectl get endpoints redis-service -n production`
3. Check network policies
4. Review deployment configuration

Example 3: Resource Usage (资源使用)

User (Chinese):

显示集群各节点的资源使用情况

Assistant Response:

我会显示集群节点的资源使用情况。

<executes: kubectl top nodes>

**Cluster Resource Usage:**

| Node | CPU (cores) | CPU % | Memory (GB) | Memory % |
|------|-------------|-------|-------------|----------|
| node-1 | 3.2 / 8 | 40% | 12.5 / 32 | 39% |
| node-2 | 5.6 / 8 | 70% | 18.2 / 32 | 57% |
| node-3 | 2.1 / 8 | 26% | 8.7 / 32 | 27% |
| node-4 | 6.8 / 8 | 85% | 24.3 / 32 | 76% |

**Analysis:**
- **node-4** is running at high utilization (85% CPU, 76% Memory)
- **node-3** has the most available capacity
- Average cluster utilization: 55% CPU, 50% Memory

**Recommendations:**
1. Consider redistributing pods from node-4
2. Monitor node-4 for potential issues
3. Check if any pods need resource limit adjustments
4. Review horizontal pod autoscaling settings

Important Notes

Access Control (访问控制)

Requires kubectl configured with cluster access
User permissions defined by RBAC roles
Some operations may be restricted based on role
Always verify namespace access before operations

Safety Guidelines (安全指南)

Read operations only through this skill
No pod deletion or modification
No deployment changes
No configuration updates
For changes, use standard deployment processes

Best Practices (最佳实践)

Always specify namespace
Use labels for filtering
Check resource limits before scaling
Monitor metrics before making decisions
Document all investigations

Prerequisites

kubectl Configuration

Ensure kubectl is configured to access your clusters:

# Verify kubectl is installed
kubectl version --client

# List available contexts
kubectl config get-contexts

# Switch context if needed
kubectl config use-context <context-name>

# Verify access
kubectl get nodes

GKE Authentication (GCP)

For GKE clusters:

# Authenticate with GCP
gcloud auth login

# Get cluster credentials
gcloud container clusters get-credentials <cluster-name> --region=<region>

# Verify access
kubectl cluster-info

Required Permissions

Minimum RBAC permissions needed:

pods/list - List pods
pods/get - Get pod details
pods/log - Read pod logs
services/list - List services
deployments/list - List deployments
nodes/list - List nodes

Limitations

Current Limitations

Read-only operations via kubectl commands
No direct MCP server integration (uses kubectl CLI)
No real-time monitoring dashboards
No automated remediation
Limited to configured clusters

Future Enhancements

Direct Kubernetes API integration
Real-time pod monitoring
Automated issue detection
Integration with monitoring systems (Prometheus, Grafana)
Pod restart and scaling capabilities
Log aggregation and search

Troubleshooting

Issue 1: "kubectl: command not found"

Solutions:

Install kubectl: https://kubernetes.io/docs/tasks/tools/
Add to PATH
Restart Claude Code

Issue 2: "Unable to connect to cluster"

Solutions:

Verify kubectl context: kubectl config current-context
Check cluster credentials
Verify network access
Re-authenticate if needed

Issue 3: "Forbidden: User cannot list pods"

Solutions:

Check RBAC permissions
Contact cluster admin
Verify service account
Switch to correct namespace

Related Skills

cloud-resources: Manage cloud infrastructure
Future: monitoring-alerts, incident-response

Safety & Compliance

This skill provides read-only access to Kubernetes clusters:

✅ View pod status and logs
✅ Monitor resource usage
✅ Check service health
❌ Cannot delete pods
❌ Cannot modify deployments
❌ Cannot change configurations

For any modifications, use established CI/CD pipelines and change management processes.

k8s-management

Kubernetes Management (Kubernetes 管理)

Quick Start

View Pods (查看 Pod)

Check Pod Logs (查看 Pod 日志)

Check Resource Usage (查看资源使用)

Key Features

1. Pod Management (Pod 管理)

2. Service and Deployment (服务与部署)

3. Resource Monitoring (资源监控)

Common Workflows

Workflow 1: Troubleshoot Pod Issues

Workflow 2: Check Deployment Health

Workflow 3: Monitor Resource Usage

Examples (示例)

Example 1: List Pods (列出 Pods)

Example 2: View Pod Logs (查看 Pod 日志)

Example 3: Resource Usage (资源使用)

Important Notes

Access Control (访问控制)

Safety Guidelines (安全指南)

Best Practices (最佳实践)

Prerequisites

kubectl Configuration

GKE Authentication (GCP)

Required Permissions

Limitations

Current Limitations

Future Enhancements

Troubleshooting

Issue 1: "kubectl: command not found"

Issue 2: "Unable to connect to cluster"

Issue 3: "Forbidden: User cannot list pods"

Related Skills

Safety & Compliance

More from serendipityoneinc/srp-claude-code-marketplace

lark-docs

slurm

lark-messages

mac-setup

raydata

github-integration