operating-kubernetes
Kubernetes Operations
Purpose
Operating Kubernetes clusters in production requires mastery of resource management, scheduling patterns, networking architecture, storage strategies, security hardening, and autoscaling. This skill provides operations-first frameworks for right-sizing workloads, implementing high-availability patterns, securing clusters with RBAC and Pod Security Standards, and systematically troubleshooting common failures.
Use this skill when deploying applications to Kubernetes, configuring cluster resources, implementing NetworkPolicies for zero-trust security, setting up autoscaling (HPA, VPA, KEDA), managing persistent storage, or diagnosing operational issues like CrashLoopBackOff or resource exhaustion.
When to Use This Skill
Common Triggers:
- "Deploy my application to Kubernetes"
- "Configure resource requests and limits"
- "Set up autoscaling for my pods"
- "Implement NetworkPolicies for security"
- "My pod is stuck in Pending/CrashLoopBackOff"
- "Configure RBAC with least privilege"
- "Set up persistent storage for my database"
- "Spread pods across availability zones"
Operations Covered:
- Resource management (CPU/memory, QoS classes, quotas)
- Advanced scheduling (affinity, taints, topology spread)
- Networking (NetworkPolicies, Ingress, Gateway API)
- Storage operations (StorageClasses, PVCs, CSI)
- Security hardening (RBAC, Pod Security Standards, policies)
- Autoscaling (HPA, VPA, KEDA, cluster autoscaler)
- Troubleshooting (systematic debugging playbooks)
Resource Management
Quality of Service (QoS) Classes
Kubernetes assigns QoS classes based on resource requests and limits:
Guaranteed (Highest Priority):
- Requests equal limits for CPU and memory
- Never evicted unless exceeding limits
- Use for critical production services
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "512Mi" # Same as request
cpu: "500m"
Burstable (Medium Priority):
- Requests less than limits (or only requests set)
- Can burst above requests
- Evicted under node pressure
- Use for web servers, most applications
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi" # 2x request
cpu: "500m"
BestEffort (Lowest Priority):
- No requests or limits set
- First to be evicted under pressure
- Use only for development/testing
Decision Framework: Which QoS Class?
| Workload Type | QoS Class | Configuration |
|---|---|---|
| Critical API/Database | Guaranteed | requests == limits |
| Web servers, services | Burstable | limits 1.5-2x requests |
| Batch jobs | Burstable | Low requests, high limits |
| Dev/test environments | BestEffort | No limits |
Resource Quotas and LimitRanges
Enforce multi-tenancy with ResourceQuotas (namespace limits) and LimitRanges (per-container defaults):
# ResourceQuota: Namespace-level limits
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: team-alpha
spec:
hard:
requests.cpu: "10"
requests.memory: "20Gi"
limits.cpu: "20"
limits.memory: "40Gi"
pods: "50"
For detailed resource management patterns including Vertical Pod Autoscaler (VPA), see references/resource-management.md.
Advanced Scheduling
Node Affinity
Control which nodes pods schedule on with required (hard) or preferred (soft) constraints:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- g4dn.xlarge # GPU instance
Taints and Tolerations
Reserve nodes for specific workloads (inverse of affinity):
# Taint GPU nodes to prevent non-GPU workloads
kubectl taint nodes gpu-node-1 workload=gpu:NoSchedule
# Pod tolerates GPU taint
tolerations:
- key: "workload"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
Topology Spread Constraints
Distribute pods evenly across failure domains (zones, nodes):
topologySpreadConstraints:
- maxSkew: 1 # Max difference in pod count
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: critical-app
For advanced scheduling patterns including pod priority and preemption, see references/scheduling-patterns.md.
Networking
NetworkPolicies (Zero-Trust Security)
Implement default-deny security with NetworkPolicies:
# Default deny all traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
# Allow specific ingress (frontend → backend)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: backend-allow-frontend
spec:
podSelector:
matchLabels:
app: backend
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
Ingress vs. Gateway API
Ingress (Legacy):
- Widely supported, mature ecosystem
- Limited expressiveness
- Use for existing applications
Gateway API (Modern):
- Role-oriented design (cluster ops vs. app devs)
- More expressive (HTTPRoute, TCPRoute, TLSRoute)
- Recommended for new applications (GA in Kubernetes 1.29+)
# Gateway API example
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: app-routes
spec:
parentRefs:
- name: production-gateway
rules:
- matches:
- path:
type: PathPrefix
value: /api
backendRefs:
- name: backend
port: 8080
For detailed networking patterns including service mesh integration, see references/networking.md.
Storage
StorageClasses (Define Performance Tiers)
StorageClasses define storage tiers for different workload needs:
# AWS EBS SSD (high performance)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iopsPerGB: "50"
encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Delete
Storage Decision Matrix
| Workload | Performance | Access Mode | Storage Class |
|---|---|---|---|
| Database | High | ReadWriteOnce | SSD (gp3/io2) |
| Shared files | Medium | ReadWriteMany | NFS/EFS |
| Logs (temp) | Low | ReadWriteOnce | Standard HDD |
| ML models | High | ReadOnlyMany | Object storage (S3) |
Access Modes:
- ReadWriteOnce (RWO): Single node read-write (most common)
- ReadOnlyMany (ROX): Multiple nodes read-only
- ReadWriteMany (RWX): Multiple nodes read-write (requires network storage)
For detailed storage operations including volume snapshots and CSI drivers, see references/storage.md.
Security
RBAC (Role-Based Access Control)
Implement least-privilege access with RBAC:
# Role (namespace-scoped)
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-reader
namespace: production
rules:
- apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "list", "watch"]
---
# RoleBinding (assign role to user)
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: read-pods
namespace: production
subjects:
- kind: User
name: jane@example.com
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: pod-reader
apiGroup: rbac.authorization.k8s.io
Pod Security Standards
Enforce secure pod configurations at the namespace level:
# Namespace with Restricted PSS (most secure)
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
Pod Security Levels:
- Restricted: Most secure, removes all privilege escalations (use for applications)
- Baseline: Minimally restrictive, prevents known escalations
- Privileged: Unrestricted (only for system workloads)
For detailed security patterns including policy enforcement (Kyverno/OPA) and secrets management, see references/security.md.
Autoscaling
Horizontal Pod Autoscaler (HPA)
Scale pod replicas based on CPU, memory, or custom metrics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5min before scaling down
KEDA (Event-Driven Autoscaling)
Scale based on events beyond CPU/memory (queues, cron schedules, Prometheus metrics):
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: rabbitmq-scaler
spec:
scaleTargetRef:
name: message-processor
minReplicaCount: 0 # Scale to zero when queue empty
maxReplicaCount: 30
triggers:
- type: rabbitmq
metadata:
queueName: tasks
queueLength: "10" # Scale up when >10 messages
Autoscaling Decision Matrix
| Scenario | Use HPA | Use VPA | Use KEDA | Use Cluster Autoscaler |
|---|---|---|---|---|
| Stateless web app with traffic spikes | ✅ | ❌ | ❌ | Maybe |
| Single-instance database | ❌ | ✅ | ❌ | Maybe |
| Queue processor (event-driven) | ❌ | ❌ | ✅ | Maybe |
| Pods pending (insufficient nodes) | ❌ | ❌ | ❌ | ✅ |
For detailed autoscaling patterns including VPA and cluster autoscaler configuration, see references/autoscaling.md.
Troubleshooting
Common Pod Issues
Pod Stuck in Pending:
kubectl describe pod <pod-name>
# Common causes:
# - Insufficient CPU/memory: Reduce requests or add nodes
# - Node selector mismatch: Fix nodeSelector or add labels
# - PVC not bound: Create PVC or fix name
# - Taint intolerance: Add toleration or remove taint
CrashLoopBackOff:
kubectl logs <pod-name>
kubectl logs <pod-name> --previous # Check previous crash
# Common causes:
# - Application crash: Fix code or configuration
# - Missing environment variables: Add to deployment
# - Liveness probe failing: Increase initialDelaySeconds
# - OOMKilled: Increase memory limit or fix leak
ImagePullBackOff:
kubectl describe pod <pod-name>
# Common causes:
# - Image doesn't exist: Fix image name/tag
# - Authentication required: Create imagePullSecrets
# - Network issues: Check NetworkPolicies, firewall rules
Service Not Accessible:
kubectl get endpoints <service-name> # Should list pod IPs
# If endpoints empty:
# - Service selector doesn't match pod labels
# - Pods aren't ready (readiness probe failing)
# - Check NetworkPolicies blocking traffic
For systematic troubleshooting playbooks including networking and storage issues, see references/troubleshooting.md.
Reference Documentation
Deep Dives
- references/resource-management.md - Resource requests/limits, QoS classes, ResourceQuotas, VPA
- references/scheduling-patterns.md - Node affinity, taints/tolerations, topology spread, priority
- references/networking.md - NetworkPolicies, Ingress, Gateway API, service mesh integration
- references/storage.md - StorageClasses, PVCs, CSI drivers, volume snapshots
- references/security.md - RBAC, Pod Security Standards, policy enforcement, secrets
- references/autoscaling.md - HPA, VPA, KEDA, cluster autoscaler configuration
- references/troubleshooting.md - Systematic debugging playbooks for common failures
Examples
- examples/manifests/ - Copy-paste ready YAML manifests
- examples/python/ - Automation scripts (audit, cost analysis, validation)
- examples/go/ - Operator development examples
Tools
- scripts/validate-resources.sh - Audit pods without resource limits
- scripts/audit-networkpolicies.sh - Find namespaces without NetworkPolicies
- scripts/cost-analysis.sh - Resource cost breakdown by namespace
Related Skills
- building-ci-pipelines - Deploy to Kubernetes from CI/CD (kubectl apply, Helm, GitOps)
- observability - Monitor clusters and workloads (Prometheus, Grafana, tracing)
- secret-management - Secure secrets in Kubernetes (External Secrets, Sealed Secrets)
- testing-strategies - Test manifests and deployments (Kubeval, Conftest, Kind)
- infrastructure-as-code - Provision Kubernetes clusters (Terraform, Cluster API)
- gitops-workflows - Declarative cluster management (Flux, ArgoCD)
Best Practices Summary
Resource Management:
- Always set CPU/memory requests and limits
- Use VPA for automated rightsizing
- Implement resource quotas per namespace
- Monitor actual usage vs. requests
Scheduling:
- Use topology spread constraints for high availability
- Apply taints for workload isolation (GPU, spot instances)
- Set pod priority for critical workloads
Networking:
- Implement NetworkPolicies with default-deny
- Use Gateway API for new applications
- Apply rate limiting at ingress layer
Storage:
- Use CSI drivers (not legacy provisioners)
- Define StorageClasses per performance tier
- Enable volume snapshots for stateful apps
Security:
- Enforce Pod Security Standards (Restricted for apps)
- Implement RBAC with least privilege
- Use policy engines for guardrails (Kyverno/OPA)
- Scan images for vulnerabilities
Autoscaling:
- Use HPA for stateless workloads
- Use KEDA for event-driven workloads
- Enable cluster autoscaler with limits
- Set PodDisruptionBudgets to prevent over-disruption