Kubernetes Operations

Core Workflow

Deployment Lifecycle

# 1. Validate before applying
kubectl apply --dry-run=server -f <manifest> -n <namespace>

# 2. Apply manifests
kubectl apply -f <manifest> -n <namespace>

# 3. Monitor rollout (blocks until complete or timeout)
kubectl rollout status deployment/<name> -n <namespace> --timeout=300s

# 4. Verify pods running
kubectl get pods -n <namespace> -l app=<label> -o wide

# 5. Check events for issues
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

Quick Health Check

# Cluster overview
kubectl cluster-info
kubectl get nodes -o wide
kubectl top nodes  # requires metrics-server

# Namespace health
kubectl get all -n <namespace>
kubectl get pods -n <namespace> -o wide
kubectl top pods -n <namespace>

Troubleshooting Decision Tree

Pod Not Starting

Check pod status: kubectl get pods -n <ns> -o wide
Describe for events: kubectl describe pod <pod> -n <ns>
Check logs: kubectl logs <pod> -n <ns> --previous (if crashed)

Common causes:

ImagePullBackOff: Wrong image name/tag, missing imagePullSecrets
CrashLoopBackOff: App crash - check logs, health probes too aggressive
Pending: Insufficient resources, node selector/affinity issues
ContainerCreating: Volume mount issues, init container stuck

Pod Running But Not Receiving Traffic

Check readiness: kubectl get pods -n <ns> (READY column)
Check endpoints: kubectl get endpoints <service> -n <ns>
Check service selector: kubectl describe service <svc> -n <ns>
Test connectivity: kubectl run debug --rm -it --image=busybox -- wget -qO- <service>:<port>

High Restart Count

# Get restart details
kubectl get pods -n <ns> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}'

# Check terminated state
kubectl get pod <pod> -n <ns> -o jsonpath='{.status.containerStatuses[0].lastState.terminated}'

# Review liveness probe config
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.containers[0].livenessProbe}'

Common Operations

Logs

# Single pod
kubectl logs <pod> -n <ns>
kubectl logs <pod> -n <ns> -c <container>  # multi-container
kubectl logs <pod> -n <ns> --previous      # crashed container
kubectl logs <pod> -n <ns> -f              # follow/stream

# All pods with label
kubectl logs -l app=<label> -n <ns> --all-containers

# Since time
kubectl logs <pod> -n <ns> --since=1h
kubectl logs <pod> -n <ns> --since-time="2024-01-01T00:00:00Z"

Exec/Debug

# Interactive shell
kubectl exec -it <pod> -n <ns> -- /bin/sh
kubectl exec -it <pod> -n <ns> -c <container> -- /bin/bash

# Run command
kubectl exec <pod> -n <ns> -- <command>

# Debug with ephemeral container (k8s 1.25+)
kubectl debug -it <pod> -n <ns> --image=busybox --target=<container>

Scaling

# Manual scale
kubectl scale deployment/<name> -n <ns> --replicas=3

# Autoscaling
kubectl autoscale deployment/<name> -n <ns> --min=2 --max=10 --cpu-percent=80
kubectl get hpa -n <ns>

Rollback

# View history
kubectl rollout history deployment/<name> -n <ns>

# Rollback to previous
kubectl rollout undo deployment/<name> -n <ns>

# Rollback to specific revision
kubectl rollout undo deployment/<name> -n <ns> --to-revision=<N>

# Pause/resume rollout
kubectl rollout pause deployment/<name> -n <ns>
kubectl rollout resume deployment/<name> -n <ns>

Resource Management

# Get resource usage
kubectl top pods -n <ns> --sort-by=memory
kubectl top pods -n <ns> --sort-by=cpu

# Describe resource limits
kubectl describe limitrange -n <ns>
kubectl describe resourcequota -n <ns>

# Get requests/limits for pods
kubectl get pods -n <ns> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].resources}{"\n"}{end}'

Context & Namespace Management

# View contexts
kubectl config get-contexts
kubectl config current-context

# Switch context
kubectl config use-context <context-name>

# Set default namespace
kubectl config set-context --current --namespace=<ns>

# Create namespace
kubectl create namespace <name>

Output Formats

# Wide output with more columns
kubectl get pods -o wide

# YAML/JSON export
kubectl get deployment <name> -o yaml
kubectl get pod <name> -o json

# Custom columns
kubectl get pods -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,IP:.status.podIP

# JSONPath
kubectl get pods -o jsonpath='{.items[*].metadata.name}'
kubectl get secret <name> -o jsonpath='{.data.password}' | base64 -d

Port Forwarding

# Forward pod port
kubectl port-forward pod/<name> <local>:<remote> -n <ns>

# Forward service port
kubectl port-forward svc/<name> <local>:<remote> -n <ns>

# Forward deployment (picks a pod)
kubectl port-forward deployment/<name> <local>:<remote> -n <ns>

Labels & Selectors

# Add label
kubectl label pods <pod> env=prod -n <ns>

# Remove label
kubectl label pods <pod> env- -n <ns>

# Select by label
kubectl get pods -l app=nginx,env=prod -n <ns>
kubectl get pods -l 'env in (prod,staging)' -n <ns>
kubectl delete pods -l app=test -n <ns>

Resource Cleanup

# Delete by manifest
kubectl delete -f <manifest> -n <ns>

# Delete by label
kubectl delete pods -l app=<label> -n <ns>

# Force delete stuck pod
kubectl delete pod <pod> -n <ns> --grace-period=0 --force

# Delete completed/failed pods
kubectl delete pods -n <ns> --field-selector=status.phase=Succeeded
kubectl delete pods -n <ns> --field-selector=status.phase=Failed

Health Probes Reference

Probe Types

Liveness: Is container alive? Failure → restart
Readiness: Can container serve traffic? Failure → remove from endpoints
Startup: Has app started? Blocks liveness/readiness until success

Debugging Probes

# Check probe config
kubectl get pod <pod> -n <ns> -o yaml | grep -A10 livenessProbe

# Test HTTP probe manually
kubectl exec <pod> -n <ns> -- wget -qO- localhost:<port>/healthz

# Check probe events
kubectl describe pod <pod> -n <ns> | grep -A5 "Liveness\|Readiness"

Tips

Always use -n <namespace> explicitly to avoid mistakes
Use --dry-run=client -o yaml to generate manifests
Add --watch to continuously monitor: kubectl get pods -w
Use kubectl explain <resource>.<field> to understand spec fields
Annotate changes: kubectl annotate deployment/<name> kubernetes.io/change-cause="<reason>"

k8s-ops

Kubernetes Operations

Core Workflow

Deployment Lifecycle

Quick Health Check

Troubleshooting Decision Tree

Pod Not Starting

Pod Running But Not Receiving Traffic

High Restart Count

Common Operations

Logs

Exec/Debug

Scaling

Rollback

Resource Management

Context & Namespace Management

Output Formats

Port Forwarding

Labels & Selectors

Resource Cleanup

Health Probes Reference

Probe Types

Debugging Probes

Tips