kubernetes-best-practices
Kubernetes Best Practices
This skill provides guidance for writing production-ready Kubernetes manifests and managing cloud-native applications.
Resource Management
Memory: Set requests and limits to the same value to ensure QoS class and prevent OOM kills.
CPU: Set requests only, omit limits to allow performance bursting and avoid throttling.
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "256Mi"
# No CPU limit
Image Versioning
Always pin specific versions, never use :latest tag unless explicitly requested:
# Good
image: nginx:1.25.3
# Bad
image: nginx:latest
For immutability, consider pinning to specific digests.
Configuration Management
Secrets: Sensitive data (passwords, tokens, certificates) ConfigMaps: Non-sensitive configuration (feature flags, URLs, settings)
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: app-secrets
key: database-url
- name: LOG_LEVEL
valueFrom:
configMapKeyRef:
name: app-config
key: log-level
Best practices:
- Never hardcode secrets in manifests
- Use external secret management (Sealed Secrets, External Secrets Operator)
- Rotate secrets regularly
- Limit access with RBAC
Workload Selection
Choose the appropriate workload type:
- Deployment: Stateless applications (web servers, APIs, microservices)
- StatefulSet: Stateful applications (databases, message queues)
- DaemonSet: Node-level services (log collectors, monitoring agents)
- Job/CronJob: Batch processing and scheduled tasks
Security Context
Always implement security best practices:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
Security checklist:
- Run as non-root user
- Drop all capabilities by default
- Use read-only root filesystem
- Disable privilege escalation
- Implement network policies
- Scan images for vulnerabilities
Health Checks
Implement all three probe types:
Liveness: Restart container if unhealthy Readiness: Remove from service endpoints if not ready Startup: Allow slow-starting containers time to initialize
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
startupProbe:
httpGet:
path: /startup
port: 8080
periodSeconds: 10
failureThreshold: 30
High Availability
Replica counts: Set minimum 2 for production workloads
Pod Disruption Budgets: Maintain availability during voluntary disruptions
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: web-app
Additional HA considerations:
- Use anti-affinity rules for pod distribution across nodes
- Configure graceful shutdown periods
- Implement horizontal pod autoscaling
- Set appropriate resource requests for scheduling
Namespace Organization
Use namespaces for environment isolation and apply resource quotas:
apiVersion: v1
kind: ResourceQuota
metadata:
name: prod-quota
namespace: production
spec:
hard:
requests.cpu: "100"
requests.memory: 200Gi
persistentvolumeclaims: "10"
Benefits: Logical separation, resource limits, RBAC boundaries, cost tracking
Labels and Annotations
Use consistent, recommended labels:
metadata:
labels:
app.kubernetes.io/name: myapp
app.kubernetes.io/instance: myapp-prod
app.kubernetes.io/version: "1.0.0"
app.kubernetes.io/component: backend
app.kubernetes.io/part-of: ecommerce
app.kubernetes.io/managed-by: helm
Service Types
- ClusterIP: Internal cluster communication (default)
- NodePort: External access via node ports (dev/test)
- LoadBalancer: Cloud provider load balancer (production)
- ExternalName: DNS CNAME record (external services)
Storage
Choose appropriate storage class and access mode:
Access Modes:
- ReadWriteOnce (RWO): Single node read-write
- ReadOnlyMany (ROX): Multiple nodes read-only
- ReadWriteMany (RWX): Multiple nodes read-write
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: app-data
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd
resources:
requests:
storage: 10Gi
Validation and Testing
Always validate before applying to production:
- Client-side validation:
kubectl apply --dry-run=client -f manifest.yaml - Server-side validation:
kubectl apply --dry-run=server -f manifest.yaml - Test in staging: Deploy to non-production environment first
- Monitor metrics: Watch resource usage and application health
- Gradual rollout: Use rolling updates with health checks
Application Checklist
When creating or reviewing Kubernetes manifests:
- Resource requests and limits configured
- Specific image version pinned (not :latest)
- Secrets and ConfigMaps used for configuration
- Security context implemented (non-root, dropped capabilities)
- Health checks configured (liveness, readiness, startup)
- Pod Disruption Budget defined for HA workloads
- Consistent labels applied
- Appropriate workload type selected
- Namespace and resource quotas configured
- Validated with dry-run before applying