NYC
skills/smithery/ai/operating-kubernetes

operating-kubernetes

SKILL.md

Kubernetes Operations

Purpose

Operating Kubernetes clusters in production requires mastery of resource management, scheduling patterns, networking architecture, storage strategies, security hardening, and autoscaling. This skill provides operations-first frameworks for right-sizing workloads, implementing high-availability patterns, securing clusters with RBAC and Pod Security Standards, and systematically troubleshooting common failures.

Use this skill when deploying applications to Kubernetes, configuring cluster resources, implementing NetworkPolicies for zero-trust security, setting up autoscaling (HPA, VPA, KEDA), managing persistent storage, or diagnosing operational issues like CrashLoopBackOff or resource exhaustion.

When to Use This Skill

Common Triggers:

  • "Deploy my application to Kubernetes"
  • "Configure resource requests and limits"
  • "Set up autoscaling for my pods"
  • "Implement NetworkPolicies for security"
  • "My pod is stuck in Pending/CrashLoopBackOff"
  • "Configure RBAC with least privilege"
  • "Set up persistent storage for my database"
  • "Spread pods across availability zones"

Operations Covered:

  • Resource management (CPU/memory, QoS classes, quotas)
  • Advanced scheduling (affinity, taints, topology spread)
  • Networking (NetworkPolicies, Ingress, Gateway API)
  • Storage operations (StorageClasses, PVCs, CSI)
  • Security hardening (RBAC, Pod Security Standards, policies)
  • Autoscaling (HPA, VPA, KEDA, cluster autoscaler)
  • Troubleshooting (systematic debugging playbooks)

Resource Management

Quality of Service (QoS) Classes

Kubernetes assigns QoS classes based on resource requests and limits:

Guaranteed (Highest Priority):

  • Requests equal limits for CPU and memory
  • Never evicted unless exceeding limits
  • Use for critical production services
resources:
  requests:
    memory: "512Mi"
    cpu: "500m"
  limits:
    memory: "512Mi"  # Same as request
    cpu: "500m"

Burstable (Medium Priority):

  • Requests less than limits (or only requests set)
  • Can burst above requests
  • Evicted under node pressure
  • Use for web servers, most applications
resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"  # 2x request
    cpu: "500m"

BestEffort (Lowest Priority):

  • No requests or limits set
  • First to be evicted under pressure
  • Use only for development/testing

Decision Framework: Which QoS Class?

Workload Type QoS Class Configuration
Critical API/Database Guaranteed requests == limits
Web servers, services Burstable limits 1.5-2x requests
Batch jobs Burstable Low requests, high limits
Dev/test environments BestEffort No limits

Resource Quotas and LimitRanges

Enforce multi-tenancy with ResourceQuotas (namespace limits) and LimitRanges (per-container defaults):

# ResourceQuota: Namespace-level limits
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-alpha
spec:
  hard:
    requests.cpu: "10"
    requests.memory: "20Gi"
    limits.cpu: "20"
    limits.memory: "40Gi"
    pods: "50"

For detailed resource management patterns including Vertical Pod Autoscaler (VPA), see references/resource-management.md.

Advanced Scheduling

Node Affinity

Control which nodes pods schedule on with required (hard) or preferred (soft) constraints:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
          - g4dn.xlarge  # GPU instance

Taints and Tolerations

Reserve nodes for specific workloads (inverse of affinity):

# Taint GPU nodes to prevent non-GPU workloads
kubectl taint nodes gpu-node-1 workload=gpu:NoSchedule
# Pod tolerates GPU taint
tolerations:
- key: "workload"
  operator: "Equal"
  value: "gpu"
  effect: "NoSchedule"

Topology Spread Constraints

Distribute pods evenly across failure domains (zones, nodes):

topologySpreadConstraints:
- maxSkew: 1  # Max difference in pod count
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      app: critical-app

For advanced scheduling patterns including pod priority and preemption, see references/scheduling-patterns.md.

Networking

NetworkPolicies (Zero-Trust Security)

Implement default-deny security with NetworkPolicies:

# Default deny all traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
# Allow specific ingress (frontend → backend)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: backend-allow-frontend
spec:
  podSelector:
    matchLabels:
      app: backend
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080

Ingress vs. Gateway API

Ingress (Legacy):

  • Widely supported, mature ecosystem
  • Limited expressiveness
  • Use for existing applications

Gateway API (Modern):

  • Role-oriented design (cluster ops vs. app devs)
  • More expressive (HTTPRoute, TCPRoute, TLSRoute)
  • Recommended for new applications (GA in Kubernetes 1.29+)
# Gateway API example
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: app-routes
spec:
  parentRefs:
  - name: production-gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /api
    backendRefs:
    - name: backend
      port: 8080

For detailed networking patterns including service mesh integration, see references/networking.md.

Storage

StorageClasses (Define Performance Tiers)

StorageClasses define storage tiers for different workload needs:

# AWS EBS SSD (high performance)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iopsPerGB: "50"
  encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Delete

Storage Decision Matrix

Workload Performance Access Mode Storage Class
Database High ReadWriteOnce SSD (gp3/io2)
Shared files Medium ReadWriteMany NFS/EFS
Logs (temp) Low ReadWriteOnce Standard HDD
ML models High ReadOnlyMany Object storage (S3)

Access Modes:

  • ReadWriteOnce (RWO): Single node read-write (most common)
  • ReadOnlyMany (ROX): Multiple nodes read-only
  • ReadWriteMany (RWX): Multiple nodes read-write (requires network storage)

For detailed storage operations including volume snapshots and CSI drivers, see references/storage.md.

Security

RBAC (Role-Based Access Control)

Implement least-privilege access with RBAC:

# Role (namespace-scoped)
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
  namespace: production
rules:
- apiGroups: [""]
  resources: ["pods", "pods/log"]
  verbs: ["get", "list", "watch"]
---
# RoleBinding (assign role to user)
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
  namespace: production
subjects:
- kind: User
  name: jane@example.com
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

Pod Security Standards

Enforce secure pod configurations at the namespace level:

# Namespace with Restricted PSS (most secure)
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

Pod Security Levels:

  • Restricted: Most secure, removes all privilege escalations (use for applications)
  • Baseline: Minimally restrictive, prevents known escalations
  • Privileged: Unrestricted (only for system workloads)

For detailed security patterns including policy enforcement (Kyverno/OPA) and secrets management, see references/security.md.

Autoscaling

Horizontal Pod Autoscaler (HPA)

Scale pod replicas based on CPU, memory, or custom metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5min before scaling down

KEDA (Event-Driven Autoscaling)

Scale based on events beyond CPU/memory (queues, cron schedules, Prometheus metrics):

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: rabbitmq-scaler
spec:
  scaleTargetRef:
    name: message-processor
  minReplicaCount: 0   # Scale to zero when queue empty
  maxReplicaCount: 30
  triggers:
  - type: rabbitmq
    metadata:
      queueName: tasks
      queueLength: "10"  # Scale up when >10 messages

Autoscaling Decision Matrix

Scenario Use HPA Use VPA Use KEDA Use Cluster Autoscaler
Stateless web app with traffic spikes Maybe
Single-instance database Maybe
Queue processor (event-driven) Maybe
Pods pending (insufficient nodes)

For detailed autoscaling patterns including VPA and cluster autoscaler configuration, see references/autoscaling.md.

Troubleshooting

Common Pod Issues

Pod Stuck in Pending:

kubectl describe pod <pod-name>

# Common causes:
# - Insufficient CPU/memory: Reduce requests or add nodes
# - Node selector mismatch: Fix nodeSelector or add labels
# - PVC not bound: Create PVC or fix name
# - Taint intolerance: Add toleration or remove taint

CrashLoopBackOff:

kubectl logs <pod-name>
kubectl logs <pod-name> --previous  # Check previous crash

# Common causes:
# - Application crash: Fix code or configuration
# - Missing environment variables: Add to deployment
# - Liveness probe failing: Increase initialDelaySeconds
# - OOMKilled: Increase memory limit or fix leak

ImagePullBackOff:

kubectl describe pod <pod-name>

# Common causes:
# - Image doesn't exist: Fix image name/tag
# - Authentication required: Create imagePullSecrets
# - Network issues: Check NetworkPolicies, firewall rules

Service Not Accessible:

kubectl get endpoints <service-name>  # Should list pod IPs

# If endpoints empty:
# - Service selector doesn't match pod labels
# - Pods aren't ready (readiness probe failing)
# - Check NetworkPolicies blocking traffic

For systematic troubleshooting playbooks including networking and storage issues, see references/troubleshooting.md.

Reference Documentation

Deep Dives

  • references/resource-management.md - Resource requests/limits, QoS classes, ResourceQuotas, VPA
  • references/scheduling-patterns.md - Node affinity, taints/tolerations, topology spread, priority
  • references/networking.md - NetworkPolicies, Ingress, Gateway API, service mesh integration
  • references/storage.md - StorageClasses, PVCs, CSI drivers, volume snapshots
  • references/security.md - RBAC, Pod Security Standards, policy enforcement, secrets
  • references/autoscaling.md - HPA, VPA, KEDA, cluster autoscaler configuration
  • references/troubleshooting.md - Systematic debugging playbooks for common failures

Examples

  • examples/manifests/ - Copy-paste ready YAML manifests
  • examples/python/ - Automation scripts (audit, cost analysis, validation)
  • examples/go/ - Operator development examples

Tools

  • scripts/validate-resources.sh - Audit pods without resource limits
  • scripts/audit-networkpolicies.sh - Find namespaces without NetworkPolicies
  • scripts/cost-analysis.sh - Resource cost breakdown by namespace

Related Skills

  • building-ci-pipelines - Deploy to Kubernetes from CI/CD (kubectl apply, Helm, GitOps)
  • observability - Monitor clusters and workloads (Prometheus, Grafana, tracing)
  • secret-management - Secure secrets in Kubernetes (External Secrets, Sealed Secrets)
  • testing-strategies - Test manifests and deployments (Kubeval, Conftest, Kind)
  • infrastructure-as-code - Provision Kubernetes clusters (Terraform, Cluster API)
  • gitops-workflows - Declarative cluster management (Flux, ArgoCD)

Best Practices Summary

Resource Management:

  • Always set CPU/memory requests and limits
  • Use VPA for automated rightsizing
  • Implement resource quotas per namespace
  • Monitor actual usage vs. requests

Scheduling:

  • Use topology spread constraints for high availability
  • Apply taints for workload isolation (GPU, spot instances)
  • Set pod priority for critical workloads

Networking:

  • Implement NetworkPolicies with default-deny
  • Use Gateway API for new applications
  • Apply rate limiting at ingress layer

Storage:

  • Use CSI drivers (not legacy provisioners)
  • Define StorageClasses per performance tier
  • Enable volume snapshots for stateful apps

Security:

  • Enforce Pod Security Standards (Restricted for apps)
  • Implement RBAC with least privilege
  • Use policy engines for guardrails (Kyverno/OPA)
  • Scan images for vulnerabilities

Autoscaling:

  • Use HPA for stateless workloads
  • Use KEDA for event-driven workloads
  • Enable cluster autoscaler with limits
  • Set PodDisruptionBudgets to prevent over-disruption
Weekly Installs
1
Repository
smithery/ai
First Seen
1 day ago
Installed on
codex1