Kubernetes Production

Use When

Use when operating production Kubernetes — Helm, autoscaling (HPA/VPA), resource management, StatefulSets, external-secrets, observability (Prometheus/Grafana/Loki), RBAC, Pod Security Standards, NetworkPolicies, admission control, backup (Velero), and cost control.
The task needs reusable judgment, domain constraints, or a proven workflow rather than ad hoc advice.

Do Not Use When

The task is unrelated to kubernetes-production or would be better handled by a more specific companion skill.
The request only needs a trivial answer and none of this skill's constraints or references materially help.

Required Inputs

Gather relevant project context, constraints, and the concrete problem to solve; load references only as needed.
Confirm the desired deliverable: design, code, review, migration plan, audit, or documentation.

Workflow

Read this SKILL.md first, then load only the referenced deep-dive files that are necessary for the task.
Apply the ordered guidance, checklists, and decision rules in this skill instead of cherry-picking isolated snippets.
Produce the deliverable with assumptions, risks, and follow-up work made explicit when they matter.

Quality Standards

Keep outputs execution-oriented, concise, and aligned with the repository's baseline engineering standards.
Preserve compatibility with existing project conventions unless the skill explicitly requires a stronger standard.
Prefer deterministic, reviewable steps over vague advice or tool-specific magic.

Anti-Patterns

Treating examples as copy-paste truth without checking fit, constraints, or failure modes.
Loading every reference file by default instead of using progressive disclosure.

Outputs

A concrete result that fits the task: implementation guidance, review findings, architecture decisions, templates, or generated artifacts.
Clear assumptions, tradeoffs, or unresolved gaps when the task cannot be completed from available context alone.
References used, companion skills, or follow-up actions when they materially improve execution.

Evidence Produced

Category	Artifact	Format	Example
Operability	K8s deployment runbook	Markdown doc per `skill-composition-standards/references/runbook-template.md` covering rollout, scaling, secret rotation, and PDB review	`docs/k8s/deployment-runbook.md`
Operability	Resource and autoscaler config note	Markdown doc covering requests/limits, HPA/VPA, and PDB rationale	`docs/k8s/resources-config.md`

References

Use the references/ directory for deep detail after reading the core workflow below.

The operational bar for running K8s in production — not just running Pods, but running them predictably, securely, observably, and cheaply.

Prerequisites: Load kubernetes-fundamentals first.

When this skill applies

Moving a POC cluster into production.
Auditing an existing cluster for production readiness.
Adding observability, security, or cost controls.
Reviewing Helm charts before deployment.

The production checklist

[ ] Helm (or Kustomize) — not raw manifests
[ ] requests + limits on every container
[ ] HPA on stateless workloads
[ ] PodDisruptionBudget on every workload with >1 replica
[ ] External secrets (Vault / AWS SM / GCP SM) — not inline Secret manifests
[ ] Observability: Prometheus + Grafana + Loki + Alertmanager
[ ] RBAC: least-privilege ServiceAccounts
[ ] Pod Security Standards enforced (restricted for app workloads)
[ ] NetworkPolicies: default-deny + explicit allow
[ ] Admission control: OPA Gatekeeper or Kyverno
[ ] Image scanning in CI (Trivy, Grype)
[ ] Backups: Velero with off-cluster storage
[ ] Cluster autoscaler or karpenter
[ ] SLOs + runbooks per service

Request sizing heuristics

Measure first; do not guess. Steady-state and peak come from kubectl top plus container_memory_working_set_bytes and rate(container_cpu_usage_seconds_total[5m]) over a representative week.

Memory request  = p95 working set
Memory limit    = p99 working set + 30% headroom
CPU request     = p95 utilisation in cores
CPU limit       = 2x request, or unset (with PriorityClass + tested cluster) for burstable workloads

QoS classes (set deliberately):

Guaranteed  requests == limits          -> tier-1 services, evicted last
Burstable   requests < limits           -> default for most apps
BestEffort  no requests/limits          -> never use in production

HPA does not work without CPU requests. VPA in recommend mode for one week tells you whether you are over- or under-provisioned.

When to reach for an operator

Stateful system you operate (Postgres, Kafka, Redis, ES) -> install a vetted operator (CloudNativePG, Strimzi, ECK)
Repeated multi-step ops you do by hand                   -> consider an operator
Watch a ConfigMap and bounce Pods                        -> CronJob is enough
Tenant lifecycle in SaaS                                 -> GitOps + ApplicationSet first; operator only if dynamic

Rule: install before you build. See references/crd-operators.md for the build/install/avoid matrix and CRD hygiene.

Cluster and node upgrades

Upgrades are releases, not chores. Pre-flight: run kube-no-trouble/pluto against Git for removed APIs, verify a Velero restore in a sandbox, drain-test one canary node.

Order: control plane -> node groups one at a time -> add-ons (CNI first, mesh last). Every workload with replicas > 1 has a PDB; a PDB of minAvailable: 100% blocks drain forever. See references/upgrade-runbook.md.

Helm vs Kustomize

Package to distribute to many users / tenants  -> Helm
Env overlays (dev/staging/prod), single app    -> Kustomize
Complex templating + conditionals              -> Helm
Strict "WYSIWYG" manifests                     -> Kustomize

We use Helm for shipped packages and Kustomize for simple in-house env overlays. Never both for the same workload.

See references/helm-vs-kustomize.md.

Resource management

Requests: guaranteed minimum; scheduler uses this to place Pods. Limits: cap; exceeded CPU = throttled, exceeded memory = OOMKilled.

Rules:

Always set both.
Memory: limit = expected peak + headroom (~30%).
CPU: request = steady-state; limit = 2–5× request or unset (in carefully-configured clusters).
Measure before setting — kubectl top and Prometheus container_memory_working_set_bytes, rate(container_cpu_usage_seconds_total[5m]).
Gradually reduce over-provisioning using VPA in recommend mode.

See references/resource-management.md.

HPA — Horizontal Pod Autoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: api, namespace: production }
spec:
  scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource: { name: cpu, target: { type: Utilization, averageUtilization: 60 } }
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies: [{ type: Percent, value: 50, periodSeconds: 60 }]
    scaleUp:
      stabilizationWindowSeconds: 0
      policies: [{ type: Percent, value: 100, periodSeconds: 30 }]

Target 60–70% CPU utilisation to leave headroom for spikes.
minReplicas: 2 minimum (HA).
Scale up fast, scale down slow.
For queue-depth-based or custom metrics: install Prometheus Adapter.

See references/autoscaling-hpa-vpa.md.

Stateful workloads

StatefulSet + PVC for databases, caches, brokers:

Stable names (pod-0, pod-1).
Ordered rollout/rollback.
Each Pod has its own PersistentVolume.
Headless Service for stable DNS per Pod.
PodDisruptionBudget (minAvailable: N-1).
Anti-affinity across nodes/zones.
Backups to object storage (never rely on PV alone).

For databases, strongly consider managed (RDS, Cloud SQL, Neon) before in-cluster. In-cluster DBs make sense only with serious ops maturity.

See references/stateful-workloads.md.

External secrets

Never commit Secret manifests to Git (even base64 is not encryption). Options:

external-secrets operator + Vault / AWS Secrets Manager / GCP Secret Manager / 1Password.
Sealed Secrets (Bitnami) — encrypted manifests safe for Git.
SOPS + age for GitOps-friendly encryption.

Pattern: external-secrets + cloud secret manager is our default in cloud; SOPS for air-gapped or self-hosted.

See references/secrets-external-secrets.md.

Observability stack

Minimum stack:

Prometheus — metrics scraping + storage (or Mimir/Thanos for long-term).
Grafana — dashboards.
Loki — logs.
Alertmanager — alert routing.
OpenTelemetry Collector — receive OTLP, fan out to backends.
Tempo or Jaeger — traces.

Install via kube-prometheus-stack Helm chart.

Golden signals per service: latency, traffic, errors, saturation.

See references/observability-stack.md.

RBAC + Pod Security

RBAC principles:

One ServiceAccount per workload.
No default ServiceAccount usage (set automountServiceAccountToken: false when not needed).
Roles — least privilege.
Audit unused ClusterRoleBindings periodically.

Pod Security Standards — enforce at namespace level:

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

Restricted disallows: privilege escalation, hostPath, hostNetwork, running as root, etc.

See references/rbac-and-pod-security.md.

NetworkPolicies — default deny

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: default-deny, namespace: production }
spec:
  podSelector: {}
  policyTypes: [Ingress, Egress]

Then add explicit allows per workload (e.g., api can reach db, web can reach api, everything can reach DNS + external API).

Requires a CNI that enforces NetworkPolicy: Calico, Cilium, or Azure CNI. Flannel does not.

See references/network-policies.md.

Admission control — OPA Gatekeeper or Kyverno

Policy-as-code enforcement at cluster admission:

Deny :latest images.
Require resources.requests and resources.limits.
Require readinessProbe.
Require specific labels (app.kubernetes.io/*, owner).
Restrict which registries are allowed.
Deny hostPath volumes.

Kyverno — YAML-based policies, easier for most teams. OPA Gatekeeper — Rego-based, more powerful, steeper curve.

See references/admission-control-opa-kyverno.md.

Image scanning

Trivy or Grype in CI for container images and IaC.
Fail builds on HIGH/CRITICAL.
Sign images (cosign) and verify at admission (policy-controller or Kyverno).

Backup — Velero

velero backup create weekly-$(date +%Y%m%d) --include-namespaces production --ttl 720h

Backup + restore cluster resources and PersistentVolumes.
Storage in off-cluster bucket (S3/GCS/Azure Blob).
Scheduled backups with Velero Schedule.
Regular restore drills — backups you never restore are hope, not a backup.

See references/backup-velero.md.

Cost control

Cluster autoscaler (or karpenter on AWS) — nodes scale with demand.
Right-sizing — VPA in recommend mode, kubectl-cost, kubecost for cost dashboards.
Spot / preemptible nodes — for stateless, fault-tolerant workloads. Use node taints + tolerations to steer.
Idle resource alerts — requests far above usage = over-provisioning.

See references/cost-control.md.

Anti-patterns

Relying on default ServiceAccount tokens.
Mounting every ConfigMap/Secret as env vars (file mounts are more secure, support rotation).
Using Deployment for stateful workloads.
HPA on workloads with startup > 60s without startup probe.
Default-allow NetworkPolicy posture.
Skipping backup/restore drills.
No PodDisruptionBudget on critical services.

References

references/helm-vs-kustomize.md
references/resource-management.md
references/autoscaling-hpa-vpa.md
references/stateful-workloads.md
references/secrets-external-secrets.md
references/observability-stack.md
references/rbac-and-pod-security.md
references/network-policies.md
references/admission-control-opa-kyverno.md
references/backup-velero.md
references/cost-control.md
references/upgrade-runbook.md
references/crd-operators.md

kubernetes-production

Kubernetes Production

Use When

Do Not Use When

Required Inputs

Workflow

Quality Standards

Anti-Patterns

Outputs

Evidence Produced

References

When this skill applies

The production checklist

Request sizing heuristics

When to reach for an operator

Cluster and node upgrades

Helm vs Kustomize

Resource management

HPA — Horizontal Pod Autoscaler

Stateful workloads

External secrets

Observability stack

RBAC + Pod Security

NetworkPolicies — default deny

Admission control — OPA Gatekeeper or Kyverno

Image scanning

Backup — Velero

Cost control

Anti-patterns

Read next

References

More from peterbamuhigire/skills-web-dev

google-play-store-review

multi-tenant-saas-architecture

jetpack-compose-ui

gis-mapping

saas-accounting-system

manual-guide