kafka-kubernetes
Kafka on Kubernetes Deployment
Expert guidance for deploying Apache Kafka on Kubernetes using industry-standard tools.
When to Use This Skill
I activate when you need help with:
- Kubernetes deployments: "Deploy Kafka on Kubernetes", "run Kafka in K8s", "Kafka Helm chart"
- Operator selection: "Strimzi vs Confluent Operator", "which Kafka operator to use"
- StatefulSet patterns: "Kafka StatefulSet best practices", "persistent volumes for Kafka"
- Production K8s: "Production-ready Kafka on K8s", "Kafka high availability in Kubernetes"
What I Know
Deployment Options Comparison
| Approach | Difficulty | Production-Ready | Best For |
|---|---|---|---|
| Strimzi Operator | Easy | ✅ Yes | Self-managed Kafka on K8s, CNCF project |
| Confluent Operator | Medium | ✅ Yes | Enterprise features, Confluent ecosystem |
| Bitnami Helm Chart | Easy | ⚠️ Mostly | Quick dev/staging environments |
| Custom StatefulSet | Hard | ⚠️ Requires expertise | Full control, custom requirements |
Recommendation: Strimzi Operator for most production use cases (CNCF project, active community, KRaft support)
Deployment Approach 1: Strimzi Operator (Recommended)
Strimzi is a CNCF Sandbox project providing Kubernetes operators for Apache Kafka.
Features
- ✅ KRaft mode support (Kafka 3.6+, no ZooKeeper)
- ✅ Declarative Kafka management (CRDs)
- ✅ Automatic rolling upgrades
- ✅ Built-in monitoring (Prometheus metrics)
- ✅ Mirror Maker 2 for replication
- ✅ Kafka Connect integration
- ✅ User and topic management via CRDs
Installation (Helm)
# 1. Add Strimzi Helm repository
helm repo add strimzi https://strimzi.io/charts/
helm repo update
# 2. Create namespace
kubectl create namespace kafka
# 3. Install Strimzi Operator
helm install strimzi-kafka-operator strimzi/strimzi-kafka-operator \
--namespace kafka \
--set watchNamespaces="{kafka}" \
--version 0.39.0
# 4. Verify operator is running
kubectl get pods -n kafka
# Output: strimzi-cluster-operator-... Running
Deploy Kafka Cluster (KRaft Mode)
# kafka-cluster.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
name: kafka-pool
namespace: kafka
labels:
strimzi.io/cluster: my-kafka-cluster
spec:
replicas: 3
roles:
- controller
- broker
storage:
type: jbod
volumes:
- id: 0
type: persistent-claim
size: 100Gi
class: fast-ssd
deleteClaim: false
---
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: my-kafka-cluster
namespace: kafka
annotations:
strimzi.io/kraft: enabled
strimzi.io/node-pools: enabled
spec:
kafka:
version: 3.7.0
metadataVersion: 3.7-IV4
replicas: 3
listeners:
- name: plain
port: 9092
type: internal
tls: false
- name: tls
port: 9093
type: internal
tls: true
authentication:
type: tls
- name: external
port: 9094
type: loadbalancer
tls: true
authentication:
type: tls
config:
default.replication.factor: 3
min.insync.replicas: 2
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
transaction.state.log.min.isr: 2
auto.create.topics.enable: false
log.retention.hours: 168
log.segment.bytes: 1073741824
compression.type: lz4
resources:
requests:
memory: 4Gi
cpu: "2"
limits:
memory: 8Gi
cpu: "4"
jvmOptions:
-Xms: 2048m
-Xmx: 4096m
metricsConfig:
type: jmxPrometheusExporter
valueFrom:
configMapKeyRef:
name: kafka-metrics
key: kafka-metrics-config.yml
# Apply Kafka cluster
kubectl apply -f kafka-cluster.yaml
# Wait for cluster to be ready (5-10 minutes)
kubectl wait kafka/my-kafka-cluster --for=condition=Ready --timeout=600s -n kafka
# Check status
kubectl get kafka -n kafka
# Output: my-kafka-cluster 3.7.0 3 True
Create Topics (Declaratively)
# kafka-topics.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: user-events
namespace: kafka
labels:
strimzi.io/cluster: my-kafka-cluster
spec:
partitions: 12
replicas: 3
config:
retention.ms: 604800000 # 7 days
segment.bytes: 1073741824
compression.type: lz4
---
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: order-events
namespace: kafka
labels:
strimzi.io/cluster: my-kafka-cluster
spec:
partitions: 6
replicas: 3
config:
retention.ms: 2592000000 # 30 days
min.insync.replicas: 2
# Apply topics
kubectl apply -f kafka-topics.yaml
# Verify topics created
kubectl get kafkatopics -n kafka
Create Users (Declaratively)
# kafka-users.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
name: my-producer
namespace: kafka
labels:
strimzi.io/cluster: my-kafka-cluster
spec:
authentication:
type: tls
authorization:
type: simple
acls:
- resource:
type: topic
name: user-events
patternType: literal
operations: [Write, Describe]
- resource:
type: topic
name: order-events
patternType: literal
operations: [Write, Describe]
---
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
name: my-consumer
namespace: kafka
labels:
strimzi.io/cluster: my-kafka-cluster
spec:
authentication:
type: tls
authorization:
type: simple
acls:
- resource:
type: topic
name: user-events
patternType: literal
operations: [Read, Describe]
- resource:
type: group
name: my-consumer-group
patternType: literal
operations: [Read]
# Apply users
kubectl apply -f kafka-users.yaml
# Get user credentials (TLS certificates)
kubectl get secret my-producer -n kafka -o jsonpath='{.data.user\.crt}' | base64 -d > producer.crt
kubectl get secret my-producer -n kafka -o jsonpath='{.data.user\.key}' | base64 -d > producer.key
kubectl get secret my-kafka-cluster-cluster-ca-cert -n kafka -o jsonpath='{.data.ca\.crt}' | base64 -d > ca.crt
Deployment Approach 2: Confluent Operator
Confluent for Kubernetes (CFK) provides enterprise-grade Kafka management.
Features
- ✅ Full Confluent Platform (Kafka, Schema Registry, ksqlDB, Connect)
- ✅ Hybrid deployments (K8s + on-prem)
- ✅ Rolling upgrades with zero downtime
- ✅ Multi-region replication
- ✅ Advanced security (RBAC, encryption)
- ⚠️ Requires Confluent Platform license (paid)
Installation
# 1. Add Confluent Helm repository
helm repo add confluentinc https://packages.confluent.io/helm
helm repo update
# 2. Create namespace
kubectl create namespace confluent
# 3. Install Confluent Operator
helm install confluent-operator confluentinc/confluent-for-kubernetes \
--namespace confluent \
--version 0.921.11
# 4. Verify
kubectl get pods -n confluent
Deploy Kafka Cluster
# kafka-cluster-confluent.yaml
apiVersion: platform.confluent.io/v1beta1
kind: Kafka
metadata:
name: kafka
namespace: confluent
spec:
replicas: 3
image:
application: confluentinc/cp-server:7.6.0
init: confluentinc/confluent-init-container:2.7.0
dataVolumeCapacity: 100Gi
storageClass:
name: fast-ssd
metricReporter:
enabled: true
listeners:
internal:
authentication:
type: plain
tls:
enabled: true
external:
authentication:
type: plain
tls:
enabled: true
dependencies:
zookeeper:
endpoint: zookeeper.confluent.svc.cluster.local:2181
podTemplate:
resources:
requests:
memory: 4Gi
cpu: 2
limits:
memory: 8Gi
cpu: 4
# Apply Kafka cluster
kubectl apply -f kafka-cluster-confluent.yaml
# Wait for cluster
kubectl wait kafka/kafka --for=condition=Ready --timeout=600s -n confluent
Deployment Approach 3: Bitnami Helm Chart (Dev/Staging)
Bitnami Helm Chart is simple but less suitable for production.
Installation
# 1. Add Bitnami repository
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
# 2. Install Kafka (KRaft mode)
helm install kafka bitnami/kafka \
--namespace kafka \
--create-namespace \
--set kraft.enabled=true \
--set controller.replicaCount=3 \
--set broker.replicaCount=3 \
--set persistence.size=100Gi \
--set persistence.storageClass=fast-ssd \
--set metrics.kafka.enabled=true \
--set metrics.jmx.enabled=true
# 3. Get bootstrap servers
export KAFKA_BOOTSTRAP=$(kubectl get svc kafka -n kafka -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'):9092
Limitations:
- ⚠️ Less production-ready than Strimzi/Confluent
- ⚠️ Limited declarative topic/user management
- ⚠️ Fewer advanced features (no MirrorMaker 2, limited RBAC)
Production Best Practices
1. Storage Configuration
Use SSD-backed storage classes for Kafka logs:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: kubernetes.io/aws-ebs # or pd.csi.storage.gke.io for GKE
parameters:
type: gp3 # AWS EBS GP3 (or io2 for extreme performance)
iopsPerGB: "50"
throughput: "125"
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
Kafka storage requirements:
- Min IOPS: 3000+ per broker
- Min Throughput: 125 MB/s per broker
- Persistent: Use
deleteClaim: false(don't delete data on pod deletion)
2. Resource Limits
resources:
requests:
memory: 4Gi
cpu: "2"
limits:
memory: 8Gi
cpu: "4"
jvmOptions:
-Xms: 2048m # Initial heap (50% of memory request)
-Xmx: 4096m # Max heap (50% of memory limit, leave room for OS cache)
Sizing guidelines:
- Small (dev): 2 CPU, 4Gi memory
- Medium (staging): 4 CPU, 8Gi memory
- Large (production): 8 CPU, 16Gi memory
3. Pod Disruption Budgets
Ensure high availability during K8s upgrades:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: kafka-pdb
namespace: kafka
spec:
maxUnavailable: 1
selector:
matchLabels:
app.kubernetes.io/name: kafka
4. Affinity Rules
Spread brokers across availability zones:
spec:
kafka:
template:
pod:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: strimzi.io/name
operator: In
values:
- my-kafka-cluster-kafka
topologyKey: topology.kubernetes.io/zone
5. Network Policies
Restrict access to Kafka brokers:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: kafka-network-policy
namespace: kafka
spec:
podSelector:
matchLabels:
strimzi.io/name: my-kafka-cluster-kafka
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: my-producer
- podSelector:
matchLabels:
app: my-consumer
ports:
- protocol: TCP
port: 9092
- protocol: TCP
port: 9093
Monitoring Integration
Prometheus + Grafana Setup
Strimzi provides built-in Prometheus metrics exporter:
# kafka-metrics-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: kafka-metrics
namespace: kafka
data:
kafka-metrics-config.yml: |
# Use JMX Exporter config from:
# plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.yml
lowercaseOutputName: true
lowercaseOutputLabelNames: true
whitelistObjectNames:
- "kafka.server:type=BrokerTopicMetrics,name=*"
# ... (copy from kafka-jmx-exporter.yml)
# Apply metrics config
kubectl apply -f kafka-metrics-configmap.yaml
# Install Prometheus Operator (if not already installed)
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace
# Create PodMonitor for Kafka
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: kafka-metrics
namespace: kafka
spec:
selector:
matchLabels:
strimzi.io/kind: Kafka
podMetricsEndpoints:
- port: tcp-prometheus
interval: 30s
EOF
# Access Grafana dashboards (from kafka-observability skill)
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Open: http://localhost:3000
# Dashboards: Kafka Cluster Overview, Broker Metrics, Consumer Lag, Topic Metrics, JVM Metrics
Troubleshooting
"Pods stuck in Pending state"
Cause: Insufficient resources or storage class not found Fix:
# Check events
kubectl describe pod kafka-my-kafka-cluster-0 -n kafka
# Check storage class exists
kubectl get storageclass
# If missing, create fast-ssd storage class (see Production Best Practices above)
"Kafka broker not ready after 10 minutes"
Cause: Slow storage provisioning or resource limits too low Fix:
# Check broker logs
kubectl logs kafka-my-kafka-cluster-0 -n kafka
# Common issues:
# 1. Low IOPS on storage → Use GP3 or better
# 2. Low memory → Increase resources.requests.memory
# 3. KRaft quorum not formed → Check all brokers are running
"Cannot connect to Kafka from outside K8s"
Cause: External listener not configured Fix:
# Add external listener (Strimzi)
spec:
kafka:
listeners:
- name: external
port: 9094
type: loadbalancer
tls: true
authentication:
type: tls
# Get external bootstrap server
kubectl get kafka my-kafka-cluster -n kafka -o jsonpath='{.status.listeners[?(@.name=="external")].bootstrapServers}'
Scaling Operations
Horizontal Scaling (Add Brokers)
# Strimzi: Update KafkaNodePool replicas
kubectl patch kafkanodepool kafka-pool -n kafka --type='json' \
-p='[{"op": "replace", "path": "/spec/replicas", "value": 5}]'
# Confluent: Update Kafka CR
kubectl patch kafka kafka -n confluent --type='json' \
-p='[{"op": "replace", "path": "/spec/replicas", "value": 5}]'
# Wait for new brokers
kubectl rollout status statefulset/kafka-my-kafka-cluster-kafka -n kafka
Vertical Scaling (Change Resources)
# Update resources in Kafka CR
kubectl patch kafka my-kafka-cluster -n kafka --type='json' \
-p='[
{"op": "replace", "path": "/spec/kafka/resources/requests/memory", "value": "8Gi"},
{"op": "replace", "path": "/spec/kafka/resources/requests/cpu", "value": "4"}
]'
# Rolling restart will happen automatically
Integration with Other Skills
- kafka-iac-deployment: Alternative to K8s (use Terraform for cloud-managed Kafka)
- kafka-observability: Set up Prometheus + Grafana dashboards for K8s Kafka
- kafka-architecture: Cluster sizing and partitioning strategy
- kafka-cli-tools: Test K8s Kafka cluster with kcat
Quick Reference Commands
# Strimzi
kubectl get kafka -n kafka # List Kafka clusters
kubectl get kafkatopics -n kafka # List topics
kubectl get kafkausers -n kafka # List users
kubectl logs kafka-my-kafka-cluster-0 -n kafka # Check broker logs
# Confluent
kubectl get kafka -n confluent # List Kafka clusters
kubectl get schemaregistry -n confluent # List Schema Registry
kubectl get ksqldb -n confluent # List ksqlDB
# Port-forward for testing
kubectl port-forward -n kafka svc/my-kafka-cluster-kafka-bootstrap 9092:9092
Next Steps After K8s Deployment:
- Use kafka-observability skill to verify Prometheus metrics and Grafana dashboards
- Use kafka-cli-tools skill to test cluster with kcat
- Deploy your producer/consumer applications to K8s
- Set up GitOps for declarative topic/user management (ArgoCD, Flux)