skills/anton-abyzov/specweave/kafka-observability

kafka-observability

SKILL.md

Kafka Monitoring & Observability

Expert guidance for implementing comprehensive monitoring and observability for Apache Kafka using Prometheus and Grafana.

When to Use This Skill

I activate when you need help with:

  • Monitoring setup: "Set up Kafka monitoring", "configure Prometheus for Kafka", "Grafana dashboards for Kafka"
  • Metrics collection: "Kafka JMX metrics", "export Kafka metrics to Prometheus"
  • Alerting: "Kafka alerting rules", "alert on under-replicated partitions", "critical Kafka metrics"
  • Troubleshooting: "Monitor Kafka performance", "track consumer lag", "broker health monitoring"

What I Know

Available Monitoring Components

This plugin provides a complete monitoring stack:

1. Prometheus JMX Exporter Configuration

  • Location: plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.yml
  • Purpose: Export Kafka JMX metrics to Prometheus format
  • Metrics Exported:
    • Broker topic metrics (bytes in/out, messages in, request rate)
    • Replica manager (under-replicated partitions, ISR shrinks/expands)
    • Controller metrics (active controller, offline partitions, leader elections)
    • Request metrics (produce/fetch latency)
    • Log metrics (flush rate, flush latency)
    • JVM metrics (heap, GC, threads, file descriptors)

2. Grafana Dashboards (5 Dashboards)

  • Location: plugins/specweave-kafka/monitoring/grafana/dashboards/
  • Dashboards:
    1. kafka-cluster-overview.json - Cluster health and throughput
    2. kafka-broker-metrics.json - Per-broker performance
    3. kafka-consumer-lag.json - Consumer lag monitoring
    4. kafka-topic-metrics.json - Topic-level metrics
    5. kafka-jvm-metrics.json - JVM health (heap, GC, threads)

3. Grafana Provisioning

  • Location: plugins/specweave-kafka/monitoring/grafana/provisioning/
  • Files:
    • dashboards/kafka.yml - Dashboard provisioning config
    • datasources/prometheus.yml - Prometheus datasource config

Setup Workflow 1: JMX Exporter (Self-Hosted Kafka)

For Kafka running on VMs or bare metal (non-Kubernetes).

Step 1: Download JMX Prometheus Agent

# Download JMX Prometheus agent JAR
cd /opt
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.20.0/jmx_prometheus_javaagent-0.20.0.jar

# Copy JMX Exporter config
cp plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.yml /opt/kafka-jmx-exporter.yml

Step 2: Configure Kafka Broker

Add JMX exporter to Kafka startup script:

# Edit Kafka startup (e.g., /etc/systemd/system/kafka.service)
[Service]
Environment="KAFKA_OPTS=-javaagent:/opt/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/kafka-jmx-exporter.yml"

Or add to kafka-server-start.sh:

export KAFKA_OPTS="-javaagent:/opt/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/kafka-jmx-exporter.yml"

Step 3: Restart Kafka and Verify

# Restart Kafka broker
sudo systemctl restart kafka

# Verify JMX exporter is running (port 7071)
curl localhost:7071/metrics | grep kafka_server

# Expected output: kafka_server_broker_topic_metrics_bytesin_total{...} 12345

Step 4: Configure Prometheus Scraping

Add Kafka brokers to Prometheus config:

# prometheus.yml
scrape_configs:
  - job_name: 'kafka'
    static_configs:
      - targets:
        - 'kafka-broker-1:7071'
        - 'kafka-broker-2:7071'
        - 'kafka-broker-3:7071'
    scrape_interval: 30s
# Reload Prometheus
sudo systemctl reload prometheus

# OR send SIGHUP
kill -HUP $(pidof prometheus)

# Verify scraping
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="kafka")'

Setup Workflow 2: Strimzi (Kubernetes)

For Kafka running on Kubernetes with Strimzi Operator.

Step 1: Create JMX Exporter ConfigMap

# Create ConfigMap from JMX exporter config
kubectl create configmap kafka-metrics \
  --from-file=kafka-metrics-config.yml=plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.yml \
  -n kafka

Step 2: Configure Kafka CR with Metrics

# kafka-cluster.yaml (add metricsConfig section)
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: my-kafka-cluster
  namespace: kafka
spec:
  kafka:
    version: 3.7.0
    replicas: 3

    # ... other config ...

    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: kafka-metrics-config.yml
# Apply updated Kafka CR
kubectl apply -f kafka-cluster.yaml

# Verify metrics endpoint (wait for rolling restart)
kubectl exec -it kafka-my-kafka-cluster-0 -n kafka -- curl localhost:9404/metrics | grep kafka_server

Step 3: Install Prometheus Operator (if not installed)

# Add Prometheus Community Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack (Prometheus + Grafana + Alertmanager)
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false

Step 4: Create PodMonitor for Kafka

# kafka-podmonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: kafka-metrics
  namespace: kafka
  labels:
    app: strimzi
spec:
  selector:
    matchLabels:
      strimzi.io/kind: Kafka
  podMetricsEndpoints:
    - port: tcp-prometheus
      interval: 30s
# Apply PodMonitor
kubectl apply -f kafka-podmonitor.yaml

# Verify Prometheus is scraping Kafka
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Open: http://localhost:9090/targets
# Should see kafka-metrics/* targets

Setup Workflow 3: Grafana Dashboards

Installation (Docker Compose)

If using Docker Compose for local development:

# docker-compose.yml (add to existing Kafka setup)
version: '3.8'
services:
  # ... Kafka services ...

  prometheus:
    image: prom/prometheus:v2.48.0
    ports:
      - "9090:9090"
    volumes:
      - ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

  grafana:
    image: grafana/grafana:10.2.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - ./monitoring/grafana/provisioning:/etc/grafana/provisioning
      - ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards
      - grafana-data:/var/lib/grafana

volumes:
  prometheus-data:
  grafana-data:
# Start monitoring stack
docker-compose up -d prometheus grafana

# Access Grafana
# URL: http://localhost:3000
# Username: admin
# Password: admin

Installation (Kubernetes)

Dashboards are auto-provisioned if using kube-prometheus-stack:

# Create ConfigMaps for each dashboard
for dashboard in plugins/specweave-kafka/monitoring/grafana/dashboards/*.json; do
  name=$(basename "$dashboard" .json)
  kubectl create configmap "kafka-dashboard-$name" \
    --from-file="$dashboard" \
    -n monitoring \
    --dry-run=client -o yaml | kubectl apply -f -
done

# Label ConfigMaps for Grafana auto-discovery
kubectl label configmap -n monitoring kafka-dashboard-* grafana_dashboard=1

# Grafana will auto-import dashboards (wait 30-60 seconds)

# Access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# URL: http://localhost:3000
# Username: admin
# Password: prom-operator (default kube-prometheus-stack password)

Manual Dashboard Import

If auto-provisioning doesn't work:

# 1. Access Grafana UI
# 2. Go to: Dashboards → Import
# 3. Upload JSON files from:
#    plugins/specweave-kafka/monitoring/grafana/dashboards/

# Or use Grafana API
for dashboard in plugins/specweave-kafka/monitoring/grafana/dashboards/*.json; do
  curl -X POST http://admin:admin@localhost:3000/api/dashboards/db \
    -H "Content-Type: application/json" \
    -d @"$dashboard"
done

Dashboard Overview

1. Kafka Cluster Overview (kafka-cluster-overview.json)

Purpose: High-level cluster health

Key Metrics:

  • Active Controller Count (should be exactly 1)
  • Under-Replicated Partitions (should be 0) ⚠️ CRITICAL
  • Offline Partitions Count (should be 0) ⚠️ CRITICAL
  • Unclean Leader Elections (should be 0)
  • Cluster Throughput (bytes in/out per second)
  • Request Rate (produce, fetch requests per second)
  • ISR Changes (shrinks/expands)
  • Leader Election Rate

Use When: Checking overall cluster health

2. Kafka Broker Metrics (kafka-broker-metrics.json)

Purpose: Per-broker performance

Key Metrics:

  • Broker CPU Usage (% utilization)
  • Broker Heap Memory Usage
  • Broker Network Throughput (bytes in/out)
  • Request Handler Idle Percentage (low = CPU saturation)
  • File Descriptors (open vs max)
  • Log Flush Latency (p50, p99)
  • JVM GC Collection Count/Time

Use When: Investigating broker performance issues

3. Kafka Consumer Lag (kafka-consumer-lag.json)

Purpose: Consumer lag monitoring

Key Metrics:

  • Consumer Lag per Topic/Partition
  • Total Lag per Consumer Group
  • Offset Commit Rate
  • Current Consumer Offset
  • Log End Offset (producer offset)
  • Consumer Group Members

Use When: Troubleshooting slow consumers or lag spikes

4. Kafka Topic Metrics (kafka-topic-metrics.json)

Purpose: Topic-level metrics

Key Metrics:

  • Messages Produced per Topic
  • Bytes per Topic (in/out)
  • Partition Count per Topic
  • Replication Factor
  • In-Sync Replicas
  • Log Size per Partition
  • Current Offset per Partition
  • Partition Leader Distribution

Use When: Analyzing topic throughput and hotspots

5. Kafka JVM Metrics (kafka-jvm-metrics.json)

Purpose: JVM health monitoring

Key Metrics:

  • Heap Memory Usage (used vs max)
  • Heap Utilization Percentage
  • GC Collection Rate (collections/sec)
  • GC Collection Time (ms/sec)
  • JVM Thread Count
  • Heap Memory by Pool (young gen, old gen, survivor)
  • Off-Heap Memory Usage (metaspace, code cache)
  • GC Pause Time Percentiles (p50, p95, p99)

Use When: Investigating memory leaks or GC pauses

Critical Alerts Configuration

Create Prometheus alerting rules for critical Kafka metrics:

# kafka-alerts.yml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kafka-alerts
  namespace: monitoring
spec:
  groups:
    - name: kafka.rules
      interval: 30s
      rules:
        # CRITICAL: Under-Replicated Partitions
        - alert: KafkaUnderReplicatedPartitions
          expr: sum(kafka_server_replica_manager_under_replicated_partitions) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Kafka has under-replicated partitions"
            description: "{{ $value }} partitions are under-replicated. Data loss risk!"

        # CRITICAL: Offline Partitions
        - alert: KafkaOfflinePartitions
          expr: kafka_controller_offline_partitions_count > 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "Kafka has offline partitions"
            description: "{{ $value }} partitions are offline. Service degradation!"

        # CRITICAL: No Active Controller
        - alert: KafkaNoActiveController
          expr: kafka_controller_active_controller_count == 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "No active Kafka controller"
            description: "Cluster has no active controller. Cannot perform administrative operations!"

        # WARNING: High Consumer Lag
        - alert: KafkaConsumerLagHigh
          expr: sum by (consumergroup) (kafka_consumergroup_lag) > 10000
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Consumer group {{ $labels.consumergroup }} has high lag"
            description: "Lag is {{ $value }} messages. Consumers may be slow."

        # WARNING: High CPU Usage
        - alert: KafkaBrokerHighCPU
          expr: os_process_cpu_load{job="kafka"} > 0.8
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Broker {{ $labels.instance }} has high CPU usage"
            description: "CPU usage is {{ $value | humanizePercentage }}. Consider scaling."

        # WARNING: Low Heap Memory
        - alert: KafkaBrokerLowHeapMemory
          expr: jvm_memory_heap_used_bytes{job="kafka"} / jvm_memory_heap_max_bytes{job="kafka"} > 0.9
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Broker {{ $labels.instance }} has low heap memory"
            description: "Heap usage is {{ $value | humanizePercentage }}. Risk of OOM!"

        # WARNING: High GC Time
        - alert: KafkaBrokerHighGCTime
          expr: rate(jvm_gc_collection_time_ms_total{job="kafka"}[5m]) > 500
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Broker {{ $labels.instance }} spending too much time in GC"
            description: "GC time is {{ $value }}ms/sec. Application pauses likely."
# Apply alerts (Kubernetes)
kubectl apply -f kafka-alerts.yml

# Verify alerts loaded
kubectl get prometheusrules -n monitoring

Troubleshooting

"Prometheus not scraping Kafka metrics"

Symptoms: No Kafka metrics in Prometheus

Fix:

# 1. Verify JMX exporter is running
curl http://kafka-broker:7071/metrics

# 2. Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="kafka")'

# 3. Check Prometheus logs
kubectl logs -n monitoring prometheus-kube-prometheus-prometheus-0

# Common issues:
# - Firewall blocking port 7071
# - Incorrect scrape config
# - Kafka broker not running

"Grafana dashboards not loading"

Symptoms: Dashboards show "No data"

Fix:

# 1. Verify Prometheus datasource
# Grafana UI → Configuration → Data Sources → Prometheus → Test

# 2. Check if Kafka metrics exist in Prometheus
# Prometheus UI → Graph → Enter: kafka_server_broker_topic_metrics_bytesin_total

# 3. Verify dashboard queries match your Prometheus job name
# Dashboard panels use job="kafka" by default
# If your job name is different, update dashboard JSON

"Consumer lag metrics missing"

Symptoms: Consumer lag dashboard empty

Fix: Consumer lag metrics require Kafka Exporter (separate from JMX Exporter):

# Install Kafka Exporter (Kubernetes)
helm install kafka-exporter prometheus-community/prometheus-kafka-exporter \
  --namespace monitoring \
  --set kafkaServer={kafka-bootstrap:9092}

# Or run as Docker container
docker run -d -p 9308:9308 \
  danielqsj/kafka-exporter \
  --kafka.server=kafka:9092 \
  --web.listen-address=:9308

# Add to Prometheus scrape config
scrape_configs:
  - job_name: 'kafka-exporter'
    static_configs:
      - targets: ['kafka-exporter:9308']

Integration with Other Skills

  • kafka-iac-deployment: Set up monitoring during Terraform deployment
  • kafka-kubernetes: Configure monitoring for Strimzi Kafka on K8s
  • kafka-architecture: Use cluster sizing metrics to validate capacity planning
  • kafka-cli-tools: Use kcat to generate test traffic and verify metrics

Quick Reference Commands

# Check JMX exporter metrics
curl http://localhost:7071/metrics | grep -E "(kafka_server|kafka_controller)"

# Prometheus query examples
curl -g 'http://localhost:9090/api/v1/query?query=kafka_server_replica_manager_under_replicated_partitions'

# Grafana dashboard export
curl http://admin:admin@localhost:3000/api/dashboards/uid/kafka-cluster-overview | jq .dashboard > backup.json

# Reload Prometheus config
kill -HUP $(pidof prometheus)

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="kafka")'

Next Steps After Monitoring Setup:

  1. Review all 5 Grafana dashboards to familiarize yourself with metrics
  2. Set up alerting (Slack, PagerDuty, email)
  3. Create runbooks for critical alerts (under-replicated partitions, offline partitions, no controller)
  4. Monitor for 7 days to establish baseline metrics
  5. Tune JVM settings based on GC metrics
Weekly Installs
11
Installed on
claude-code10
cursor8
opencode7
codex7
antigravity7
gemini-cli7