skills/mindrally/skills/kafka-development

kafka-development

SKILL.md

Kafka Development

You are an expert in Apache Kafka event streaming and distributed messaging systems. Follow these best practices when building Kafka-based applications.

Core Principles

  • Kafka is a distributed event streaming platform for high-throughput, fault-tolerant messaging
  • Unlike traditional pub/sub, Kafka uses a pull model - consumers pull messages from partitions
  • Design for scalability, durability, and exactly-once semantics where needed
  • Leave NO todos, placeholders, or missing pieces in the implementation

Architecture Overview

Core Components

  • Topics: Categories/feeds for organizing messages
  • Partitions: Ordered, immutable sequences within topics enabling parallelism
  • Producers: Clients that publish messages to topics
  • Consumers: Clients that read messages from topics
  • Consumer Groups: Coordinate consumption across multiple consumers
  • Brokers: Kafka servers that store data and serve clients

Key Concepts

  • Messages are appended to partitions in order
  • Each message has an offset - a unique sequential ID within the partition
  • Consumers maintain their own cursor (offset) and can read streams repeatedly
  • Partitions are distributed across brokers for scalability

Topic Design

Partitioning Strategy

  • Use partition keys to place related events in the same partition
  • Messages with the same key always go to the same partition
  • This ensures ordering for related events
  • Choose keys carefully - uneven distribution causes hot partitions

Partition Count

  • More partitions = more parallelism but more overhead
  • Consider: expected throughput, consumer count, broker resources
  • Start with number of consumers you expect to run concurrently
  • Partitions can be increased but not decreased

Topic Configuration

  • retention.ms: How long to keep messages (default 7 days)
  • retention.bytes: Maximum size per partition
  • cleanup.policy: delete (remove old) or compact (keep latest per key)
  • min.insync.replicas: Minimum replicas that must acknowledge

Producer Best Practices

Reliability Settings

acks=all               # Wait for all replicas to acknowledge
retries=MAX_INT        # Retry on transient failures
enable.idempotence=true # Prevent duplicate messages on retry

Performance Tuning

  • batch.size: Accumulate messages before sending (default 16KB)
  • linger.ms: Wait time for batching (0 = send immediately)
  • buffer.memory: Total memory for buffering unsent messages
  • compression.type: gzip, snappy, lz4, or zstd for bandwidth savings

Error Handling

  • Implement retry logic with exponential backoff
  • Handle retriable vs non-retriable exceptions differently
  • Log and alert on send failures
  • Consider dead letter topics for messages that fail repeatedly

Partitioner

  • Default: hash of key determines partition (null key = round-robin)
  • Custom partitioners for specific routing needs
  • Ensure even distribution to avoid hot partitions

Consumer Best Practices

Offset Management

  • Consumers track which messages they've processed via offsets
  • auto.offset.reset: earliest (start from beginning) or latest (only new messages)
  • Commit offsets after successful processing, not before
  • Use enable.auto.commit=false for exactly-once semantics

Consumer Groups

  • Consumers in a group share partitions (each partition to one consumer)
  • More consumers than partitions = some consumers idle
  • Group rebalancing occurs when consumers join/leave
  • Use group.instance.id for static membership to reduce rebalances

Processing Patterns

  • Process messages in order within a partition
  • Handle out-of-order messages across partitions if needed
  • Implement idempotent processing for at-least-once delivery
  • Consider transactional processing for exactly-once

Timeouts and Failures

  • Implement processing timeout to isolate slow events
  • When timeout occurs, set event aside and continue to next message
  • Maintain overall system performance over processing every single event
  • Use dead letter queues for messages failing all retries

Error Handling and Retry

Retry Strategy

  • Allow multiple runtime retries per processing attempt
  • Example: 3 runtime retries per redrive, maximum 5 redrives = 15 total retries
  • Runtime retries typically cover 99% of failures
  • After exhausting retries, route to dead letter queue

Dead Letter Topics

  • Create dedicated DLT for messages that can't be processed
  • Include original topic, partition, offset, and error details
  • Monitor DLT for patterns indicating systemic issues
  • Implement manual or automated retry from DLT

Schema Management

Schema Registry

  • Use Confluent Schema Registry for schema management
  • Producers validate data against registered schemas during serialization
  • Schema mismatches throw exceptions, preventing malformed data
  • Provides common reference for producers and consumers

Schema Evolution

  • Design schemas for forward and backward compatibility
  • Add optional fields with defaults for backward compatibility
  • Avoid removing or renaming fields
  • Use schema versioning and migration strategies

Kafka Streams

State Management

  • Implement log compaction to maintain latest version of each key
  • Periodically purge old data from state stores
  • Monitor state store size and access patterns
  • Use appropriate storage backends for your scale

Windowing Operations

  • Handle out-of-order events and skewed timestamps
  • Use appropriate time extraction and watermarking techniques
  • Configure grace periods for late-arriving data
  • Choose window types based on use case (tumbling, hopping, sliding, session)

Security

Authentication

  • Use SASL/SSL for client authentication
  • Support SASL mechanisms: PLAIN, SCRAM, OAUTHBEARER, GSSAPI
  • Enable SSL for encryption in transit
  • Rotate credentials regularly

Authorization

  • Use Kafka ACLs for fine-grained access control
  • Grant minimum necessary permissions per principal
  • Separate read/write permissions by topic
  • Audit access patterns regularly

Monitoring and Observability

Key Metrics

  • Producer: record-send-rate, record-error-rate, batch-size-avg
  • Consumer: records-consumed-rate, records-lag, commit-latency
  • Broker: under-replicated-partitions, request-latency, disk-usage

Lag Monitoring

  • Consumer lag = last produced offset - last committed offset
  • High lag indicates consumers can't keep up
  • Alert on increasing lag trends
  • Scale consumers or optimize processing

Distributed Tracing

  • Propagate trace context in message headers
  • Use OpenTelemetry for end-to-end tracing
  • Correlate producer and consumer spans
  • Track message journey through the pipeline

Testing

Unit Testing

  • Mock Kafka clients for isolated testing
  • Test serialization/deserialization logic
  • Verify partitioning logic
  • Test error handling paths

Integration Testing

  • Use embedded Kafka or Testcontainers
  • Test full producer-consumer flows
  • Verify exactly-once semantics if used
  • Test rebalancing scenarios

Performance Testing

  • Load test with production-like message rates
  • Test consumer throughput and lag behavior
  • Verify broker resource usage under load
  • Test failure and recovery scenarios

Common Patterns

Event Sourcing

  • Store all state changes as immutable events
  • Rebuild state by replaying events
  • Use log compaction for snapshots
  • Enable time-travel debugging

CQRS (Command Query Responsibility Segregation)

  • Separate write (command) and read (query) models
  • Use Kafka as the event store
  • Build read-optimized projections from events
  • Handle eventual consistency appropriately

Saga Pattern

  • Coordinate distributed transactions across services
  • Each service publishes events for next step
  • Implement compensating transactions for rollback
  • Use correlation IDs to track saga instances

Change Data Capture (CDC)

  • Capture database changes as Kafka events
  • Use Debezium or similar CDC tools
  • Enable real-time data synchronization
  • Build event-driven integrations
Weekly Installs
1
Installed on
windsurf1
opencode1
codex1
claude-code1
antigravity1
gemini-cli1