kafka-development
SKILL.md
Kafka Development
You are an expert in Apache Kafka event streaming and distributed messaging systems. Follow these best practices when building Kafka-based applications.
Core Principles
- Kafka is a distributed event streaming platform for high-throughput, fault-tolerant messaging
- Unlike traditional pub/sub, Kafka uses a pull model - consumers pull messages from partitions
- Design for scalability, durability, and exactly-once semantics where needed
- Leave NO todos, placeholders, or missing pieces in the implementation
Architecture Overview
Core Components
- Topics: Categories/feeds for organizing messages
- Partitions: Ordered, immutable sequences within topics enabling parallelism
- Producers: Clients that publish messages to topics
- Consumers: Clients that read messages from topics
- Consumer Groups: Coordinate consumption across multiple consumers
- Brokers: Kafka servers that store data and serve clients
Key Concepts
- Messages are appended to partitions in order
- Each message has an offset - a unique sequential ID within the partition
- Consumers maintain their own cursor (offset) and can read streams repeatedly
- Partitions are distributed across brokers for scalability
Topic Design
Partitioning Strategy
- Use partition keys to place related events in the same partition
- Messages with the same key always go to the same partition
- This ensures ordering for related events
- Choose keys carefully - uneven distribution causes hot partitions
Partition Count
- More partitions = more parallelism but more overhead
- Consider: expected throughput, consumer count, broker resources
- Start with number of consumers you expect to run concurrently
- Partitions can be increased but not decreased
Topic Configuration
retention.ms: How long to keep messages (default 7 days)retention.bytes: Maximum size per partitioncleanup.policy: delete (remove old) or compact (keep latest per key)min.insync.replicas: Minimum replicas that must acknowledge
Producer Best Practices
Reliability Settings
acks=all # Wait for all replicas to acknowledge
retries=MAX_INT # Retry on transient failures
enable.idempotence=true # Prevent duplicate messages on retry
Performance Tuning
batch.size: Accumulate messages before sending (default 16KB)linger.ms: Wait time for batching (0 = send immediately)buffer.memory: Total memory for buffering unsent messagescompression.type: gzip, snappy, lz4, or zstd for bandwidth savings
Error Handling
- Implement retry logic with exponential backoff
- Handle retriable vs non-retriable exceptions differently
- Log and alert on send failures
- Consider dead letter topics for messages that fail repeatedly
Partitioner
- Default: hash of key determines partition (null key = round-robin)
- Custom partitioners for specific routing needs
- Ensure even distribution to avoid hot partitions
Consumer Best Practices
Offset Management
- Consumers track which messages they've processed via offsets
auto.offset.reset: earliest (start from beginning) or latest (only new messages)- Commit offsets after successful processing, not before
- Use
enable.auto.commit=falsefor exactly-once semantics
Consumer Groups
- Consumers in a group share partitions (each partition to one consumer)
- More consumers than partitions = some consumers idle
- Group rebalancing occurs when consumers join/leave
- Use
group.instance.idfor static membership to reduce rebalances
Processing Patterns
- Process messages in order within a partition
- Handle out-of-order messages across partitions if needed
- Implement idempotent processing for at-least-once delivery
- Consider transactional processing for exactly-once
Timeouts and Failures
- Implement processing timeout to isolate slow events
- When timeout occurs, set event aside and continue to next message
- Maintain overall system performance over processing every single event
- Use dead letter queues for messages failing all retries
Error Handling and Retry
Retry Strategy
- Allow multiple runtime retries per processing attempt
- Example: 3 runtime retries per redrive, maximum 5 redrives = 15 total retries
- Runtime retries typically cover 99% of failures
- After exhausting retries, route to dead letter queue
Dead Letter Topics
- Create dedicated DLT for messages that can't be processed
- Include original topic, partition, offset, and error details
- Monitor DLT for patterns indicating systemic issues
- Implement manual or automated retry from DLT
Schema Management
Schema Registry
- Use Confluent Schema Registry for schema management
- Producers validate data against registered schemas during serialization
- Schema mismatches throw exceptions, preventing malformed data
- Provides common reference for producers and consumers
Schema Evolution
- Design schemas for forward and backward compatibility
- Add optional fields with defaults for backward compatibility
- Avoid removing or renaming fields
- Use schema versioning and migration strategies
Kafka Streams
State Management
- Implement log compaction to maintain latest version of each key
- Periodically purge old data from state stores
- Monitor state store size and access patterns
- Use appropriate storage backends for your scale
Windowing Operations
- Handle out-of-order events and skewed timestamps
- Use appropriate time extraction and watermarking techniques
- Configure grace periods for late-arriving data
- Choose window types based on use case (tumbling, hopping, sliding, session)
Security
Authentication
- Use SASL/SSL for client authentication
- Support SASL mechanisms: PLAIN, SCRAM, OAUTHBEARER, GSSAPI
- Enable SSL for encryption in transit
- Rotate credentials regularly
Authorization
- Use Kafka ACLs for fine-grained access control
- Grant minimum necessary permissions per principal
- Separate read/write permissions by topic
- Audit access patterns regularly
Monitoring and Observability
Key Metrics
- Producer: record-send-rate, record-error-rate, batch-size-avg
- Consumer: records-consumed-rate, records-lag, commit-latency
- Broker: under-replicated-partitions, request-latency, disk-usage
Lag Monitoring
- Consumer lag = last produced offset - last committed offset
- High lag indicates consumers can't keep up
- Alert on increasing lag trends
- Scale consumers or optimize processing
Distributed Tracing
- Propagate trace context in message headers
- Use OpenTelemetry for end-to-end tracing
- Correlate producer and consumer spans
- Track message journey through the pipeline
Testing
Unit Testing
- Mock Kafka clients for isolated testing
- Test serialization/deserialization logic
- Verify partitioning logic
- Test error handling paths
Integration Testing
- Use embedded Kafka or Testcontainers
- Test full producer-consumer flows
- Verify exactly-once semantics if used
- Test rebalancing scenarios
Performance Testing
- Load test with production-like message rates
- Test consumer throughput and lag behavior
- Verify broker resource usage under load
- Test failure and recovery scenarios
Common Patterns
Event Sourcing
- Store all state changes as immutable events
- Rebuild state by replaying events
- Use log compaction for snapshots
- Enable time-travel debugging
CQRS (Command Query Responsibility Segregation)
- Separate write (command) and read (query) models
- Use Kafka as the event store
- Build read-optimized projections from events
- Handle eventual consistency appropriately
Saga Pattern
- Coordinate distributed transactions across services
- Each service publishes events for next step
- Implement compensating transactions for rollback
- Use correlation IDs to track saga instances
Change Data Capture (CDC)
- Capture database changes as Kafka events
- Use Debezium or similar CDC tools
- Enable real-time data synchronization
- Build event-driven integrations
Weekly Installs
1
Repository
mindrally/skillsInstalled on
windsurf1
opencode1
codex1
claude-code1
antigravity1
gemini-cli1