stream-processing
Stream Processing
Patterns and technologies for real-time data processing, event streaming, and stream analytics.
When to Use This Skill
- Designing real-time data pipelines
- Choosing stream processing frameworks
- Implementing event-driven architectures
- Building real-time analytics
- Understanding streaming vs batch trade-offs
Batch vs Streaming
Comparison
| Aspect | Batch | Streaming |
|---|---|---|
| Latency | Minutes to hours | Milliseconds to seconds |
| Data | Bounded (finite) | Unbounded (infinite) |
| Processing | Process all at once | Process as it arrives |
| State | Recompute each run | Maintain continuously |
| Complexity | Lower | Higher |
| Cost | Often lower | Often higher |
When to Use Streaming
Use streaming when:
- Real-time responses required (<1 minute)
- Events need immediate action (fraud, alerts)
- Data arrives continuously
- Users expect live updates
- Time-sensitive business decisions
Use batch when:
- Daily/hourly reports sufficient
- Complex transformations needed
- Cost optimization priority
- Historical analysis
- One-time processing
Stream Processing Concepts
Event Time vs Processing Time
Event Time: When event actually occurred
Processing Time: When event is processed
Example:
┌─────────────────────────────────────────────────────────┐
│ Event: Purchase at 10:00:00 (event time) │
│ Network delay: 5 seconds │
│ Processing: 10:00:05 (processing time) │
└─────────────────────────────────────────────────────────┘
Why it matters:
- Late events need handling
- Ordering not guaranteed
- Watermarks track progress
Watermarks
Watermark = "All events before this time have arrived"
Event stream:
──[10:01]──[10:02]──[10:00]──[10:03]──[Watermark: 10:00]──
Allows system to:
- Know when window is complete
- Handle late events
- Balance latency vs completeness
Windows
Tumbling Window (fixed, non-overlapping):
|─────|─────|─────|
0 5 10 15 (seconds)
Sliding Window (fixed, overlapping):
|─────|
|─────|
|─────|
Size: 5s, Slide: 2s
Session Window (activity-based):
|──────| |───────────| |───|
User activity with gaps defines windows
Count Window:
Process every N events
State Management
Stateful operations require maintained state:
- Aggregations (sum, count, avg)
- Joins between streams
- Pattern detection
- Deduplication
State backends:
- In-memory (fast, limited)
- RocksDB (larger, persistent)
- External (Redis, database)
Stream Processing Frameworks
Apache Kafka Streams
Characteristics:
- Library (not a cluster)
- Exactly-once semantics
- Kafka-native
- Java/Scala
Best for:
- Kafka-centric architectures
- Simpler transformations
- Microservices
Example topology:
source → filter → map → aggregate → sink
Apache Flink
Characteristics:
- Distributed cluster
- True streaming (not micro-batch)
- Advanced state management
- SQL support
Best for:
- Complex event processing
- Large-scale streaming
- Low-latency requirements
Example:
DataStream<Event> events = env.addSource(kafkaSource);
events
.keyBy(e -> e.getUserId())
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.aggregate(new CountAggregator())
.addSink(sink);
Apache Spark Streaming
Characteristics:
- Micro-batch processing
- Unified batch + streaming API
- Wide ecosystem
- Python, Scala, Java, R
Best for:
- Teams with Spark experience
- Batch + streaming unified
- Machine learning integration
Latency: Seconds (micro-batch)
Kafka Streams vs Flink vs Spark
| Factor | Kafka Streams | Flink | Spark Streaming |
|---|---|---|---|
| Deployment | Library | Cluster | Cluster |
| Latency | Low | Lowest | Medium |
| State | Good | Excellent | Good |
| Exactly-once | Yes | Yes | Yes |
| Complexity | Low | High | Medium |
| Scaling | With Kafka | Independent | Independent |
| SQL | Limited | Yes | Yes |
| ML integration | Limited | Limited | Excellent |
Stream Processing Patterns
Filtering
Input: All events
Output: Events matching criteria
Example: Only process events where amount > 1000
Mapping/Transformation
Input: Event type A
Output: Event type B
Example: Enrich order events with customer data
Aggregation
Input: Multiple events
Output: Single aggregated result
Examples:
- Count events per window
- Sum amounts per user
- Average latency per endpoint
Join Patterns
Stream-Stream Join:
┌─────────────┐ ┌─────────────┐
│ Orders │ ──► │ Join │
└─────────────┘ │ (by order_id│
┌─────────────┐ │ in window) │
│ Shipments │ ──► │ │
└─────────────┘ └─────────────┘
Stream-Table Join (Enrichment):
┌─────────────┐ ┌─────────────┐
│ Events │ ──► │ Join │
└─────────────┘ │ (lookup by │
┌─────────────┐ │ customer) │
│ Customer │ ──► │ │
│ Table │ └─────────────┘
└─────────────┘
Deduplication
Problem: Duplicate events from at-least-once delivery
Solution:
1. Track seen IDs in state (with TTL)
2. If seen, drop
3. If new, process and store ID
State: {event_id: timestamp}
TTL: Based on expected duplicate window
Event Delivery Guarantees
At-Most-Once
May lose events, never duplicates
Process → Commit → (if fail, event lost)
Use when: Loss acceptable, simplicity preferred
At-Least-Once
Never loses, may have duplicates
Commit → Process → (if fail, reprocess)
Use when: No loss acceptable, handle duplicates downstream
Exactly-Once
Never loses, never duplicates
Requires:
- Idempotent operations, OR
- Transactional processing
How it works:
1. Read from source transactionally
2. Process and update state
3. Write output and commit together
Flink: Checkpointing + two-phase commit
Kafka Streams: Transactional producer + EOS
Late Event Handling
Strategies
1. Drop late events
Simple, may lose data
2. Allow late events (allowed lateness)
Process if within lateness threshold
3. Side output late events
Main stream processes on-time
Side stream handles late separately
4. Reprocess historical
Batch job fixes late data impact
Watermark Strategies
Bounded Out-of-Orderness:
watermark = max_event_time - max_lateness
Example:
max_event_time = 10:00:00
max_lateness = 5 seconds
watermark = 09:59:55
Events before 09:59:55 considered complete
Scalability Patterns
Partitioning
Partition by key for parallel processing:
┌─────────────────────────────────────────────────────┐
│ Kafka Topic (3 partitions) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐│
│ │ Partition 0 │ │ Partition 1 │ │ Partition 2 ││
│ │ user_a, b │ │ user_c, d │ │ user_e, f ││
│ └─────────────┘ └─────────────┘ └─────────────────┘│
└─────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│Worker 0 │ │Worker 1 │ │Worker 2 │
└─────────┘ └─────────┘ └─────────┘
Backpressure
When downstream can't keep up:
1. Buffer (risk: OOM)
2. Drop (risk: data loss)
3. Backpressure (slow down source)
Flink: Backpressure propagates automatically
Kafka: Consumer lag indicates backpressure
Monitoring Streaming Applications
Key Metrics
Throughput:
- Events per second
- Bytes per second
Latency:
- Processing latency
- End-to-end latency
Health:
- Consumer lag
- Checkpoint duration
- Backpressure rate
- Error rate
Consumer Lag
Lag = Latest offset - Consumer offset
High lag indicates:
- Processing too slow
- Need more parallelism
- Downstream bottleneck
Monitor: Set alerting thresholds
Best Practices
1. Design for exactly-once when needed
2. Handle late events explicitly
3. Use event time, not processing time
4. Monitor consumer lag closely
5. Plan for state recovery
6. Test with realistic data volumes
7. Implement backpressure handling
8. Keep processing idempotent when possible
Related Skills
message-queues- Messaging patternsdata-architecture- Data platform designetl-elt-patterns- Data pipeline patterns
More from melodic-software/claude-code-plugins
design-thinking
Design Thinking methodology for human-centered innovation. Covers the 5-phase IDEO/Stanford d.school approach (Empathize, Define, Ideate, Prototype, Test) with workshop facilitation and exercise templates.
196plantuml-syntax
Authoritative reference for PlantUML diagram syntax. Provides UML and non-UML diagram types, syntax patterns, examples, and setup guidance for generating accurate PlantUML diagrams.
162system-prompt-engineering
Design effective system prompts for custom agents. Use when creating agent system prompts, defining agent identity and rules, or designing high-impact prompts that shape agent behavior.
142architecture-documentation
Generate architecture documents using templates with diagram integration. Use for creating C4 diagrams, viewpoint documents, and technical overviews.
128data-modeling
Data modeling with Entity-Relationship Diagrams (ERDs), data dictionaries, and conceptual/logical/physical models. Documents data structures, relationships, and attributes.
101resume-optimization
Resume structure, achievement bullet formulas, ATS optimization, and job-targeted tailoring for software engineers. Use when reviewing resumes, crafting achievement bullets, extracting keywords from job descriptions, or tailoring content for specific roles.
94