beam-concepts
Apache Beam Core Concepts
The Beam Model
Evolved from Google's MapReduce, FlumeJava, and Millwheel projects. Originally called the "Dataflow Model."
Key Abstractions
Pipeline
A Pipeline encapsulates the entire data processing task, including reading, transforming, and writing data.
// Java
Pipeline p = Pipeline.create(options);
p.apply(...)
.apply(...)
.apply(...);
p.run().waitUntilFinish();
# Python
with beam.Pipeline(options=options) as p:
(p | 'Read' >> beam.io.ReadFromText('input.txt')
| 'Transform' >> beam.Map(process)
| 'Write' >> beam.io.WriteToText('output'))
PCollection
A distributed dataset that can be bounded (batch) or unbounded (streaming).
Properties
- Immutable - Once created, cannot be modified
- Distributed - Elements processed in parallel
- May be bounded or unbounded
- Timestamped - Each element has an event timestamp
- Windowed - Elements assigned to windows
PTransform
A data processing operation that transforms PCollections.
// Java
PCollection<String> output = input.apply(MyTransform.create());
# Python
output = input | 'Name' >> beam.ParDo(MyDoFn())
Core Transforms
ParDo
General-purpose parallel processing.
// Java
input.apply(ParDo.of(new DoFn<String, Integer>() {
@ProcessElement
public void processElement(@Element String element, OutputReceiver<Integer> out) {
out.output(element.length());
}
}));
# Python
class LengthFn(beam.DoFn):
def process(self, element):
yield len(element)
input | beam.ParDo(LengthFn())
# Or simpler:
input | beam.Map(len)
GroupByKey
Groups elements by key.
PCollection<KV<String, Integer>> input = ...;
PCollection<KV<String, Iterable<Integer>>> grouped = input.apply(GroupByKey.create());
CoGroupByKey
Joins multiple PCollections by key.
Combine
Combines elements (sum, mean, etc.).
// Global combine
input.apply(Combine.globally(Sum.ofIntegers()));
// Per-key combine
input.apply(Combine.perKey(Sum.ofIntegers()));
Flatten
Merges multiple PCollections.
PCollectionList<String> collections = PCollectionList.of(pc1).and(pc2).and(pc3);
PCollection<String> merged = collections.apply(Flatten.pCollections());
Partition
Splits a PCollection into multiple PCollections.
Windowing
Types
- Fixed Windows - Regular, non-overlapping intervals
- Sliding Windows - Overlapping intervals
- Session Windows - Gaps of inactivity define boundaries
- Global Window - All elements in one window (default)
input.apply(Window.into(FixedWindows.of(Duration.standardMinutes(5))));
input | beam.WindowInto(beam.window.FixedWindows(300))
Triggers
Control when results are emitted.
input.apply(Window.<T>into(FixedWindows.of(Duration.standardMinutes(5)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(1))))
.withAllowedLateness(Duration.standardHours(1))
.accumulatingFiredPanes());
Side Inputs
Additional inputs to ParDo.
PCollectionView<Map<String, String>> sideInput =
lookupTable.apply(View.asMap());
mainInput.apply(ParDo.of(new DoFn<String, String>() {
@ProcessElement
public void processElement(ProcessContext c) {
Map<String, String> lookup = c.sideInput(sideInput);
// Use lookup...
}
}).withSideInputs(sideInput));
Pipeline Options
Configure pipeline execution.
public interface MyOptions extends PipelineOptions {
@Description("Input file")
@Required
String getInput();
void setInput(String value);
}
MyOptions options = PipelineOptionsFactory.fromArgs(args).as(MyOptions.class);
Schema
Strongly-typed access to structured data.
@DefaultSchema(AutoValueSchema.class)
@AutoValue
public abstract class User {
public abstract String getName();
public abstract int getAge();
}
PCollection<User> users = ...;
PCollection<Row> rows = users.apply(Convert.toRows());
Error Handling
Dead Letter Queue Pattern
TupleTag<String> successTag = new TupleTag<>() {};
TupleTag<String> failureTag = new TupleTag<>() {};
PCollectionTuple results = input.apply(ParDo.of(new DoFn<String, String>() {
@ProcessElement
public void processElement(ProcessContext c) {
try {
c.output(process(c.element()));
} catch (Exception e) {
c.output(failureTag, c.element());
}
}
}).withOutputTags(successTag, TupleTagList.of(failureTag)));
results.get(successTag).apply(WriteToSuccess());
results.get(failureTag).apply(WriteToDeadLetter());
Cross-Language Pipelines
Use transforms from other SDKs.
# Use Java Kafka connector from Python
from apache_beam.io.kafka import ReadFromKafka
result = pipeline | ReadFromKafka(
consumer_config={'bootstrap.servers': 'localhost:9092'},
topics=['my-topic']
)
Best Practices
- Prefer built-in transforms over custom DoFns
- Use schemas for type-safe operations
- Minimize side inputs for performance
- Handle late data explicitly
- Test with DirectRunner before deploying
- Use TestPipeline for unit tests
More from apache/beam
gradle-build
Guides understanding and using the Gradle build system in Apache Beam. Use when building projects, understanding dependencies, or troubleshooting build issues.
48java-development
Guides Java SDK development in Apache Beam, including building, testing, running examples, and understanding the project structure. Use when working with Java code in sdks/java/, runners/, or examples/java/.
27python-development
Guides Python SDK development in Apache Beam, including environment setup, testing, building, and running pipelines. Use when working with Python code in sdks/python/.
25license-compliance
Ensures all new files include proper Apache 2.0 license headers. Use when creating any new file in the Apache Beam repository.
24ci-cd
Guides understanding and working with Apache Beam's CI/CD system using GitHub Actions. Use when debugging CI failures, understanding test workflows, or modifying CI configuration.
23contributing
Guides the contribution workflow for Apache Beam, including creating PRs, issue management, code review process, and release cycles. Use when contributing code, creating PRs, or understanding the contribution process.
23