databricks-spark-structured-streaming
Spark Structured Streaming
Production-ready streaming pipelines with Spark Structured Streaming. This skill provides navigation to detailed patterns and best practices.
Quick Start
from pyspark.sql.functions import col, from_json
# Basic Kafka to Delta streaming
df = (spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker:9092")
.option("subscribe", "topic")
.load()
.select(from_json(col("value").cast("string"), schema).alias("data"))
.select("data.*")
)
df.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/Volumes/catalog/checkpoints/stream") \
.trigger(processingTime="30 seconds") \
.start("/delta/target_table")
Core Patterns
| Pattern | Description | Reference |
|---|---|---|
| Kafka Streaming | Kafka to Delta, Kafka to Kafka, Real-Time Mode | See kafka-streaming.md |
| Stream Joins | Stream-stream joins, stream-static joins | See stream-stream-joins.md, stream-static-joins.md |
| Multi-Sink Writes | Write to multiple tables, parallel merges | See multi-sink-writes.md |
| Merge Operations | MERGE performance, parallel merges, optimizations | See merge-operations.md |
Configuration
| Topic | Description | Reference |
|---|---|---|
| Checkpoints | Checkpoint management and best practices | See checkpoint-best-practices.md |
| Stateful Operations | Watermarks, state stores, RocksDB configuration | See stateful-operations.md |
| Trigger & Cost | Trigger selection, cost optimization, RTM | See trigger-and-cost-optimization.md |
Best Practices
| Topic | Description | Reference |
|---|---|---|
| Production Checklist | Comprehensive best practices | See streaming-best-practices.md |
Production Checklist
- Checkpoint location is persistent (UC volumes, not DBFS)
- Unique checkpoint per stream
- Fixed-size cluster (no autoscaling for streaming)
- Monitoring configured (input rate, lag, batch duration)
- Exactly-once verified (txnVersion/txnAppId)
- Watermark configured for stateful operations
- Left joins for stream-static (not inner)
More from databricks-solutions/ai-dev-kit
databricks-python-sdk
Databricks development guidance including Python SDK, Databricks Connect, CLI, and REST API. Use when working with databricks-sdk, databricks-connect, or Databricks APIs.
132python-dev
Python development guidance with code quality standards, error handling, testing practices, and environment management. Use when writing, reviewing, or modifying Python code (.py files) or Jupyter notebooks (.ipynb files).
68skill-test
Testing framework for evaluating Databricks skills. Use when building test cases for skills, running skill evaluations, comparing skill versions, or creating ground truth datasets with the Generate-Review-Promote (GRP) pipeline. Triggers include "test skill", "evaluate skill", "skill regression", "ground truth", "GRP pipeline", "skill quality", and "skill metrics".
53databricks-docs
Databricks documentation reference via llms.txt index. Use when other skills do not cover a topic, looking up unfamiliar Databricks features, or needing authoritative docs on APIs, configurations, or platform capabilities.
29databricks-config
Manage Databricks workspace connections: check current workspace, switch profiles, list available workspaces, or authenticate to a new workspace. Use when the user mentions \"switch workspace\", \"which workspace\", \"current profile\", \"databrickscfg\", \"connect to workspace\", or \"databricks auth\".
26databricks-app-python
Builds Python-based Databricks applications using Dash, Streamlit, Gradio, Flask, FastAPI, or Reflex. Handles OAuth authorization (app and user auth), app resources, SQL warehouse and Lakebase connectivity, model serving integration, foundation model APIs, LLM integration, and deployment. Use when building Python web apps, dashboards, ML demos, or REST APIs for Databricks, or when the user mentions Streamlit, Dash, Gradio, Flask, FastAPI, Reflex, or Databricks app.
22