data-engineering-best-practices
Data Engineering Best Practices
Use this skill for production architecture and standards decisions: storage layout, lifecycle, incremental semantics, schema evolution, quality checks, and cost/performance tradeoffs.
When to use this skill
Use for:
- Designing Bronze/Silver/Gold or equivalent data layers
- Choosing append vs overwrite vs merge behavior
- Partitioning and file-size strategy
- Defining schema evolution policy
- Setting testing/observability guardrails
- Establishing retention + cost controls
Use domain skills for implementation details:
@data-engineering-core@data-engineering-storage-lakehouse@data-engineering-storage-formats@data-engineering-storage-remote-access@data-engineering-storage-authentication
Decision checklist (apply in order)
- Data contract
- Required columns/types?
- Nullability and key uniqueness?
- Layer semantics
- Bronze immutable?
- Silver deduplicated/validated?
- Gold business-ready aggregates?
- Write mode
- Append, partition overwrite, or merge?
- Layout
- Partition keys + target file size set?
- Incremental logic
- Watermark/checkpoint strategy defined?
- Evolution policy
- Additive-only by default?
- Operational controls
- Tests + observability + retention + backfill process?
Core standards
1) Layering (Medallion)
- Bronze: raw immutable ingestion; append-only
- Silver: cleaned, validated, conformed schema
- Gold: consumption-specific marts/features/aggregates
Do not skip Silver validation for convenience; silent quality drift is costly.
2) Write semantics
| Operation | Use when | Notes |
|---|---|---|
| Append | strictly new immutable events | simplest, cheapest |
| Partition overwrite | deterministic reprocessing for date/key slice | safe for backfills |
| Merge/Upsert | corrections/late updates/deletes | needs key + conflict semantics |
3) Partitioning
Good partition keys:
- Frequently filtered dimensions (often date + low/moderate-cardinality dimension)
Avoid:
- High-cardinality keys (e.g., user_id)
- Over-partitioning creating tiny files
4) File sizing
Target file size: ~256MB–1GB (workload-dependent).
- Too small → metadata/listing overhead + slow scans
- Too large → poor parallelism and skewed processing
5) Schema evolution
Default policy:
- ✅ additive changes first (new nullable columns)
- ⚠️ type widening only when compatibility is clear
- ❌ destructive rename/drop in-place for shared production tables
6) Incremental processing
- Persist watermark/checkpoint externally
- Make re-runs idempotent
- Include late-arriving data strategy (lag window/backfill)
7) Quality and reliability
Minimum controls:
- Required columns + types
- Primary key uniqueness (or dedupe policy)
- Null thresholds on critical fields
- Freshness/SLA checks
- Run-level metrics (rows in/out, failures, latency)
Anti-patterns (reject in review)
- Full table overwrite for small incremental changes
- No checkpoint/watermark for recurring pipeline
- Unbounded tiny-file generation
- Dynamic SQL built from user values without parameter binding
- Production credentials in code/config committed to repo
- No backfill plan / no rollback strategy
Minimal production blueprint
- Ingest raw to Bronze (append-only)
- Validate + standardize to Silver
- Build Gold outputs
- Emit metrics + quality report
- Persist checkpoint/watermark
- Apply lifecycle rules + periodic compaction/maintenance
Progressive disclosure (read next as needed)
best-practices-detailed.md— comprehensive deep-dive examples@data-engineering-core/patterns/incremental.md— incremental loading patterns@data-engineering-storage-lakehouse— Delta/Iceberg/Hudi-specific behavior@data-engineering-quality— validation framework implementation@data-engineering-observability— metrics/tracing/alerting
References
More from legout/data-platform-agent-skills
data-science-eda
Exploratory Data Analysis (EDA): profiling, visualization, correlation analysis, and data quality checks. Use when understanding dataset structure, distributions, relationships, or preparing for feature engineering and modeling.
13data-science-visualization
Data visualization for Python: Matplotlib, Seaborn, Plotly, Altair, hvPlot/HoloViz, and Bokeh. Use when creating exploratory charts, interactive dashboards, publication-quality figures, or choosing the right library for your data and audience.
12data-engineering-core
Core Python data engineering: Polars, DuckDB, PyArrow, PostgreSQL, ETL patterns, performance tuning, and resilient pipeline construction. Use when building or reviewing batch ETL/dataframe/SQL pipelines in Python.
10data-science-feature-engineering
Feature engineering for machine learning: encoding, scaling, transformations, datetime features, text features, and feature selection. Use when preparing data for modeling or improving model performance through better representations.
10data-science-notebooks
Interactive notebooks for data science: Jupyter, JupyterLab, and marimo. Use for exploratory analysis, reproducible research, documentation, and sharing insights with stakeholders.
9data-engineering-storage-formats
Modern data serialization formats: Parquet, Apache Arrow (Feather/IPC), Lance (ML-native), Zarr (chunked arrays), Avro, and ORC. Covers compression, partitioning, and format selection.
8