skills/legout/data-agent-skills/data-engineering-best-practices

data-engineering-best-practices

SKILL.md

Data Engineering Best Practices

Use this skill for production architecture and standards decisions: storage layout, lifecycle, incremental semantics, schema evolution, quality checks, and cost/performance tradeoffs.

When to use this skill

Use for:

  • Designing Bronze/Silver/Gold or equivalent data layers
  • Choosing append vs overwrite vs merge behavior
  • Partitioning and file-size strategy
  • Defining schema evolution policy
  • Setting testing/observability guardrails
  • Establishing retention + cost controls

Use domain skills for implementation details:

  • @data-engineering-core
  • @data-engineering-storage-lakehouse
  • @data-engineering-storage-formats
  • @data-engineering-storage-remote-access
  • @data-engineering-storage-authentication

Decision checklist (apply in order)

  1. Data contract
    • Required columns/types?
    • Nullability and key uniqueness?
  2. Layer semantics
    • Bronze immutable?
    • Silver deduplicated/validated?
    • Gold business-ready aggregates?
  3. Write mode
    • Append, partition overwrite, or merge?
  4. Layout
    • Partition keys + target file size set?
  5. Incremental logic
    • Watermark/checkpoint strategy defined?
  6. Evolution policy
    • Additive-only by default?
  7. Operational controls
    • Tests + observability + retention + backfill process?

Core standards

1) Layering (Medallion)

  • Bronze: raw immutable ingestion; append-only
  • Silver: cleaned, validated, conformed schema
  • Gold: consumption-specific marts/features/aggregates

Do not skip Silver validation for convenience; silent quality drift is costly.

2) Write semantics

Operation Use when Notes
Append strictly new immutable events simplest, cheapest
Partition overwrite deterministic reprocessing for date/key slice safe for backfills
Merge/Upsert corrections/late updates/deletes needs key + conflict semantics

3) Partitioning

Good partition keys:

  • Frequently filtered dimensions (often date + low/moderate-cardinality dimension)

Avoid:

  • High-cardinality keys (e.g., user_id)
  • Over-partitioning creating tiny files

4) File sizing

Target file size: ~256MB–1GB (workload-dependent).

  • Too small → metadata/listing overhead + slow scans
  • Too large → poor parallelism and skewed processing

5) Schema evolution

Default policy:

  • ✅ additive changes first (new nullable columns)
  • ⚠️ type widening only when compatibility is clear
  • ❌ destructive rename/drop in-place for shared production tables

6) Incremental processing

  • Persist watermark/checkpoint externally
  • Make re-runs idempotent
  • Include late-arriving data strategy (lag window/backfill)

7) Quality and reliability

Minimum controls:

  • Required columns + types
  • Primary key uniqueness (or dedupe policy)
  • Null thresholds on critical fields
  • Freshness/SLA checks
  • Run-level metrics (rows in/out, failures, latency)

Anti-patterns (reject in review)

  • Full table overwrite for small incremental changes
  • No checkpoint/watermark for recurring pipeline
  • Unbounded tiny-file generation
  • Dynamic SQL built from user values without parameter binding
  • Production credentials in code/config committed to repo
  • No backfill plan / no rollback strategy

Minimal production blueprint

  1. Ingest raw to Bronze (append-only)
  2. Validate + standardize to Silver
  3. Build Gold outputs
  4. Emit metrics + quality report
  5. Persist checkpoint/watermark
  6. Apply lifecycle rules + periodic compaction/maintenance

Progressive disclosure (read next as needed)

  • best-practices-detailed.md — comprehensive deep-dive examples
  • @data-engineering-core/patterns/incremental.md — incremental loading patterns
  • @data-engineering-storage-lakehouse — Delta/Iceberg/Hudi-specific behavior
  • @data-engineering-quality — validation framework implementation
  • @data-engineering-observability — metrics/tracing/alerting

References

Weekly Installs
4
First Seen
13 days ago
Installed on
cline4
github-copilot4
codex4
kimi-cli4
gemini-cli4
cursor4