data-engineering

Installation

SKILL.md

Data Engineering

Use this skill when the user is building or fixing a data platform, analytics stack, or warehouse-backed reporting workflow.

What this skill covers

Reasoning through the full data engineering lifecycle (generation, ingestion, storage, transformation, serving) and the six undercurrents (security, data management, DataOps, data architecture, orchestration, software engineering)
Calibrating architecture complexity to the organization's data maturity stage
Designing dbt-style staging/intermediate/mart layers with explicit grain and update patterns
Picking data models for analytics workloads (Kimball, Inmon, Data Vault, wide tables) with concrete trade-offs
Defining metrics before building dashboards or features, using a four-tier hierarchy and six-step decision framework
Choosing serving patterns: BI, embedded analytics, operational analytics, reverse ETL, ML feature serving
Balancing centralized execution with domain ownership ("data mesh lite")

Boundaries

Use jimmy-skills@data-pipeline-reliability when the main problem is retries, duplicates, backfills, replay behavior, overwrite vs merge, or ordered delivery.
Use jimmy-skills@data-architecture-strategy when the main problem is choosing between RDW, MDW, lakehouse, data fabric, or data mesh.
Use jimmy-skills@data-quality when the main problem is publish gates, contracts, reconciliation, or schema drift.
Use jimmy-skills@data-observability when the main problem is freshness, lag, skew, stale dashboards, or SLA alerting.
Use jimmy-skills@data-stack-delivery when the main question is how Airflow, Snowflake, dbt, Spark, Kafka, and delivery automation fit together in practice.

Working approach

Start from the decision the user needs to enable, not from the tool.
Identify the lifecycle stage involved: generation, ingestion, storage, transformation, or serving.
Prefer boring architecture:
- Batch before streaming unless latency requirements are explicit
- ELT before bespoke ETL when a warehouse can handle transforms
- Centralized warehouse + domain-reviewed definitions before pure data mesh
Make grain, ownership, and freshness explicit before writing pipelines.
Treat every mart, metric, or dashboard as a data product with an owner, SLA, and quality checks.

Default procedure

Define the business question or decision first.
Write down the grain, freshness target, and owner.
Choose the simplest ingestion mode that satisfies the latency requirement.
Keep source-shaped data in raw, reusable business entities in intermediate, and team-facing outputs in marts.
Define shared metrics once before building dashboards, alerts, or feature logic.
Validate that the serving layer matches the consumer:
- BI for reviews and planning
- embedded analytics for product experiences
- operational analytics for fast response
- reverse ETL for action in external tools

Default architecture bias

raw captures source-shaped data with minimal logic
intermediate expresses reusable business entities and joins
marts serve a specific team or decision
Shared dimensions and semantic metrics are defined once and reused

Heuristics

If the question is "should this be real-time?", challenge it. Most analytics use cases should start with hourly or daily batch.
If analytics queries are hitting an OLTP database, move them to an OLAP store.
If metrics differ across teams, add a metrics/semantic layer before adding more dashboards.
If the organization is small or mid-sized, borrow domain ownership ideas without adopting a full data mesh.
If the user is in edtech or B2C learning, prioritize completion, retention, engagement, and content-quality feedback loops.

Gotchas

A successful pipeline run is not the same as a trustworthy dataset. Quality and freshness checks still need to pass.
Do not design marts before the row grain is explicit. Most reporting errors start there.
Do not let each dashboard redefine core metrics such as active_user, completion_rate, or churn_rate.
For small and mid-sized teams, "data mesh" usually means domain review of definitions plus centralized platform execution, not decentralized infrastructure.

References

Read references/platform-patterns.md when the task requires lifecycle framing, modeling trade-offs, serving choices, or multi-team platform strategy.

Related skills

More from jimnguyendev/jimmy-skills

Installs

Repository

jimnguyendev/ji…y-skills

GitHub Stars

First Seen

Apr 23, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass

data-engineering

Data Engineering

What this skill covers

Boundaries

Working approach

Default procedure

Default architecture bias

Heuristics

Gotchas

References

More from jimnguyendev/jimmy-skills

backend-go-testing

backend-go-code-style

backend-go-safety

engineering-rest-api-design

backend-go-design-patterns

backend-go-grpc