data-engineering-catalogs

Installation

SKILL.md

Data Catalogs

Comprehensive guide to data catalog systems: purpose, Iceberg catalog implementations (Hive Metastore, AWS Glue, Tabular), using DuckDB as a lightweight multi-source catalog, and comparisons of open-source catalog tools (Amundsen, DataHub, OpenMetadata). Learn selection criteria, setup patterns, and best practices for data discovery, governance, and unified querying.

Why Catalogs Matter

Data catalogs are centralized metadata repositories that enable:

Data discovery: Find datasets by name, schema, owner, tags
Governance: Access control, data lineage, PII tagging
Schema management: Track table schemas, partitions, evolution over time
Table format abstraction: Iceberg/Delta/Hudi tables registered in catalog can be queried by multiple engines (Spark, Trino, Flink, DuckDB) without knowing underlying storage URIs
Multi-engine consistency: Same table name across Spark/DuckDB/Trino

Without a catalog, you must manage table locations and schemas manually in each engine.

Iceberg Catalogs

Apache Iceberg requires a catalog to store table metadata (schema, location, snapshots, partition specs). The catalog name → table identifier → storage location mapping.

Catalog Types

Catalog	Backend	Managed?	Best For
Hive Metastore	RDBMS (Postgres/MySQL)	Self-operated	Existing Hadoop installations, high partition counts
AWS Glue	AWS-managed (serverless)	AWS-managed	AWS-native stacks, Athena/EMR
Tabular	SaaS (Nessie-backed)	Tabular-managed	Iceberg-native, Git-like branching
Custom REST	Any HTTP service	Self-built	Custom auth, multi-cloud

Hive Metastore

The traditional, battle-tested catalog used by Hadoop/Hive for over a decade. Stores table metadata in a relational database.

Setup:

# Install Hive Metastore (Docker quickstart)
docker run -p 9083:9083 -e METASTORE_DB_TYPE=postgresql \
  -e METASTORE_DB_URL=jdbc:postgresql://localhost/metastore \
  apache/hive:4.0

# Initialize metastore schema (run once)
schematool -dbType postgresql -initSchema

PyIceberg configuration:

from pyiceberg.catalog import Catalog
from pyiceberg.catalog.hive import HiveCatalog

catalog = HiveCatalog(
    name="my_hive",
    uri="thrift://localhost:9083",
    warehouse="s3://my-bucket/warehouse/",
    properties={
        "postgresql.connection": "jdbc:postgresql://localhost/metastore",
        "postgresql.user": "hive",
        "postgresql.password": "hive"
    }
)

# Create table
catalog.create_table(
    identifier=("my_db", "my_table"),
    schema=table_schema,
    location="s3://my-bucket/my_table/"
)

Pros:

Mature, battle-tested (10+ years)
Handles 500k+ partitions with proper DB tuning
Works with any Hadoop ecosystem (Spark, Presto, Hive, Flink)
No vendor lock-in

Cons:

Self-managed (DB ops, backups, HA)
Limited Iceberg-specific features (no branching/namespace)
Thrift protocol (legacy)

AWS Glue Data Catalog

Fully managed catalog by AWS, serverless, Hive-compatible API.

Setup:

from pyiceberg.catalog.glue import GlueCatalog

catalog = GlueCatalog(
    name="my_glue",
    region="us-east-1",
    warehouse="s3://my-bucket/warehouse/"
)

# Tables automatically appear in AWS Glue console

Unity Catalog Federation (Databricks):

# Databricks can read AWS Glue tables via federation
catalog = GlueCatalog(
    name="aws_glue_fed",
    region="us-east-1"
)
# Spark on Databricks queries glue catalog

Pros:

Serverless, no operations
Integrated with Athena, EMR, Redshift Spectrum
Fine-grained IAM permissions via Lake Formation
Works with Delta Lake, Iceberg, Hudi

Cons:

Performance degrades >10k partitions (known issue)
Limited metadata operations (slow operations)
AWS lock-in

Tabular (Iceberg-Native SaaS)

Tabular is a managed catalog service built for Iceberg with Nessie-like versioning (branching, tags, time travel on metadata).

from pyiceberg.catalog.tabular import TabularCatalog

catalog = TabularCatalog(
    name="tabular",
    warehouse="s3://my-bucket/warehouse/",
    token="tabular-token-..."
)

# Git-like operations
catalog.create_table("my_table", schema)
catalog.create_branch("dev")  # Branch for development
catalog.set_current_branch("dev")
# Later merge to main

Pros:

Native Iceberg features (branching, tags, atomic multi-table ops)
No ops overhead
Git-like workflow for datasets

Cons:

Commercial SaaS (cost)
Newer ecosystem

Using DuckDB as a Multi-Source Catalog

DuckDB does NOT provide a production data catalog service (no REST API, limited concurrency). However, you can use it as a lightweight embedded catalog to unify queries across multiple heterogeneous sources using the ATTACH statement. This pattern is suitable for:

Single-user analytics notebooks
PoC/mVP data platforms
Local dev environments
Small teams (< 10 users)

Architecture

DuckDB Process
    ├── ATTACH 'postgres://...' AS postgres
    ├── ATTACH 's3://lakehouse/delta/' AS delta_uc (TYPE unity_catalog)
    ├── ATTACH 's3://lakehouse/iceberg/' AS iceberg
    ├── ATTACH 'ducklake:data/catalog.ducklake' AS ducklake
    └── CREATE VIEW unified_dataset AS
          SELECT 'postgres' AS source, * FROM postgres.public.orders
          UNION ALL
          SELECT 'delta' AS source, * FROM delta.default.orders
          UNION ALL
          SELECT 'iceberg' AS source, * FROM iceberg.db.orders

Attaching PostgreSQL

Use Postgres as a metadata storage backend (DuckLake) or query external Postgres tables:

-- Attach external Postgres database
ATTACH 'postgres://user:pass@host:5432/mydb' AS pg_db;

-- Query Postgres tables
SELECT * FROM pg_db.public.orders LIMIT 10;

DuckLake pattern: Attach a Postgres database as a DuckLake catalog:

-- Install and load DuckLake extension
INSTALL ducklake;
LOAD ducklake;

-- Attach Postgres as DuckLake catalog
ATTACH 'postgres://user:pass@host:5432/lakehouse_catalog' AS my_lakehouse (
    DATA_PATH 's3://my-bucket/lakehouse/'
);

-- Create Delta/Iceberg tables managed by DuckLake
CREATE TABLE my_lakehouse.silver.orders (
    order_id BIGINT,
    amount DOUBLE,
    order_date DATE
) USING DELTA;  -- Or USING ICEBERG

Benefits: Single SQL catalog with ACID transactions, time travel via Postgres WAL.

Attaching Delta Lake (Unity Catalog)

Use the Unity Catalog extension (experimental, works without Databricks):

INSTALL unity_catalog;
LOAD unity_catalog;

-- Create a Unity Catalog connection
CREATE SECRET uc_cred (
    TYPE 'aws',
    REGION 'us-east-1',
    ACCESS_KEY_ID 'AKIA...',
    SECRET_ACCESS_KEY '...'
);

ATTACH 'my_uc' AS uc (
    TYPE unity_catalog,
    ENDPOINT 'https://api.uc.tabular.io',
    SECRET 'uc_cred'
);

-- Query Delta tables
SELECT * FROM uc.my_db.my_delta_table;

Note: This is experimental and may require Tabular's hosted Unity Catalog. For local Delta tables, use delta_scan() directly:

INSTALL delta;
LOAD delta;

SELECT * FROM delta_scan('s3://bucket/delta_table/');

Attaching Iceberg

Use Iceberg extension or read via REST catalog:

from duckdb import DuckDBPyConnection

con = duckdb.connect()
con.execute("INSTALL iceberg; LOAD iceberg;")

# Attach Iceberg catalog
con.execute("""
ATTACH 'iceberg_catalog' (
    TYPE iceberg,
    CATALOG 'hive',
    URI 'thrift://localhost:9083',
    WAREHOUSE 's3://bucket/warehouse/'
);
""")

# Query
df = con.execute("SELECT * FROM iceberg.db.my_table").df()

Unified Multi-Source View

Create a virtual view that unions data from all sources:

CREATE VIEW unified_orders AS
SELECT
    'postgres' AS source_system,
    o.order_id,
    o.amount,
    o.order_date,
    NULL AS metadata
FROM pg_db.public.orders o
UNION ALL
SELECT
    'delta' AS source_system,
    o.order_id,
    o.amount,
    o.order_date,
    o._metadata
FROM uc.production.orders o
UNION ALL
SELECT
    'iceberg' AS source_system,
    o.order_id,
    o.amount,
    o.order_date,
    o._metadata
FROM iceberg.analytics.orders o;

Now query the unified view:

SELECT source_system, COUNT(*) FROM unified_orders GROUP BY source_system;

Use case: Cross-platform migration validation (ensure counts match between Postgres source and Delta target).

Limitations of DuckDB-as-Catalog

Limitation	Impact	Workaround
No REST API	Only in-process access; no multi-service sharing	Run query gateway (FastAPI) that wraps DuckDB (but single point of failure)
Write lock	Only one writer at a time (file-based DB)	Use Postgres as DuckLake backend for multi-client
No fine-grained auth	All queries share same DB credentials	Use separate DuckDB files per user (not shared)
Scalability	Metadata fits in memory; works for ≤ 100k tables	Large enterprises need dedicated catalog service (Glue, Hive, Tabular)
Availability	Single file corruption = total loss	Back up catalog DB frequently

Conclusion: Use DuckDB for single-user/developer catalog or small team PoC. For production multi-user data platforms, use AWS Glue (AWS), Hive Metastore (self-hosted), or Tabular (Iceberg-native).

Open Source Catalogs Comparison (2024)

Amundsen (Lyft)

Focus: Data discovery, lightweight

Metadata store: Neo4j (graph) + MySQL (text search)
Lineage: Limited (no built-in column-level)
Search: Simple keyword search, easy to use
Governance: Minimal
Deployment: Easiest (fewer services)
Best for: Small teams needing basic discovery
Status: Slower development post-2023 acquisition

DataHub (LinkedIn)

Focus: Scalable, enterprise-grade

Metadata store: MySQL (RDBMS) + Kafka (streaming updates)
Lineage: Full end-to-end (auto-ingested from Spark, Flink, etc.)
Search: Advanced faceted search, elastic
Governance: Strong (PII tagging, access policies)
Deployment: Complex (requires multiple services)
Best for: Large enterprises ( LinkedIn-scale)
Extensibility: High (many metadata source connectors)

OpenMetadata

Focus: Unified governance & quality

Metadata store: Postgres + Elasticsearch/OpenSearch
Lineage: Built-in, column-level
Search: Good
Governance: Excellent (workflows, tasks, approvals, data quality tests integration)
Deployment: Moderate (needs Postgres, OpenSearch, Airflow for ingestion)
Best for: Teams needing strong governance + quality testing
Modern UI: Cleanest interface

Comparison Summary

Tool	Discovery	Lineage	Governance	Scale	Ops Complexity
Amundsen	⭐⭐⭐⭐⭐	⭐⭐	⭐⭐	Small	Low
DataHub	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	Enterprise	High
OpenMetadata	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	Medium-Large	Moderate

Recommendation:

Start with OpenMetadata for new projects (governance + quality focus)
Use DataHub if you need massive scale and lineage automation
Avoid Amundsen for new deployments (stale)

Catalog Selection Matrix for Lakehouse

Scenario	Recommended Catalog
AWS-native (Athena, Redshift)	AWS Glue
Azure-native (Synapse, Fabric)	Azure Data Catalog / Unity Catalog
Self-hosted Hadoop/Spark	Hive Metastore
Iceberg-first, multi-cloud	Tabular or Hive Metastore
Small team, PoC	DuckDB + DuckLake (single file)
Governance-heavy	OpenMetadata (with separate metastore for Iceberg/Delta)
LinkedIn-scale	DataHub

Setting Up a Multi-Source Catalog (DuckDB Pattern)

Use this concise flow in the main skill:

Create a local DuckDB catalog DB and load extensions (httpfs, postgres, delta, iceberg, ducklake)
Attach each source system (Postgres, Delta, Iceberg)
Build a unified view over all sources
Run validation queries across sources

For complete runnable examples, see:

duckdb-multisource.md

Best Practices

Catalog Selection

Default to Hive Metastore if you have existing Hadoop investments.
Use AWS Glue for pure AWS stacks (Athena, EMR).
Choose Tabular for Iceberg-first with branching needs.
Keep separate Iceberg catalog from business metadata catalog (OpenMetadata) - they serve different purposes.

DuckDB Multi-Source Pattern

Use for development only - Not production multi-user
Store catalog DuckDB file in version control (encrypted if credentials)
Separate credentials from catalog - use environment variables or secret manager
Read-only attaches for source systems - prevent accidental writes
Back up DuckDB regularly if using as primary catalog

References

@data-engineering-storage-lakehouse - Delta Lake, Iceberg table formats
@data-engineering-best-practices - Medallion architecture, dataset lifecycle, partitioning, schema evolution
duckdb-multisource.md - Step-by-step DuckDB multi-source catalog setup
Apache Iceberg Catalog Documentation
PyIceberg Catalog API
DuckDB ATTACH Documentation
DuckLake Documentation
Unity Catalog Extension
OpenMetadata vs DataHub vs Amundsen

Related skills

More from legout/data-platform-agent-skills

Installs

Repository

legout/data-pla…t-skills

First Seen

Feb 11, 2026

Security Audits

Gen Agent Trust HubFail

SocketPass

SnykFail

data-engineering-catalogs

Data Catalogs

Why Catalogs Matter

Iceberg Catalogs

Catalog Types

Hive Metastore

AWS Glue Data Catalog

Tabular (Iceberg-Native SaaS)

Using DuckDB as a Multi-Source Catalog

Architecture

Attaching PostgreSQL

Attaching Delta Lake (Unity Catalog)

Attaching Iceberg

Unified Multi-Source View

Limitations of DuckDB-as-Catalog

Open Source Catalogs Comparison (2024)

Amundsen (Lyft)

DataHub (LinkedIn)

OpenMetadata

Comparison Summary

Catalog Selection Matrix for Lakehouse

Setting Up a Multi-Source Catalog (DuckDB Pattern)

Best Practices

Catalog Selection

DuckDB Multi-Source Pattern

References

More from legout/data-platform-agent-skills

data-science-eda

data-science-visualization

data-engineering-core

data-science-feature-engineering

data-science-notebooks

data-engineering-best-practices