managing-data-catalogs
Managing Data Catalogs
Guide to selecting, configuring, and using data catalog systems for data discovery, governance, and unified table access across multiple query engines.
When to Use This Skill
Use this skill when:
- Setting up an Iceberg catalog (Hive Metastore, AWS Glue, REST/Tabular)
- Comparing catalog options for your data platform
- Using DuckDB to unify queries across heterogeneous data sources
- Evaluating open-source metadata discovery tools (DataHub, OpenMetadata)
- Configuring cross-service table access (Athena, EMR, Spark, DuckDB)
Do not use this skill for:
- Table format selection (Delta vs Iceberg) → use
@designing-data-storage - Cloud storage authentication patterns → use
@accessing-cloud-storage - Raw storage I/O without catalog abstraction → use
@accessing-cloud-storage - General ETL pipeline patterns → use
@building-data-pipelines
Quick Catalog Type Comparison
| Catalog | Backend | Managed? | Best For |
|---|---|---|---|
| Hive Metastore | RDBMS (Postgres/MySQL) | Self-hosted | Existing Hadoop, high partition counts |
| AWS Glue | AWS-managed serverless | AWS-managed | AWS-native stacks (Athena, EMR) |
| Tabular/REST | SaaS (Nessie-backed) | Vendor-managed | Iceberg-native, Git-like branching |
| DuckDB (embedded) | Local file/Postgres | Self-hosted | Single-user, PoC, small teams |
When to Use Which Catalog
| Scenario | Recommended Catalog | Why |
|---|---|---|
| AWS-native (Athena, Redshift Spectrum) | AWS Glue | Serverless, IAM integration |
| Self-hosted Hadoop/Spark | Hive Metastore | Battle-tested, no vendor lock-in |
| Iceberg-first, multi-cloud | Tabular or Hive | Native Iceberg features or flexibility |
| Small team, PoC, analytics | DuckDB | Zero infrastructure, SQL-native |
| LinkedIn-scale metadata | DataHub | Enterprise lineage, scale |
| Governance-heavy workflows | OpenMetadata | Built-in workflows, data quality |
Core Patterns
Catalog → Table → Storage Mapping
Catalog (name) → Table identifier → Metadata location → Data files
(schema, snapshots, partitions)
Key insight: The catalog stores only metadata pointers. Actual data lives in object storage (S3, GCS, Azure).
Multi-Engine Access Pattern
# Same table, different engines
table = catalog.load_table("db.events")
# PyIceberg → Pandas DataFrame
df = table.scan().to_pandas()
# Spark SQL (via same catalog)
spark.table("db.events") # Same underlying data
# DuckDB (via ATTACH or REST catalog)
SELECT * FROM iceberg.db.events
Detailed Guides
| Guide | Covers | When to Read |
|---|---|---|
| Hive Metastore | Docker deployment, PyIceberg integration, pros/cons | Self-hosting Hadoop/Spark |
| AWS Glue Catalog | GlueCatalog setup, crawlers, IAM, Unity Catalog federation | AWS-native stacks |
| REST Catalog & Tabular | Tabular SaaS, Nessie patterns, Git-like branching | Iceberg-first, multi-cloud |
| DuckDB Multi-Source | ATTACH patterns, unified views, limitations | Single-user/PoC catalog |
| Open Source Tools | Amundsen vs DataHub vs OpenMetadata comparison | Metadata discovery/governance |
Quick Reference: PyIceberg Catalog Setup
from pyiceberg.catalog import load_catalog
# Hive Metastore
catalog = load_catalog("hive", **{
"type": "hive",
"uri": "thrift://localhost:9083",
"warehouse": "s3://bucket/warehouse/"
})
# AWS Glue
catalog = load_catalog("glue", **{
"type": "glue",
"region": "us-east-1",
"warehouse": "s3://bucket/warehouse/"
})
# REST/Tabular
catalog = load_catalog("rest", **{
"type": "rest",
"uri": "https://api.tabular.io/ws/...",
"token": "tabular-token-...",
"warehouse": "s3://bucket/warehouse/"
})
# Create and query table
table = catalog.create_table("db.events", schema=schema)
table.append(data)
df = table.scan().to_pandas()
Best Practices
Catalog Selection
- Default to Hive Metastore if you have Hadoop investments
- Use AWS Glue for pure AWS (Athena, EMR, Redshift Spectrum)
- Choose Tabular for Iceberg-native with branching needs
- Separate concerns: Iceberg catalog (table storage) ≠ business metadata catalog (OpenMetadata for discovery)
DuckDB-as-Catalog (Development Only)
- Use for single-user notebooks or small team PoC
- Store catalog file in version control (encrypt credentials)
- Use Postgres as DuckLake backend for multi-client writes
- Back up regularly if using as primary catalog
Security
- Never hardcode credentials in catalog configs
- Use IAM roles (AWS), service principals (Azure), or workload identity (GCP)
- Store tokens/secrets in environment variables or secret managers
See Also
@designing-data-storage- Delta Lake, Iceberg table formats, file format selection@accessing-cloud-storage- fsspec, pyarrow.fs, obstore for storage access@building-data-pipelines- ETL patterns using catalog-registered tables
References
More from legout/data-agent-skills
data-engineering
Comprehensive data engineering skill suite covering core libraries (Polars, DuckDB, PyArrow), lakehouse formats, cloud storage, orchestration, streaming, quality, observability, and AI/ML pipelines.
5data-engineering-storage-remote-access-libraries-obstore
High-performance Rust-based remote filesystem library. Covers store creation, basic operations, async API, streaming uploads, Arrow integration, and fsspec compatibility wrapper.
4data-engineering-storage-remote-access-integrations-iceberg
Apache Iceberg catalog configuration for cloud storage (S3, GCS, Azure). Covers AWS Glue and REST catalogs, table scanning, and append/overwrite operations.
4data-science-eda
Exploratory Data Analysis (EDA): profiling, visualization, correlation analysis, and data quality checks. Use when understanding dataset structure, distributions, relationships, or preparing for feature engineering and modeling.
4data-science-notebooks
Interactive notebooks for data science: Jupyter, JupyterLab, and marimo. Use for exploratory analysis, reproducible research, documentation, and sharing insights with stakeholders.
4data-engineering-storage-remote-access-libraries-fsspec
Comprehensive guide to fsspec: the universal filesystem interface for Python. Covers S3, GCS, Azure via s3fs, gcsfs, adlfs; protocol chaining, caching, async operations, and integration with the data ecosystem.
4