managing-data-catalogs
SKILL.md
Managing Data Catalogs
Guide to selecting, configuring, and using data catalog systems for data discovery, governance, and unified table access across multiple query engines.
When to Use This Skill
Use this skill when:
- Setting up an Iceberg catalog (Hive Metastore, AWS Glue, REST/Tabular)
- Comparing catalog options for your data platform
- Using DuckDB to unify queries across heterogeneous data sources
- Evaluating open-source metadata discovery tools (DataHub, OpenMetadata)
- Configuring cross-service table access (Athena, EMR, Spark, DuckDB)
Do not use this skill for:
- Table format selection (Delta vs Iceberg) → use
@designing-data-storage - Cloud storage authentication patterns → use
@accessing-cloud-storage - Raw storage I/O without catalog abstraction → use
@accessing-cloud-storage - General ETL pipeline patterns → use
@building-data-pipelines
Quick Catalog Type Comparison
| Catalog | Backend | Managed? | Best For |
|---|---|---|---|
| Hive Metastore | RDBMS (Postgres/MySQL) | Self-hosted | Existing Hadoop, high partition counts |
| AWS Glue | AWS-managed serverless | AWS-managed | AWS-native stacks (Athena, EMR) |
| Tabular/REST | SaaS (Nessie-backed) | Vendor-managed | Iceberg-native, Git-like branching |
| DuckDB (embedded) | Local file/Postgres | Self-hosted | Single-user, PoC, small teams |
When to Use Which Catalog
| Scenario | Recommended Catalog | Why |
|---|---|---|
| AWS-native (Athena, Redshift Spectrum) | AWS Glue | Serverless, IAM integration |
| Self-hosted Hadoop/Spark | Hive Metastore | Battle-tested, no vendor lock-in |
| Iceberg-first, multi-cloud | Tabular or Hive | Native Iceberg features or flexibility |
| Small team, PoC, analytics | DuckDB | Zero infrastructure, SQL-native |
| LinkedIn-scale metadata | DataHub | Enterprise lineage, scale |
| Governance-heavy workflows | OpenMetadata | Built-in workflows, data quality |
Core Patterns
Catalog → Table → Storage Mapping
Catalog (name) → Table identifier → Metadata location → Data files
(schema, snapshots, partitions)
Key insight: The catalog stores only metadata pointers. Actual data lives in object storage (S3, GCS, Azure).
Multi-Engine Access Pattern
# Same table, different engines
table = catalog.load_table("db.events")
# PyIceberg → Pandas DataFrame
df = table.scan().to_pandas()
# Spark SQL (via same catalog)
spark.table("db.events") # Same underlying data
# DuckDB (via ATTACH or REST catalog)
SELECT * FROM iceberg.db.events
Detailed Guides
| Guide | Covers | When to Read |
|---|---|---|
| Hive Metastore | Docker deployment, PyIceberg integration, pros/cons | Self-hosting Hadoop/Spark |
| AWS Glue Catalog | GlueCatalog setup, crawlers, IAM, Unity Catalog federation | AWS-native stacks |
| REST Catalog & Tabular | Tabular SaaS, Nessie patterns, Git-like branching | Iceberg-first, multi-cloud |
| DuckDB Multi-Source | ATTACH patterns, unified views, limitations | Single-user/PoC catalog |
| Open Source Tools | Amundsen vs DataHub vs OpenMetadata comparison | Metadata discovery/governance |
Quick Reference: PyIceberg Catalog Setup
from pyiceberg.catalog import load_catalog
# Hive Metastore
catalog = load_catalog("hive", **{
"type": "hive",
"uri": "thrift://localhost:9083",
"warehouse": "s3://bucket/warehouse/"
})
# AWS Glue
catalog = load_catalog("glue", **{
"type": "glue",
"region": "us-east-1",
"warehouse": "s3://bucket/warehouse/"
})
# REST/Tabular
catalog = load_catalog("rest", **{
"type": "rest",
"uri": "https://api.tabular.io/ws/...",
"token": "tabular-token-...",
"warehouse": "s3://bucket/warehouse/"
})
# Create and query table
table = catalog.create_table("db.events", schema=schema)
table.append(data)
df = table.scan().to_pandas()
Best Practices
Catalog Selection
- Default to Hive Metastore if you have Hadoop investments
- Use AWS Glue for pure AWS (Athena, EMR, Redshift Spectrum)
- Choose Tabular for Iceberg-native with branching needs
- Separate concerns: Iceberg catalog (table storage) ≠ business metadata catalog (OpenMetadata for discovery)
DuckDB-as-Catalog (Development Only)
- Use for single-user notebooks or small team PoC
- Store catalog file in version control (encrypt credentials)
- Use Postgres as DuckLake backend for multi-client writes
- Back up regularly if using as primary catalog
Security
- Never hardcode credentials in catalog configs
- Use IAM roles (AWS), service principals (Azure), or workload identity (GCP)
- Store tokens/secrets in environment variables or secret managers
See Also
@designing-data-storage- Delta Lake, Iceberg table formats, file format selection@accessing-cloud-storage- fsspec, pyarrow.fs, obstore for storage access@building-data-pipelines- ETL patterns using catalog-registered tables
References
Weekly Installs
1
Repository
legout/data-agent-skillsFirst Seen
3 days ago
Security Audits
Installed on
mcpjam1
claude-code1
replit1
junie1
windsurf1
zencoder1