skills/legout/data-platform-agent-skills/data-engineering-storage-remote-access-integrations-iceberg

data-engineering-storage-remote-access-integrations-iceberg

SKILL.md

Apache Iceberg with Cloud Storage

Configuring PyIceberg catalogs to store Iceberg tables on S3, GCS, or Azure Blob Storage.

Installation

pip install pyiceberg[pyarrow,pandas,aws]  # AWS backend
# or
pip install pyiceberg[pyarrow,rest]       # REST catalog

Catalog Configuration

AWS Glue Catalog

from pyiceberg.catalog import load_catalog

catalog = load_catalog(
    "glue",
    **{
        "type": "glue",
        "s3.region": "us-east-1",
        "s3.access-key-id": "AKIA...",        # Optional: uses env/IAM if omitted
        "s3.secret-access-key": "...",
    }
)

Credentials are read from environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) or IAM roles by default. Pass explicitly only when necessary.

REST Catalog (Tabular, custom REST service)

catalog = load_catalog(
    "rest",
    **{
        "uri": "https://iceberg-catalog.example.com",
        "s3.endpoint": "http://minio:9000",
        "s3.access-key-id": "minioadmin",
        "s3.secret-access-key": "minioadmin",
    }
)

Hive Metastore

catalog = load_catalog(
    "hive",
    **{
        "uri": "thrift://localhost:9083",
        "s3.endpoint": "http://minio:9000",
    }
)

Local Development (No Catalog)

from pyiceberg.catalog import InMemoryCatalog

catalog = InMemoryCatalog("local")
# Tables stored in ~/.pyiceberg/ by default (local file-based catalog)

Table Operations

# Load existing table
table = catalog.load_table("db.my_table")

# Scan with filter pushdown
scan = table.scan(
    row_filter="year = 2024 AND country = 'USA'",
    selected_fields=("id", "value", "timestamp")
)
df = scan.to_pandas()  # or .to_arrow(), .to_polars()

# Append data
import pyarrow as pa
new_data = pa.table({
    "id": [4, 5],
    "value": [400.0, 500.0],
    "year": [2024, 2024]
})
table.append(new_data)

# Overwrite (replaces entire table)
table.overwrite(new_data)

Schema Evolution

# Add column (non-breaking)
with table.update_schema() as update:
    update.add_column("country", StringType(), required=False)

# Upgrade column type (e.g., int → long)
with table.update_schema() as update:
    update.upgrade_column("population", IntegerType(), required=False)

Cloud Storage Authentication

See @data-engineering-storage-authentication for:

  • AWS: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, IAM roles
  • GCS: GOOGLE_APPLICATION_CREDENTIALS
  • Azure: AZURE_STORAGE_ACCOUNT, AZURE_STORAGE_KEY

PyIceberg catalogs automatically detect these environment variables. Only provide explicit credentials for local development or non-standard setups.

Best Practices

  1. Use a catalog - Never manage Iceberg tables without catalog metadata
  2. Leverage partition evolution - Change partition specs without rewriting data
  3. Archive old snapshots - Run expire_snapshots() to limit metadata growth
  4. Schema evolution over schema enforcement - Iceberg is designed for evolving schemas
  5. ⚠️ Monitor table metadata size - Large histories slow operations
  6. ⚠️ Don't use local filesystem for production - Use a shared catalog (Glue, Hive, REST)

Performance

  • Predicate pushdown: Use row_filter in scan() to skip irrelevant files
  • Column pruning: Use selected_fields to read only needed columns
  • Batch operations: Append multiple records at once for better throughput
  • PyArrow backend: Use PyArrow tables (not pandas) for zero-copy operations

Related Skills

  • @data-engineering-storage-lakehouse/iceberg.md - Iceberg concepts and detailed API
  • @data-engineering-storage-lakehouse - Delta Lake vs Iceberg comparison
  • @data-engineering-storage-remote-access/libraries/pyarrow-fs - PyArrow filesystem for direct S3/GCS access

References

Weekly Installs
6
First Seen
Feb 11, 2026
Installed on
pi6
mcpjam4
claude-code4
junie4
windsurf4
zencoder4