data-engineering-storage-remote-access-integrations-iceberg

Installation

SKILL.md

Apache Iceberg with Cloud Storage

Configuring PyIceberg catalogs to store Iceberg tables on S3, GCS, or Azure Blob Storage.

Installation

pip install pyiceberg[pyarrow,pandas,aws]  # AWS backend
# or
pip install pyiceberg[pyarrow,rest]       # REST catalog

Catalog Configuration

AWS Glue Catalog

from pyiceberg.catalog import load_catalog

catalog = load_catalog(
    "glue",
    **{
        "type": "glue",
        "s3.region": "us-east-1",
        "s3.access-key-id": "AKIA...",        # Optional: uses env/IAM if omitted
        "s3.secret-access-key": "...",
    }
)

Credentials are read from environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) or IAM roles by default. Pass explicitly only when necessary.

REST Catalog (Tabular, custom REST service)

catalog = load_catalog(
    "rest",
    **{
        "uri": "https://iceberg-catalog.example.com",
        "s3.endpoint": "http://minio:9000",
        "s3.access-key-id": "minioadmin",
        "s3.secret-access-key": "minioadmin",
    }
)

Hive Metastore

catalog = load_catalog(
    "hive",
    **{
        "uri": "thrift://localhost:9083",
        "s3.endpoint": "http://minio:9000",
    }
)

Local Development (No Catalog)

from pyiceberg.catalog import InMemoryCatalog

catalog = InMemoryCatalog("local")
# Tables stored in ~/.pyiceberg/ by default (local file-based catalog)

Table Operations

# Load existing table
table = catalog.load_table("db.my_table")

# Scan with filter pushdown
scan = table.scan(
    row_filter="year = 2024 AND country = 'USA'",
    selected_fields=("id", "value", "timestamp")
)
df = scan.to_pandas()  # or .to_arrow(), .to_polars()

# Append data
import pyarrow as pa
new_data = pa.table({
    "id": [4, 5],
    "value": [400.0, 500.0],
    "year": [2024, 2024]
})
table.append(new_data)

# Overwrite (replaces entire table)
table.overwrite(new_data)

Schema Evolution

# Add column (non-breaking)
with table.update_schema() as update:
    update.add_column("country", StringType(), required=False)

# Upgrade column type (e.g., int → long)
with table.update_schema() as update:
    update.upgrade_column("population", IntegerType(), required=False)

Cloud Storage Authentication

See @data-engineering-storage-authentication for:

AWS: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, IAM roles
GCS: GOOGLE_APPLICATION_CREDENTIALS
Azure: AZURE_STORAGE_ACCOUNT, AZURE_STORAGE_KEY

PyIceberg catalogs automatically detect these environment variables. Only provide explicit credentials for local development or non-standard setups.

Best Practices

✅ Use a catalog - Never manage Iceberg tables without catalog metadata
✅ Leverage partition evolution - Change partition specs without rewriting data
✅ Archive old snapshots - Run expire_snapshots() to limit metadata growth
✅ Schema evolution over schema enforcement - Iceberg is designed for evolving schemas
⚠️ Monitor table metadata size - Large histories slow operations
⚠️ Don't use local filesystem for production - Use a shared catalog (Glue, Hive, REST)

Performance

✅ Predicate pushdown: Use row_filter in scan() to skip irrelevant files
✅ Column pruning: Use selected_fields to read only needed columns
✅ Batch operations: Append multiple records at once for better throughput
✅ PyArrow backend: Use PyArrow tables (not pandas) for zero-copy operations

Related Skills

@data-engineering-storage-lakehouse/iceberg.md - Iceberg concepts and detailed API
@data-engineering-storage-lakehouse - Delta Lake vs Iceberg comparison
@data-engineering-storage-remote-access/libraries/pyarrow-fs - PyArrow filesystem for direct S3/GCS access

References

Related skills

More from legout/data-agent-skills

Installs

Repository

legout/data-agent-skills

First Seen

Mar 1, 2026

Security Audits

Gen Agent Trust HubPass

data-engineering-storage-remote-access-integrations-iceberg

Apache Iceberg with Cloud Storage

Installation

Catalog Configuration

AWS Glue Catalog

REST Catalog (Tabular, custom REST service)

Hive Metastore

Local Development (No Catalog)

Table Operations

Schema Evolution

Cloud Storage Authentication

Best Practices

Performance

Related Skills

References

More from legout/data-agent-skills

data-engineering

data-science-eda

data-engineering-storage-remote-access-libraries-fsspec

data-engineering-storage-remote-access-integrations-duckdb

data-engineering-storage-remote-access-libraries-pyarrow-fs

flowerpower