skills/legout/data-platform-agent-skills/data-engineering-storage-remote-access-integrations-iceberg
data-engineering-storage-remote-access-integrations-iceberg
SKILL.md
Apache Iceberg with Cloud Storage
Configuring PyIceberg catalogs to store Iceberg tables on S3, GCS, or Azure Blob Storage.
Installation
pip install pyiceberg[pyarrow,pandas,aws] # AWS backend
# or
pip install pyiceberg[pyarrow,rest] # REST catalog
Catalog Configuration
AWS Glue Catalog
from pyiceberg.catalog import load_catalog
catalog = load_catalog(
"glue",
**{
"type": "glue",
"s3.region": "us-east-1",
"s3.access-key-id": "AKIA...", # Optional: uses env/IAM if omitted
"s3.secret-access-key": "...",
}
)
Credentials are read from environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) or IAM roles by default. Pass explicitly only when necessary.
REST Catalog (Tabular, custom REST service)
catalog = load_catalog(
"rest",
**{
"uri": "https://iceberg-catalog.example.com",
"s3.endpoint": "http://minio:9000",
"s3.access-key-id": "minioadmin",
"s3.secret-access-key": "minioadmin",
}
)
Hive Metastore
catalog = load_catalog(
"hive",
**{
"uri": "thrift://localhost:9083",
"s3.endpoint": "http://minio:9000",
}
)
Local Development (No Catalog)
from pyiceberg.catalog import InMemoryCatalog
catalog = InMemoryCatalog("local")
# Tables stored in ~/.pyiceberg/ by default (local file-based catalog)
Table Operations
# Load existing table
table = catalog.load_table("db.my_table")
# Scan with filter pushdown
scan = table.scan(
row_filter="year = 2024 AND country = 'USA'",
selected_fields=("id", "value", "timestamp")
)
df = scan.to_pandas() # or .to_arrow(), .to_polars()
# Append data
import pyarrow as pa
new_data = pa.table({
"id": [4, 5],
"value": [400.0, 500.0],
"year": [2024, 2024]
})
table.append(new_data)
# Overwrite (replaces entire table)
table.overwrite(new_data)
Schema Evolution
# Add column (non-breaking)
with table.update_schema() as update:
update.add_column("country", StringType(), required=False)
# Upgrade column type (e.g., int → long)
with table.update_schema() as update:
update.upgrade_column("population", IntegerType(), required=False)
Cloud Storage Authentication
See @data-engineering-storage-authentication for:
- AWS:
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY, IAM roles - GCS:
GOOGLE_APPLICATION_CREDENTIALS - Azure:
AZURE_STORAGE_ACCOUNT,AZURE_STORAGE_KEY
PyIceberg catalogs automatically detect these environment variables. Only provide explicit credentials for local development or non-standard setups.
Best Practices
- ✅ Use a catalog - Never manage Iceberg tables without catalog metadata
- ✅ Leverage partition evolution - Change partition specs without rewriting data
- ✅ Archive old snapshots - Run
expire_snapshots()to limit metadata growth - ✅ Schema evolution over schema enforcement - Iceberg is designed for evolving schemas
- ⚠️ Monitor table metadata size - Large histories slow operations
- ⚠️ Don't use local filesystem for production - Use a shared catalog (Glue, Hive, REST)
Performance
- ✅ Predicate pushdown: Use
row_filterinscan()to skip irrelevant files - ✅ Column pruning: Use
selected_fieldsto read only needed columns - ✅ Batch operations: Append multiple records at once for better throughput
- ✅ PyArrow backend: Use PyArrow tables (not pandas) for zero-copy operations
Related Skills
@data-engineering-storage-lakehouse/iceberg.md- Iceberg concepts and detailed API@data-engineering-storage-lakehouse- Delta Lake vs Iceberg comparison@data-engineering-storage-remote-access/libraries/pyarrow-fs- PyArrow filesystem for direct S3/GCS access
References
Weekly Installs
6
Repository
legout/data-pla…t-skillsFirst Seen
Feb 11, 2026
Security Audits
Installed on
pi6
mcpjam4
claude-code4
junie4
windsurf4
zencoder4