data-engineering-storage-remote-access-integrations-iceberg
Apache Iceberg with Cloud Storage
Configuring PyIceberg catalogs to store Iceberg tables on S3, GCS, or Azure Blob Storage.
Installation
pip install pyiceberg[pyarrow,pandas,aws] # AWS backend
# or
pip install pyiceberg[pyarrow,rest] # REST catalog
Catalog Configuration
AWS Glue Catalog
from pyiceberg.catalog import load_catalog
catalog = load_catalog(
"glue",
**{
"type": "glue",
"s3.region": "us-east-1",
"s3.access-key-id": "AKIA...", # Optional: uses env/IAM if omitted
"s3.secret-access-key": "...",
}
)
Credentials are read from environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) or IAM roles by default. Pass explicitly only when necessary.
REST Catalog (Tabular, custom REST service)
catalog = load_catalog(
"rest",
**{
"uri": "https://iceberg-catalog.example.com",
"s3.endpoint": "http://minio:9000",
"s3.access-key-id": "minioadmin",
"s3.secret-access-key": "minioadmin",
}
)
Hive Metastore
catalog = load_catalog(
"hive",
**{
"uri": "thrift://localhost:9083",
"s3.endpoint": "http://minio:9000",
}
)
Local Development (No Catalog)
from pyiceberg.catalog import InMemoryCatalog
catalog = InMemoryCatalog("local")
# Tables stored in ~/.pyiceberg/ by default (local file-based catalog)
Table Operations
# Load existing table
table = catalog.load_table("db.my_table")
# Scan with filter pushdown
scan = table.scan(
row_filter="year = 2024 AND country = 'USA'",
selected_fields=("id", "value", "timestamp")
)
df = scan.to_pandas() # or .to_arrow(), .to_polars()
# Append data
import pyarrow as pa
new_data = pa.table({
"id": [4, 5],
"value": [400.0, 500.0],
"year": [2024, 2024]
})
table.append(new_data)
# Overwrite (replaces entire table)
table.overwrite(new_data)
Schema Evolution
# Add column (non-breaking)
with table.update_schema() as update:
update.add_column("country", StringType(), required=False)
# Upgrade column type (e.g., int → long)
with table.update_schema() as update:
update.upgrade_column("population", IntegerType(), required=False)
Cloud Storage Authentication
See @data-engineering-storage-authentication for:
- AWS:
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY, IAM roles - GCS:
GOOGLE_APPLICATION_CREDENTIALS - Azure:
AZURE_STORAGE_ACCOUNT,AZURE_STORAGE_KEY
PyIceberg catalogs automatically detect these environment variables. Only provide explicit credentials for local development or non-standard setups.
Best Practices
- ✅ Use a catalog - Never manage Iceberg tables without catalog metadata
- ✅ Leverage partition evolution - Change partition specs without rewriting data
- ✅ Archive old snapshots - Run
expire_snapshots()to limit metadata growth - ✅ Schema evolution over schema enforcement - Iceberg is designed for evolving schemas
- ⚠️ Monitor table metadata size - Large histories slow operations
- ⚠️ Don't use local filesystem for production - Use a shared catalog (Glue, Hive, REST)
Performance
- ✅ Predicate pushdown: Use
row_filterinscan()to skip irrelevant files - ✅ Column pruning: Use
selected_fieldsto read only needed columns - ✅ Batch operations: Append multiple records at once for better throughput
- ✅ PyArrow backend: Use PyArrow tables (not pandas) for zero-copy operations
Related Skills
@data-engineering-storage-lakehouse/iceberg.md- Iceberg concepts and detailed API@data-engineering-storage-lakehouse- Delta Lake vs Iceberg comparison@data-engineering-storage-remote-access/libraries/pyarrow-fs- PyArrow filesystem for direct S3/GCS access
References
More from legout/data-agent-skills
data-engineering
Comprehensive data engineering skill suite covering core libraries (Polars, DuckDB, PyArrow), lakehouse formats, cloud storage, orchestration, streaming, quality, observability, and AI/ML pipelines.
5data-science-eda
Exploratory Data Analysis (EDA): profiling, visualization, correlation analysis, and data quality checks. Use when understanding dataset structure, distributions, relationships, or preparing for feature engineering and modeling.
4data-engineering-storage-remote-access-libraries-fsspec
Comprehensive guide to fsspec: the universal filesystem interface for Python. Covers S3, GCS, Azure via s3fs, gcsfs, adlfs; protocol chaining, caching, async operations, and integration with the data ecosystem.
4data-engineering-storage-remote-access-integrations-duckdb
Using DuckDB with remote cloud storage via HTTPFS extension, fsspec, and Delta Lake integration. Covers S3, GCS, Azure, and S3-compatible endpoints.
4data-engineering-storage-remote-access-libraries-pyarrow-fs
Native Arrow filesystem integration with PyArrow. Optimized for Parquet workflows, zero-copy data transfer, predicate pushdown, and column pruning. Covers S3, GCS, HDFS with PyArrow datasets.
4flowerpower
Create and manage data pipelines using the FlowerPower framework with Hamilton DAGs and uv. Lightweight orchestration for batch ETL, data transformation, and ML pipelines. Integrates with Delta Lake, DuckDB, Polars, and cloud storage.
4