spice-connect-data
Connect to Data Sources
Spice federates SQL queries across 30+ data sources without ETL. Connect databases, data lakes, warehouses, and APIs, then query across them with standard SQL.
How Federation Works
Configure datasets pointing to different sources. Spice's query planner (built on Apache DataFusion) optimizes and routes queries with filter pushdown and column projection:
datasets:
- from: postgres:customers
name: customers
params:
pg_host: db.example.com
pg_user: ${secrets:PG_USER}
- from: s3://bucket/orders/
name: orders
params:
file_format: parquet
- from: snowflake:analytics.sales
name: sales
-- Query across all three sources in one statement
SELECT c.name, o.order_total, s.region
FROM customers c
JOIN orders o ON c.id = o.customer_id
JOIN sales s ON o.id = s.order_id
WHERE s.region = 'EMEA';
Dataset Configuration
datasets:
- from: <connector>:<identifier>
name: <dataset_name>
params:
# connector-specific parameters
acceleration:
enabled: true # optional: materialize locally (see spice-acceleration)
Supported Connectors
Databases
| Connector | From Format | Status |
|---|---|---|
| PostgreSQL | postgres:schema.table |
Stable (also Amazon Redshift) |
| MySQL | mysql:schema.table |
Stable |
| DuckDB | duckdb:database.table |
Stable |
| MS SQL Server | mssql:db.table |
Beta |
| DynamoDB | dynamodb:table |
Release Candidate |
| MongoDB | mongodb:collection |
Alpha |
| ClickHouse | clickhouse:db.table |
Alpha |
Data Warehouses
| Connector | From Format | Status |
|---|---|---|
| Snowflake | snowflake:db.schema.table |
Beta |
| Databricks (Delta Lake) | databricks:catalog.schema.table |
Stable |
| Spark | spark:db.table |
Beta |
Data Lakes & Object Storage
| Connector | From Format | Status |
|---|---|---|
| S3 | s3://bucket/path/ |
Stable |
| Delta Lake | delta_lake:/path/to/delta/ |
Stable |
| Iceberg | iceberg:table |
Beta |
| Azure BlobFS | abfs://container/path/ |
Alpha |
| File (local) | file:./path/to/data |
Stable |
Other Sources
| Connector | From Format | Status |
|---|---|---|
| Spice.ai | spice.ai:path/to/dataset |
Stable |
| Dremio | dremio:source.table |
Stable |
| GitHub | github:github.com/owner/repo/issues |
Stable |
| GraphQL | graphql:endpoint |
Release Candidate |
| FlightSQL | flightsql:query |
Beta |
| ODBC | odbc:connection |
Beta |
| FTP/SFTP | sftp://host/path/ |
Alpha |
| HTTP/HTTPS | https://url/path/data.csv |
Alpha |
| Kafka | kafka:topic |
Alpha |
| Debezium CDC | debezium:topic |
Alpha |
| SharePoint | sharepoint:site/path |
Alpha |
| IMAP | imap:mailbox |
Alpha |
Common Examples
PostgreSQL
datasets:
- from: postgres:public.users
name: users
params:
pg_host: localhost
pg_port: 5432
pg_user: ${ env:PG_USER }
pg_pass: ${ env:PG_PASS }
acceleration:
enabled: true
S3 with Parquet
datasets:
- from: s3://my-bucket/data/sales/
name: sales
params:
file_format: parquet
s3_region: us-east-1
acceleration:
enabled: true
engine: duckdb
GitHub Issues
datasets:
- from: github:github.com/spiceai/spiceai/issues
name: spiceai.issues
params:
github_token: ${ secrets:GITHUB_TOKEN }
acceleration:
enabled: true
refresh_mode: append
refresh_check_interval: 24h
refresh_data_window: 14d
Local File
datasets:
- from: file:./data/sales.parquet
name: sales
File Formats
Connectors reading from object stores (S3, ABFS) or network storage (FTP, SFTP) support:
| Format | file_format |
Type |
|---|---|---|
| Apache Parquet | parquet |
Structured |
| CSV | csv |
Structured |
| Markdown | md |
Document |
| Text | txt |
Document |
pdf |
Document | |
| Microsoft Word | docx |
Document |
Document files produce a table with location and content columns:
datasets:
- from: file:docs/decisions/
name: my_documents
params:
file_format: md
Hive Partitioning
datasets:
- from: s3://bucket/data/
name: partitioned_data
params:
file_format: parquet
hive_partitioning_enabled: true
SELECT * FROM partitioned_data WHERE year = '2024' AND month = '01';
Dataset Naming
name: foo→spice.public.fooname: myschema.foo→spice.myschema.foo- Use
.to organize datasets into schemas
Catalogs
Catalog connectors expose external data catalogs, preserving the source schema hierarchy. Tables are accessed as <catalog>.<schema>.<table>.
Note: Acceleration is not supported for catalog tables. Use datasets for accelerated access.
catalogs:
- from: <connector>
name: <catalog_name>
params:
# connector-specific parameters
include:
- 'schema.*' # optional: filter with glob patterns
Supported Catalogs
| Connector | From Value | Status |
|---|---|---|
| Unity Catalog | unity_catalog |
Stable |
| Databricks | databricks |
Beta |
| Iceberg | iceberg |
Beta |
| Spice.ai | spice.ai |
Beta |
| AWS Glue | glue |
Alpha |
Catalog Example
catalogs:
- from: unity_catalog
name: unity
params:
unity_catalog_endpoint: https://my-workspace.cloud.databricks.com
databricks_token: ${ secrets:DATABRICKS_TOKEN }
include:
- 'my_schema.*'
SELECT * FROM unity.my_schema.customers LIMIT 10;
Views
Views are virtual tables defined by SQL queries — useful for pre-aggregations, transformations, and simplified access:
views:
- name: daily_sales
sql: |
SELECT DATE(created_at) as date, SUM(amount) as total, COUNT(*) as orders
FROM orders
GROUP BY DATE(created_at)
- name: order_details
sql: |
SELECT o.id, c.name as customer, p.name as product, o.quantity
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id
Views can be accelerated:
views:
- name: rankings
sql: |
SELECT product_id, SUM(quantity) as total_sold
FROM orders GROUP BY product_id ORDER BY total_sold DESC LIMIT 100
acceleration:
enabled: true
refresh_check_interval: 1h
Views are read-only and queried like regular tables: SELECT * FROM daily_sales.
Writing Data
Spice supports writing to Apache Iceberg tables and Amazon S3 Tables via INSERT INTO:
datasets:
- from: iceberg:https://catalog.example.com/v1/namespaces/sales/tables/transactions
name: transactions
access: read_write # required for writes
INSERT INTO transactions SELECT * FROM staging_transactions;
Referencing Secrets
Use ${ store_name:KEY } syntax in params. See spice-secrets for full configuration:
params:
pg_user: ${ env:PG_USER }
pg_pass: ${ secrets:PG_PASSWORD }