spice-data-connector
SKILL.md
Spice Data Connectors
Data Connectors enable federated SQL queries across databases, data warehouses, data lakes, and files. Spice connects directly to your existing data sources and provides a unified SQL interface — no ETL pipelines required. The query planner (built on Apache DataFusion) optimizes and routes queries, including filter pushdown and column projection.
Cross-Source Federation
Query across multiple heterogeneous sources in one SQL statement:
datasets:
- from: postgres:customers
name: customers
params:
pg_host: db.example.com
pg_user: ${secrets:PG_USER}
- from: s3://bucket/orders/
name: orders
params:
file_format: parquet
- from: snowflake:analytics.sales
name: sales
-- Query across all three sources in one statement
SELECT c.name, o.order_total, s.region
FROM customers c
JOIN orders o ON c.id = o.customer_id
JOIN sales s ON o.id = s.order_id
WHERE s.region = 'EMEA';
Without acceleration, each query fetches data directly from the underlying sources with optimized filter pushdown.
Basic Dataset Configuration
datasets:
- from: <connector>:<identifier>
name: <dataset_name>
params:
# connector-specific parameters
acceleration:
enabled: true # optional: enable local materialization
Supported Connectors
Databases
| Connector | From Format | Status |
|---|---|---|
| PostgreSQL | postgres:schema.table |
Stable (also Amazon Redshift) |
| MySQL | mysql:schema.table |
Stable |
| DuckDB | duckdb:database.table |
Stable |
| MS SQL Server | mssql:db.table |
Beta |
| MongoDB | mongodb:collection |
Alpha |
| ClickHouse | clickhouse:db.table |
Alpha |
| DynamoDB | dynamodb:table |
Release Candidate |
Data Warehouses
| Connector | From Format | Status |
|---|---|---|
| Snowflake | snowflake:db.schema.table |
Beta |
| Databricks (Delta Lake) | databricks:catalog.schema.table |
Stable |
| Spark | spark:db.table |
Beta |
Data Lakes & Object Storage
| Connector | From Format | Status |
|---|---|---|
| S3 | s3://bucket/path/ |
Stable |
| Delta Lake | delta_lake:/path/to/delta/ |
Stable |
| Iceberg | iceberg:table |
Beta |
| Azure BlobFS | abfs://container/path/ |
Alpha |
| File (local) | file:./path/to/data |
Stable |
Other Sources
| Connector | From Format | Status |
|---|---|---|
| Spice.ai | spice.ai:path/to/dataset |
Stable |
| Dremio | dremio:source.table |
Stable |
| GitHub | github:github.com/owner/repo/issues |
Stable |
| GraphQL | graphql:endpoint |
Release Candidate |
| FlightSQL | flightsql:query |
Beta |
| ODBC | odbc:connection |
Beta |
| FTP/SFTP | sftp://host/path/ |
Alpha |
| HTTP/HTTPS | https://url/path/data.csv |
Alpha |
| Kafka | kafka:topic |
Alpha |
| Debezium CDC | debezium:topic |
Alpha |
| SharePoint | sharepoint:site/path |
Alpha |
| IMAP | imap:mailbox |
Alpha |
Common Examples
PostgreSQL
datasets:
- from: postgres:public.users
name: users
params:
pg_host: localhost
pg_port: 5432
pg_user: ${ env:PG_USER }
pg_pass: ${ env:PG_PASS }
acceleration:
enabled: true
S3 with Parquet
datasets:
- from: s3://my-bucket/data/sales/
name: sales
params:
file_format: parquet
s3_region: us-east-1
acceleration:
enabled: true
engine: duckdb
GitHub Issues
datasets:
- from: github:github.com/spiceai/spiceai/issues
name: spiceai.issues
params:
github_token: ${ secrets:GITHUB_TOKEN }
acceleration:
enabled: true
refresh_mode: append
refresh_check_interval: 24h
refresh_data_window: 14d
Local File
datasets:
- from: file:./data/sales.parquet
name: sales
File Formats
Connectors reading from object stores (S3, ABFS, GCS) or network storage (FTP, SFTP) support:
| Format | file_format |
Status | Type |
|---|---|---|---|
| Apache Parquet | parquet |
Stable | Structured |
| CSV | csv |
Stable | Structured |
| Markdown | md |
Stable | Document |
| Text | txt |
Stable | Document |
pdf |
Alpha | Document | |
| Microsoft Word | docx |
Alpha | Document |
Document Formats
Document files (md, txt, pdf, docx) produce a table with location and content columns:
datasets:
- from: file:docs/decisions/
name: my_documents
params:
file_format: md
SELECT location, content FROM my_documents LIMIT 5;
Hive Partitioning
datasets:
- from: s3://bucket/data/
name: partitioned_data
params:
file_format: parquet
hive_partitioning_enabled: true
SELECT * FROM partitioned_data WHERE year = '2024' AND month = '01';
Dataset Naming
name: foocreatesspice.public.fooname: myschema.foocreatesspice.myschema.foo- Use
.to organize datasets into schemas
Documentation
Weekly Installs
12
Repository
spiceai/skillsGitHub Stars
1
First Seen
Jan 20, 2026
Security Audits
Installed on
opencode12
codex11
github-copilot11
gemini-cli10
claude-code8
cursor7