spice-data-connector
Spice Data Connectors
Data Connectors enable federated SQL queries across databases, data warehouses, data lakes, and files. Spice connects directly to your existing data sources and provides a unified SQL interface — no ETL pipelines required. The query planner (built on Apache DataFusion) optimizes and routes queries, including filter pushdown and column projection.
Cross-Source Federation
Query across multiple heterogeneous sources in one SQL statement:
datasets:
- from: postgres:customers
name: customers
params:
pg_host: db.example.com
pg_user: ${secrets:PG_USER}
- from: s3://bucket/orders/
name: orders
params:
file_format: parquet
- from: snowflake:analytics.sales
name: sales
-- Query across all three sources in one statement
SELECT c.name, o.order_total, s.region
FROM customers c
JOIN orders o ON c.id = o.customer_id
JOIN sales s ON o.id = s.order_id
WHERE s.region = 'EMEA';
Without acceleration, each query fetches data directly from the underlying sources with optimized filter pushdown.
Basic Dataset Configuration
datasets:
- from: <connector>:<identifier>
name: <dataset_name>
params:
# connector-specific parameters
acceleration:
enabled: true # optional: enable local materialization
Supported Connectors
Databases
| Connector | From Format | Status |
|---|---|---|
| PostgreSQL | postgres:schema.table |
Stable (also Amazon Redshift) |
| MySQL | mysql:schema.table |
Stable |
| DuckDB | duckdb:database.table |
Stable |
| MS SQL Server | mssql:db.table |
Beta |
| MongoDB | mongodb:collection |
Alpha |
| ClickHouse | clickhouse:db.table |
Alpha |
| DynamoDB | dynamodb:table |
Release Candidate |
Data Warehouses
| Connector | From Format | Status |
|---|---|---|
| Snowflake | snowflake:db.schema.table |
Beta |
| Databricks (Delta Lake) | databricks:catalog.schema.table |
Stable |
| Spark | spark:db.table |
Beta |
Data Lakes & Object Storage
| Connector | From Format | Status |
|---|---|---|
| S3 | s3://bucket/path/ |
Stable |
| Delta Lake | delta_lake:/path/to/delta/ |
Stable |
| Iceberg | iceberg:table |
Beta |
| Azure BlobFS | abfs://container/path/ |
Alpha |
| File (local) | file:./path/to/data |
Stable |
Other Sources
| Connector | From Format | Status |
|---|---|---|
| Spice.ai | spice.ai:path/to/dataset |
Stable |
| Dremio | dremio:source.table |
Stable |
| GitHub | github:github.com/owner/repo/issues |
Stable |
| GraphQL | graphql:endpoint |
Release Candidate |
| FlightSQL | flightsql:query |
Beta |
| ODBC | odbc:connection |
Beta |
| FTP/SFTP | sftp://host/path/ |
Alpha |
| HTTP/HTTPS | https://url/path/data.csv |
Alpha |
| Kafka | kafka:topic |
Alpha |
| Debezium CDC | debezium:topic |
Alpha |
| SharePoint | sharepoint:site/path |
Alpha |
| IMAP | imap:mailbox |
Alpha |
Common Examples
PostgreSQL
datasets:
- from: postgres:public.users
name: users
params:
pg_host: localhost
pg_port: 5432
pg_user: ${ env:PG_USER }
pg_pass: ${ env:PG_PASS }
acceleration:
enabled: true
S3 with Parquet
datasets:
- from: s3://my-bucket/data/sales/
name: sales
params:
file_format: parquet
s3_region: us-east-1
acceleration:
enabled: true
engine: duckdb
GitHub Issues
datasets:
- from: github:github.com/spiceai/spiceai/issues
name: spiceai.issues
params:
github_token: ${ secrets:GITHUB_TOKEN }
acceleration:
enabled: true
refresh_mode: append
refresh_check_interval: 24h
refresh_data_window: 14d
Local File
datasets:
- from: file:./data/sales.parquet
name: sales
File Formats
Connectors reading from object stores (S3, ABFS, GCS) or network storage (FTP, SFTP) support:
| Format | file_format |
Status | Type |
|---|---|---|---|
| Apache Parquet | parquet |
Stable | Structured |
| CSV | csv |
Stable | Structured |
| Markdown | md |
Stable | Document |
| Text | txt |
Stable | Document |
pdf |
Alpha | Document | |
| Microsoft Word | docx |
Alpha | Document |
Document Formats
Document files (md, txt, pdf, docx) produce a table with location and content columns:
datasets:
- from: file:docs/decisions/
name: my_documents
params:
file_format: md
SELECT location, content FROM my_documents LIMIT 5;
Hive Partitioning
datasets:
- from: s3://bucket/data/
name: partitioned_data
params:
file_format: parquet
hive_partitioning_enabled: true
SELECT * FROM partitioned_data WHERE year = '2024' AND month = '01';
Dataset Naming
name: foocreatesspice.public.fooname: myschema.foocreatesspice.myschema.foo- Use
.to organize datasets into schemas
Documentation
More from spiceai/skills
spice-models
Configure AI/LLM model providers and connections in Spice — OpenAI, Anthropic, Azure, Google, xAI, Bedrock, Perplexity, Databricks, HuggingFace, and local GGUF models. Use this skill whenever the user wants to add a model, configure a specific LLM provider, set up an OpenAI-compatible endpoint (e.g. Groq, Ollama), serve a local model, configure system prompts, set parameter overrides (temperature, response format), or understand which providers are available. This skill is the model connector reference. For AI features like tools, memory, workers, and NSQL, see spice-ai.
16spicepod-config
Create and configure Spicepod manifests (spicepod.yaml) — the central configuration file for Spice applications. Use this skill whenever the user wants to create a new spicepod.yaml from scratch, understand the overall spicepod structure and available sections, configure runtime settings (ports, caching, telemetry/observability), set up a complete Spice application combining datasets + models + search, or understand deployment models and use cases. This is the "glue" skill that shows how all Spice components fit together in one manifest. For details on specific sections (datasets, models, search, etc.), see the dedicated skills.
16spice-secrets
Configure secret stores in Spice — environment variables, Kubernetes, AWS Secrets Manager, and OS keyring. Use this skill whenever the user needs to manage credentials, API keys, passwords, or tokens in Spice, reference secrets in spicepod.yaml params with ${ store:KEY } syntax, set up .env files, configure secret store precedence, or understand how the `secrets:` section works. Also use when the user asks how to pass database passwords or API keys securely to Spice datasets or models.
12spice-acceleration
Accelerate data locally for sub-second query performance — the feature and its configuration. Use this skill whenever the user asks about data acceleration concepts, enabling acceleration on a dataset, choosing refresh modes (full, append, changes, caching), configuring retention policies, setting up snapshots for cold-start, adding indexes and constraints, or understanding the difference between federated and accelerated queries. This skill covers the "what and why" of acceleration. For choosing which acceleration engine to use (Arrow vs DuckDB vs SQLite vs Cayenne), see spice-accelerators.
10spice-setup
Get started with Spice.ai — install the runtime, initialize a project, run the runtime, and use the CLI. Use this skill whenever the user mentions installing Spice, setting up a new Spice project, running `spice run`, looking up CLI commands or API endpoints, deployment models, or getting started with Spice. Also use when the user asks "how do I install Spice", "how do I start Spice", "what CLI commands does Spice have", or any question about Spice runtime setup and configuration basics.
9spice-connect-data
Connect Spice to data sources and query across them with federated SQL — including datasets, catalogs, views, and writes. Use this skill whenever the user wants to set up federated queries across multiple sources, create views, configure catalogs (Unity Catalog, Databricks, Iceberg), write data with INSERT INTO, or understand how Spice's query federation works. This skill focuses on the federation layer — cross-source joins, views, catalogs, and data writes. For configuring individual data source connectors (PostgreSQL params, S3 file formats, etc.), see spice-data-connector.
9