skills/aradotso/trending-skills/openduck-distributed-duckdb

openduck-distributed-duckdb

Installation
SKILL.md

OpenDuck Distributed DuckDB

Skill by ara.so — Daily 2026 Skills collection

OpenDuck is an open-source implementation of distributed DuckDB featuring differential storage (append-only immutable layers via Postgres + object store), hybrid dual execution (single queries split across local and remote workers), and transparent remote database attach via ATTACH 'openduck:mydb'. It is architecturally inspired by MotherDuck but fully open protocol (gRPC + Arrow IPC).


Architecture Overview

DuckDB client (openduck extension)
  └─ ATTACH 'openduck:mydb?endpoint=...' AS cloud
       └─ gRPC + Arrow IPC
            └─ Gateway (Rust)
                 ├─ auth / routing / plan splitting
                 ├─ Worker 1 (embedded DuckDB)
                 └─ Worker N (embedded DuckDB)
                      ├─ Postgres (metadata)
                      └─ Object store (sealed layers)

Key concepts:

  • OpenDuckCatalog / OpenDuckTableEntry — remote tables appear as first-class DuckDB catalog entries
  • Hybrid execution — gateway labels operators LOCAL or REMOTE, inserts Bridge operators at boundaries
  • Differential storage — immutable sealed layers, snapshot isolation, one write path, many readers
  • Protocol — only 2 gRPC RPCs defined in proto/openduck/v1/execution.proto

Repository Layout

crates/
  exec-gateway/     # auth, routing, hybrid plan splitting
  exec-worker/      # embedded DuckDB, Arrow IPC streaming
  exec-proto/       # protobuf/tonic codegen
  openduck-cli/     # unified CLI (serve|gateway|worker)
  diff-*/           # differential storage (layers, metadata, FUSE)

extensions/
  openduck/         # DuckDB C++ extension (StorageExtension + Catalog)

clients/
  python/           # pip-installable openduck wrapper

proto/
  openduck/v1/      # execution.proto

Installation & Build

Prerequisites

  • Rust toolchain (stable)
  • C++ build tools, vcpkg, bison (macOS: brew install bison)
  • DuckDB development headers (handled by the extension Makefile)

1. Build the Rust backend

git clone https://github.com/CITGuru/openduck
cd openduck
cargo build --workspace

2. Build the DuckDB C++ extension

cd extensions/openduck
make
# Output:
# extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension

3. Install the Python client (optional)

pip install -e clients/python

Running the Server

# Required env vars
export OPENDUCK_TOKEN=your-secret-token

# Start all-in-one (gateway + worker)
cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN

# Or run gateway and worker separately
cargo run -p openduck-cli -- gateway --port 7878
cargo run -p openduck-cli -- worker --gateway http://localhost:7878

Connecting from Python

Via openduck wrapper (recommended)

import openduck  # auto-detects extension from build tree or OPENDUCK_EXTENSION_PATH

con = openduck.connect("mydb")  # uses OPENDUCK_TOKEN env var
con.sql("SELECT 1 AS x").show()
con.sql("SELECT * FROM cloud.users LIMIT 10").show()

Via raw DuckDB SDK

import duckdb
import os

ext_path = os.environ["OPENDUCK_EXTENSION_PATH"]
# e.g. extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension

con = duckdb.connect(config={"allow_unsigned_extensions": "true"})
con.execute(f"LOAD '{ext_path}';")
con.execute(
    "ATTACH 'openduck:mydb"
    "?endpoint=http://localhost:7878"
    f"&token={os.environ[\"OPENDUCK_TOKEN\"]}' AS cloud;"
)

# Query remote table
con.sql("SELECT * FROM cloud.users LIMIT 10").show()

# Hybrid query — local table joined with remote table
con.sql("""
    SELECT l.product_id, l.name, r.total_sales
    FROM local.products l
    JOIN cloud.sales r ON l.product_id = r.product_id
    WHERE r.total_sales > 1000
""").show()

Environment variables

export OPENDUCK_TOKEN=your-secret-token
export OPENDUCK_EXTENSION_PATH=extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
export OPENDUCK_ENDPOINT=http://localhost:7878   # default

Connecting from the CLI

duckdb -unsigned -c "
  LOAD 'extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension';
  ATTACH 'openduck:mydb?endpoint=http://localhost:7878&token=${OPENDUCK_TOKEN}' AS cloud;
  SHOW ALL TABLES;
  SELECT * FROM cloud.users LIMIT 5;
"

Connecting from Rust

use duckdb::{Connection, Result};

fn main() -> Result<()> {
    let conn = Connection::open_in_memory()?;
    let ext_path = std::env::var("OPENDUCK_EXTENSION_PATH").unwrap();
    let token = std::env::var("OPENDUCK_TOKEN").unwrap();

    conn.execute_batch(&format!(r#"
        SET allow_unsigned_extensions = true;
        LOAD '{ext_path}';
        ATTACH 'openduck:mydb?endpoint=http://localhost:7878&token={token}' AS cloud;
    "#))?;

    let mut stmt = conn.prepare("SELECT * FROM cloud.users LIMIT 10")?;
    let rows = stmt.query_map([], |row| {
        Ok(row.get::<_, String>(0)?)
    })?;

    for row in rows {
        println!("{}", row?);
    }
    Ok(())
}

Hybrid Execution Pattern

Hybrid execution happens automatically — the gateway splits the logical plan:

[LOCAL]  HashJoin(l.id = r.id)
  [LOCAL]  Scan(products)       ← runs on your machine
  [LOCAL]  Bridge(REMOTE→LOCAL)
    [REMOTE] Scan(sales)        ← runs on remote worker

Write queries naturally — the extension handles routing:

# This single query runs across two engines transparently
con.sql("""
    SELECT
        p.category,
        SUM(s.amount) AS revenue
    FROM local.products p          -- local table
    JOIN cloud.sales s             -- remote table
      ON p.id = s.product_id
    GROUP BY p.category
    ORDER BY revenue DESC
""").show()

Differential Storage

Differential storage is managed server-side. Key properties:

  • Append-only sealed layers stored in object storage (S3-compatible)
  • Postgres stores layer metadata and snapshot pointers
  • Snapshot isolation — readers always see a consistent view
  • One serialized write path — many concurrent readers

From a client perspective it is fully transparent. DuckDB sees normal table semantics.


ATTACH URL Reference

openduck:<database_name>?endpoint=<url>&token=<token>
Parameter Default Description
endpoint http://localhost:7878 Gateway URL
token $OPENDUCK_TOKEN env var Auth token matching server config

Examples:

-- Local dev
ATTACH 'openduck:mydb?token=dev-token' AS cloud;

-- Remote server, explicit endpoint
ATTACH 'openduck:mydb?endpoint=https://my-server.example.com&token=prod-token' AS cloud;

-- Alias: od: also works
ATTACH 'od:mydb?endpoint=http://localhost:7878&token=dev-token' AS cloud;

DuckLake Integration

OpenDuck and DuckLake operate at different layers and complement each other:

import duckdb, os

ext_path = os.environ["OPENDUCK_EXTENSION_PATH"]
token = os.environ["OPENDUCK_TOKEN"]

con = duckdb.connect(config={"allow_unsigned_extensions": "true"})
con.execute(f"LOAD '{ext_path}';")

# Attach DuckLake catalog (managed by remote worker backed by DuckLake)
con.execute(f"ATTACH 'openduck:lakehouse?endpoint=http://localhost:7878&token={token}' AS lh;")

# Query DuckLake tables transparently via OpenDuck transport
con.sql("SELECT * FROM lh.events WHERE event_date = today()").show()

# Hybrid: local scratch data joined with remote DuckLake table
con.sql("""
    SELECT l.session_id, r.user_email
    FROM memory.sessions l
    JOIN lh.users r ON l.user_id = r.id
""").show()

Protocol Reference

The wire protocol is intentionally minimal. See proto/openduck/v1/execution.proto:

  • ExecuteQuery — send SQL, receive a query handle
  • StreamResults — stream Arrow IPC record batches back to client

Any gRPC service implementing these two RPCs is a valid OpenDuck backend. You can replace the Rust gateway with a custom implementation in any language.


Common Patterns

Check which tables are available remotely

-- After ATTACH ... AS cloud
SHOW ALL TABLES;
SELECT table_name FROM information_schema.tables WHERE table_schema = 'cloud';

Write to a remote table

INSERT INTO cloud.events SELECT * FROM read_parquet('local_dump.parquet');

Create a remote table from local data

CREATE TABLE cloud.new_table AS SELECT * FROM local_csv LIMIT 0;
INSERT INTO cloud.new_table SELECT * FROM local_csv;

Export remote query result to local Parquet

con.sql("SELECT * FROM cloud.large_table WHERE region = 'us-east'") \
   .write_parquet("output/us_east.parquet")

Troubleshooting

Extension is not trusted / signature error

# Always set allow_unsigned_extensions before loading
con = duckdb.connect(config={"allow_unsigned_extensions": "true"})

Or in CLI:

duckdb -unsigned

LOAD fails — extension not found

# Set the env var to the exact built path
export OPENDUCK_EXTENSION_PATH=$(pwd)/extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
ls -la $OPENDUCK_EXTENSION_PATH   # confirm it exists

Connection refused to gateway

# Verify server is running
cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN
# Default port is 7878 — check firewall / port binding
curl http://localhost:7878/health

Token mismatch / auth failure

# Server token and client token must match exactly
export OPENDUCK_TOKEN=same-value-on-both-sides
# Server: cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN
# Client: ATTACH '...&token=same-value-on-both-sides' AS cloud;

Build fails on macOS — bison version

brew install bison
export PATH="$(brew --prefix bison)/bin:$PATH"
cd extensions/openduck && make

Extension version mismatch with DuckDB

The extension must be built against the same DuckDB version as the Python package:

python -c "import duckdb; print(duckdb.__version__)"
# Ensure the extension Makefile targets the same version
# Check extensions/openduck/Makefile for DUCKDB_VERSION

OpenDuck vs Alternatives

Feature OpenDuck Arrow Flight SQL DuckLake
Remote attach UX ATTACH 'openduck:db' Separate driver ATTACH 'ducklake:...'
Hybrid execution ✅ split plan ❌ full remote
DuckDB catalog integration ✅ native
Protocol RPCs 2 ~15 N/A
Differential storage via Parquet layers
Self-hosted
Weekly Installs
186
GitHub Stars
39
First Seen
Today