turbo-pipelines
Turbo Pipeline Configuration & Architecture
YAML configuration reference and architecture guide for Turbo pipelines. For interactive pipeline building, use /turbo-builder. For troubleshooting, use /turbo-doctor. For transform implementation, use /turbo-transforms.
CRITICAL: Always validate YAML with
goldsky turbo validate <file.yaml>before showing complete pipeline YAML to the user or deploying.
Quick Start
name: my-first-pipeline
resource_size: s
sources:
transfers:
type: dataset
dataset_name: base.erc20_transfers
version: 1.2.0
start_at: latest
transforms: {}
sinks:
output:
type: blackhole
from: transfers
goldsky turbo validate pipeline.yaml # Validate first
goldsky turbo apply pipeline.yaml -i # Deploy + inspect
Prerequisites
- Goldsky CLI —
curl https://goldsky.com | sh - Turbo extension (separate binary) —
curl https://install-turbo.goldsky.com | sh - Logged in —
goldsky login - Secrets created for sinks if using PostgreSQL, ClickHouse, Kafka, etc. (see
/secrets)
Architecture Decisions
Source Type Selection
| Scenario | Source Type | Why |
|---|---|---|
| Decode contract events from logs | dataset |
Need raw_logs + _gs_log_decode() |
| Track token transfers | dataset |
erc20_transfers has structured data |
| Historical backfill + live | dataset |
start_at: earliest processes history |
| Live token balances | kafka |
latest_balances_v2 is a streaming topic |
| Real-time state snapshots | kafka |
Kafka delivers latest state continuously |
| Only need new data going forward | Either | Dataset with start_at: latest or Kafka |
Note: Kafka sources are used in production but are not documented in official Goldsky docs. Contact Goldsky support for topic names.
Data Flow Patterns
| Pattern | When to Use | Template |
|---|---|---|
| Linear | Single source, single destination, simple processing | templates/linear-pipeline.yaml |
| Fan-out | One source → multiple sinks (different views/subsets) | templates/fan-out-pipeline.yaml |
| Fan-in | Multiple event types → UNION ALL → one table | templates/fan-in-pipeline.yaml |
| Multi-chain | Same logic across chains (separate pipelines) | templates/multi-chain-templated.yaml |
For detailed pattern diagrams, YAML examples, and multi-chain deployment guidance, read references/architecture-patterns.md.
Resource Sizing
| Size | Workers | CPU | Memory | When to Use |
|---|---|---|---|---|
xs |
— | 0.4 | 0.5 Gi | Small datasets, light testing |
s |
1 | 0.8 | 1.0 Gi | Simple filters, single source/sink, low volume (default) |
m |
4 | 1.6 | 2.0 Gi | Multiple sinks, Kafka streaming, moderate transform complexity |
l |
10 | 3.2 | 4.0 Gi | Multi-event decoding + UNION ALL, high-volume backfill |
xl |
20 | 6.4 | 8.0 Gi | Large chain backfills, complex JOINs |
xxl |
40 | 12.8 | 16.0 Gi | Highest throughput; up to 6.3M rows/min |
Start small and scale up — defensive sizing avoids wasted resources.
Sink Selection
| Destination | Sink Type | Best For |
|---|---|---|
| Application DB | postgres |
Row-level lookups, joins, application serving |
| Real-time aggregates | postgres_aggregate |
Balances, counters, running totals via triggers |
| Analytics queries | clickhouse |
Large-scale aggregations, time-series data |
| Event processing | kafka |
Downstream consumers, event-driven systems |
| GCP messaging | pubsub |
Google Cloud Pub/Sub topics (Turbo-only) |
| Serverless streaming | s2_sink |
S2.dev streams, alternative to Kafka |
| Notifications | webhook |
Lambda functions, API callbacks, alerts |
| Data lake | s3_sink |
Long-term archival, batch processing |
| Testing | blackhole |
Validate pipeline without writing data |
For full sink field specifications, read references/sink-reference.md.
Streaming vs Job Mode
| Scenario | Mode | Why |
|---|---|---|
| Real-time dashboard | Streaming | Continuous updates needed |
| Backfill 6 months of history | Job | One-time, stops when done |
| Real-time + catch-up on deploy | Streaming | start_at: earliest does backfill then streams |
| Export data to S3 once | Job | No need for continuous processing |
| Webhook notifications on events | Streaming | Needs to react as events happen |
| Load test with historical data | Job | Process and inspect, then discard |
Job mode rules: Runs to completion, auto-deletes ~1hr after finishing. Must delete before redeploying. Cannot pause/resume/restart.
Pipeline Splitting
Default to one pipeline with multiple sinks when the user wants the same source data sent to multiple destinations. A single Turbo pipeline supports multiple sinks natively — each sink has a from: field pointing at a source or transform by name, and sinks run independently (one failing does not block the others, and each can have its own batch_size / batch_flush_interval).
Do NOT split into separate pipelines just because there are multiple destinations. Generating one pipeline per sink duplicates the source ingestion, wastes resources, and decouples deployments that should be atomic.
Split into multiple pipelines only when: sources are fundamentally different (different chains with independent lifecycles), the destinations need different resource sizes, or the user explicitly asks for separate pipelines.
See references/architecture-patterns.md for fan-out (one source → multiple sinks) and fan-in (UNION ALL) YAML examples, and templates/multi-sink-pipeline.yaml / templates/fan-out-pipeline.yaml / templates/fan-in-pipeline.yaml for ready-to-adapt configs.
Configuration Reference
Pipeline Structure
name: my-pipeline # Required: unique identifier (lowercase, hyphens)
resource_size: s # Required: xs/s/m/l/xl/xxl
job: true # Optional: one-time batch (default: false = streaming)
sources:
source_name:
type: dataset # or: kafka
# ... source config
transforms: # Optional
transform_name:
type: sql # or: script, handler, dynamic_table
# ... transform config
sinks:
sink_name:
type: postgres # or: clickhouse, kafka, webhook, s3_sink, etc.
# ... sink config
Source Configuration
Dataset Source
sources:
my_source:
type: dataset
dataset_name: <chain>.<dataset_type>
version: <version>
start_at: latest | earliest # EVM chains
# start_block: <slot_number> # Solana only
# end_block: <block_number> # Optional: for bounded backfills
# filter: >- # Optional: SQL WHERE for source-level pre-filtering
# address = '0x...' AND block_number >= 10000000
| Field | Required | Description |
|---|---|---|
type |
Yes | dataset for blockchain data |
dataset_name |
Yes | Format: <chain>.<dataset_type> |
version |
Yes | Dataset version (e.g., 1.2.0) |
start_at |
EVM | latest or earliest |
start_block |
Solana | Specific slot number (omit for latest) |
end_block |
No | Stop at this block (for bounded backfills) |
filter |
No | SQL WHERE clause — pre-filters at ingestion (efficient) |
Use filter for contract addresses and block ranges (coarse pre-filtering). Use transform WHERE for fine-grained filtering.
For chain prefixes and dataset types, see /datasets.
Kafka Source
sources:
my_source:
type: kafka
topic: base.raw.latest_balances_v2
No start_at or version fields. Optional: filter, include_metadata, starting_offsets.
Transform Configuration
| Type | Use Case |
|---|---|
sql |
Filtering, projections, SQL functions |
script |
Custom TypeScript/WASM logic |
handler |
Call external HTTP APIs to enrich |
dynamic_table |
Lookup tables backed by a database |
throttle |
Cap throughput to a fixed records-per-second |
SQL Transform
transforms:
filtered:
type: sql
primary_key: id
sql: |
SELECT id, sender, recipient, amount
FROM source_name
WHERE amount > 1000
| Field | Required | Description |
|---|---|---|
type |
Yes | sql |
primary_key |
Yes | Column for uniqueness/ordering |
sql |
Yes | SQL query (reference sources by name) |
from |
No | Override default source (for chaining) |
Transform Chaining
Chain transforms using from:
transforms:
step1:
type: sql
primary_key: id
sql: SELECT * FROM source WHERE amount > 100
step2:
type: sql
primary_key: id
from: step1
sql: SELECT *, 'processed' as status FROM step1
For TypeScript, handler, and dynamic table transforms, see /turbo-transforms.
Throttle Transform
Caps the throughput of a stream by buffering records into batches and emitting each batch on a fixed minimum interval. Throttle does not modify data — every input record passes through unchanged. Use it to stay under rate limits of downstream sinks or external APIs, smooth bursty sources, or pace records into HTTP handlers.
transforms:
throttled:
type: throttle
from: my_source
max_batch_size: 100
min_batch_interval: 10s
| Field | Required | Description |
|---|---|---|
type |
Yes | throttle |
from |
Yes | Source or transform to read from |
max_batch_size |
No | Max records per batch |
min_batch_interval |
No | Minimum time between batches (e.g. 10s, 500ms, 1m) |
Effective max throughput ≈ max_batch_size / min_batch_interval records per second. Throttle limits the maximum rate, not the minimum — if upstream is slow, batches will be smaller and arrive less frequently.
Place throttle close to the bottleneck (just before the rate-limited sink or handler) so upstream transforms still process at full speed.
Sink Configuration
Quick examples for common sinks. For full field specs of all sink types, read references/sink-reference.md.
PostgreSQL
sinks:
output:
type: postgres
from: my_transform
schema: public
table: my_table
secret_name: MY_POSTGRES_SECRET
primary_key: id
ClickHouse
sinks:
output:
type: clickhouse
from: my_transform
table: my_table
secret_name: MY_CLICKHOUSE_SECRET
primary_key: id
Pub/Sub (Turbo-only)
sinks:
output:
type: pubsub
from: my_transform
topic: my-topic
secret_name: MY_PUBSUB_SECRET
The secret holds a GCP project id and service-account JSON. Topic must already exist in GCP.
Blackhole (Testing)
sinks:
output:
type: blackhole
from: my_transform
Checkpoint Behavior
- Preserved by default when updating a pipeline
- Tied to source names — renaming a source resets its checkpoint
- Tied to pipeline names — renaming the pipeline resets all checkpoints
To reset checkpoints: rename the source or pipeline. Warning: this reprocesses all historical data.
Starter Templates
| Template | Description | Use Case |
|---|---|---|
minimal-erc20-blackhole.yaml |
Simplest pipeline, no credentials | Quick testing |
filtered-transfers-sql.yaml |
Filter by contract address | USDC, specific tokens |
postgres-output.yaml |
Write to PostgreSQL | Production data storage |
pubsub-output.yaml |
Write to Google Cloud Pub/Sub | GCP-based event consumers |
multi-chain-pipeline.yaml |
Combine multiple chains | Cross-chain analytics |
solana-transfers.yaml |
Solana SPL tokens | Non-EVM chains |
multi-sink-pipeline.yaml |
Multiple outputs | Archive + alerts + streaming |
linear-pipeline.yaml |
Simple decode → filter → sink | Basic linear flow |
fan-out-pipeline.yaml |
One source → multiple sinks | Multi-destination |
fan-in-pipeline.yaml |
Multiple events → UNION ALL | Activity feeds |
multi-chain-templated.yaml |
Per-chain pipeline pattern | Independent chain deploys |
Template location: templates/ (relative to this skill's directory)
CLI Quick Reference
| Action | Command |
|---|---|
| Install Goldsky CLI | curl https://goldsky.com | sh |
| Install Turbo extension | curl https://install-turbo.goldsky.com | sh |
| Validate (REQUIRED) | goldsky turbo validate pipeline.yaml |
| Deploy/Update | goldsky turbo apply pipeline.yaml |
| Deploy + Inspect | goldsky turbo apply pipeline.yaml -i |
| List pipelines | goldsky turbo list |
| List datasets | goldsky dataset list (slow, 30-60s) |
| List secrets | goldsky secret list |
For lifecycle commands (pause/resume/restart/delete) and monitoring (inspect/logs), see /turbo-operations.
Troubleshooting
See references/troubleshooting.md for CLI hanging, validation errors, and runtime errors.
Related
/turbo-builder— Interactive wizard to build pipelines step-by-step/turbo-doctor— Diagnose and fix pipeline issues/turbo-operations— Lifecycle commands and monitoring reference/turbo-transforms— SQL, TypeScript, and dynamic table transform reference/datasets— Dataset names and chain prefixes/secrets— Sink credential management
More from goldsky-io/goldsky-agent
turbo-builder
Build and deploy new Goldsky Turbo pipelines from scratch. Triggers on: 'build a pipeline', 'index X on Y chain', 'set up a pipeline', 'track transfers to postgres', or any request describing data to move from a chain/contract to a destination (postgres, clickhouse, kafka, s3, webhook). Covers the full workflow: requirements → dataset selection → YAML generation → validation → deploy. Not for debugging (use /turbo-doctor) or syntax lookups (use /turbo-pipelines).
39secrets
Use this skill when a user wants to store, manage, or work with Goldsky secrets — the named credential objects used by pipeline sinks. This includes: creating a new secret from a connection string or credentials, listing or inspecting existing secrets, updating or rotating credentials after a password change, and deleting secrets that are no longer needed. Trigger for any query where the user mentions 'goldsky secret', wants to securely store database credentials for a pipeline, or is working with sink authentication for PostgreSQL, Neon, Supabase, ClickHouse, Kafka, S3, Elasticsearch, DynamoDB, SQS, OpenSearch, or webhooks.
34datasets
Use this skill when the user needs to look up or verify Goldsky blockchain dataset names, chain prefixes, dataset types, or versions. Triggers on questions like 'what\\'s the dataset name for X?', 'what prefix does Goldsky use for chain Y?', 'what version should I use for Z?', or 'what datasets are available for Solana/Stellar/Arbitrum/etc?'. Also use for chain-specific dataset questions (e.g., polygon vs matic prefix, stellarnet balance datasets, solana token transfer dataset names). Do NOT trigger for questions about CLI commands, pipeline setup, or general Goldsky architecture unless the core question is about finding the right dataset name or chain prefix.
34turbo-doctor
Diagnose and fix broken Goldsky Turbo pipelines interactively. Triggers on: pipeline in error state, stuck starting, connection refused, not getting data, duplicate rows, missing fields, slow backfill, or any named pipeline misbehaving. Runs logs/status commands, identifies root cause, and offers fixes. For CLI syntax or error pattern lookup without an active problem, use /turbo-operations instead.
34turbo-transforms
Write SQL, TypeScript, and dynamic table transforms for Turbo pipelines. Covers: decoding EVM logs with _gs_log_decode, filtering/casting blockchain data, UNION ALL for combining events, TypeScript/WASM transforms (invoke function), dynamic lookup tables (dynamic_table_check), transform chaining, and Solana decoding. Triggers on: 'decode Transfer events', 'write a SQL transform', 'filter by contract', 'TypeScript transform', 'dynamic table', 'UNION ALL'. For pipeline YAML structure, use /turbo-pipelines. For end-to-end building, use /turbo-builder.
33auth-setup
Set up Goldsky CLI authentication and project configuration. Use this skill when the user needs to: install the goldsky CLI (what's the official install command?), run goldsky login (including when the browser opens but 'authentication failed'), run goldsky project list and see 'not logged in' or 'unauthorized', switch between Goldsky projects, check which project they're currently authenticated to, or fix 'unauthorized' errors when running goldsky turbo commands. Also use for 'walk me through setting up goldsky CLI from scratch for the first time'. If any other Goldsky skill hits an auth error, redirect here first.
32