turbo-architecture
Turbo Pipeline Architecture
Help users make architecture decisions for Turbo pipelines — source types, data flow patterns, resource sizing, sink strategies, streaming vs job mode, dynamic table design, and multi-chain deployment.
Agent Instructions
Step 1: Understand the Requirements
Ask the user about:
- What data? — Which chain(s), which events/datasets, historical or real-time only?
- Where does it go? — Database, webhook, streaming topic, multiple destinations?
- How much volume? — Single contract or all chain activity? How many events/sec?
- Latency needs? — Real-time dashboards (sub-second) or analytics (minutes OK)?
Step 2: Recommend an Architecture
Use the patterns and decision guides below to recommend a pipeline architecture. Reference the templates in templates/ as starting points:
templates/linear-pipeline.yaml— Simple decode → filter → sinktemplates/fan-out-pipeline.yaml— One source → multiple sinkstemplates/fan-in-pipeline.yaml— Multiple events → UNION ALL → sinktemplates/multi-chain-templated.yaml— Per-chain pipeline pattern
Step 3: Hand Off to Implementation Skills
After the architecture is decided, direct the user to:
/turbo-pipelines— To build and deploy the YAML/turbo-transforms— To write the SQL transforms/secrets— To set up sink credentials
Reminder: When presenting complete pipeline YAML as part of architecture recommendations, validate it first with goldsky turbo validate. Templates in templates/ are structural patterns — any customized version must be validated before presenting to the user.
Source Types
Dataset Sources — Historical + Real-Time
Use type: dataset when you need to process historical blockchain data and/or continue streaming new data.
sources:
my_source:
type: dataset
dataset_name: base.erc20_transfers
version: 1.2.0
start_at: earliest # or: latest
Best for: Raw logs, transactions, transfers, blocks — anything where you need historical backfill or decoded event processing.
Available EVM datasets: raw_logs, blocks, raw_transactions, raw_traces, erc20_transfers, erc721_transfers, erc1155_transfers, decoded_logs
Non-EVM chains also supported: Solana (solana.*), Bitcoin (bitcoin.raw.*), Stellar (stellar_mainnet.*)
Kafka Sources — Real-Time Streaming
Note: Kafka sources are used in production pipelines but are not documented in the official Goldsky docs. Use with caution and contact Goldsky support for topic names.
Use type: kafka when consuming from a Goldsky-managed Kafka topic, typically for continuously-updated state like balances.
sources:
my_source:
type: kafka
topic: base.raw.latest_balances_v2
Best for: Balance snapshots, latest state data, high-volume continuous streams.
Key differences from dataset sources:
- No
start_atorversionfields - Optional fields:
filter,include_metadata,starting_offsets - Delivers the latest state rather than historical event logs
When to Use Which
| Scenario | Source Type | Why |
|---|---|---|
| Decode contract events from logs | dataset |
Need raw_logs + _gs_log_decode() |
| Track token transfers | dataset |
erc20_transfers has structured data |
| Historical backfill + live | dataset |
start_at: earliest processes history |
| Live token balances | kafka |
latest_balances_v2 is a streaming topic |
| Real-time state snapshots | kafka |
Kafka delivers latest state continuously |
| Only need new data going forward | Either | Dataset with start_at: latest or Kafka |
Data Flow Patterns
Linear Pipeline
The simplest pattern. One source → one or more chained transforms → one sink.
source → transform_a → transform_b → sink
Use when: You have a single data source, single destination, and straightforward processing (decode, filter, reshape).
Example: templates/linear-pipeline.yaml — raw logs → decode → extract trade events → postgres
Resource size: s or m
Fan-Out (One Source → Multiple Sinks)
One source feeds multiple transforms, each writing to a different sink.
┌→ transform_a → sink_1 (clickhouse)
source ──────┤
└→ transform_b → sink_2 (webhook)
Use when: You need different views or subsets of the same data going to different destinations — e.g., balances to a warehouse AND token metadata to a webhook.
Example: templates/fan-out-pipeline.yaml — one Kafka source → fungible balances to ClickHouse + all tokens to a webhook
transforms:
fungible_balances:
type: sql
primary_key: id
sql: |
SELECT ... FROM latest_balances balance
WHERE balance.token_type = 'ERC_20' OR balance.token_type IS NULL
all_tokens:
type: sql
primary_key: id
sql: |
SELECT ... FROM latest_balances balance
WHERE balance.token_type IN ('ERC_20', 'ERC_721', 'ERC_1155')
sinks:
warehouse:
type: clickhouse
from: fungible_balances
# ...
webhook:
type: webhook
from: all_tokens
# ...
Resource size: m (multiple output paths)
Fan-In (Multiple Events → One Output)
Multiple event types decoded from the same source, normalized to a common schema, then combined with UNION ALL into a single sink.
┌→ event_type_a ──┐
source → decode ┤ ├→ UNION ALL → sink
└→ event_type_b ──┘
Use when: You want a unified activity feed, combining trades, deposits, withdrawals, transfers, etc. into one table.
Example: templates/fan-in-pipeline.yaml — one raw_logs source → decode → multiple event-type transforms → UNION ALL → ClickHouse
Resource size: l (complex processing with many transforms)
Multi-Chain Fan-In
Multiple sources from different chains combined into a single output.
source_chain_a ──┐
source_chain_b ──┼→ UNION ALL → sink
source_chain_c ──┘
Use when: You want cross-chain analytics or a unified view across chains.
sources:
eth_transfers:
type: dataset
dataset_name: ethereum.erc20_transfers
version: 1.0.0
start_at: latest
base_transfers:
type: dataset
dataset_name: base.erc20_transfers
version: 1.2.0
start_at: latest
transforms:
combined:
type: sql
primary_key: id
sql: |
SELECT *, 'ethereum' AS chain FROM eth_transfers
UNION ALL
SELECT *, 'base' AS chain FROM base_transfers
Resource size: m or l depending on chain count
Templated Multi-Chain Deployment
When you need the same pipeline logic across multiple chains, create separate pipeline files per chain rather than one multi-source pipeline. This gives you:
- Independent lifecycle (deploy/delete per chain)
- Independent checkpointing (one chain failing doesn't block others)
- Clearer monitoring per chain
Pattern: Copy the pipeline YAML and swap the chain-specific values:
| Field | Chain A (base) | Chain B (arbitrum) |
|---|---|---|
name |
base-balance-streaming |
arbitrum-balance-streaming |
topic |
base.raw.latest_balances_v2 |
arbitrum.raw.latest_balances_v2 |
| Source key | base_latest_balances_v2 |
arbitrum_latest_balances_v2 |
| Transform SQL | 'base' AS chain |
'arbitrum' AS chain |
| Sink table | base_token_balances |
arbitrum_token_balances |
Example: templates/multi-chain-templated.yaml — shows the base chain version; duplicate for each chain.
When to use templated vs multi-source:
| Approach | Pros | Cons |
|---|---|---|
| Templated (per-chain) | Independent lifecycle, clear monitoring | More files to manage |
| Multi-source (one file) | Single deployment, cross-chain UNION possible | Coupled lifecycle, harder to debug |
Resource Sizing
Each size doubles the previous tier's CPU and memory:
| Size | Workers | CPU Request | Memory | When to Use |
|---|---|---|---|---|
xs |
— | 0.4 | 0.5 Gi | Small datasets, light testing |
s |
1 | 0.8 | 1.0 Gi | Testing, simple filters, single source/sink, low volume (default) |
m |
4 | 1.6 | 2.0 Gi | Multiple sinks, Kafka streaming, moderate transform complexity |
l |
10 | 3.2 | 4.0 Gi | Multi-event decoding with UNION ALL, high-volume historical backfill |
xl |
20 | 6.4 | 8.0 Gi | Large chain backfills, complex JOINs (e.g. Solana accounts+transactions) |
xxl |
40 | 12.8 | 16.0 Gi | Highest throughput needs; up to 6.3M rows/min |
Rules of thumb from production pipelines:
- Simple filter + single sink →
s(default, try this first) - Kafka source + multiple sinks OR multiple transforms →
m - Raw log decoding + 5+ event types + UNION ALL →
l - Historical backfill of high-volume data →
lorxl(can downsize after catch-up) - Start small and scale up — defensive sizing avoids wasted resources
Sink Selection
Quick Reference
| Destination | Sink Type | Best For |
|---|---|---|
| Application DB | postgres |
Row-level lookups, joins, application serving |
| Real-time aggregates | postgres_aggregate |
Balances, counters, running totals via triggers |
| Analytics queries | clickhouse |
Large-scale aggregations, time-series data |
| Event processing | kafka |
Downstream consumers, event-driven systems |
| Serverless streaming | s2_sink |
S2.dev streams, alternative to Kafka |
| Notifications | webhook |
Lambda functions, API callbacks, alerts |
| Data lake | s3_sink |
Long-term archival, batch processing |
| Testing | blackhole |
Validate pipeline without writing data |
Decision Flowchart
What's your primary use case?
│
├─ Application serving (REST/GraphQL API)
│ └─ PostgreSQL ← row-level lookups, joins, strong consistency
│
├─ Analytics / dashboards
│ ├─ Time-series queries → ClickHouse ← columnar, fast aggregations
│ └─ Full-text search → Elasticsearch / OpenSearch
│
├─ Real-time aggregations (balances, counters)
│ └─ PostgreSQL Aggregate ← trigger-based running totals
│
├─ Event-driven downstream processing
│ ├─ Need Kafka ecosystem → Kafka
│ └─ Serverless / simpler → S2 (s2.dev)
│
├─ Notifications / webhooks
│ └─ Webhook ← HTTP POST per event
│
├─ Long-term archival
│ └─ S3 ← object storage, cheapest for bulk data
│
├─ Just testing
│ └─ Blackhole ← validates pipeline without writing
│
└─ Multiple of the above
└─ Use multiple sinks in the same pipeline (fan-out pattern)
PostgreSQL Aggregate Sink
The postgres_aggregate sink is uniquely suited for real-time running aggregations (balances, counters, totals). It uses a two-table pattern: a landing table that receives raw events, and an aggregation table maintained by a database trigger.
sinks:
token_balances:
type: postgres_aggregate
from: transfers
schema: public
landing_table: transfer_events
agg_table: account_balances
primary_key: id
secret_name: MY_POSTGRES
group_by:
account:
type: text
token_address:
type: text
aggregate:
balance:
from: amount
fn: sum
transfer_count:
from: id
fn: count
Supported aggregation functions: sum, count, avg, min, max
Multi-Sink Considerations
- Each sink reads from a
from:field — different sinks can read from different transforms - Sinks are independent — one failing doesn't block others
- Use different
batch_size/batch_flush_intervalper sink based on latency needs - ClickHouse supports
parallelism: Nfor concurrent writers (default1)
Webhook Sinks Without Secrets
Webhooks can use a direct URL instead of a secret when no auth headers are needed:
sinks:
my_webhook:
type: webhook
from: my_transform
url: https://my-lambda.us-west-2.on.aws/
Pipeline Splitting Decisions
One Pipeline vs. Multiple
Use one pipeline when:
- All data comes from the same source
- Transforms share intermediate results (e.g., a shared decode step)
- You want atomic deployment of the whole flow
Split into multiple pipelines when:
- Different data sources with no shared transforms
- Different lifecycle needs (one is stable, another changes frequently)
- Different resource requirements (one needs
l, another needss) - Different chains with independent processing (templated pattern)
Keeping Pipelines Focused
A pipeline should ideally do one logical thing:
| Pipeline | Focus |
|---|---|
dex-trades |
Trade events → Postgres |
dex-activities |
All activity types → ClickHouse DWH |
token-balances |
Token balances → Postgres |
base-balance-streaming |
Base balances → ClickHouse + webhook |
Even though trades are a subset of activities, they're separate pipelines because they serve different consumers (application DB vs data warehouse).
Streaming vs Job Mode
Turbo pipelines have two execution modes:
Streaming Mode (Default)
name: my-streaming-pipeline
resource_size: s
# job: false (default — omit this field)
- Runs continuously, processing data as it arrives
- Maintains checkpoints for exactly-once processing
- Use for real-time feeds, dashboards, APIs
Job Mode (One-Time Batch)
name: my-backfill-job
resource_size: l
job: true
- Runs to completion and stops automatically
- Auto-deletes resources ~1 hour after completion
- Must delete before redeploying — cannot update a job pipeline, must
goldsky turbo deletefirst - Cannot use
restart— use delete + apply instead - Use for historical backfills, one-time data migrations, snapshot exports
When to Use Which
| Scenario | Mode | Why |
|---|---|---|
| Real-time dashboard | Streaming | Continuous updates needed |
| Backfill 6 months of history | Job | One-time, stops when done |
| Real-time + catch-up on deploy | Streaming | start_at: earliest does backfill then streams |
| Export data to S3 once | Job | No need for continuous processing |
| Webhook notifications on events | Streaming | Needs to react as events happen |
| Load test with historical data | Job | Process and inspect, then discard |
Job Mode with Bounded Ranges
Combine job mode with start_at: earliest and an end_block to process a specific range:
name: historical-export
resource_size: l
job: true
sources:
logs:
type: dataset
dataset_name: ethereum.raw_logs
version: 1.0.0
start_at: earliest
end_block: 19000000
filter: >-
address = '0xdac17f958d2ee523a2206206994597c13d831ec7'
Dynamic Table Architecture
Dynamic tables enable runtime-updatable lookup data within a pipeline. They're the Turbo answer to the "no joins in streaming SQL" limitation.
Pattern: Dynamic Allowlist/Blocklist
┌──────────────────────┐
│ External Updates │
│ (Postgres / REST) │
└──────────┬───────────┘
▼
source ──→ sql transform ──→ [dynamic_table_check()] ──→ sink
The SQL transform filters records against the dynamic table. The table contents can be updated externally without pipeline restart.
Pattern: Lookup Enrichment
source ──→ decode ──→ filter ──→ sql (with dynamic_table_check) ──→ sink
▲
[token_metadata table]
(postgres-backed)
Store metadata (token symbols, decimals, protocol names) in a PostgreSQL table. Reference it in transforms for enrichment.
Backend Decisions
| Backend | backend_type |
When to Use |
|---|---|---|
| PostgreSQL | Postgres |
Data managed by external systems, shared across pipeline restarts |
| In-memory | InMemory |
Auto-populated from pipeline data, ephemeral, fastest lookups |
Sizing Considerations
- Dynamic tables add memory overhead proportional to table size
- For large lookup tables (>100K rows), use
Postgresbackend - For small, frequently-changing lists (<10K rows),
InMemoryis faster - Dynamic table queries are async — they add slight latency per record
For full dynamic table configuration syntax and examples, see
/turbo-transforms.
Related
/turbo-builder— Build and deploy pipelines interactively using these architecture patterns/turbo-doctor— Diagnose and fix pipeline issues/turbo-pipelines— Pipeline YAML configuration reference/turbo-transforms— SQL, TypeScript, and dynamic table transform reference/datasets— Blockchain dataset and chain prefix reference/secrets— Sink credential management/turbo-monitor-debug— Monitoring and debugging reference/turbo-lifecycle— Pipeline lifecycle command reference
More from goldsky-io/goldsky-agent
turbo-builder
Build and deploy new Goldsky Turbo pipelines from scratch. Triggers on: 'build a pipeline', 'index X on Y chain', 'set up a pipeline', 'track transfers to postgres', or any request describing data to move from a chain/contract to a destination (postgres, clickhouse, kafka, s3, webhook). Covers the full workflow: requirements → dataset selection → YAML generation → validation → deploy. Not for debugging (use /turbo-doctor) or syntax lookups (use /turbo-pipelines).
38turbo-pipelines
Turbo pipeline YAML reference and architecture guide. Covers: YAML field syntax (start_at, from, version, primary_key), source/transform/sink configuration, validation errors, resource sizing (xs–xxl), architecture decisions (dataset vs kafka, streaming vs job, fan-out vs fan-in, sink selection, pipeline splitting). Triggers on: 'what does field X do', 'what fields does a postgres sink need', 'what resource size', 'should I use kafka or dataset', 'how to structure my pipeline'. For writing transforms, use /turbo-transforms. For end-to-end building, use /turbo-builder.
38turbo-doctor
Diagnose and fix broken Goldsky Turbo pipelines interactively. Triggers on: pipeline in error state, stuck starting, connection refused, not getting data, duplicate rows, missing fields, slow backfill, or any named pipeline misbehaving. Runs logs/status commands, identifies root cause, and offers fixes. For CLI syntax or error pattern lookup without an active problem, use /turbo-operations instead.
34secrets
Use this skill when a user wants to store, manage, or work with Goldsky secrets — the named credential objects used by pipeline sinks. This includes: creating a new secret from a connection string or credentials, listing or inspecting existing secrets, updating or rotating credentials after a password change, and deleting secrets that are no longer needed. Trigger for any query where the user mentions 'goldsky secret', wants to securely store database credentials for a pipeline, or is working with sink authentication for PostgreSQL, Neon, Supabase, ClickHouse, Kafka, S3, Elasticsearch, DynamoDB, SQS, OpenSearch, or webhooks.
33datasets
Use this skill when the user needs to look up or verify Goldsky blockchain dataset names, chain prefixes, dataset types, or versions. Triggers on questions like 'what\\'s the dataset name for X?', 'what prefix does Goldsky use for chain Y?', 'what version should I use for Z?', or 'what datasets are available for Solana/Stellar/Arbitrum/etc?'. Also use for chain-specific dataset questions (e.g., polygon vs matic prefix, stellarnet balance datasets, solana token transfer dataset names). Do NOT trigger for questions about CLI commands, pipeline setup, or general Goldsky architecture unless the core question is about finding the right dataset name or chain prefix.
33turbo-transforms
Write SQL, TypeScript, and dynamic table transforms for Turbo pipelines. Covers: decoding EVM logs with _gs_log_decode, filtering/casting blockchain data, UNION ALL for combining events, TypeScript/WASM transforms (invoke function), dynamic lookup tables (dynamic_table_check), transform chaining, and Solana decoding. Triggers on: 'decode Transfer events', 'write a SQL transform', 'filter by contract', 'TypeScript transform', 'dynamic table', 'UNION ALL'. For pipeline YAML structure, use /turbo-pipelines. For end-to-end building, use /turbo-builder.
32