turbo-pipelines

Installation

SKILL.md

Turbo Pipeline Configuration & Architecture

YAML configuration reference and architecture guide for Turbo pipelines. For interactive pipeline building, use /turbo-builder. For troubleshooting, use /turbo-doctor. For transform implementation, use /turbo-transforms.

CRITICAL: Always validate YAML with goldsky turbo validate <file.yaml> before showing complete pipeline YAML to the user or deploying.

Quick Start

name: my-first-pipeline
resource_size: s
sources:
  transfers:
    type: dataset
    dataset_name: base.erc20_transfers
    version: 1.2.0
    start_at: latest
transforms: {}
sinks:
  output:
    type: blackhole
    from: transfers

goldsky turbo validate pipeline.yaml   # Validate first
goldsky turbo apply pipeline.yaml -i   # Deploy + inspect

Prerequisites

Goldsky CLI — curl https://goldsky.com | sh
Turbo extension (separate binary) — curl https://install-turbo.goldsky.com | sh
Logged in — goldsky login
Secrets created for sinks if using PostgreSQL, ClickHouse, Kafka, etc. (see /secrets)

Architecture Decisions

Source Type Selection

Scenario	Source Type	Why
Decode contract events from logs	`dataset`	Need `raw_logs` + `_gs_log_decode()`
Track token transfers	`dataset`	`erc20_transfers` has structured data
Historical backfill + live	`dataset`	`start_at: earliest` processes history
Live token balances	`kafka`	`latest_balances_v2` is a streaming topic
Real-time state snapshots	`kafka`	Kafka delivers latest state continuously
Only need new data going forward	Either	Dataset with `start_at: latest` or Kafka

Note: Kafka sources are used in production but are not documented in official Goldsky docs. Contact Goldsky support for topic names.

Data Flow Patterns

Pattern	When to Use	Template
Linear	Single source, single destination, simple processing	`templates/linear-pipeline.yaml`
Fan-out	One source → multiple sinks (different views/subsets)	`templates/fan-out-pipeline.yaml`
Fan-in	Multiple event types → UNION ALL → one table	`templates/fan-in-pipeline.yaml`
Multi-chain	Same logic across chains (separate pipelines)	`templates/multi-chain-templated.yaml`

For detailed pattern diagrams, YAML examples, and multi-chain deployment guidance, read references/architecture-patterns.md.

Resource Sizing

Size	Workers	CPU	Memory	When to Use
`xs`	—	0.4	0.5 Gi	Small datasets, light testing
`s`	1	0.8	1.0 Gi	Simple filters, single source/sink, low volume (default)
`m`	4	1.6	2.0 Gi	Multiple sinks, Kafka streaming, moderate transform complexity
`l`	10	3.2	4.0 Gi	Multi-event decoding + UNION ALL, high-volume backfill
`xl`	20	6.4	8.0 Gi	Large chain backfills, complex JOINs
`xxl`	40	12.8	16.0 Gi	Highest throughput; up to 6.3M rows/min

Start small and scale up — defensive sizing avoids wasted resources.

Sink Selection

Destination	Sink Type	Best For
Application DB	`postgres`	Row-level lookups, joins, application serving
Real-time aggregates	`postgres_aggregate`	Balances, counters, running totals via triggers
Analytics queries	`clickhouse`	Large-scale aggregations, time-series data
Event processing	`kafka`	Downstream consumers, event-driven systems
GCP messaging	`pubsub`	Google Cloud Pub/Sub topics (Turbo-only)
Serverless streaming	`s2_sink`	S2.dev streams, alternative to Kafka
Notifications	`webhook`	Lambda functions, API callbacks, alerts
Data lake	`s3_sink`	Long-term archival, batch processing
Testing	`blackhole`	Validate pipeline without writing data

For full sink field specifications, read references/sink-reference.md.

Streaming vs Job Mode

Scenario	Mode	Why
Real-time dashboard	Streaming	Continuous updates needed
Backfill 6 months of history	Job	One-time, stops when done
Real-time + catch-up on deploy	Streaming	`start_at: earliest` does backfill then streams
Export data to S3 once	Job	No need for continuous processing
Webhook notifications on events	Streaming	Needs to react as events happen
Load test with historical data	Job	Process and inspect, then discard

Job mode rules: Runs to completion, auto-deletes ~1hr after finishing. Must delete before redeploying. Cannot pause/resume/restart.

Pipeline Splitting

Default to one pipeline with multiple sinks when the user wants the same source data sent to multiple destinations. A single Turbo pipeline supports multiple sinks natively — each sink has a from: field pointing at a source or transform by name, and sinks run independently (one failing does not block the others, and each can have its own batch_size / batch_flush_interval).

Do NOT split into separate pipelines just because there are multiple destinations. Generating one pipeline per sink duplicates the source ingestion, wastes resources, and decouples deployments that should be atomic.

Split into multiple pipelines only when: sources are fundamentally different (different chains with independent lifecycles), the destinations need different resource sizes, or the user explicitly asks for separate pipelines.

See references/architecture-patterns.md for fan-out (one source → multiple sinks) and fan-in (UNION ALL) YAML examples, and templates/multi-sink-pipeline.yaml / templates/fan-out-pipeline.yaml / templates/fan-in-pipeline.yaml for ready-to-adapt configs.

Configuration Reference

Pipeline Structure

name: my-pipeline          # Required: unique identifier (lowercase, hyphens)
resource_size: s            # Required: xs/s/m/l/xl/xxl
job: true                   # Optional: one-time batch (default: false = streaming)

sources:
  source_name:
    type: dataset           # or: kafka
    # ... source config

transforms:                 # Optional
  transform_name:
    type: sql               # or: script, handler, dynamic_table
    # ... transform config

sinks:
  sink_name:
    type: postgres           # or: clickhouse, kafka, webhook, s3_sink, etc.
    # ... sink config

Source Configuration

Dataset Source

sources:
  my_source:
    type: dataset
    dataset_name: <chain>.<dataset_type>
    version: <version>
    start_at: latest | earliest    # EVM chains
    # start_block: <slot_number>   # Solana only
    # end_block: <block_number>    # Optional: for bounded backfills
    # filter: >-                   # Optional: SQL WHERE for source-level pre-filtering
    #   address = '0x...' AND block_number >= 10000000

Field	Required	Description
`type`	Yes	`dataset` for blockchain data
`dataset_name`	Yes	Format: `<chain>.<dataset_type>`
`version`	Yes	Dataset version (e.g., `1.2.0`)
`start_at`	EVM	`latest` or `earliest`
`start_block`	Solana	Specific slot number (omit for latest)
`end_block`	No	Stop at this block (for bounded backfills)
`filter`	No	SQL WHERE clause — pre-filters at ingestion (efficient)

Use filter for contract addresses and block ranges (coarse pre-filtering). Use transform WHERE for fine-grained filtering.

For chain prefixes and dataset types, see /datasets.

Kafka Source

sources:
  my_source:
    type: kafka
    topic: base.raw.latest_balances_v2

No start_at or version fields. Optional: filter, include_metadata, starting_offsets.

Transform Configuration

Type	Use Case
`sql`	Filtering, projections, SQL functions
`script`	Custom TypeScript/WASM logic
`handler`	Call external HTTP APIs to enrich
`dynamic_table`	Lookup tables backed by a database
`throttle`	Cap throughput to a fixed records-per-second

SQL Transform

transforms:
  filtered:
    type: sql
    primary_key: id
    sql: |
      SELECT id, sender, recipient, amount
      FROM source_name
      WHERE amount > 1000

Field	Required	Description
`type`	Yes	`sql`
`primary_key`	Yes	Column for uniqueness/ordering
`sql`	Yes	SQL query (reference sources by name)
`from`	No	Override default source (for chaining)

Transform Chaining

Chain transforms using from:

transforms:
  step1:
    type: sql
    primary_key: id
    sql: SELECT * FROM source WHERE amount > 100
  step2:
    type: sql
    primary_key: id
    from: step1
    sql: SELECT *, 'processed' as status FROM step1

For TypeScript, handler, and dynamic table transforms, see /turbo-transforms.

Throttle Transform

Caps the throughput of a stream by buffering records into batches and emitting each batch on a fixed minimum interval. Throttle does not modify data — every input record passes through unchanged. Use it to stay under rate limits of downstream sinks or external APIs, smooth bursty sources, or pace records into HTTP handlers.

transforms:
  throttled:
    type: throttle
    from: my_source
    max_batch_size: 100
    min_batch_interval: 10s

Field	Required	Description
`type`	Yes	`throttle`
`from`	Yes	Source or transform to read from
`max_batch_size`	No	Max records per batch
`min_batch_interval`	No	Minimum time between batches (e.g. `10s`, `500ms`, `1m`)

Effective max throughput ≈ max_batch_size / min_batch_interval records per second. Throttle limits the maximum rate, not the minimum — if upstream is slow, batches will be smaller and arrive less frequently.

Place throttle close to the bottleneck (just before the rate-limited sink or handler) so upstream transforms still process at full speed.

Sink Configuration

Quick examples for common sinks. For full field specs of all sink types, read references/sink-reference.md.

PostgreSQL

sinks:
  output:
    type: postgres
    from: my_transform
    schema: public
    table: my_table
    secret_name: MY_POSTGRES_SECRET
    primary_key: id

ClickHouse

sinks:
  output:
    type: clickhouse
    from: my_transform
    table: my_table
    secret_name: MY_CLICKHOUSE_SECRET
    primary_key: id

Pub/Sub (Turbo-only)

sinks:
  output:
    type: pubsub
    from: my_transform
    topic: my-topic
    secret_name: MY_PUBSUB_SECRET

The secret holds a GCP project id and service-account JSON. Topic must already exist in GCP.

Blackhole (Testing)

sinks:
  output:
    type: blackhole
    from: my_transform

Checkpoint Behavior

Preserved by default when updating a pipeline
Tied to source names — renaming a source resets its checkpoint
Tied to pipeline names — renaming the pipeline resets all checkpoints

To reset checkpoints: rename the source or pipeline. Warning: this reprocesses all historical data.

Starter Templates

Template	Description	Use Case
`minimal-erc20-blackhole.yaml`	Simplest pipeline, no credentials	Quick testing
`filtered-transfers-sql.yaml`	Filter by contract address	USDC, specific tokens
`postgres-output.yaml`	Write to PostgreSQL	Production data storage
`pubsub-output.yaml`	Write to Google Cloud Pub/Sub	GCP-based event consumers
`multi-chain-pipeline.yaml`	Combine multiple chains	Cross-chain analytics
`solana-transfers.yaml`	Solana SPL tokens	Non-EVM chains
`multi-sink-pipeline.yaml`	Multiple outputs	Archive + alerts + streaming
`linear-pipeline.yaml`	Simple decode → filter → sink	Basic linear flow
`fan-out-pipeline.yaml`	One source → multiple sinks	Multi-destination
`fan-in-pipeline.yaml`	Multiple events → UNION ALL	Activity feeds
`multi-chain-templated.yaml`	Per-chain pipeline pattern	Independent chain deploys

Template location: templates/ (relative to this skill's directory)

CLI Quick Reference

Action	Command
Install Goldsky CLI	`curl https://goldsky.com \| sh`
Install Turbo extension	`curl https://install-turbo.goldsky.com \| sh`
Validate (REQUIRED)	`goldsky turbo validate pipeline.yaml`
Deploy/Update	`goldsky turbo apply pipeline.yaml`
Deploy + Inspect	`goldsky turbo apply pipeline.yaml -i`
List pipelines	`goldsky turbo list`
List datasets	`goldsky dataset list` (slow, 30-60s)
List secrets	`goldsky secret list`

For lifecycle commands (pause/resume/restart/delete) and monitoring (inspect/logs), see /turbo-operations.

Troubleshooting

See references/troubleshooting.md for CLI hanging, validation errors, and runtime errors.

/turbo-builder — Interactive wizard to build pipelines step-by-step
/turbo-doctor — Diagnose and fix pipeline issues
/turbo-operations — Lifecycle commands and monitoring reference
/turbo-transforms — SQL, TypeScript, and dynamic table transform reference
/datasets — Dataset names and chain prefixes
/secrets — Sink credential management

Related skills

More from goldsky-io/goldsky-agent

Installs

Repository

goldsky-io/goldsky-agent

GitHub Stars

First Seen

Mar 9, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykWarn