Redshift Skill

Read-only Redshift exploration and business analysis via the AWS Data API. Works with both provisioned clusters and Redshift Serverless. Cross-platform (Mac + Windows). Works with any AI coding agent.

All scripts are in ${CLAUDE_SKILL_DIR}/scripts/ and require only Python 3 + AWS CLI.

Python Command

Read ~/.redshift-skill/config.json and use the "python" key as the Python command. If config doesn't exist yet, try python3 --version first, falling back to python --version. Throughout this document, PYTHON means the detected Python command.

First-Time Setup

You cannot run the setup wizard directly — it requires interactive terminal input.

Check if ~/.redshift-skill/config.json exists:

If it exists: Read it to confirm the connection details.
If it doesn't exist: Tell the user to run the setup wizard in their terminal:

Run this in your terminal to configure the Redshift connection:
python3 scripts/setup.py
(On Windows, use python instead of python3)

Wait for the user to confirm setup is complete before running any queries.

Quick Reference

Task	Script	When to use	Key Args
Run SQL	`query.py`	Any free-form read-only query	`"SELECT ..."` or `--sql-file=PATH`
List schemas	`schemas.py`	Starting point — see what schemas exist
List tables	`tables.py`	Browse tables, check row counts and sizes before querying	`--schema=NAME`
List columns	`columns.py`	Understand column types, encoding, dist/sort keys	`--schema=NAME --table=NAME`
Table DDL	`ddl.py`	Get full CREATE TABLE with encoding, distkey, sortkey	`--schema=NAME --name=NAME`
View DDL	`ddl.py`	Get CREATE VIEW definition	`--type=view --schema=NAME --name=NAME`
Schema DDL	`ddl.py`	Get CREATE SCHEMA with authorization	`--type=schema [--name=NAME]`
Database DDL	`ddl.py`	Get CREATE DATABASE	`--type=database`
UDF DDL	`ddl.py`	Get function definitions (SQL + Python)	`--type=udf [--schema=NAME]`
External DDL	`ddl.py`	Get Spectrum external table DDL	`--type=external [--schema=NAME]`
Group DDL	`ddl.py`	Get user group definitions	`--type=group`
Table metadata	`table_info.py`	Check size, rows, skew, encoding — before writing big queries	`--schema=NAME --table=NAME`
Search objects	`search.py`	Find tables or columns when you don't know the exact name	`--pattern=TEXT`
Sample data	`sample.py`	Quick peek at actual values — always do this before writing queries	`--schema=NAME --table=NAME`
Disk usage	`space.py`	Find the largest tables, identify storage issues	`[--schema=NAME] [--top=N]`
Data profile	`profile.py`	Per-column stats (nulls, cardinality, min/max/avg)	`--schema=NAME --table=NAME`
Local analytics	`analyze.py`	Analyze saved results locally without hitting Redshift	`FILE --describe`

Common options (all Redshift scripts)

Option	Description
`--format=txt\|csv\|json`	Terminal display format (default: txt)
`--save-format=txt\|csv\|json`	File save format (default: csv)
`--save=PATH`	Save to a specific file path
`--no-save`	Don't auto-save to ~/redshift-exports/
`--save-sql`	Save the SQL query as a .sql file alongside results
`--sql-file=PATH`	Read SQL from a file (query.py only)
`--profile=NAME`	Override AWS profile
`--cluster=NAME`	Override cluster (provisioned)
`--workgroup=NAME`	Override workgroup (serverless)
`--database=NAME`	Override database
`--db-user=NAME`	Override database user (provisioned only)
`--timeout=N`	Max wait seconds (default: 120)
`--max-rows=N`	Max rows to fetch (default: 1000)

Output and File Saving

All query results are automatically saved to ~/redshift-exports/query-{timestamp}.csv. The terminal shows an aligned txt preview (first 200 rows). The saved file defaults to CSV for spreadsheet compatibility.

This means you always have:

Inline preview (200 rows in txt format) — enough to understand the data shape and answer quick questions
Full CSV on disk — for deeper analysis with analyze.py or for the user to open in a spreadsheet

--format controls terminal display (default: txt). --save-format controls the saved file format (default: csv). Use --save-sql to also save the SQL query as a matching .sql file. Use --no-save to skip auto-save. Use --save=PATH to save to a specific location.

Defensive Guardrails

These rules protect the cluster from expensive queries. Follow them — but use judgement. If the user explicitly asks for something that bends a rule, explain the trade-off and proceed if they confirm.

Never SELECT * from tables with >10K rows — this tool is for exploration and analysis, not data export. Use aggregations, filters, or sample.py instead. For smaller tables, SELECT * is fine.
Always add LIMIT when exploring unfamiliar tables (default LIMIT 100).
Check row counts first — run tables.py --schema=X before writing queries so you know what you're dealing with.
Prefer aggregations for large tables — COUNT, SUM, AVG with GROUP BY over pulling raw rows. Raw data is perfectly fine for small tables and when exploring/understanding data.
Filter on sort key for very large fact tables — especially tables sorted by date/timestamp when filtering on a specific time period. Redshift uses zone maps to skip blocks, making date-range filters very efficient. Not strictly required, but strongly recommended for tables >100M rows.
Joins are fine — Redshift handles large table joins well as long as the join condition is correct and you aggregate/filter the result appropriately. The risk is not the join itself but returning unbounded raw data from a join.
Avoid accidental cross joins — always include ON/USING unless you intentionally need a cartesian product (e.g. exploding data by design).
Prefer LIMIT + ORDER BY over unbounded selects when exploring.

Size awareness:

Table size	Approach
<10K rows	Explore freely, `SELECT *` is fine
10K–1M rows	Add `WHERE` or `LIMIT`, aggregations preferred for full-table queries
>1M rows	Always aggregate or filter, never `SELECT *`, use sort key filters on large fact tables

SQL Standards

Every SQL query you write must follow these rules:

Always comment your SQL

Explain the business intent, not just the mechanics. Use section headers for complex queries.

Always show the SQL to the user

Short queries (<10 lines): show the full SQL inline with comments
Long queries (>10 lines): save to ~/redshift-exports/query-{timestamp}.sql, show the key parts inline, and reference the saved file

Use `--sql-file` for complex queries

For long SQL, write it to a file first and execute with --sql-file:

PYTHON ${CLAUDE_SKILL_DIR}/scripts/query.py --sql-file=~/redshift-exports/my_query.sql

Tip — inline multiline SQL without reading/writing in the LLM context: When you need to write the SQL file and run the query in one Bash call:

Mac/Linux — heredoc:

cat > "/path/to/query.sql" << 'EOSQL'
-- Your SQL here
SELECT ...
EOSQL
PYTHON ${CLAUDE_SKILL_DIR}/scripts/query.py --sql-file="/path/to/query.sql" --save="/path/to/results.csv" --save-sql

The 'EOSQL' quoting prevents shell variable expansion inside the SQL.

Cross-platform — Python:

PYTHON -c "
sql = '''
-- Your SQL here
SELECT ...
'''
open('/path/to/query.sql', 'w').write(sql.strip())
"
PYTHON ${CLAUDE_SKILL_DIR}/scripts/query.py --sql-file="/path/to/query.sql" --save="/path/to/results.csv" --save-sql

Tip — reuse SQL files by replacing values: When you already have a SQL file and just need to change specific values (e.g. rolling dates forward a week), use a one-liner to swap them without rewriting the whole file in the LLM context:

Mac/Linux — sed:

sed "s/2026-03-15/2026-03-22/g; s/2026-03-21/2026-03-28/g" original_query.sql > query.sql

Cross-platform — Python:

PYTHON -c "open('query.sql','w').write(open('original_query.sql').read().replace('2026-03-15','2026-03-22').replace('2026-03-21','2026-03-28'))"

Then execute:

PYTHON ${CLAUDE_SKILL_DIR}/scripts/query.py --sql-file=query.sql --save="/path/to/results.csv" --save-sql

For templates with many replacements, named placeholders like {{WC_START}} can be used instead of literal dates.

Formatting conventions

------------------------------------------------------------------------------------------------------------------------
-- Monthly revenue by region
-- Purpose: Aggregate order line items by region for the last 12 months
-- Filters: Excludes cancelled orders and zero-quantity items
------------------------------------------------------------------------------------------------------------------------
WITH monthly_revenue AS (
    SELECT  r.region_name,                                                           -- region dimension
            DATE_TRUNC('month', o.order_date) :: DATE                                AS month_nk,
            SUM(o.quantity)                                                          AS total_quantity,
            SUM(o.line_total)                                                        AS total_revenue,
            SUM(o.margin)                                                            AS total_margin,
            COUNT(DISTINCT o.customer_id)                                            AS unique_customers
    FROM    sales.fact_order_lines o
    JOIN    sales.dim_regions r                     USING (region_id)
    WHERE   o.order_date >= DATEADD(month, -12, CURRENT_DATE)
            AND o.quantity > 0
            AND o.order_status <> 'cancelled'
    GROUP   BY 1, 2
)
SELECT  region_name,
        month_nk,
        total_quantity,
        total_revenue,
        total_margin,
        unique_customers,
        -- margin percentage
        ROUND(total_margin / NULLIF(total_revenue, 0) * 100, 2)                     AS margin_pct
FROM    monthly_revenue
ORDER   BY region_name, month_nk

Key formatting rules:

Section headers with -- dashes above the query
SELECT, FROM, JOIN, WHERE, GROUP BY, ORDER BY left-aligned
One column per line — never pack multiple columns onto a single line. Each column expression gets its own line with its AS alias.
Column aliases aligned with AS at a consistent column position
Inline comments explaining non-obvious business logic
CTE names that describe the business concept, not the technical operation

Generous commenting

Every query must include comments that explain the why, not just the what. Treat SQL as documentation for the next person (or the next agent) who reads it.

Header block at the top of every query: purpose, sources, key assumptions, parameters
Section separators (dashed -- lines) between logical blocks (e.g. between actuals, budget, pivot CTEs)
Inline comments on non-obvious expressions (e.g. customer estimation formulas, business rules, why a filter exists)
CTE-level comments explaining what each CTE produces and why it exists as a separate step

------------------------------------------------------------------------------------------------------------------------
-- Weekly Sales Scorecard
-- Purpose: Core trading metrics with TY/LP/LY comparisons and BU/LV variances
-- Source:  fact_basket_items (actuals), fact_commercial_budget_all (BU)
-- Params:  Change the date in the periods CTE to set the reporting period
-- Notes:
--   - Customer estimation done per channel first, then summed to total
--   - BU/LV only have sales and margin — derivative ratios are actuals-only
------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
-- ACTUALS: per channel for correct customer estimation
------------------------------------------------------------------------------------------------------------------------

Leverage Redshift alias reuse

Redshift allows referencing a column alias later in the same SELECT. Use this to avoid unnecessary intermediate CTEs for derived columns.

Do this — single SELECT with alias reuse:

SELECT  SUM(quantity)                              AS volume,
        COUNT(DISTINCT basket_sk)                  AS orders,
        1.0 * volume / NULLIF(orders, 0)           AS ipb
FROM    fact_basket_items

Not this — unnecessary extra CTE just to compute a derived column:

-- BAD: extra CTE only needed because standard SQL can't reuse aliases
base AS (
    SELECT SUM(quantity) AS volume, COUNT(DISTINCT basket_sk) AS orders
    FROM   fact_basket_items
)
SELECT volume, orders, 1.0 * volume / NULLIF(orders, 0) AS ipb
FROM   base

This applies to all derived columns: ratios, variances, customer estimations, margin rates, etc. If a CTE exists solely to reference aliases from a previous step, collapse it into the source SELECT.

Use `GROUP BY ALL` or positional `GROUP BY`

Redshift supports GROUP BY ALL (groups by every non-aggregate column in SELECT) and positional GROUP BY 1, 2, 3. Prefer these over repeating long CASE expressions in the GROUP BY clause.

Do this:

SELECT  dp.period_name,
        f.country_sk,
        CASE WHEN ps.composite_rnk <= 100 THEN 'Top 100 SKUs'
             ELSE 'Other SKUs' END                              AS sku_segment,
        SUM(f.revenue_excl_vat_all_gbp)                         AS sales
FROM    ...
GROUP BY ALL

Not this — repeating the CASE expression in GROUP BY:

GROUP BY dp.period_name, f.country_sk,
         CASE WHEN ps.composite_rnk <= 100 THEN 'Top 100 SKUs' ELSE 'Other SKUs' END

Use GROUP BY ALL when every non-aggregate column should be grouped. Use GROUP BY 1, 2, 3 when you need ROLLUP or GROUPING SETS on specific positions. When using ROLLUP, positional references are clearest: GROUP BY ROLLUP(1, 2), 3.

Schema Exploration Workflow

Use this graduated approach when exploring an unfamiliar schema. Don't follow this rigidly — if the user's intent is clear and specific, skip straight to the relevant step. Parallelise steps where possible.

Understand the landscape — schemas.py → see what schemas exist
Browse tables — tables.py --schema=X → check row counts and sizes
Understand structure — columns.py --schema=X --table=Y → column types, keys
Check metadata — table_info.py → distribution, sort key, unsorted %
Sample first — sample.py --limit=5 → see actual data values
Profile if needed — profile.py → nulls, cardinality, min/max per column
Write targeted queries — now you know enough to write safe, efficient SQL
Analyze locally — analyze.py on saved results for follow-up

Shortcut: If the user says "how many orders last month?", don't run 6 discovery scripts. Check the table size, write the query, run it.

Business Analysis Workflows

These are patterns for approaching common business questions. Think like a product analyst, commercial analyst, or BI developer.

Understanding business performance

Start with the headline metric — total revenue, order count, customer count for the period
Break down by dimensions — time (daily/weekly/monthly), region, channel, product category
Compare periods — this month vs last month, this year vs last year, vs same period last year
Identify outliers — which segments are significantly above or below expectations?

Root cause analysis

When something looks wrong ("revenue dropped 15% last week"):

Confirm the anomaly — is it real? Check the data source, compare with other metrics
Decompose the metric — is it volume (fewer orders) or value (lower average order)?
Slice by dimensions — which region/channel/category drove the change?
Drill into the outlier — what changed in that segment? New products? Pricing? Stock issues?
Correlate with events — promotions starting/ending, price changes, stock-outs, seasonality

Common analyst patterns

Pattern	Approach	SQL shape
Trend analysis	Track metrics over time	`GROUP BY date/month` + `SUM`/`COUNT`, compare YoY/MoM
Cohort analysis	Group by first purchase, track retention	`MIN(order_date)` per customer, then join back
Distribution analysis	Understand spread of values	`NTILE(100)`, percentiles, use `analyze.py --hist` locally
Top/bottom N	Best and worst performers	`ORDER BY metric DESC LIMIT N`
YoY comparison	Year-over-year growth	Self-join on date shifted by 1 year, or `LAG()` window function
Funnel analysis	Conversion at each step	`COUNT(DISTINCT user_id)` per step, compute drop-off rates
Contribution / Pareto	Which items drive 80% of revenue	Cumulative `SUM() OVER (ORDER BY ...)`, find the 80% cutoff
Segmentation	Group entities by behavior	`CASE WHEN` or `NTILE` to bucket, then profile each segment

Script Details

query.py — Run SQL

Run any read-only SQL query. Accepts SQL as a command-line argument or from a file.

PYTHON ${CLAUDE_SKILL_DIR}/scripts/query.py "SELECT count(1) FROM sales.fact_orders"
PYTHON ${CLAUDE_SKILL_DIR}/scripts/query.py --sql-file=~/redshift-exports/my_query.sql

count
-----
12500000
1 rows returned (1 total). Duration: 25ms
Results saved to: ~/redshift-exports/query-20260318_150505.csv

schemas.py — List schemas

List all user schemas with their owners. Starting point for exploration.

PYTHON ${CLAUDE_SKILL_DIR}/scripts/schemas.py

schema_name   owner
-----------   ---------
public        admin
sales         etl_user
marketing     etl_user
...

tables.py — List tables

List tables in a schema with row counts, size, distribution style, and sort keys. Run this before writing queries so you know what you're dealing with.

PYTHON ${CLAUDE_SKILL_DIR}/scripts/tables.py --schema=sales

table_name       row_count   size_mb  diststyle  sortkey1     unsorted  pct_used
--------------   ----------  -------  ---------  ----------   --------  --------
dim_customers    250000      120      ALL        customer_id  0.00      0.0100
dim_products     15000       48       ALL        product_id   0.00      0.0000
fact_orders      85000000    45000    KEY(...)   order_date   12.50     5.8600
...

columns.py — List columns

Show column names, data types, encoding, distribution key, and sort key position. Works with tables, views, and late-binding views.

PYTHON ${CLAUDE_SKILL_DIR}/scripts/columns.py --schema=sales --table=fact_orders

pos  column_name    data_type          max_len  encoding  distkey  sortkey
---  -------------  -----------------  -------  --------  -------  -------
1    order_id       CHARACTER VARYING  200      lzo       False    0
2    order_date     DATE                        az64      False    1
3    customer_id    CHARACTER VARYING  200      lzo       True     0
...

ddl.py — Generate DDL

Generate DDL for 7 object types. Full CREATE TABLE includes encoding, distkey, sortkey, constraints, comments, and ownership.

PYTHON ${CLAUDE_SKILL_DIR}/scripts/ddl.py --schema=sales --name=fact_orders
PYTHON ${CLAUDE_SKILL_DIR}/scripts/ddl.py --type=view --schema=sales --name=v_daily_revenue
PYTHON ${CLAUDE_SKILL_DIR}/scripts/ddl.py --type=schema --name=sales
PYTHON ${CLAUDE_SKILL_DIR}/scripts/ddl.py --type=database
PYTHON ${CLAUDE_SKILL_DIR}/scripts/ddl.py --type=udf --schema=public
PYTHON ${CLAUDE_SKILL_DIR}/scripts/ddl.py --type=group

table_info.py — Table metadata

Detailed stats for a single table: size, rows, distribution skew, encoding, sort key performance.

PYTHON ${CLAUDE_SKILL_DIR}/scripts/table_info.py --schema=sales --table=fact_orders

tablename          row_count   size_mb  pct_used  diststyle     sortkey1    unsorted  stats_off
-----------------  ----------  -------  --------  -----------   ----------  --------  ---------
sales.fact_orders  85000000    45000    5.8600    KEY(cust_id)  order_date  12.50     0.00

search.py — Search objects

Find tables or columns by name pattern (case-insensitive). Useful when you don't know the exact name.

PYTHON ${CLAUDE_SKILL_DIR}/scripts/search.py --pattern=order
PYTHON ${CLAUDE_SKILL_DIR}/scripts/search.py --pattern=revenue --type=column

sample.py — Sample data

Quick peek at actual row data. Always do this before writing complex queries — it helps you understand column values, formats, and nullability.

PYTHON ${CLAUDE_SKILL_DIR}/scripts/sample.py --schema=sales --table=dim_customers --limit=5

space.py — Disk usage

Find the largest tables, optionally filtered by schema.

PYTHON ${CLAUDE_SKILL_DIR}/scripts/space.py --schema=sales --top=5

schemaname  tablename         size_mb  row_count   pct_used
----------  ----------------  -------  ----------  --------
sales       fact_orders       45000    85000000    5.8600
sales       fact_line_items   32000    420000000   4.1700
...

profile.py — Data profiling (via Redshift)

Runs a single Redshift query to compute per-column statistics. Useful for understanding data quality before analysis.

PYTHON ${CLAUDE_SKILL_DIR}/scripts/profile.py --schema=sales --table=dim_customers

column_name     data_type          total_rows  null_count  null_pct  distinct_count  min_val    max_val    avg_val
--------------  -----------------  ----------  ----------  --------  --------------  ---------  ---------  -------
customer_id     CHARACTER VARYING  250000      0           0.0       250000          C0001      C250000
customer_name   CHARACTER VARYING  250000      12          0.0       249500          Aaron      Zoe
signup_date     DATE               250000      0           0.0       3650            2015-01-01 2025-12-31
lifetime_value  NUMERIC            250000      1500        0.6       48000           0.00       125000.00  2340.50

analyze.py — Local analytics (no Redshift)

Analyze previously saved CSV/JSON files locally. No network access needed — great for follow-up analysis without hitting the cluster again.

PYTHON ${CLAUDE_SKILL_DIR}/scripts/analyze.py ~/redshift-exports/query-*.csv --describe
PYTHON ${CLAUDE_SKILL_DIR}/scripts/analyze.py data.csv --sum=revenue
PYTHON ${CLAUDE_SKILL_DIR}/scripts/analyze.py data.csv --group-by=region --sum=sales
PYTHON ${CLAUDE_SKILL_DIR}/scripts/analyze.py data.csv --filter='year=2024' --sort=amount --desc --top=10
PYTHON ${CLAUDE_SKILL_DIR}/scripts/analyze.py data.csv --hist=price

Available operations: --count, --describe, --sum=COL, --avg=COL, --min=COL, --max=COL, --median=COL, --group-by=COL, --filter=EXPR, --sort=COL, --desc, --top=N, --hist=COL

Read-Only Guardrails

Two layers of protection:

You must validate — before passing any SQL to query.py, verify the first keyword is in the allowlist. If the user asks to run a modifying query, refuse and explain why.
lib/client.py enforces — the script itself rejects non-allowlisted SQL before it reaches AWS. This is a hard guardrail that cannot be bypassed.

Allowed: SELECT, WITH, SHOW, DESCRIBE, EXPLAIN, SET

Blocked: CREATE, ALTER, DROP, TRUNCATE, INSERT, UPDATE, DELETE, MERGE, COPY, UNLOAD, GRANT, REVOKE, CALL, EXECUTE, VACUUM, ANALYZE

Multi-statement queries (; followed by another statement) are also blocked.

Bundled SQL

The skill ships with SQL extracted from amazon-redshift-utils admin views in ${CLAUDE_SKILL_DIR}/scripts/sql/. These run directly against system catalogs — no admin schema needed. Used by ddl.py for full DDL generation with distkeys, sortkeys, encoding, constraints, and comments.

Advanced SQL Templates

For ad-hoc exploration via query.py:

Query history (last 20):

SELECT query, TRIM(querytxt) AS sql, starttime, endtime,
       DATEDIFF(ms, starttime, endtime) AS duration_ms
FROM stl_query WHERE userid > 1
ORDER BY starttime DESC LIMIT 20

Running queries:

SELECT user_name, db_name, query, pid, starttime, duration, status
FROM stv_recents WHERE status = 'Running'
ORDER BY starttime DESC

Table dependencies (FK relationships):

SELECT n1.nspname || '.' || c1.relname AS source_table,
       n2.nspname || '.' || c2.relname AS referenced_table,
       con.conname AS constraint_name
FROM pg_constraint con
JOIN pg_class c1 ON con.conrelid = c1.oid
JOIN pg_namespace n1 ON c1.relnamespace = n1.oid
JOIN pg_class c2 ON con.confrelid = c2.oid
JOIN pg_namespace n2 ON c2.relnamespace = n2.oid
WHERE con.contype = 'f'
ORDER BY source_table

redshift