redshift
Redshift Skill
Read-only Redshift exploration and business analysis via the AWS Data API. Works with both provisioned clusters and Redshift Serverless. Cross-platform (Mac + Windows). Works with any AI coding agent.
All scripts are in ${CLAUDE_SKILL_DIR}/scripts/ and require only Python 3 + AWS CLI.
Python Command
Read ~/.redshift-skill/config.json and use the "python" key as the Python command.
If config doesn't exist yet, try python3 --version first, falling back to python --version.
Throughout this document, PYTHON means the detected Python command.
First-Time Setup
You cannot run the setup wizard directly — it requires interactive terminal input.
Check if ~/.redshift-skill/config.json exists:
- If it exists: Read it to confirm the connection details.
- If it doesn't exist: Tell the user to run the setup wizard in their terminal:
Run this in your terminal to configure the Redshift connection:
python3 scripts/setup.py(On Windows, use
pythoninstead ofpython3)
Wait for the user to confirm setup is complete before running any queries.
Quick Reference
| Task | Script | When to use | Key Args |
|---|---|---|---|
| Run SQL | query.py |
Any free-form read-only query | "SELECT ..." or --sql-file=PATH |
| List schemas | schemas.py |
Starting point — see what schemas exist | |
| List tables | tables.py |
Browse tables, check row counts and sizes before querying | --schema=NAME |
| List columns | columns.py |
Understand column types, encoding, dist/sort keys | --schema=NAME --table=NAME |
| Table DDL | ddl.py |
Get full CREATE TABLE with encoding, distkey, sortkey | --schema=NAME --name=NAME |
| View DDL | ddl.py |
Get CREATE VIEW definition | --type=view --schema=NAME --name=NAME |
| Schema DDL | ddl.py |
Get CREATE SCHEMA with authorization | --type=schema [--name=NAME] |
| Database DDL | ddl.py |
Get CREATE DATABASE | --type=database |
| UDF DDL | ddl.py |
Get function definitions (SQL + Python) | --type=udf [--schema=NAME] |
| External DDL | ddl.py |
Get Spectrum external table DDL | --type=external [--schema=NAME] |
| Group DDL | ddl.py |
Get user group definitions | --type=group |
| Table metadata | table_info.py |
Check size, rows, skew, encoding — before writing big queries | --schema=NAME --table=NAME |
| Search objects | search.py |
Find tables or columns when you don't know the exact name | --pattern=TEXT |
| Sample data | sample.py |
Quick peek at actual values — always do this before writing queries | --schema=NAME --table=NAME |
| Disk usage | space.py |
Find the largest tables, identify storage issues | [--schema=NAME] [--top=N] |
| Data profile | profile.py |
Per-column stats (nulls, cardinality, min/max/avg) | --schema=NAME --table=NAME |
| Local analytics | analyze.py |
Analyze saved results locally without hitting Redshift | FILE --describe |
Common options (all Redshift scripts)
| Option | Description |
|---|---|
--format=txt|csv|json |
Terminal display format (default: txt) |
--save-format=txt|csv|json |
File save format (default: csv) |
--save=PATH |
Save to a specific file path |
--no-save |
Don't auto-save to ~/redshift-exports/ |
--save-sql |
Save the SQL query as a .sql file alongside results |
--sql-file=PATH |
Read SQL from a file (query.py only) |
--profile=NAME |
Override AWS profile |
--cluster=NAME |
Override cluster (provisioned) |
--workgroup=NAME |
Override workgroup (serverless) |
--database=NAME |
Override database |
--db-user=NAME |
Override database user (provisioned only) |
--timeout=N |
Max wait seconds (default: 120) |
--max-rows=N |
Max rows to fetch (default: 1000) |
Output and File Saving
All query results are automatically saved to ~/redshift-exports/query-{timestamp}.csv.
The terminal shows an aligned txt preview (first 200 rows). The saved file defaults to CSV for spreadsheet compatibility.
This means you always have:
- Inline preview (200 rows in txt format) — enough to understand the data shape and answer quick questions
- Full CSV on disk — for deeper analysis with
analyze.pyor for the user to open in a spreadsheet
--format controls terminal display (default: txt). --save-format controls the saved file format (default: csv). Use --save-sql to also save the SQL query as a matching .sql file. Use --no-save to skip auto-save. Use --save=PATH to save to a specific location.
Defensive Guardrails
These rules protect the cluster from expensive queries. Follow them — but use judgement. If the user explicitly asks for something that bends a rule, explain the trade-off and proceed if they confirm.
- Never
SELECT *from tables with >10K rows — this tool is for exploration and analysis, not data export. Use aggregations, filters, orsample.pyinstead. For smaller tables,SELECT *is fine. - Always add
LIMITwhen exploring unfamiliar tables (default LIMIT 100). - Check row counts first — run
tables.py --schema=Xbefore writing queries so you know what you're dealing with. - Prefer aggregations for large tables —
COUNT,SUM,AVGwithGROUP BYover pulling raw rows. Raw data is perfectly fine for small tables and when exploring/understanding data. - Filter on sort key for very large fact tables — especially tables sorted by date/timestamp when filtering on a specific time period. Redshift uses zone maps to skip blocks, making date-range filters very efficient. Not strictly required, but strongly recommended for tables >100M rows.
- Joins are fine — Redshift handles large table joins well as long as the join condition is correct and you aggregate/filter the result appropriately. The risk is not the join itself but returning unbounded raw data from a join.
- Avoid accidental cross joins — always include
ON/USINGunless you intentionally need a cartesian product (e.g. exploding data by design). - Prefer
LIMIT+ORDER BYover unbounded selects when exploring.
Size awareness:
| Table size | Approach |
|---|---|
| <10K rows | Explore freely, SELECT * is fine |
| 10K–1M rows | Add WHERE or LIMIT, aggregations preferred for full-table queries |
| >1M rows | Always aggregate or filter, never SELECT *, use sort key filters on large fact tables |
SQL Standards
Every SQL query you write must follow these rules:
Always comment your SQL
Explain the business intent, not just the mechanics. Use section headers for complex queries.
Always show the SQL to the user
- Short queries (<10 lines): show the full SQL inline with comments
- Long queries (>10 lines): save to
~/redshift-exports/query-{timestamp}.sql, show the key parts inline, and reference the saved file
Use --sql-file for complex queries
For long SQL, write it to a file first and execute with --sql-file:
PYTHON ${CLAUDE_SKILL_DIR}/scripts/query.py --sql-file=~/redshift-exports/my_query.sql
Tip — inline multiline SQL without reading/writing in the LLM context: When you need to write the SQL file and run the query in one Bash call:
Mac/Linux — heredoc:
cat > "/path/to/query.sql" << 'EOSQL'
-- Your SQL here
SELECT ...
EOSQL
PYTHON ${CLAUDE_SKILL_DIR}/scripts/query.py --sql-file="/path/to/query.sql" --save="/path/to/results.csv" --save-sql
The 'EOSQL' quoting prevents shell variable expansion inside the SQL.
Cross-platform — Python:
PYTHON -c "
sql = '''
-- Your SQL here
SELECT ...
'''
open('/path/to/query.sql', 'w').write(sql.strip())
"
PYTHON ${CLAUDE_SKILL_DIR}/scripts/query.py --sql-file="/path/to/query.sql" --save="/path/to/results.csv" --save-sql
Tip — reuse SQL files by replacing values: When you already have a SQL file and just need to change specific values (e.g. rolling dates forward a week), use a one-liner to swap them without rewriting the whole file in the LLM context:
Mac/Linux — sed:
sed "s/2026-03-15/2026-03-22/g; s/2026-03-21/2026-03-28/g" original_query.sql > query.sql
Cross-platform — Python:
PYTHON -c "open('query.sql','w').write(open('original_query.sql').read().replace('2026-03-15','2026-03-22').replace('2026-03-21','2026-03-28'))"
Then execute:
PYTHON ${CLAUDE_SKILL_DIR}/scripts/query.py --sql-file=query.sql --save="/path/to/results.csv" --save-sql
For templates with many replacements, named placeholders like {{WC_START}} can be used instead of literal dates.
Formatting conventions
------------------------------------------------------------------------------------------------------------------------
-- Monthly revenue by region
-- Purpose: Aggregate order line items by region for the last 12 months
-- Filters: Excludes cancelled orders and zero-quantity items
------------------------------------------------------------------------------------------------------------------------
WITH monthly_revenue AS (
SELECT r.region_name, -- region dimension
DATE_TRUNC('month', o.order_date) :: DATE AS month_nk,
SUM(o.quantity) AS total_quantity,
SUM(o.line_total) AS total_revenue,
SUM(o.margin) AS total_margin,
COUNT(DISTINCT o.customer_id) AS unique_customers
FROM sales.fact_order_lines o
JOIN sales.dim_regions r USING (region_id)
WHERE o.order_date >= DATEADD(month, -12, CURRENT_DATE)
AND o.quantity > 0
AND o.order_status <> 'cancelled'
GROUP BY 1, 2
)
SELECT region_name,
month_nk,
total_quantity,
total_revenue,
total_margin,
unique_customers,
-- margin percentage
ROUND(total_margin / NULLIF(total_revenue, 0) * 100, 2) AS margin_pct
FROM monthly_revenue
ORDER BY region_name, month_nk
Key formatting rules:
- Section headers with
--dashes above the query SELECT,FROM,JOIN,WHERE,GROUP BY,ORDER BYleft-aligned- One column per line — never pack multiple columns onto a single line. Each column expression gets its own line with its
ASalias. - Column aliases aligned with
ASat a consistent column position - Inline comments explaining non-obvious business logic
- CTE names that describe the business concept, not the technical operation
Generous commenting
Every query must include comments that explain the why, not just the what. Treat SQL as documentation for the next person (or the next agent) who reads it.
- Header block at the top of every query: purpose, sources, key assumptions, parameters
- Section separators (dashed
--lines) between logical blocks (e.g. between actuals, budget, pivot CTEs) - Inline comments on non-obvious expressions (e.g. customer estimation formulas, business rules, why a filter exists)
- CTE-level comments explaining what each CTE produces and why it exists as a separate step
------------------------------------------------------------------------------------------------------------------------
-- Weekly Sales Scorecard
-- Purpose: Core trading metrics with TY/LP/LY comparisons and BU/LV variances
-- Source: fact_basket_items (actuals), fact_commercial_budget_all (BU)
-- Params: Change the date in the periods CTE to set the reporting period
-- Notes:
-- - Customer estimation done per channel first, then summed to total
-- - BU/LV only have sales and margin — derivative ratios are actuals-only
------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------
-- ACTUALS: per channel for correct customer estimation
------------------------------------------------------------------------------------------------------------------------
Leverage Redshift alias reuse
Redshift allows referencing a column alias later in the same SELECT. Use this to avoid unnecessary intermediate CTEs for derived columns.
Do this — single SELECT with alias reuse:
SELECT SUM(quantity) AS volume,
COUNT(DISTINCT basket_sk) AS orders,
1.0 * volume / NULLIF(orders, 0) AS ipb
FROM fact_basket_items
Not this — unnecessary extra CTE just to compute a derived column:
-- BAD: extra CTE only needed because standard SQL can't reuse aliases
base AS (
SELECT SUM(quantity) AS volume, COUNT(DISTINCT basket_sk) AS orders
FROM fact_basket_items
)
SELECT volume, orders, 1.0 * volume / NULLIF(orders, 0) AS ipb
FROM base
This applies to all derived columns: ratios, variances, customer estimations, margin rates, etc. If a CTE exists solely to reference aliases from a previous step, collapse it into the source SELECT.
Use GROUP BY ALL or positional GROUP BY
Redshift supports GROUP BY ALL (groups by every non-aggregate column in SELECT) and positional GROUP BY 1, 2, 3. Prefer these over repeating long CASE expressions in the GROUP BY clause.
Do this:
SELECT dp.period_name,
f.country_sk,
CASE WHEN ps.composite_rnk <= 100 THEN 'Top 100 SKUs'
ELSE 'Other SKUs' END AS sku_segment,
SUM(f.revenue_excl_vat_all_gbp) AS sales
FROM ...
GROUP BY ALL
Not this — repeating the CASE expression in GROUP BY:
GROUP BY dp.period_name, f.country_sk,
CASE WHEN ps.composite_rnk <= 100 THEN 'Top 100 SKUs' ELSE 'Other SKUs' END
Use GROUP BY ALL when every non-aggregate column should be grouped. Use GROUP BY 1, 2, 3 when you need ROLLUP or GROUPING SETS on specific positions. When using ROLLUP, positional references are clearest: GROUP BY ROLLUP(1, 2), 3.
Schema Exploration Workflow
Use this graduated approach when exploring an unfamiliar schema. Don't follow this rigidly — if the user's intent is clear and specific, skip straight to the relevant step. Parallelise steps where possible.
- Understand the landscape —
schemas.py→ see what schemas exist - Browse tables —
tables.py --schema=X→ check row counts and sizes - Understand structure —
columns.py --schema=X --table=Y→ column types, keys - Check metadata —
table_info.py→ distribution, sort key, unsorted % - Sample first —
sample.py --limit=5→ see actual data values - Profile if needed —
profile.py→ nulls, cardinality, min/max per column - Write targeted queries — now you know enough to write safe, efficient SQL
- Analyze locally —
analyze.pyon saved results for follow-up
Shortcut: If the user says "how many orders last month?", don't run 6 discovery scripts. Check the table size, write the query, run it.
Business Analysis Workflows
These are patterns for approaching common business questions. Think like a product analyst, commercial analyst, or BI developer.
Understanding business performance
- Start with the headline metric — total revenue, order count, customer count for the period
- Break down by dimensions — time (daily/weekly/monthly), region, channel, product category
- Compare periods — this month vs last month, this year vs last year, vs same period last year
- Identify outliers — which segments are significantly above or below expectations?
Root cause analysis
When something looks wrong ("revenue dropped 15% last week"):
- Confirm the anomaly — is it real? Check the data source, compare with other metrics
- Decompose the metric — is it volume (fewer orders) or value (lower average order)?
- Slice by dimensions — which region/channel/category drove the change?
- Drill into the outlier — what changed in that segment? New products? Pricing? Stock issues?
- Correlate with events — promotions starting/ending, price changes, stock-outs, seasonality
Common analyst patterns
| Pattern | Approach | SQL shape |
|---|---|---|
| Trend analysis | Track metrics over time | GROUP BY date/month + SUM/COUNT, compare YoY/MoM |
| Cohort analysis | Group by first purchase, track retention | MIN(order_date) per customer, then join back |
| Distribution analysis | Understand spread of values | NTILE(100), percentiles, use analyze.py --hist locally |
| Top/bottom N | Best and worst performers | ORDER BY metric DESC LIMIT N |
| YoY comparison | Year-over-year growth | Self-join on date shifted by 1 year, or LAG() window function |
| Funnel analysis | Conversion at each step | COUNT(DISTINCT user_id) per step, compute drop-off rates |
| Contribution / Pareto | Which items drive 80% of revenue | Cumulative SUM() OVER (ORDER BY ...), find the 80% cutoff |
| Segmentation | Group entities by behavior | CASE WHEN or NTILE to bucket, then profile each segment |
Script Details
query.py — Run SQL
Run any read-only SQL query. Accepts SQL as a command-line argument or from a file.
PYTHON ${CLAUDE_SKILL_DIR}/scripts/query.py "SELECT count(1) FROM sales.fact_orders"
PYTHON ${CLAUDE_SKILL_DIR}/scripts/query.py --sql-file=~/redshift-exports/my_query.sql
count
-----
12500000
1 rows returned (1 total). Duration: 25ms
Results saved to: ~/redshift-exports/query-20260318_150505.csv
schemas.py — List schemas
List all user schemas with their owners. Starting point for exploration.
PYTHON ${CLAUDE_SKILL_DIR}/scripts/schemas.py
schema_name owner
----------- ---------
public admin
sales etl_user
marketing etl_user
...
tables.py — List tables
List tables in a schema with row counts, size, distribution style, and sort keys. Run this before writing queries so you know what you're dealing with.
PYTHON ${CLAUDE_SKILL_DIR}/scripts/tables.py --schema=sales
table_name row_count size_mb diststyle sortkey1 unsorted pct_used
-------------- ---------- ------- --------- ---------- -------- --------
dim_customers 250000 120 ALL customer_id 0.00 0.0100
dim_products 15000 48 ALL product_id 0.00 0.0000
fact_orders 85000000 45000 KEY(...) order_date 12.50 5.8600
...
columns.py — List columns
Show column names, data types, encoding, distribution key, and sort key position. Works with tables, views, and late-binding views.
PYTHON ${CLAUDE_SKILL_DIR}/scripts/columns.py --schema=sales --table=fact_orders
pos column_name data_type max_len encoding distkey sortkey
--- ------------- ----------------- ------- -------- ------- -------
1 order_id CHARACTER VARYING 200 lzo False 0
2 order_date DATE az64 False 1
3 customer_id CHARACTER VARYING 200 lzo True 0
...
ddl.py — Generate DDL
Generate DDL for 7 object types. Full CREATE TABLE includes encoding, distkey, sortkey, constraints, comments, and ownership.
PYTHON ${CLAUDE_SKILL_DIR}/scripts/ddl.py --schema=sales --name=fact_orders
PYTHON ${CLAUDE_SKILL_DIR}/scripts/ddl.py --type=view --schema=sales --name=v_daily_revenue
PYTHON ${CLAUDE_SKILL_DIR}/scripts/ddl.py --type=schema --name=sales
PYTHON ${CLAUDE_SKILL_DIR}/scripts/ddl.py --type=database
PYTHON ${CLAUDE_SKILL_DIR}/scripts/ddl.py --type=udf --schema=public
PYTHON ${CLAUDE_SKILL_DIR}/scripts/ddl.py --type=group
table_info.py — Table metadata
Detailed stats for a single table: size, rows, distribution skew, encoding, sort key performance.
PYTHON ${CLAUDE_SKILL_DIR}/scripts/table_info.py --schema=sales --table=fact_orders
tablename row_count size_mb pct_used diststyle sortkey1 unsorted stats_off
----------------- ---------- ------- -------- ----------- ---------- -------- ---------
sales.fact_orders 85000000 45000 5.8600 KEY(cust_id) order_date 12.50 0.00
search.py — Search objects
Find tables or columns by name pattern (case-insensitive). Useful when you don't know the exact name.
PYTHON ${CLAUDE_SKILL_DIR}/scripts/search.py --pattern=order
PYTHON ${CLAUDE_SKILL_DIR}/scripts/search.py --pattern=revenue --type=column
sample.py — Sample data
Quick peek at actual row data. Always do this before writing complex queries — it helps you understand column values, formats, and nullability.
PYTHON ${CLAUDE_SKILL_DIR}/scripts/sample.py --schema=sales --table=dim_customers --limit=5
space.py — Disk usage
Find the largest tables, optionally filtered by schema.
PYTHON ${CLAUDE_SKILL_DIR}/scripts/space.py --schema=sales --top=5
schemaname tablename size_mb row_count pct_used
---------- ---------------- ------- ---------- --------
sales fact_orders 45000 85000000 5.8600
sales fact_line_items 32000 420000000 4.1700
...
profile.py — Data profiling (via Redshift)
Runs a single Redshift query to compute per-column statistics. Useful for understanding data quality before analysis.
PYTHON ${CLAUDE_SKILL_DIR}/scripts/profile.py --schema=sales --table=dim_customers
column_name data_type total_rows null_count null_pct distinct_count min_val max_val avg_val
-------------- ----------------- ---------- ---------- -------- -------------- --------- --------- -------
customer_id CHARACTER VARYING 250000 0 0.0 250000 C0001 C250000
customer_name CHARACTER VARYING 250000 12 0.0 249500 Aaron Zoe
signup_date DATE 250000 0 0.0 3650 2015-01-01 2025-12-31
lifetime_value NUMERIC 250000 1500 0.6 48000 0.00 125000.00 2340.50
analyze.py — Local analytics (no Redshift)
Analyze previously saved CSV/JSON files locally. No network access needed — great for follow-up analysis without hitting the cluster again.
PYTHON ${CLAUDE_SKILL_DIR}/scripts/analyze.py ~/redshift-exports/query-*.csv --describe
PYTHON ${CLAUDE_SKILL_DIR}/scripts/analyze.py data.csv --sum=revenue
PYTHON ${CLAUDE_SKILL_DIR}/scripts/analyze.py data.csv --group-by=region --sum=sales
PYTHON ${CLAUDE_SKILL_DIR}/scripts/analyze.py data.csv --filter='year=2024' --sort=amount --desc --top=10
PYTHON ${CLAUDE_SKILL_DIR}/scripts/analyze.py data.csv --hist=price
Available operations: --count, --describe, --sum=COL, --avg=COL, --min=COL, --max=COL, --median=COL, --group-by=COL, --filter=EXPR, --sort=COL, --desc, --top=N, --hist=COL
Read-Only Guardrails
Two layers of protection:
-
You must validate — before passing any SQL to
query.py, verify the first keyword is in the allowlist. If the user asks to run a modifying query, refuse and explain why. -
lib/client.pyenforces — the script itself rejects non-allowlisted SQL before it reaches AWS. This is a hard guardrail that cannot be bypassed.
Allowed: SELECT, WITH, SHOW, DESCRIBE, EXPLAIN, SET
Blocked: CREATE, ALTER, DROP, TRUNCATE, INSERT, UPDATE, DELETE, MERGE, COPY, UNLOAD, GRANT, REVOKE, CALL, EXECUTE, VACUUM, ANALYZE
Multi-statement queries (; followed by another statement) are also blocked.
Bundled SQL
The skill ships with SQL extracted from amazon-redshift-utils admin views in ${CLAUDE_SKILL_DIR}/scripts/sql/. These run directly against system catalogs — no admin schema needed. Used by ddl.py for full DDL generation with distkeys, sortkeys, encoding, constraints, and comments.
Advanced SQL Templates
For ad-hoc exploration via query.py:
Query history (last 20):
SELECT query, TRIM(querytxt) AS sql, starttime, endtime,
DATEDIFF(ms, starttime, endtime) AS duration_ms
FROM stl_query WHERE userid > 1
ORDER BY starttime DESC LIMIT 20
Running queries:
SELECT user_name, db_name, query, pid, starttime, duration, status
FROM stv_recents WHERE status = 'Running'
ORDER BY starttime DESC
Table dependencies (FK relationships):
SELECT n1.nspname || '.' || c1.relname AS source_table,
n2.nspname || '.' || c2.relname AS referenced_table,
con.conname AS constraint_name
FROM pg_constraint con
JOIN pg_class c1 ON con.conrelid = c1.oid
JOIN pg_namespace n1 ON c1.relnamespace = n1.oid
JOIN pg_class c2 ON con.confrelid = c2.oid
JOIN pg_namespace n2 ON c2.relnamespace = n2.oid
WHERE con.contype = 'f'
ORDER BY source_table