skills/funnelenvy/agents_webinar_demos/bq-query-optimization

bq-query-optimization

SKILL.md

BigQuery Query Optimization

Use this skill when writing, debugging, or optimizing BigQuery SQL queries for performance and efficiency.

Query Execution Analysis

Using EXPLAIN

-- Get execution plan
EXPLAIN SELECT * FROM `project.dataset.table` WHERE condition;

-- Get execution plan with runtime stats
EXPLAIN ANALYZE SELECT * FROM `project.dataset.table` WHERE condition;

What the plan shows:

  • Stages of execution
  • Bytes read per stage
  • Slot time consumed
  • Potential bottlenecks

Performance Best Practices

1. Avoid SELECT *

❌ Bad (scans all columns):

SELECT * FROM `project.dataset.large_table`

✅ Good (only needed columns):

SELECT customer_id, amount, date
FROM `project.dataset.large_table`

Impact: Full table scan vs targeted column read. Can reduce data scanned by 90%+.

2. Filter Early and Often

❌ Bad (filter after aggregation):

SELECT customer_id, SUM(amount) as total
FROM `project.dataset.orders`
GROUP BY customer_id
HAVING SUM(amount) > 1000

✅ Good (filter before aggregation):

SELECT customer_id, SUM(amount) as total
FROM `project.dataset.orders`
WHERE amount > 100  -- Filter early
GROUP BY customer_id
HAVING SUM(amount) > 1000

3. Use Partitioned Tables

Without partition filter:

-- Scans entire table
SELECT * FROM `project.dataset.orders`
WHERE order_date >= '2024-01-01'

With partition filter:

-- Only scans relevant partitions
SELECT * FROM `project.dataset.orders`
WHERE DATE(order_timestamp) >= '2024-01-01'  -- Partition column

Key: Filter on the partition column for automatic partition pruning.

4. Break Complex Queries

❌ Anti-pattern (one huge query):

SELECT ...
FROM (
  SELECT ...
  FROM (
    SELECT ... -- Deeply nested
  )
)
WHERE ...

✅ Good (use CTEs):

WITH base_data AS (
  SELECT customer_id, amount, date
  FROM `project.dataset.orders`
  WHERE date >= '2024-01-01'
),
aggregated AS (
  SELECT customer_id, SUM(amount) as total
  FROM base_data
  GROUP BY customer_id
)
SELECT * FROM aggregated WHERE total > 1000

✅ Better (multi-statement with temp tables):

CREATE TEMP TABLE base_data AS
SELECT customer_id, amount, date
FROM `project.dataset.orders`
WHERE date >= '2024-01-01';

CREATE TEMP TABLE aggregated AS
SELECT customer_id, SUM(amount) as total
FROM base_data
GROUP BY customer_id;

SELECT * FROM aggregated WHERE total > 1000;

5. JOIN Optimization

Put largest table first:

-- ✅ Large table first
SELECT l.*, s.detail
FROM `project.dataset.large_table` l
JOIN `project.dataset.small_table` s
  ON l.id = s.id

Use clustering on JOIN columns:

  • Cluster tables on frequently joined columns
  • BigQuery can prune data blocks more effectively

Consider ARRAY/STRUCT for 1:many:

-- Instead of JOIN for 1:many relationships
SELECT
  order_id,
  ARRAY_AGG(STRUCT(product_id, quantity, price)) as items
FROM `project.dataset.order_items`
GROUP BY order_id

6. Leverage Automatic Features

BigQuery automatically performs:

  • Query rewrites - Optimizes query structure
  • Partition pruning - With proper filters
  • Dynamic filtering - Reduces data scanned

Ensure your queries enable these:

  • Filter on partition columns
  • Use simple, clear predicates
  • Avoid functions on partition columns in WHERE clause

Parameterized Queries

CLI Syntax

bq query \
  --use_legacy_sql=false \
  --parameter=start_date:DATE:2024-01-01 \
  --parameter=end_date:DATE:2024-12-31 \
  --parameter=min_amount:FLOAT64:100.0 \
  'SELECT *
   FROM `project.dataset.orders`
   WHERE order_date BETWEEN @start_date AND @end_date
   AND amount >= @min_amount'

Python Syntax

from google.cloud import bigquery

client = bigquery.Client()

query = """
SELECT customer_id, SUM(amount) as total
FROM `project.dataset.orders`
WHERE order_date >= @start_date
GROUP BY customer_id
"""

job_config = bigquery.QueryJobConfig(
    query_parameters=[
        bigquery.ScalarQueryParameter("start_date", "DATE", "2024-01-01")
    ]
)

results = client.query_and_wait(query, job_config=job_config)

Key points:

  • Use @ prefix for named parameters
  • Syntax: name:TYPE:value or name::value (STRING default)
  • Cannot use as column/table names
  • Only works with standard SQL

User-Defined Functions (UDFs)

SQL UDF (Simple)

CREATE TEMP FUNCTION CleanEmail(email STRING)
RETURNS STRING
AS (
  LOWER(TRIM(email))
);

SELECT CleanEmail(customer_email) as email
FROM `project.dataset.customers`;

JavaScript UDF (Complex Logic)

CREATE TEMP FUNCTION ParseUserAgent(ua STRING)
RETURNS STRUCT<browser STRING, version STRING>
LANGUAGE js AS r"""
  var match = ua.match(/(Chrome|Firefox|Safari)\/(\d+)/);
  return {
    browser: match ? match[1] : 'Unknown',
    version: match ? match[2] : '0'
  };
""";

SELECT ParseUserAgent(user_agent).browser as browser
FROM `project.dataset.sessions`;

Limitations:

  • INT64 unsupported in JavaScript (use FLOAT64 or STRING)
  • JavaScript doesn't support 64-bit integers natively

Persistent UDFs

-- Create once, use many times
CREATE FUNCTION `project.dataset.clean_email`(email STRING)
RETURNS STRING
AS (LOWER(TRIM(email)));

-- Use anywhere
SELECT `project.dataset.clean_email`(email) FROM ...

Scripting & Procedural Language

Variables

DECLARE total_orders INT64;
SET total_orders = (SELECT COUNT(*) FROM `project.dataset.orders`);

SELECT total_orders;

Loops

LOOP:

DECLARE x INT64 DEFAULT 0;
LOOP
  SET x = x + 1;
  IF x >= 10 THEN
    LEAVE;
  END IF;
END LOOP;

WHILE:

DECLARE x INT64 DEFAULT 0;
WHILE x < 10 DO
  SET x = x + 1;
END WHILE;

FOR with arrays:

DECLARE ids ARRAY<STRING>;
SET ids = ['id1', 'id2', 'id3'];

FOR item IN (SELECT * FROM UNNEST(ids) as id)
DO
  -- Process each id
  SELECT id;
END FOR;

Query Caching

Automatic caching (24 hours):

  • Identical queries serve cached results (free)
  • No additional cost
  • Instant response

To bypass cache:

bq query --use_cache=false 'SELECT...'

Common Anti-Patterns

❌ Using LIMIT to reduce cost

-- LIMIT doesn't reduce data scanned or cost!
SELECT * FROM `project.dataset.huge_table` LIMIT 10

Impact: Still scans entire table. Use WHERE filters instead.

❌ Functions on partition columns

-- Prevents partition pruning
WHERE CAST(date_column AS STRING) = '2024-01-01'

✅ Better:

WHERE date_column = DATE('2024-01-01')

❌ Cross joins without filters

-- Cartesian product = huge result
SELECT * FROM table1 CROSS JOIN table2

Impact: Can generate millions/billions of rows.

❌ Correlated subqueries

-- Runs subquery for each row
SELECT *
FROM orders o
WHERE amount > (SELECT AVG(amount) FROM orders WHERE customer_id = o.customer_id)

✅ Better (use window functions):

SELECT *
FROM (
  SELECT *, AVG(amount) OVER (PARTITION BY customer_id) as avg_amount
  FROM orders
)
WHERE amount > avg_amount

Common SQL Gotchas

CTE Re-execution (CRITICAL COST ISSUE)

Problem: When a CTE is referenced multiple times, BigQuery re-executes it each time, billing you multiple times.

❌ Bad (CTE runs 3 times - billed 3x):

WITH expensive_cte AS (
  SELECT * FROM `project.dataset.huge_table`
  WHERE complex_conditions
  AND lots_of_joins
)
SELECT COUNT(*) FROM expensive_cte
UNION ALL
SELECT SUM(amount) FROM expensive_cte
UNION ALL
SELECT MAX(date) FROM expensive_cte;

Impact: If the CTE scans 10 TB, you're billed for 30 TB (10 TB × 3).

✅ Good (use temp table - billed 1x):

CREATE TEMP TABLE expensive_data AS
SELECT * FROM `project.dataset.huge_table`
WHERE complex_conditions
AND lots_of_joins;

SELECT COUNT(*) FROM expensive_data
UNION ALL
SELECT SUM(amount) FROM expensive_data
UNION ALL
SELECT MAX(date) FROM expensive_data;

When CTEs are OK:

  • Referenced only once
  • Very small result set
  • Part of larger query (BigQuery may optimize)

When to use temp tables:

  • CTE referenced 2+ times
  • Large data volumes
  • Complex/expensive CTE query

NOT IN with NULL Values (SILENT FAILURE)

Problem: NOT IN returns NOTHING (empty result) if ANY NULL exists in the subquery.

❌ Broken (returns empty if blocked_customers has any NULL):

SELECT * FROM customers
WHERE customer_id NOT IN (
  SELECT customer_id FROM blocked_customers  -- If ANY NULL, returns 0 rows!
);

Why it fails:

  • SQL three-valued logic: TRUE, FALSE, UNKNOWN
  • NULL IN (...) evaluates to UNKNOWN
  • NOT UNKNOWN is still UNKNOWN
  • Rows with UNKNOWN are filtered out

✅ Solution 1: Use NOT EXISTS (safest):

SELECT * FROM customers c
WHERE NOT EXISTS (
  SELECT 1 FROM blocked_customers b
  WHERE b.customer_id = c.customer_id
);

✅ Solution 2: Filter NULLs explicitly:

SELECT * FROM customers
WHERE customer_id NOT IN (
  SELECT customer_id
  FROM blocked_customers
  WHERE customer_id IS NOT NULL  -- Explicit NULL filter
);

✅ Solution 3: Use LEFT JOIN:

SELECT c.*
FROM customers c
LEFT JOIN blocked_customers b
  ON c.customer_id = b.customer_id
WHERE b.customer_id IS NULL;

Best practice: Prefer NOT EXISTS - it's clearer, safer, and often faster.


DML Statement Performance

Problem: BigQuery is optimized for analytics (OLAP), not transactional updates (OLTP). DML statements are slow and expensive.

Why DML is slow in BigQuery:

  • Columnar storage (not row-based)
  • Designed for bulk reads, not individual updates
  • No indexes for fast row lookups
  • Every update rewrites affected partitions

❌ Very slow (row-by-row updates):

-- Don't do this - takes minutes/hours
UPDATE `project.dataset.orders`
SET status = 'processed'
WHERE order_id = '12345';

-- This is even worse - runs once per row
FOR record IN (SELECT order_id FROM orders_to_update)
DO
  UPDATE orders SET status = 'processed' WHERE order_id = record.order_id;
END FOR;

⚠️ Better (batch updates):

UPDATE `project.dataset.orders`
SET status = 'processed'
WHERE order_id IN (SELECT order_id FROM orders_to_update);

✅ Best (recreate table - fastest):

CREATE OR REPLACE TABLE `project.dataset.orders` AS
SELECT
  * EXCEPT(status),
  CASE
    WHEN order_id IN (SELECT order_id FROM orders_to_update)
      THEN 'processed'
    ELSE status
  END AS status
FROM `project.dataset.orders`;

For INSERT/UPSERT - use MERGE:

MERGE `project.dataset.customers` AS target
USING `project.dataset.customer_updates` AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
  UPDATE SET name = source.name, updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN
  INSERT (customer_id, name, created_at)
  VALUES (customer_id, name, CURRENT_TIMESTAMP());

Best practices:

  • Batch updates instead of row-by-row
  • Use MERGE for upserts
  • Consider recreating table for large updates
  • Partition tables to limit update scope
  • Avoid frequent small DML operations

Data Type Gotchas

INT64 is the only integer type:

-- All of these are the same: INT64
CREATE TABLE example (
  col1 INT64,      -- ✅ Explicit
  col2 INTEGER,    -- Converted to INT64
  col3 INT,        -- Converted to INT64
  col4 SMALLINT,   -- Converted to INT64
  col5 BIGINT      -- Converted to INT64
);

No UUID type - use STRING:

-- PostgreSQL
CREATE TABLE users (id UUID);

-- BigQuery
CREATE TABLE users (id STRING);  -- Store UUID as string

NUMERIC precision limits:

-- NUMERIC: 38 digits precision, 9 decimal places
NUMERIC(38, 9)

-- BIGNUMERIC: 76 digits precision, 38 decimal places
BIGNUMERIC(76, 38)

-- Example
SELECT
  CAST('12345678901234567890.123456789' AS NUMERIC) AS num,
  CAST('12345678901234567890.123456789' AS BIGNUMERIC) AS bignum;

TIMESTAMP vs DATETIME vs DATE:

-- TIMESTAMP: UTC, timezone-aware
SELECT CURRENT_TIMESTAMP();  -- 2024-01-15 10:30:45.123456 UTC

-- DATETIME: No timezone
SELECT CURRENT_DATETIME();   -- 2024-01-15 10:30:45.123456

-- DATE: Date only
SELECT CURRENT_DATE();       -- 2024-01-15

-- Conversion
SELECT
  TIMESTAMP('2024-01-15 10:30:45'),          -- Assumes UTC
  DATETIME(TIMESTAMP '2024-01-15 10:30:45'), -- Loses timezone
  DATE(TIMESTAMP '2024-01-15 10:30:45');     -- Loses time

Type coercion in JOINs:

-- ❌ Implicit cast can prevent optimization
SELECT * FROM table1 t1
JOIN table2 t2
  ON t1.id = CAST(t2.id AS STRING);  -- Prevents clustering optimization

-- ✅ Match types explicitly
SELECT * FROM table1 t1
JOIN table2 t2
  ON t1.id = t2.id;  -- Both STRING

Window Functions

Window functions perform calculations across a set of rows related to the current row, without collapsing the result set.

Basic Syntax

<function> OVER (
  [PARTITION BY partition_expression]
  [ORDER BY sort_expression]
  [window_frame_clause]
)

Ranking Functions

ROW_NUMBER() - Sequential numbering:

SELECT
  customer_id,
  order_date,
  amount,
  ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS row_num
FROM orders;

Common use: Deduplication

SELECT * EXCEPT(row_num)
FROM (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS row_num
  FROM customers
)
WHERE row_num = 1;

RANK() - Rank with gaps:

SELECT
  product_name,
  revenue,
  RANK() OVER (ORDER BY revenue DESC) AS rank
FROM products;

-- Results:
-- Product A: $1000, rank 1
-- Product B: $900,  rank 2
-- Product C: $900,  rank 2  (tie)
-- Product D: $800,  rank 4  (gap after tie)

DENSE_RANK() - Rank without gaps:

SELECT
  product_name,
  revenue,
  DENSE_RANK() OVER (ORDER BY revenue DESC) AS dense_rank
FROM products;

-- Results:
-- Product A: $1000, rank 1
-- Product B: $900,  rank 2
-- Product C: $900,  rank 2  (tie)
-- Product D: $800,  rank 3  (no gap)

NTILE() - Divide into N buckets:

SELECT
  customer_id,
  total_spend,
  NTILE(4) OVER (ORDER BY total_spend DESC) AS quartile
FROM customer_totals;

Analytical Functions

LEAD() and LAG() - Access rows before/after:

-- Time series analysis
SELECT
  date,
  revenue,
  LAG(revenue) OVER (ORDER BY date) AS prev_day_revenue,
  LEAD(revenue) OVER (ORDER BY date) AS next_day_revenue,
  revenue - LAG(revenue) OVER (ORDER BY date) AS daily_change,
  ROUND(100.0 * (revenue - LAG(revenue) OVER (ORDER BY date)) / LAG(revenue) OVER (ORDER BY date), 2) AS pct_change
FROM daily_sales
ORDER BY date;

With PARTITION BY:

-- Per-customer analysis
SELECT
  customer_id,
  order_date,
  amount,
  LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS prev_order_amount,
  amount - LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS amount_diff
FROM orders;

FIRST_VALUE() and LAST_VALUE():

SELECT
  date,
  revenue,
  FIRST_VALUE(revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS first_day_revenue,
  LAST_VALUE(revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_day_revenue
FROM daily_sales;

NTH_VALUE() - Get Nth value:

SELECT
  product_id,
  date,
  sales,
  NTH_VALUE(sales, 2) OVER (PARTITION BY product_id ORDER BY date) AS second_day_sales
FROM product_sales;

Aggregate Window Functions

SUM/AVG/COUNT as window functions:

SELECT
  date,
  revenue,
  SUM(revenue) OVER (ORDER BY date) AS cumulative_revenue,
  AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg_7day,
  COUNT(*) OVER (ORDER BY date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS rolling_30day_count
FROM daily_sales;

Running totals:

SELECT
  customer_id,
  order_date,
  amount,
  SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS customer_lifetime_value
FROM orders;

QUALIFY Clause (BigQuery-Specific)

QUALIFY filters on window function results - no subquery needed!

❌ Standard SQL (verbose):

SELECT *
FROM (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS row_num
  FROM orders
)
WHERE row_num = 1;

✅ BigQuery with QUALIFY (clean):

SELECT customer_id, order_date, amount
FROM orders
QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) = 1;

More QUALIFY examples:

-- Get top 3 products per category
SELECT category, product_name, revenue
FROM products
QUALIFY RANK() OVER (PARTITION BY category ORDER BY revenue DESC) <= 3;

-- Filter outliers (keep middle 80%)
SELECT *
FROM measurements
QUALIFY NTILE(10) OVER (ORDER BY value) BETWEEN 2 AND 9;

-- Get first order per customer
SELECT customer_id, order_date, amount
FROM orders
QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date ASC) = 1;

Window Frame Clauses

Control which rows are included in the window:

-- ROWS: Physical row count
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW  -- Last 4 rows including current

-- RANGE: Logical range (based on ORDER BY values)
RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW  -- Last 7 days

-- Examples:
SELECT
  date,
  sales,
  -- Last 7 rows
  AVG(sales) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_7rows,

  -- Last 7 days (logical)
  AVG(sales) OVER (ORDER BY date RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW) AS avg_7days,

  -- All preceding rows
  SUM(sales) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum,

  -- Centered window (3 before, 3 after)
  AVG(sales) OVER (ORDER BY date ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING) AS centered_avg
FROM daily_sales;

Window Function Performance Tips

1. Partition appropriately:

-- ✅ Good: Partitions reduce data scanned
SELECT *
FROM events
WHERE date >= '2024-01-01'  -- Partition filter
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY timestamp DESC) = 1;

2. Avoid window functions in WHERE:

-- ❌ Wrong: Can't use window functions in WHERE
SELECT * FROM orders
WHERE ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date) = 1;  -- ERROR

-- ✅ Use QUALIFY instead
SELECT * FROM orders
QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date) = 1;

3. Reuse window definitions:

SELECT
  date,
  revenue,
  ROW_NUMBER() OVER w AS row_num,
  RANK() OVER w AS rank,
  AVG(revenue) OVER w AS avg_revenue
FROM sales
WINDOW w AS (PARTITION BY category ORDER BY revenue DESC);

JSON Functions

BigQuery provides rich functions for parsing, extracting, and manipulating JSON data.

Extracting JSON Values

JSON_VALUE() - Extract scalar values (Standard SQL):

SELECT
  JSON_VALUE(json_column, '$.user.name') AS user_name,
  JSON_VALUE(json_column, '$.user.email') AS email,
  CAST(JSON_VALUE(json_column, '$.amount') AS FLOAT64) AS amount,
  CAST(JSON_VALUE(json_column, '$.quantity') AS INT64) AS quantity
FROM events;

JSON_QUERY() - Extract objects or arrays:

SELECT
  JSON_QUERY(json_column, '$.user') AS user_object,
  JSON_QUERY(json_column, '$.items') AS items_array
FROM events;

JSON_EXTRACT() - Legacy, still widely used:

SELECT
  JSON_EXTRACT(json_column, '$.user.name') AS user_name,
  JSON_EXTRACT_SCALAR(json_column, '$.user.email') AS email  -- Returns STRING
FROM events;

JSONPath syntax:

-- Dot notation
'$.user.name'

-- Array index
'$.items[0].product_id'

-- Array slice
'$.items[0:3]'

-- Wildcard
'$.users[*].name'

-- Recursive descent
'$..name'  -- All 'name' fields at any level

Working with JSON Arrays

JSON_EXTRACT_ARRAY() - Extract array elements:

SELECT
  event_id,
  tag
FROM events,
UNNEST(JSON_EXTRACT_ARRAY(tags_json, '$')) AS tag;

JSON_VALUE_ARRAY() - Extract array of scalars:

SELECT
  product_id,
  tag
FROM products,
UNNEST(JSON_VALUE_ARRAY(tags_json, '$')) AS tag;

Complete example:

-- JSON: {"product_id": "A123", "tags": ["electronics", "sale", "featured"]}
SELECT
  JSON_VALUE(product_json, '$.product_id') AS product_id,
  tag
FROM products,
UNNEST(JSON_VALUE_ARRAY(product_json, '$.tags')) AS tag;

-- Results:
-- A123, electronics
-- A123, sale
-- A123, featured

Creating JSON

TO_JSON_STRING() - Convert to JSON:

SELECT
  customer_id,
  TO_JSON_STRING(STRUCT(
    name,
    email,
    created_at
  )) AS customer_json
FROM customers;

Create JSON objects:

SELECT
  TO_JSON_STRING(STRUCT(
    'John' AS name,
    30 AS age,
    ['reading', 'hiking'] AS hobbies,
    STRUCT('123 Main St' AS street, 'Boston' AS city) AS address
  )) AS person_json;

-- Result:
-- {"name":"John","age":30,"hobbies":["reading","hiking"],"address":{"street":"123 Main St","city":"Boston"}}

Aggregate into JSON:

SELECT
  customer_id,
  TO_JSON_STRING(
    ARRAY_AGG(
      STRUCT(order_id, amount, date)
      ORDER BY date DESC
      LIMIT 5
    )
  ) AS recent_orders_json
FROM orders
GROUP BY customer_id;

Parsing JSON Strings

PARSE_JSON() - Convert string to JSON:

SELECT
  JSON_VALUE(PARSE_JSON('{"name":"Alice","age":25}'), '$.name') AS name;

Safe JSON parsing (avoid errors):

SELECT
  SAFE.JSON_VALUE(invalid_json, '$.name') AS name  -- Returns NULL on error
FROM events;

Complex JSON Examples

Nested JSON extraction:

-- JSON structure:
-- {
--   "order": {
--     "id": "ORD123",
--     "items": [
--       {"product": "A", "qty": 2, "price": 10.50},
--       {"product": "B", "qty": 1, "price": 25.00}
--     ]
--   }
-- }

SELECT
  JSON_VALUE(data, '$.order.id') AS order_id,
  JSON_VALUE(item, '$.product') AS product,
  CAST(JSON_VALUE(item, '$.qty') AS INT64) AS quantity,
  CAST(JSON_VALUE(item, '$.price') AS FLOAT64) AS price
FROM events,
UNNEST(JSON_EXTRACT_ARRAY(data, '$.order.items')) AS item;

Transform relational data to JSON:

SELECT
  category,
  TO_JSON_STRING(
    STRUCT(
      category AS category_name,
      COUNT(*) AS product_count,
      ROUND(AVG(price), 2) AS avg_price,
      ARRAY_AGG(
        STRUCT(product_name, price)
        ORDER BY price DESC
        LIMIT 3
      ) AS top_products
    )
  ) AS category_summary
FROM products
GROUP BY category;

JSON Performance Tips

1. Extract once, reuse:

-- ❌ Bad: Multiple extractions
SELECT
  JSON_VALUE(data, '$.user.id'),
  JSON_VALUE(data, '$.user.name'),
  JSON_VALUE(data, '$.user.email')
FROM events;

-- ✅ Better: Extract object once
WITH extracted AS (
  SELECT JSON_QUERY(data, '$.user') AS user_json
  FROM events
)
SELECT
  JSON_VALUE(user_json, '$.id'),
  JSON_VALUE(user_json, '$.name'),
  JSON_VALUE(user_json, '$.email')
FROM extracted;

2. Consider STRUCT columns instead of JSON:

-- If schema is known and stable, use STRUCT
CREATE TABLE events (
  user STRUCT<id STRING, name STRING, email STRING>,
  timestamp TIMESTAMP
);

-- Query with dot notation (faster than JSON extraction)
SELECT user.id, user.name, user.email
FROM events;

3. Materialize frequently accessed JSON fields:

-- Add extracted columns to table
ALTER TABLE events
ADD COLUMN user_id STRING AS (JSON_VALUE(data, '$.user.id'));

-- Now queries can filter efficiently
SELECT * FROM events WHERE user_id = 'U123';

ARRAY and STRUCT

BigQuery's native support for nested and repeated data allows for powerful denormalization and performance optimization.

ARRAY Basics

Creating arrays:

SELECT
  [1, 2, 3, 4, 5] AS numbers,
  ['apple', 'banana', 'cherry'] AS fruits,
  [DATE '2024-01-01', DATE '2024-01-02'] AS dates;

ARRAY_AGG() - Aggregate into array:

SELECT
  customer_id,
  ARRAY_AGG(order_id ORDER BY order_date DESC) AS order_ids,
  ARRAY_AGG(amount) AS order_amounts
FROM orders
GROUP BY customer_id;

With LIMIT:

SELECT
  customer_id,
  ARRAY_AGG(order_id ORDER BY order_date DESC LIMIT 5) AS recent_order_ids
FROM orders
GROUP BY customer_id;

UNNEST - Flattening Arrays

Basic UNNEST:

SELECT element
FROM UNNEST(['a', 'b', 'c']) AS element;

-- Results:
-- a
-- b
-- c

UNNEST with table:

-- Table: customers
-- customer_id | order_ids
-- 1           | [101, 102, 103]
-- 2           | [201, 202]

SELECT
  customer_id,
  order_id
FROM customers,
UNNEST(order_ids) AS order_id;

-- Results:
-- 1, 101
-- 1, 102
-- 1, 103
-- 2, 201
-- 2, 202

UNNEST with OFFSET (get array index):

SELECT
  item,
  idx
FROM UNNEST(['first', 'second', 'third']) AS item WITH OFFSET AS idx;

-- Results:
-- first, 0
-- second, 1
-- third, 2

STRUCT - Nested Records

Creating structs:

SELECT
  STRUCT('John' AS name, 30 AS age, 'Engineer' AS role) AS person,
  STRUCT('123 Main St' AS street, 'Boston' AS city, '02101' AS zip) AS address;

Querying struct fields:

SELECT
  person.name,
  person.age,
  address.city
FROM (
  SELECT
    STRUCT('John' AS name, 30 AS age) AS person,
    STRUCT('Boston' AS city) AS address
);

ARRAY of STRUCT (Most Powerful Pattern)

Create:

SELECT
  customer_id,
  ARRAY_AGG(
    STRUCT(
      order_id,
      amount,
      order_date,
      status
    )
    ORDER BY order_date DESC
  ) AS orders
FROM orders
GROUP BY customer_id;

Query:

-- Flatten array of struct
SELECT
  customer_id,
  order.order_id,
  order.amount,
  order.order_date
FROM customers,
UNNEST(orders) AS order
WHERE order.status = 'completed';

Filter array elements:

SELECT
  customer_id,
  ARRAY(
    SELECT AS STRUCT order_id, amount
    FROM UNNEST(orders) AS order
    WHERE order.status = 'completed'
    ORDER BY amount DESC
    LIMIT 3
  ) AS top_completed_orders
FROM customers;

ARRAY Functions

ARRAY_LENGTH():

SELECT
  customer_id,
  ARRAY_LENGTH(order_ids) AS total_orders
FROM customers;

ARRAY_CONCAT():

SELECT ARRAY_CONCAT([1, 2], [3, 4], [5]) AS combined;
-- Result: [1, 2, 3, 4, 5]

ARRAY_TO_STRING():

SELECT ARRAY_TO_STRING(['apple', 'banana', 'cherry'], ', ') AS fruits;
-- Result: 'apple, banana, cherry'

GENERATE_ARRAY():

SELECT GENERATE_ARRAY(1, 10) AS numbers;
-- Result: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

SELECT GENERATE_ARRAY(0, 100, 10) AS multiples_of_10;
-- Result: [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

ARRAY_REVERSE():

SELECT ARRAY_REVERSE([1, 2, 3, 4, 5]) AS reversed;
-- Result: [5, 4, 3, 2, 1]

Performance: ARRAY vs JOIN

Traditional approach (2 tables, JOIN):

-- Table 1: customers (1M rows)
-- Table 2: orders (10M rows)

SELECT
  c.customer_id,
  c.name,
  o.order_id,
  o.amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE c.customer_id = '12345';

-- Scans: customers (1M) + orders (10M) = 11M rows
-- Join cost: High

Array approach (1 table with ARRAY):

-- Table: customers (1M rows with nested orders array)

SELECT
  customer_id,
  name,
  order.order_id,
  order.amount
FROM customers,
UNNEST(orders) AS order
WHERE customer_id = '12345';

-- Scans: customers (1M) only
-- No join cost
-- 50-80% faster for 1:many relationships

When to use ARRAY:

  • 1:many relationships (orders per customer)
  • Moderate array size (< 1000 elements)
  • Frequent filtering by parent entity
  • Want to reduce JOINs

When NOT to use ARRAY:

  • Many:many relationships
  • Very large arrays (> 10,000 elements)
  • Need to query array elements independently
  • Array elements frequently updated

Complete Example: Denormalized Design

Traditional normalized:

-- 3 tables, 2 JOINs
SELECT
  c.customer_id,
  c.name,
  o.order_id,
  oi.product_id,
  oi.quantity
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id;

Denormalized with ARRAY/STRUCT:

-- 1 table, no JOINs
CREATE TABLE customers_denormalized AS
SELECT
  c.customer_id,
  c.name,
  ARRAY_AGG(
    STRUCT(
      o.order_id,
      o.order_date,
      o.status,
      ARRAY(
        SELECT AS STRUCT product_id, quantity, price
        FROM order_items
        WHERE order_id = o.order_id
      ) AS items
    )
    ORDER BY o.order_date DESC
  ) AS orders
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name;

-- Query (no JOINs!)
SELECT
  customer_id,
  name,
  order.order_id,
  item.product_id,
  item.quantity
FROM customers_denormalized,
UNNEST(orders) AS order,
UNNEST(order.items) AS item
WHERE customer_id = '12345';

Performance improvement: 3-5x faster for typical queries.


BigQuery-Specific Features

EXCEPT and REPLACE in SELECT

EXCEPT - Exclude columns:

-- Select all except sensitive columns
SELECT * EXCEPT(ssn, password, credit_card)
FROM customers;

-- Combine with WHERE
SELECT * EXCEPT(internal_notes)
FROM orders
WHERE status = 'shipped';

REPLACE - Modify columns:

-- Replace column values
SELECT * REPLACE(UPPER(name) AS name, ROUND(price, 2) AS price)
FROM products;

-- Anonymize data
SELECT * REPLACE('***' AS ssn, '***' AS credit_card)
FROM customers;

Combine EXCEPT and REPLACE:

SELECT * EXCEPT(password) REPLACE(LOWER(email) AS email)
FROM users;

SAFE Functions (NULL Instead of Errors)

SAFE_CAST() - Returns NULL on error:

-- Regular CAST throws error on invalid input
SELECT CAST('invalid' AS INT64);  -- ERROR

-- SAFE_CAST returns NULL
SELECT SAFE_CAST('invalid' AS INT64) AS result;  -- NULL

SAFE_DIVIDE() - Returns NULL on division by zero:

SELECT
  revenue,
  orders,
  SAFE_DIVIDE(revenue, orders) AS avg_order_value  -- NULL if orders = 0
FROM daily_metrics;

Other SAFE functions:

-- SAFE_SUBTRACT (for dates)
SELECT SAFE_SUBTRACT(DATE '2024-01-01', DATE '2024-12-31');  -- NULL (negative)

-- SAFE_NEGATE
SELECT SAFE_NEGATE(9223372036854775807);  -- NULL (overflow)

-- SAFE_ADD
SELECT SAFE_ADD(9223372036854775807, 1);  -- NULL (overflow)

Use case: Data quality checks:

SELECT
  COUNT(*) AS total_rows,
  COUNT(SAFE_CAST(amount AS FLOAT64)) AS valid_amounts,
  COUNT(*) - COUNT(SAFE_CAST(amount AS FLOAT64)) AS invalid_amounts
FROM transactions;

GROUP BY with Column Numbers

-- ✅ Valid: Group by column position
SELECT
  customer_id,
  DATE(order_date) AS order_date,
  SUM(amount) AS total
FROM orders
GROUP BY 1, 2;  -- Same as: GROUP BY customer_id, DATE(order_date)

-- ✅ Also valid: Mix names and numbers
SELECT
  customer_id,
  DATE(order_date) AS order_date,
  SUM(amount) AS total
FROM orders
GROUP BY customer_id, 2;

When useful:

  • Long expressions in SELECT
  • Complex CASE statements
  • Simplifies GROUP BY clause

TABLESAMPLE - Random Sampling

System sampling (fast, approximate):

-- Sample ~10% of data (by blocks)
SELECT * FROM large_table
TABLESAMPLE SYSTEM (10 PERCENT);

Use cases:

  • Quick data exploration
  • Testing queries on large tables
  • Statistical sampling

Note: SYSTEM sampling is block-based, not truly random. For exact percentages, use ROW_NUMBER() with RAND().


PIVOT and UNPIVOT

PIVOT - Columns to rows:

SELECT *
FROM (
  SELECT product, quarter, sales
  FROM quarterly_sales
)
PIVOT (
  SUM(sales)
  FOR quarter IN ('Q1', 'Q2', 'Q3', 'Q4')
);

-- Before:
-- product | quarter | sales
-- A       | Q1      | 100
-- A       | Q2      | 150

-- After:
-- product | Q1  | Q2  | Q3  | Q4
-- A       | 100 | 150 | ... | ...

UNPIVOT - Rows to columns:

SELECT *
FROM quarterly_totals
UNPIVOT (
  sales FOR quarter IN (Q1, Q2, Q3, Q4)
);

-- Before:
-- product | Q1  | Q2  | Q3  | Q4
-- A       | 100 | 150 | 200 | 250

-- After:
-- product | quarter | sales
-- A       | Q1      | 100
-- A       | Q2      | 150

Standard SQL vs Legacy SQL

BigQuery has two SQL dialects:

Feature Standard SQL Legacy SQL
ANSI Compliance ✅ Yes ❌ No
Recommended ✅ Yes ❌ Deprecated
Table Reference `project.dataset.table` [project:dataset.table]
CTEs (WITH) ✅ Yes ❌ No
Window Functions ✅ Full support ⚠️ Limited
ARRAY/STRUCT ✅ Native ⚠️ Limited
Portability ✅ High ❌ BigQuery-only

How to detect Legacy SQL:

-- Legacy SQL indicators:
-- 1. Square brackets for tables
SELECT * FROM [project:dataset.table]

-- 2. GROUP EACH BY
SELECT customer_id, COUNT(*) FROM orders GROUP EACH BY customer_id

-- 3. FLATTEN
SELECT * FROM FLATTEN([project:dataset.table], field)

-- 4. TABLE_DATE_RANGE
SELECT * FROM TABLE_DATE_RANGE([dataset.table_], TIMESTAMP('2024-01-01'), TIMESTAMP('2024-12-31'))

Standard SQL equivalent:

-- 1. Backticks
SELECT * FROM `project.dataset.table`

-- 2. Regular GROUP BY
SELECT customer_id, COUNT(*) FROM orders GROUP BY customer_id

-- 3. UNNEST
SELECT * FROM `project.dataset.table`, UNNEST(field)

-- 4. Partitioned table filter
SELECT * FROM `project.dataset.table`
WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2024-01-01') AND TIMESTAMP('2024-12-31')

Migration:

# Set default to Standard SQL
bq query --use_legacy_sql=false 'SELECT ...'

# Or in Python
job_config = bigquery.QueryJobConfig(use_legacy_sql=False)

Performance Checklist

Before running expensive queries:

  1. ☐ Use --dry_run to estimate cost
  2. ☐ Check if partition pruning is active
  3. ☐ Verify clustering on JOIN/WHERE columns
  4. ☐ Remove SELECT *
  5. ☐ Filter early with WHERE
  6. ☐ Use EXPLAIN to analyze plan
  7. ☐ Consider materialized views for repeated queries
  8. ☐ Test with LIMIT first on full query
  9. ☐ Avoid CTE re-execution (use temp tables if referenced 2+ times)
  10. ☐ Use NOT EXISTS instead of NOT IN
  11. ☐ Batch DML operations (avoid row-by-row updates)
  12. ☐ Consider ARRAY/STRUCT for 1:many relationships

Monitoring Query Performance

-- Check query statistics
SELECT
  job_id,
  user_email,
  total_bytes_processed,
  total_slot_ms,
  TIMESTAMP_DIFF(end_time, start_time, SECOND) as duration_sec
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
ORDER BY total_bytes_processed DESC
LIMIT 10;

Quick Wins

Immediate improvements:

  1. Add partition filter → 50-90% cost reduction
  2. Remove SELECT * → 30-70% cost reduction
  3. Use clustering → 20-50% performance improvement
  4. Break complex queries → 2-5x faster execution
  5. Enable query cache → Free repeated queries
Weekly Installs
20
GitHub Stars
1
First Seen
Jan 24, 2026
Installed on
codex16
gemini-cli16
opencode16
claude-code14
cursor14
antigravity12