Data AWS Skill

Version: 1.1 Stack: AWS (Glue, Lambda, S3, Redshift, Athena, Step Functions)

AWS data services charge by usage — every query scans data, every DPU-hour costs money, every unpartitioned table means full scans at $5/TB. The difference between a well-architected data lake and a poorly-structured one isn't just performance — it's real dollars. Unpartitioned Athena queries can cost 100x what partitioned queries cost for the same result. Glue jobs at default DPU count waste compute. Lambda functions with clients initialized inside the handler waste cold start time on every invocation.

These patterns aren't just best practices — they're cost controls.

Scope and Boundaries

This skill covers:

S3 data lake patterns
Glue ETL jobs and crawlers
Lambda for data processing
Athena query patterns
Redshift optimization
Step Functions orchestration

Defers to other skills:

data-iac: Terraform/CDK for infrastructure
data-pipelines: General ETL patterns
security: IAM, encryption, secrets

Use this skill when: Building data solutions on AWS.

Core Principles

S3 as the Foundation — Data lake first, warehouse for served data.
Partition Everything — S3 paths and tables partitioned by date at minimum.
Right-Size Compute — Lambda for small (<10 GB, <15 min), Glue for large, Redshift for served.
Cost Awareness — Every query scans data, every DPU costs money. Partition, compress, columnar.
Event-Driven When Possible — S3 events trigger processing, not polling.
Explicit Schemas — Define schemas in Glue Catalog, never rely on crawlers in production.

Service Selection

Data Size	Frequency	Latency Need	Use
<10 GB	Event-driven	Seconds	Lambda
>10 GB	Scheduled	Minutes	Glue
Any	Ad-hoc SQL	Seconds	Athena
Large	Concurrent BI users	Sub-second	Redshift
Streaming	Continuous	Milliseconds	Kinesis

See references/service-selection.md for detailed comparison and cost breakdown.

S3 Data Lake

Path Structure

s3://{project}-{env}-data/
├── raw/                                    # Immutable source data
│   └── source={source}/year={YYYY}/month={MM}/day={DD}/
├── staged/                                 # Cleaned, validated
│   └── table={table}/year={YYYY}/month={MM}/day={DD}/
├── curated/                                # Business logic applied
│   └── table={table}/year={YYYY}/month={MM}/day={DD}/
└── aggregated/                             # Pre-computed metrics
    └── table={table}/year={YYYY}/month={MM}/day={DD}/

Rules:

Hive-style partitions (key=value/) for automatic Athena partition discovery
Parquet format for staged and beyond (columnar, compressed, splittable)
Raw stays in original format (JSON, CSV) — never modify source data
One file per partition minimum, avoid millions of small files (aim for 128 MB–1 GB each)

Athena

Partition Projection

Eliminates MSCK REPAIR TABLE and partition management overhead:

CREATE EXTERNAL TABLE events (
    event_id STRING,
    user_id STRING,
    event_type STRING,
    payload STRING
)
PARTITIONED BY (year STRING, month STRING, day STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS PARQUET
LOCATION 's3://bucket/curated/table=events/'
TBLPROPERTIES (
    'projection.enabled' = 'true',
    'projection.year.type' = 'integer',
    'projection.year.range' = '2020,2030',
    'projection.month.type' = 'integer',
    'projection.month.range' = '1,12',
    'projection.month.digits' = '2',
    'projection.day.type' = 'integer',
    'projection.day.range' = '1,31',
    'projection.day.digits' = '2',
    'storage.location.template' =
        's3://bucket/curated/table=events/year=${year}/month=${month}/day=${day}'
);

Query Cost Control

Athena charges $5 per TB scanned. Reduce cost by:

Always filtering on partition columns (WHERE year = '2026' AND month = '02')
Using columnar formats (Parquet scans only requested columns)
Using LIMIT during exploration
Setting workgroup query data scan limits

Glue ETL

PySpark Job Pattern

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext

args = getResolvedOptions(sys.argv, ["JOB_NAME", "SOURCE_DATABASE", "SOURCE_TABLE", "OUTPUT_PATH"])

sc = SparkContext()
glue_context = GlueContext(sc)
spark = glue_context.spark_session
job = Job(glue_context)
job.init(args["JOB_NAME"], args)

# Read from Glue Catalog
source_frame = glue_context.create_dynamic_frame.from_catalog(
    database=args["SOURCE_DATABASE"],
    table_name=args["SOURCE_TABLE"],
    push_down_predicate="year='2026' AND month='02'"  # Partition pruning
)

# Transform
mapped_frame = ApplyMapping.apply(
    frame=source_frame,
    mappings=[
        ("event_id", "string", "event_id", "string"),
        ("user_id", "string", "user_id", "string"),
        ("event_type", "string", "event_type", "string"),
        ("timestamp", "string", "event_ts", "timestamp"),
    ]
)

filtered_frame = Filter.apply(
    frame=mapped_frame,
    f=lambda row: row["event_type"] is not None
)

# Write as Parquet, partitioned
glue_context.write_dynamic_frame.from_options(
    frame=filtered_frame,
    connection_type="s3",
    connection_options={
        "path": args["OUTPUT_PATH"],
        "partitionKeys": ["year", "month", "day"]
    },
    format="parquet",
    format_options={"compression": "snappy"}
)

job.commit()

Glue Rules

Start small: 2 DPUs, increase only if job is slow. Each DPU costs $0.44/hr.
Job bookmarks: Enable for incremental processing (avoids reprocessing old data).
No crawlers in production: Schema drift breaks downstream. Define schemas explicitly in Glue Catalog.
Push down predicates: Filter at read time, not after loading all data.

Lambda

S3 Event Processing Pattern

import json
import logging
import boto3
from botocore.exceptions import ClientError

logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Initialize outside handler for connection reuse across invocations
s3 = boto3.client("s3")

def handler(event, context):
    """Process S3 event: read from raw, transform, write to staged."""
    processed = 0
    errors = 0

    for record in event["Records"]:
        bucket = record["s3"]["bucket"]["name"]
        key = record["s3"]["object"]["key"]

        try:
            response = s3.get_object(Bucket=bucket, Key=key)
            data = json.loads(response["Body"].read())

            result = transform(data)

            output_key = key.replace("raw/", "staged/")
            s3.put_object(
                Bucket=bucket,
                Key=output_key,
                Body=json.dumps(result),
                ContentType="application/json"
            )
            processed += 1

        except ClientError as e:
            logger.error("S3 error processing %s: %s", key, e.response["Error"]["Code"])
            errors += 1
        except (json.JSONDecodeError, KeyError) as e:
            logger.error("Data error processing %s: %s", key, e)
            errors += 1

    logger.info("Processed %d records, %d errors", processed, errors)
    return {"processed": processed, "errors": errors}

Lambda Rules

Clients outside handler: boto3.client() calls reuse TCP connections across warm invocations.
Memory = CPU: Increasing memory also increases CPU allocation. Profile to find the cost-optimal setting.
Idempotent operations: S3 events can deliver duplicates. Writing the same output twice should be safe.
Dead letter queue: Configure SQS DLQ to capture failed events for reprocessing.

Redshift

Table Design

CREATE TABLE fact_orders (
    order_id        BIGINT        NOT NULL ENCODE az64,
    customer_id     BIGINT        NOT NULL ENCODE az64,
    order_date      DATE          NOT NULL ENCODE az64,
    product_id      BIGINT        NOT NULL ENCODE az64,
    quantity         INTEGER       NOT NULL ENCODE az64,
    total_amount    DECIMAL(12,2) NOT NULL ENCODE az64,
    region          VARCHAR(50)   ENCODE zstd
)
DISTSTYLE KEY
DISTKEY (customer_id)     -- Collocates joins on customer_id
SORTKEY (order_date);     -- Optimizes date range filters

Key design decisions:

DISTKEY: Column used most in JOIN conditions. Collocates matching rows on the same node.
SORTKEY: Column used most in WHERE/ORDER BY. Enables zone map pruning.
DISTSTYLE EVEN: When no clear join column. Spreads rows equally.
DISTSTYLE ALL: For small dimension tables (<1M rows). Copies to every node.

Loading Data with COPY

COPY fact_orders
FROM 's3://bucket/curated/table=orders/'
IAM_ROLE 'arn:aws:iam::123456789012:role/redshift-s3-read'
FORMAT AS PARQUET;

Always use COPY over INSERT for bulk loads — it reads directly from S3 in parallel across all nodes.

Maintenance

-- Run after significant data changes
VACUUM FULL fact_orders;    -- Reclaims space, re-sorts rows
ANALYZE fact_orders;        -- Updates query planner statistics

Step Functions

ETL Workflow Pattern

{
  "StartAt": "RunGlueJob",
  "States": {
    "RunGlueJob": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "etl-raw-to-curated",
        "Arguments": {
          "--SOURCE_DATABASE": "raw_db",
          "--SOURCE_TABLE": "events",
          "--OUTPUT_PATH": "s3://bucket/curated/table=events/"
        }
      },
      "Retry": [
        {
          "ErrorEquals": ["Glue.AWSGlueException"],
          "IntervalSeconds": 60,
          "MaxAttempts": 2,
          "BackoffRate": 2.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "NotifyFailure"
        }
      ],
      "Next": "RefreshAthenaPartitions"
    },
    "RefreshAthenaPartitions": {
      "Type": "Task",
      "Resource": "arn:aws:states:::athena:startQueryExecution.sync",
      "Parameters": {
        "QueryString": "MSCK REPAIR TABLE curated_db.events",
        "WorkGroup": "data-team"
      },
      "Next": "NotifySuccess"
    },
    "NotifySuccess": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sns:publish",
      "Parameters": {
        "TopicArn": "arn:aws:sns:us-east-1:123456789012:etl-alerts",
        "Message": "ETL pipeline completed successfully"
      },
      "End": true
    },
    "NotifyFailure": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sns:publish",
      "Parameters": {
        "TopicArn": "arn:aws:sns:us-east-1:123456789012:etl-alerts",
        "Message.$": "States.Format('ETL pipeline failed: {}', $.Error)"
      },
      "End": true
    }
  }
}

Rules:

.sync suffix: Waits for the job to complete (otherwise Step Functions just starts it and moves on).
Retry before Catch: Let transient failures auto-recover before falling to error handling.
Timeouts on every state: Prevents stuck executions from running indefinitely.

Anti-Patterns

Anti-Pattern	Problem	Fix
Unpartitioned S3/tables	Full scans, slow and expensive	Partition by date at minimum
Millions of small files	Slow listing, high request costs	Compact into 128 MB–1 GB files
Lambda for large files	Timeout, memory limits	Use Glue or Fargate for >10 GB
Glue crawlers in production	Schema drift, unpredictable	Define schemas explicitly
Redshift without sort/dist keys	Poor query performance, reshuffling	Define keys based on query patterns
Athena without partition pruning	Scans entire dataset at $5/TB	Always filter on partition columns
Polling S3 for new files	Wasteful, delayed	Use S3 event notifications or EventBridge
Hardcoded ARNs/paths	Breaks across environments	Parameterize with job arguments or SSM

Cost Rules of Thumb

Service	Cost Driver	Save Money By
S3	Storage + requests	Compress, lifecycle to Glacier, avoid tiny files
Athena	$5/TB scanned	Partition, Parquet, select only needed columns
Glue	$0.44/DPU-hour	Start at 2 DPUs, use push-down predicates
Lambda	Duration x memory	Right-size memory, reuse connections
Redshift	RPU-hours or node-hours	Pause when idle, use Serverless for variable loads

Checklist

References

references/service-selection.md — When to use which AWS data service, cost comparison

Assets

assets/aws-data-checklist.md — Complete AWS data infrastructure checklist (S3, Athena, Glue, Lambda, Redshift, Step Functions, monitoring, security)

data-aws