aws-well-architected-review
AWS Well-Architected
Apply AWS Well-Architected best practices to all AWS infrastructure work. This applies whether you are writing new infrastructure or reviewing existing infrastructure.
When writing new infra (create, scaffold, add, generate): Apply the mandatory defaults below so every resource is correct from the first draft. Do not generate anti-patterns and then suggest fixes. The code itself is the output. Read rules/generate-defaults.md for framework-specific property name mappings when generating code.
When reviewing existing infra (review, check, audit, validate): Scan for anti-patterns using the review checklist below and produce a structured findings report with concrete fixes.
If both (e.g., "add a Lambda and review the rest"): Write new code with defaults applied, then review the existing code.
Detect the framework and language from the project context. Generate code and fixes in the matching format.
Generate Mode — Mandatory Defaults
Every resource you generate must satisfy these requirements:
Every resource: Encryption at rest + in transit, least-privilege IAM (no wildcards), tags (Environment, Service, Team), no hardcoded secrets (use Secrets Manager / SSM)
Every Lambda: DLQ, retry config, ARM64, tuned memory + timeout (never 128MB/3s defaults), log retention set (never infinite), X-Ray tracing, structured logging (Powertools), connection reuse outside handler, non-secret config via env vars (secrets go in Secrets Manager / SSM)
Every S3 bucket: Public access blocked, encryption, SSL enforced (bucket policy), versioning, lifecycle rules, removal policy RETAIN
Every DynamoDB table: PITR enabled, encryption, PAY_PER_REQUEST (unless steady patterns), Query-friendly key design with GSIs (no Scans), removal policy RETAIN
Every SQS queue: Companion DLQ with maxReceiveCount, encryption, SSL enforced (queue policy)
Every IAM policy: Specific actions only (never Action: "*"), specific resource ARNs (never Resource: "*"), prefer framework grant helpers, no full-access managed policies
Every API Gateway: Throttling rate + burst limits, authorization on every endpoint, X-Ray tracing, explicit CORS origins (never wildcard *)
Every Lambda handler: Structured logging (Powertools, not console.log/print), input validation (zod/pydantic/joi), specific error handling (no silent catch), non-secret config from env vars
All other resources (SNS, RDS, Step Functions, EventBridge, ALB, CloudFront, etc.): Encryption, least-privilege IAM, DLQ/error destination where applicable, multi-AZ for stateful services, automated backups/PITR where supported
Architecture: Services decoupled via queues/events where possible, async over blocking sync, auto-scaling on stateful services, multi-AZ for RDS/ElastiCache/ECS, environment isolation (dev/staging/prod)
Prototypes / minimal requests: If the user explicitly asks for a minimal or prototype setup, apply the defaults but note which ones you would skip in production and why.
Review Mode — Evaluating Existing Infrastructure
When reviewing existing infrastructure code, follow these steps:
Step 1 — Discover AWS Resources
Scan the codebase for infrastructure definitions and handler code. Identify:
- Compute: Lambda, ECS, EC2, Fargate, Step Functions
- Storage: S3, DynamoDB, RDS, ElastiCache, EFS
- Networking: API Gateway, ALB/NLB, CloudFront, VPC, Route 53
- Messaging: SQS, SNS, EventBridge, Kinesis
- Security: IAM roles/policies, KMS, Secrets Manager, WAF, Cognito
- Monitoring: CloudWatch alarms, X-Ray, CloudTrail
- CI/CD: CodePipeline, CodeBuild, CodeDeploy
Search patterns:
- CDK:
new lambda.Function,new s3.Bucket,new dynamodb.Table,new sqs.Queue, etc. - SAM/CFN:
AWS::Lambda::Function,AWS::S3::Bucket,AWS::DynamoDB::Table, etc. - Terraform:
resource "aws_lambda_function",resource "aws_s3_bucket", etc. - Serverless:
functions:,resources:,provider:blocks
Step 2 — Map Architecture Patterns
Identify architectural patterns in use:
- Event-driven (Lambda + SQS/SNS/EventBridge)
- API-driven (API Gateway + Lambda/ECS)
- Data pipeline (Kinesis/SQS + Lambda + DynamoDB/S3)
- Static hosting (S3 + CloudFront)
- Microservices (multiple services with independent deployments)
- Monolithic Lambda (single large function handling many routes)
Step 3 — Evaluate Against Each Pillar
For each pillar, check the specific items listed below.
Pillar 1: Security (SEC)
| Check | Severity | What to look for |
|---|---|---|
| Over-permissive IAM | High | Action: "*", Resource: "*", Effect: Allow with wildcards, full-access managed policies (AdministratorAccess, AmazonS3FullAccess, etc.) |
| Root account usage | High | Root account credentials used for application tasks, CI/CD, or operational access instead of IAM roles |
| Hardcoded credentials | High | API keys, passwords, tokens, connection strings embedded in source files, IaC templates, or config files |
| Public S3 buckets | High | Missing public access block (BlockPublicAccess, PublicAccessBlockConfiguration, aws_s3_bucket_public_access_block) |
| Encryption at rest | High | S3, DynamoDB, RDS, SQS, EBS, EFS without encryption configured |
| Encryption in transit | High | API Gateway without HTTPS, missing TLS on ALB/NLB listeners, no SSL enforcement on S3/SQS |
| No Secrets Manager / Parameter Store | High | Secrets stored in environment variables, config files, or code instead of Secrets Manager or SSM Parameter Store |
| No VPC for sensitive services | Medium | Lambda/ECS accessing databases or internal services without VPC configuration — lack of network isolation |
| No WAF | Medium | API Gateway or ALB exposed to internet without WAF association |
| Security groups | Medium | Ingress 0.0.0.0/0 on non-public-facing resources, overly permissive egress rules |
| KMS key management | Medium | Using AWS-managed keys instead of CMKs for sensitive data |
| Missing auth | Medium | API endpoints without authorization (Cognito, IAM, Lambda authorizer, API keys) |
Pillar 2: Reliability (REL)
| Check | Severity | What to look for |
|---|---|---|
| Missing DLQ | High | Lambda, SQS, or SNS without dead-letter queue configured |
| No retries | High | Lambda invocations, SQS consumers, or SDK calls without retry configuration for transient failures |
| Single point of failure | High | RDS, ElastiCache, or ECS in a single AZ without multi-AZ or cross-AZ redundancy |
| No idempotency | High | Event-driven handlers (SQS, SNS, EventBridge, Kinesis) without idempotency keys, conditional writes, or deduplication |
| No circuit breaker / fallback | Medium | Service-to-service calls without timeout, retry limits, circuit breaker, or fallback logic |
| Service limits / throttling ignored | Medium | No API Gateway throttling, no reserved concurrency on Lambda, no awareness of AWS service quotas |
| No health checks | Medium | ALB target groups without health check configuration |
| No backup | Medium | DynamoDB without PITR, RDS without automated backups, no snapshot policies |
| Error handling | Medium | Handlers with bare catch/except blocks that swallow errors silently |
Pillar 3: Performance Efficiency (PERF)
| Check | Severity | What to look for |
|---|---|---|
| Wrong compute choice | High | EC2 for short-lived tasks that should be Lambda/Fargate, or Lambda for long-running batch jobs that should be ECS/Step Functions |
| Blocking synchronous workflows | High | Synchronous request-response chains where async/event-driven patterns would reduce latency and improve throughput |
| Inefficient DB access | High | DynamoDB Scan operations instead of Query, missing GSIs for common access patterns, N+1 query patterns |
| No caching strategy | Medium | Repeated reads from DynamoDB/RDS without DAX, ElastiCache, CloudFront, or API Gateway caching |
| Large Lambda packages | Medium | Unminified bundles, bundled SDKs not used, no tree-shaking — causing slow cold starts |
| No connection reuse | Medium | Lambda creating new DB/HTTP connections per invocation instead of reusing connections outside the handler |
| Lambda memory | Medium | Default 128MB memory (often too low for Node.js/Python runtimes) |
| Lambda timeout | Medium | Default 3s timeout for non-trivial operations |
| Monolithic Lambda | Medium | Single function handling 10+ routes instead of per-route or per-domain functions |
| Poor batching / streaming | Medium | Processing SQS/Kinesis records one-at-a-time instead of batching, no batch window configuration |
| Missing CDN | Low | Static assets served from S3 or origin without CloudFront |
| ARM architecture | Low | Lambda not using arm64 / Graviton2 |
Pillar 4: Cost Optimization (COST)
| Check | Severity | What to look for |
|---|---|---|
| Over-provisioned compute | High | EC2/RDS instances sized far beyond actual utilization, Lambda with 3008MB+ for simple CRUD |
| Unused resources | High | Orphaned EBS volumes, unattached Elastic IPs, old snapshots, resources defined in IaC but not referenced |
| No auto-scaling | High | EC2, ECS, or DynamoDB (provisioned mode) without auto-scaling policies |
| No cost monitoring | Medium | No AWS Budgets, Cost Anomaly Detection, or billing alarms configured |
| Chatty service communication | Medium | High-frequency synchronous calls between services instead of batching, aggregation, or event-driven patterns |
| No reserved / spot instances | Medium | Steady-state EC2/RDS workloads on On-Demand pricing without Reserved Instances or Savings Plans; batch workloads not using Spot |
| NAT Gateway costs | Medium | Lambda in VPC using NAT Gateway for AWS API access instead of VPC endpoints |
| No lifecycle rules | Medium | S3 buckets without lifecycle policies to transition to IA/Glacier or expire objects |
| Log retention | Low | CloudWatch log groups with infinite retention (default) |
Pillar 5: Operational Excellence (OPS)
| Check | Severity | What to look for |
|---|---|---|
| No centralized logging | High | Missing CloudWatch Logs, or using console.log/print() instead of structured JSON logger (Powertools for Lambda, pino/winston for Node.js) |
| No monitoring / alerting | High | No CloudWatch alarms, no dashboards, no SNS notifications for failures or threshold breaches |
| No CI/CD pipeline | High | Manual deployments via console or CLI instead of CodePipeline, GitHub Actions, GitLab CI, etc. |
| No distributed tracing | Medium | Missing X-Ray tracing on Lambda/API Gateway/ECS — cannot trace requests across services |
| No runbooks / incident response | Medium | No documented runbooks, no automated remediation, no incident response playbooks |
| Unstructured / unsearchable logs | Medium | Log output as plain text without JSON structure, correlation IDs, or service context |
| No deployment strategies | Medium | Deploying directly to production without blue/green, canary, or rolling deployment strategies |
| Centralized config missing | Medium | Hardcoded values instead of SSM Parameter Store, AppConfig, or environment variables |
| IaC completeness | Medium | Resources created via AWS Console not captured in IaC |
| Tagging | Low | Resources missing Environment, Team, Service tags |
Pillar 6: Sustainability (SUS)
| Check | Severity | What to look for |
|---|---|---|
| Always-on without optimization | Medium | EC2/ECS/RDS running 24/7 without scheduled scaling, stop/start schedules, or usage-based right-sizing |
| Inefficient data transfer | Medium | Cross-region or cross-AZ data transfers that could be avoided with local caching, CDN, or regional design |
| No batch processing | Low | Processing items one-at-a-time instead of batching for efficiency |
Cross-Cutting Concerns (CROSS)
These issues span multiple pillars and indicate fundamental architectural problems:
| Check | Severity | What to look for |
|---|---|---|
| No event-driven architecture | High | Synchronous polling or request-response patterns where event-driven (SQS, SNS, EventBridge) would decouple and improve resilience |
| Tight coupling between services | High | Microservices calling each other directly via HTTP/SDK without queues, events, or contracts — failure cascades, deploy dependencies |
| Poor error handling / silent failures | High | Empty catch blocks, errors logged but not propagated, no alerting on failures, swallowed exceptions |
| No backpressure handling | Medium | Producers flooding consumers without throttling, queue-based buffering, or rate limiting |
| No environment isolation | Medium | Same AWS account or resources shared across dev/staging/prod without separation (separate accounts, stacks, or naming) |
Step 4 — Quick Wins
Identify the top 3–5 fixes that are:
- High impact (address High severity findings)
- Low effort (can be fixed in < 30 minutes)
- No architectural changes required
These go at the top of the report so the developer knows where to start.
Step 5 — Detailed Findings
For each issue found, output in this format:
### [PILLAR-NNN] Finding Title
<!-- Number sequentially per pillar: SEC-001, SEC-002, REL-001, etc. -->
- **Severity**: High | Medium | Low
- **Location**: `path/to/file:line`
- **Issue**: Specific description of what is wrong
- **Why it matters**: Business/technical impact if not addressed
- **Fix**: concrete code fix in the SAME framework/language as the existing code
Report Template
Output the review in this structure:
# AWS Well-Architected Review
## Architecture Summary
- **Services detected**: [list]
- **Architecture pattern**: [pattern name]
- **Framework**: [detected framework]
- **Language**: [detected language]
## Quick Wins
1. [high impact, low effort fix — with estimated time]
2. [high impact, low effort fix — with estimated time]
3. [high impact, low effort fix — with estimated time]
## Findings
### Security
[findings sorted by severity...]
### Reliability
[findings sorted by severity...]
### Performance Efficiency
[findings sorted by severity...]
### Cost Optimization
[findings sorted by severity...]
### Operational Excellence
[findings sorted by severity...]
### Sustainability
[findings sorted by severity...]
### Cross-Cutting Concerns
[findings sorted by severity...]
Order the pillar sections by number of High-severity findings (most critical pillar first). Omit pillar sections with zero findings.