aws-well-architected-framework
AWS Well-Architected Framework
Expert guidance for designing, reviewing, and improving AWS architectures using the six pillars of the Well-Architected Framework.
When to Use
Use this skill when:
- Reviewing existing AWS architecture for best practices
- Designing new cloud systems or applications
- Troubleshooting operational issues, security vulnerabilities, or reliability problems
- Optimizing costs or improving performance
- Preparing for architecture reviews or audits
- Migrating workloads to AWS
- Addressing compliance or sustainability requirements
- User asks "is my architecture good?" or "how can I improve my AWS setup?"
Core Principle
Systematic architecture evaluation across 6 pillars ensures balanced, well-designed systems that meet business objectives.
The AWS Well-Architected Framework provides a consistent approach for evaluating cloud architectures and implementing scalable designs.
The Six Pillars
| Pillar | Focus | Key Question |
|---|---|---|
| Operational Excellence | Run and monitor systems | How do we operate effectively? |
| Security | Protect information and systems | How do we protect data and resources? |
| Reliability | Recover from failures | How do we ensure workload availability? |
| Performance Efficiency | Use resources effectively | How do we meet performance requirements? |
| Cost Optimization | Avoid unnecessary costs | How do we achieve cost-effective outcomes? |
| Sustainability | Minimize environmental impact | How do we reduce carbon footprint? |
Architecture Review Workflow
CRITICAL: You MUST review ALL 6 pillars systematically. Never skip a pillar because it "seems not applicable" - every workload has considerations across all pillars.
digraph review_flow {
"Architecture review needed" [shape=doublecircle];
"Identify workload scope" [shape=box];
"Review each pillar systematically" [shape=box];
"Document findings per pillar" [shape=box];
"Prioritize improvements" [shape=box];
"Create action plan" [shape=box];
"All pillars reviewed?" [shape=diamond];
"Complete" [shape=doublecircle];
"Architecture review needed" -> "Identify workload scope";
"Identify workload scope" -> "Review each pillar systematically";
"Review each pillar systematically" -> "Document findings per pillar";
"Document findings per pillar" -> "All pillars reviewed?";
"All pillars reviewed?" -> "Review each pillar systematically" [label="no"];
"All pillars reviewed?" -> "Prioritize improvements" [label="yes"];
"Prioritize improvements" -> "Create action plan";
"Create action plan" -> "Complete";
}
Red Flags - You're Skipping the Framework:
- "This pillar doesn't apply to this workload" - WRONG, every pillar applies
- Jumping straight to recommendations without documenting current state
- Only reviewing 3-4 pillars instead of all 6
- Providing generic advice instead of workload-specific assessment
Pillar 1: Operational Excellence
Goal: Support development and run workloads effectively, gain insight into operations, and continuously improve processes.
Design Principles
- Perform operations as code (IaC)
- Make frequent, small, reversible changes
- Refine operations procedures frequently
- Anticipate failure
- Learn from operational events and failures
Key Areas
Organization:
- How do teams share architecture knowledge?
- Are there clear ownership and accountability models?
Prepare:
- How do you design workloads for observability?
- Infrastructure as code implementation?
- Deployment practices (CI/CD)?
Operate:
- What's the runbook for common operations?
- How do you understand workload health?
- How do you respond to events?
Evolve:
- How do you learn from operational events?
- Process for continuous improvement?
Common Issues & Solutions
| Issue | Solution |
|---|---|
| Manual deployments | Implement CI/CD with CloudFormation/CDK/Terraform |
| No visibility into system health | Add CloudWatch dashboards, metrics, alarms |
| Operational procedures outdated | Regular runbook reviews, post-incident learning |
| Slow incident response | Create automated remediation with Lambda/Systems Manager |
Quick Implementation Checklist
- Infrastructure defined as code (CloudFormation/CDK/Terraform)
- CI/CD pipeline implemented
- CloudWatch dashboards for key metrics
- Alarms for critical thresholds
- Runbooks documented and accessible
- Regular game days to test procedures
- Post-incident review process
Pillar 2: Security
Goal: Protect data, systems, and assets through cloud security practices.
Design Principles
- Implement strong identity foundation
- Enable traceability
- Apply security at all layers
- Automate security best practices
- Protect data in transit and at rest
- Keep people away from data
- Prepare for security events
Key Areas
Security Foundations:
- How do you manage credentials and authentication?
- IAM roles and policies following least privilege?
Identity and Access Management:
- How do you manage identities for people and machines?
- MFA enabled for all human access?
Detection:
- How do you detect and investigate security events?
- CloudTrail, GuardDuty, Security Hub configured?
Infrastructure Protection:
- How do you protect networks and compute?
- VPC configuration, security groups, NACLs?
Data Protection:
- How do you classify and protect data?
- Encryption at rest and in transit?
Incident Response:
- How do you respond to security incidents?
- Incident response plan tested?
Critical Security Patterns
Never Do:
// ❌ DANGEROUS: Hardcoded credentials
const AWS = require('aws-sdk');
const s3 = new AWS.S3({
accessKeyId: 'AKIAIOSFODNN7EXAMPLE',
secretAccessKey: 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
});
Always Do:
// ✅ CORRECT: Use IAM roles
const AWS = require('aws-sdk');
const s3 = new AWS.S3(); // Credentials from IAM role
// Lambda function with IAM role
const lambda = new lambda.Function(this, 'MyFunction', {
// IAM role with least privilege
role: myRole,
// ...
});
Security Checklist
- No hardcoded credentials anywhere (check git history!)
- IAM roles follow least privilege principle
- MFA enabled for root and privileged accounts
- CloudTrail enabled in all regions
- VPC with proper public/private subnet architecture
- Security groups with minimal inbound rules
- Encryption at rest for all data stores
- HTTPS/TLS for all data in transit
- Secrets Manager or Parameter Store for secrets
- Regular security patching process
- AWS Config for compliance monitoring
- GuardDuty for threat detection
Pillar 3: Reliability
Goal: Ensure workload performs its intended function correctly and consistently.
Design Principles
- Automatically recover from failure
- Test recovery procedures
- Scale horizontally
- Stop guessing capacity
- Manage change through automation
Key Areas
Foundations:
- How do you manage service quotas and constraints?
- Network topology designed for HA?
Workload Architecture:
- How do you design workload service architecture?
- Microservices vs monolith considerations?
Change Management:
- How do you monitor workload resources?
- How are changes deployed safely?
Failure Management:
- How do you back up data?
- How do you design for resilience?
- DR plan and RTO/RPO defined?
High Availability Patterns
Multi-AZ Deployment:
Region
├── AZ-1: Application + Database
├── AZ-2: Application + Database (standby)
└── AZ-3: Application + Database (standby)
Multi-Region Deployment:
Primary Region Secondary Region
├── Active workload ├── Standby/Active
├── Database (primary) ├── Database (replica)
└── Route 53 health check monitoring
Backup Strategy
| Data Type | Solution | RPO | RTO |
|---|---|---|---|
| RDS | Automated backups + snapshots | < 5 min | < 30 min |
| DynamoDB | Point-in-time recovery | Seconds | Minutes |
| S3 | Versioning + cross-region replication | Real-time | Immediate |
| EBS | Snapshots via AWS Backup | Hours | Hours |
Reliability Checklist
- Multi-AZ deployment for critical components
- Health checks configured (ELB, Route 53)
- Auto Scaling groups with proper sizing
- RDS automated backups enabled
- DynamoDB point-in-time recovery enabled
- S3 versioning for critical buckets
- Disaster recovery plan documented and tested
- Chaos engineering tests (failure injection)
- Graceful degradation strategies
- Circuit breaker patterns implemented
Pillar 4: Performance Efficiency
Goal: Use computing resources efficiently to meet requirements and maintain efficiency as demand changes.
Design Principles
- Democratize advanced technologies
- Go global in minutes
- Use serverless architectures
- Experiment more often
- Consider mechanical sympathy
Key Areas
Selection:
- How do you select appropriate resource types and sizes?
- Compute: EC2, Lambda, Fargate, ECS, EKS?
- Database: RDS, DynamoDB, Aurora, ElastiCache?
- Storage: S3, EFS, EBS, Glacier?
Review:
- How do you evolve workload to use new resources?
- Regular review of AWS new features?
Monitoring:
- How do you monitor resources?
- CloudWatch, X-Ray for distributed tracing?
Trade-offs:
- How do you use trade-offs to improve performance?
- Caching, consistency models, compression?
Performance Patterns
Caching Strategy:
Client → CloudFront (edge cache)
→ API Gateway
→ Lambda
→ ElastiCache (data cache)
→ DynamoDB/RDS
Database Selection:
| Use Case | Recommended Service |
|---|---|
| Relational, complex queries | RDS (PostgreSQL/MySQL) |
| High throughput, simple queries | DynamoDB |
| Graph relationships | Neptune |
| Search and analytics | OpenSearch |
| Time-series data | Timestream |
| In-memory cache | ElastiCache (Redis/Memcached) |
Performance Checklist
- Right-sized compute instances (not over-provisioned)
- Content delivery through CloudFront
- Database read replicas for read-heavy workloads
- Caching layer (ElastiCache, DAX, CloudFront)
- Asynchronous processing with SQS/SNS/EventBridge
- Auto Scaling configured appropriately
- Database indexes optimized
- Monitoring with CloudWatch and X-Ray
- Regular performance testing under load
Pillar 5: Cost Optimization
Goal: Run systems to deliver business value at lowest price point.
Design Principles
- Implement cloud financial management
- Adopt consumption model
- Measure overall efficiency
- Stop spending on undifferentiated heavy lifting
- Analyze and attribute expenditure
Key Areas
Practice Cloud Financial Management:
- Cost allocation tags implemented?
- Budgets and alerts configured?
Expenditure and Usage Awareness:
- How do you govern usage?
- Cost Explorer and AWS Budgets configured?
Cost-Effective Resources:
- How do you evaluate cost when selecting services?
- Reserved Instances or Savings Plans for predictable workloads?
Manage Demand:
- How do you manage demand and supply resources?
- Throttling, caching to reduce demand?
Optimize Over Time:
- How do you evaluate new services?
- Regular review of cost optimization opportunities?
Cost Optimization Strategies
| Strategy | Implementation | Potential Savings |
|---|---|---|
| Right-sizing | Use Compute Optimizer recommendations | 20-40% |
| Reserved Instances | 1-year or 3-year commitments | 30-75% |
| Savings Plans | Flexible compute commitments | 30-70% |
| Spot Instances | Fault-tolerant workloads | 50-90% |
| S3 Intelligent-Tiering | Automatic storage class optimization | 40-60% |
| Auto Scaling | Scale resources with demand | 30-50% |
| Lambda instead of EC2 | For appropriate workloads | Varies |
Cost Monitoring
// CDK Example: Set up budget alerts
import * as budgets from 'aws-cdk-lib/aws-budgets';
new budgets.CfnBudget(this, 'MonthlyBudget', {
budget: {
budgetType: 'COST',
timeUnit: 'MONTHLY',
budgetLimit: {
amount: 1000,
unit: 'USD',
},
},
notificationsWithSubscribers: [{
notification: {
notificationType: 'ACTUAL',
comparisonOperator: 'GREATER_THAN',
threshold: 80, // Alert at 80%
},
subscribers: [{
subscriptionType: 'EMAIL',
address: 'team@example.com',
}],
}],
});
Cost Optimization Checklist
- Cost allocation tags applied consistently
- AWS Budgets configured with alerts
- Cost Explorer reviewed monthly
- Reserved Instances or Savings Plans for stable workloads
- Spot Instances for fault-tolerant workloads
- Unused resources identified and terminated
- S3 lifecycle policies for data management
- Right-sized instances (not over-provisioned)
- Lambda memory optimization
- DynamoDB on-demand vs provisioned analysis
- Data transfer costs analyzed and optimized
Pillar 6: Sustainability
Goal: Minimize environmental impact of running cloud workloads.
Design Principles
- Understand your impact
- Establish sustainability goals
- Maximize utilization
- Anticipate and adopt new, more efficient offerings
- Use managed services
- Reduce downstream impact
Key Areas
Region Selection:
- Choose regions with renewable energy
- AWS regions with lower carbon intensity
User Behavior Patterns:
- Scale resources with demand
- Remove unused resources
Software and Architecture:
- Optimize code for efficiency
- Use appropriate services (serverless over provisioned)
Data Patterns:
- Minimize data movement
- Use data compression
- Implement lifecycle policies
Hardware Patterns:
- Use minimum necessary hardware
- Use instance types with best performance per watt
Development Process:
- Test sustainability improvements
- Measure and report carbon footprint
Sustainability Checklist
- Workloads in regions with renewable energy
- Auto Scaling to match demand (no idle resources)
- Unused resources regularly cleaned up
- Graviton processors considered for better efficiency
- Managed services used where appropriate
- Data lifecycle policies to reduce storage
- Efficient code (async processing, optimized queries)
- Monitoring resource utilization
- Carbon footprint tracked (AWS Customer Carbon Footprint Tool)
Review Process
1. Scoping Phase
Questions to ask:
- What is the workload scope? (entire system vs specific component)
- What are the business objectives?
- What are the compliance requirements?
- What are the current pain points?
2. Review Each Pillar
For each pillar, use this template:
Current State:
- Document what exists today
Gaps:
- What's missing or needs improvement?
Risks:
- What are the high/medium/low priority risks?
Recommendations:
- Specific, actionable improvements
3. Prioritization Matrix
| Priority | Criteria |
|---|---|
| High | Security vulnerabilities, critical availability risks, major cost waste |
| Medium | Performance issues, moderate cost optimization, operational improvements |
| Low | Nice-to-haves, future considerations, minor optimizations |
4. Action Plan Template
## Pillar: [Name]
### Issue: [Description]
- **Risk Level:** High/Medium/Low
- **Impact:** [Business impact]
- **Effort:** Low/Medium/High
### Recommendation:
[Specific actions]
### Implementation Steps:
1. [Step 1]
2. [Step 2]
3. [Step 3]
### Success Criteria:
- [Measurable outcome 1]
- [Measurable outcome 2]
### Resources:
- [AWS documentation links]
- [Blog posts or examples]
Common Anti-Patterns
| Anti-Pattern | Issue | Better Approach |
|---|---|---|
| Single AZ deployment | No fault tolerance | Multi-AZ architecture |
| No IaC | Manual config, drift | CloudFormation/CDK/Terraform |
| Hardcoded secrets | Security vulnerability | Secrets Manager/Parameter Store |
| No monitoring | Blind operation | CloudWatch dashboards + alarms |
| No backups | Data loss risk | Automated backup strategy |
| Over-provisioning | Cost waste | Right-sizing + Auto Scaling |
| No cost tracking | Budget overruns | Tags + Budgets + Cost Explorer |
| Monolithic architecture | Hard to scale | Microservices or serverless |
Real-World Example
Scenario: Serverless API with authentication
Architecture Review:
Operational Excellence:
- ✅ Lambda functions deployed via CDK
- ✅ CloudWatch logs enabled
- ❌ Missing: Distributed tracing (X-Ray), dashboards
Security:
- ❌ CRITICAL: Hardcoded API keys in Lambda environment variables
- ✅ API Gateway with IAM authorization
- ❌ Missing: Secrets Manager, encryption at rest
Reliability:
- ✅ Multi-AZ DynamoDB table
- ❌ Single region deployment
- ❌ Missing: Backup strategy, DR plan
Performance:
- ✅ CloudFront for static assets
- ❌ No caching for API responses
- ❌ Lambda cold starts not optimized
Cost:
- ❌ DynamoDB provisioned capacity, but traffic is spiky
- ✅ Lambda usage-based pricing
- ❌ Missing: Budget alerts, cost allocation tags
Sustainability:
- ✅ Serverless architecture (good utilization)
- ❌ Unused dev/test resources running 24/7
Priority Actions:
- HIGH: Move API keys to Secrets Manager (Security)
- HIGH: Implement DynamoDB backups (Reliability)
- MEDIUM: Add X-Ray tracing (Operational Excellence)
- MEDIUM: Switch DynamoDB to on-demand (Cost)
- LOW: Add API Gateway caching (Performance)
Resources
- AWS Well-Architected Framework Whitepaper
- AWS Well-Architected Tool (Interactive reviews)
- Well-Architected Labs
- AWS Architecture Center
- Sustainability Pillar Whitepaper
Common Mistakes When Using This Framework
| Mistake | Why It's Wrong | Correct Approach |
|---|---|---|
| "Sustainability doesn't apply to this workload" | Every workload consumes resources and energy | Review all 6 pillars, even if findings are minimal |
| Skipping current state documentation | Can't measure improvement without baseline | Always document "Current State" before recommendations |
| Generic recommendations | Not actionable or specific to this workload | Provide specific AWS services, code examples, priorities |
| No prioritization | Everything seems equally important | Use HIGH/MEDIUM/LOW risk levels, create phased plan |
| Forgetting about trade-offs | Optimizing one pillar at expense of others | Explicitly call out trade-offs (e.g., multi-region cost vs reliability) |
Using This Skill
When conducting architecture reviews:
- Start with context - understand business objectives and constraints
- Review systematically - go through all 6 pillars, don't skip ANY
- Document findings - use consistent format per pillar (Current State → Gaps → Recommendations)
- Prioritize ruthlessly - security and availability issues first
- Be specific - actionable recommendations with examples and AWS service names
- Provide resources - link to AWS docs and examples
- Create action plan - clear next steps with success criteria and effort estimates
- Call out trade-offs - be explicit about costs and benefits of each recommendation
Remember: Architecture is about trade-offs. A perfect architecture doesn't exist - aim for a well-balanced one that meets business needs.
No exceptions to reviewing all 6 pillars - even if a pillar seems "not applicable", document why and what the current state is.