skills/rameshvr/skills/aws-well-architected-framework

aws-well-architected-framework

SKILL.md

AWS Well-Architected Framework

Expert guidance for designing, reviewing, and improving AWS architectures using the six pillars of the Well-Architected Framework.

When to Use

Use this skill when:

  • Reviewing existing AWS architecture for best practices
  • Designing new cloud systems or applications
  • Troubleshooting operational issues, security vulnerabilities, or reliability problems
  • Optimizing costs or improving performance
  • Preparing for architecture reviews or audits
  • Migrating workloads to AWS
  • Addressing compliance or sustainability requirements
  • User asks "is my architecture good?" or "how can I improve my AWS setup?"

Core Principle

Systematic architecture evaluation across 6 pillars ensures balanced, well-designed systems that meet business objectives.

The AWS Well-Architected Framework provides a consistent approach for evaluating cloud architectures and implementing scalable designs.

The Six Pillars

Pillar Focus Key Question
Operational Excellence Run and monitor systems How do we operate effectively?
Security Protect information and systems How do we protect data and resources?
Reliability Recover from failures How do we ensure workload availability?
Performance Efficiency Use resources effectively How do we meet performance requirements?
Cost Optimization Avoid unnecessary costs How do we achieve cost-effective outcomes?
Sustainability Minimize environmental impact How do we reduce carbon footprint?

Architecture Review Workflow

CRITICAL: You MUST review ALL 6 pillars systematically. Never skip a pillar because it "seems not applicable" - every workload has considerations across all pillars.

digraph review_flow {
    "Architecture review needed" [shape=doublecircle];
    "Identify workload scope" [shape=box];
    "Review each pillar systematically" [shape=box];
    "Document findings per pillar" [shape=box];
    "Prioritize improvements" [shape=box];
    "Create action plan" [shape=box];
    "All pillars reviewed?" [shape=diamond];
    "Complete" [shape=doublecircle];

    "Architecture review needed" -> "Identify workload scope";
    "Identify workload scope" -> "Review each pillar systematically";
    "Review each pillar systematically" -> "Document findings per pillar";
    "Document findings per pillar" -> "All pillars reviewed?";
    "All pillars reviewed?" -> "Review each pillar systematically" [label="no"];
    "All pillars reviewed?" -> "Prioritize improvements" [label="yes"];
    "Prioritize improvements" -> "Create action plan";
    "Create action plan" -> "Complete";
}

Red Flags - You're Skipping the Framework:

  • "This pillar doesn't apply to this workload" - WRONG, every pillar applies
  • Jumping straight to recommendations without documenting current state
  • Only reviewing 3-4 pillars instead of all 6
  • Providing generic advice instead of workload-specific assessment

Pillar 1: Operational Excellence

Goal: Support development and run workloads effectively, gain insight into operations, and continuously improve processes.

Design Principles

  • Perform operations as code (IaC)
  • Make frequent, small, reversible changes
  • Refine operations procedures frequently
  • Anticipate failure
  • Learn from operational events and failures

Key Areas

Organization:

  • How do teams share architecture knowledge?
  • Are there clear ownership and accountability models?

Prepare:

  • How do you design workloads for observability?
  • Infrastructure as code implementation?
  • Deployment practices (CI/CD)?

Operate:

  • What's the runbook for common operations?
  • How do you understand workload health?
  • How do you respond to events?

Evolve:

  • How do you learn from operational events?
  • Process for continuous improvement?

Common Issues & Solutions

Issue Solution
Manual deployments Implement CI/CD with CloudFormation/CDK/Terraform
No visibility into system health Add CloudWatch dashboards, metrics, alarms
Operational procedures outdated Regular runbook reviews, post-incident learning
Slow incident response Create automated remediation with Lambda/Systems Manager

Quick Implementation Checklist

  • Infrastructure defined as code (CloudFormation/CDK/Terraform)
  • CI/CD pipeline implemented
  • CloudWatch dashboards for key metrics
  • Alarms for critical thresholds
  • Runbooks documented and accessible
  • Regular game days to test procedures
  • Post-incident review process

Pillar 2: Security

Goal: Protect data, systems, and assets through cloud security practices.

Design Principles

  • Implement strong identity foundation
  • Enable traceability
  • Apply security at all layers
  • Automate security best practices
  • Protect data in transit and at rest
  • Keep people away from data
  • Prepare for security events

Key Areas

Security Foundations:

  • How do you manage credentials and authentication?
  • IAM roles and policies following least privilege?

Identity and Access Management:

  • How do you manage identities for people and machines?
  • MFA enabled for all human access?

Detection:

  • How do you detect and investigate security events?
  • CloudTrail, GuardDuty, Security Hub configured?

Infrastructure Protection:

  • How do you protect networks and compute?
  • VPC configuration, security groups, NACLs?

Data Protection:

  • How do you classify and protect data?
  • Encryption at rest and in transit?

Incident Response:

  • How do you respond to security incidents?
  • Incident response plan tested?

Critical Security Patterns

Never Do:

// ❌ DANGEROUS: Hardcoded credentials
const AWS = require('aws-sdk');
const s3 = new AWS.S3({
  accessKeyId: 'AKIAIOSFODNN7EXAMPLE',
  secretAccessKey: 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
});

Always Do:

// ✅ CORRECT: Use IAM roles
const AWS = require('aws-sdk');
const s3 = new AWS.S3(); // Credentials from IAM role

// Lambda function with IAM role
const lambda = new lambda.Function(this, 'MyFunction', {
  // IAM role with least privilege
  role: myRole,
  // ...
});

Security Checklist

  • No hardcoded credentials anywhere (check git history!)
  • IAM roles follow least privilege principle
  • MFA enabled for root and privileged accounts
  • CloudTrail enabled in all regions
  • VPC with proper public/private subnet architecture
  • Security groups with minimal inbound rules
  • Encryption at rest for all data stores
  • HTTPS/TLS for all data in transit
  • Secrets Manager or Parameter Store for secrets
  • Regular security patching process
  • AWS Config for compliance monitoring
  • GuardDuty for threat detection

Pillar 3: Reliability

Goal: Ensure workload performs its intended function correctly and consistently.

Design Principles

  • Automatically recover from failure
  • Test recovery procedures
  • Scale horizontally
  • Stop guessing capacity
  • Manage change through automation

Key Areas

Foundations:

  • How do you manage service quotas and constraints?
  • Network topology designed for HA?

Workload Architecture:

  • How do you design workload service architecture?
  • Microservices vs monolith considerations?

Change Management:

  • How do you monitor workload resources?
  • How are changes deployed safely?

Failure Management:

  • How do you back up data?
  • How do you design for resilience?
  • DR plan and RTO/RPO defined?

High Availability Patterns

Multi-AZ Deployment:

Region
├── AZ-1: Application + Database
├── AZ-2: Application + Database (standby)
└── AZ-3: Application + Database (standby)

Multi-Region Deployment:

Primary Region          Secondary Region
├── Active workload    ├── Standby/Active
├── Database (primary) ├── Database (replica)
└── Route 53 health check monitoring

Backup Strategy

Data Type Solution RPO RTO
RDS Automated backups + snapshots < 5 min < 30 min
DynamoDB Point-in-time recovery Seconds Minutes
S3 Versioning + cross-region replication Real-time Immediate
EBS Snapshots via AWS Backup Hours Hours

Reliability Checklist

  • Multi-AZ deployment for critical components
  • Health checks configured (ELB, Route 53)
  • Auto Scaling groups with proper sizing
  • RDS automated backups enabled
  • DynamoDB point-in-time recovery enabled
  • S3 versioning for critical buckets
  • Disaster recovery plan documented and tested
  • Chaos engineering tests (failure injection)
  • Graceful degradation strategies
  • Circuit breaker patterns implemented

Pillar 4: Performance Efficiency

Goal: Use computing resources efficiently to meet requirements and maintain efficiency as demand changes.

Design Principles

  • Democratize advanced technologies
  • Go global in minutes
  • Use serverless architectures
  • Experiment more often
  • Consider mechanical sympathy

Key Areas

Selection:

  • How do you select appropriate resource types and sizes?
  • Compute: EC2, Lambda, Fargate, ECS, EKS?
  • Database: RDS, DynamoDB, Aurora, ElastiCache?
  • Storage: S3, EFS, EBS, Glacier?

Review:

  • How do you evolve workload to use new resources?
  • Regular review of AWS new features?

Monitoring:

  • How do you monitor resources?
  • CloudWatch, X-Ray for distributed tracing?

Trade-offs:

  • How do you use trade-offs to improve performance?
  • Caching, consistency models, compression?

Performance Patterns

Caching Strategy:

Client → CloudFront (edge cache)
  → API Gateway
    → Lambda
      → ElastiCache (data cache)
        → DynamoDB/RDS

Database Selection:

Use Case Recommended Service
Relational, complex queries RDS (PostgreSQL/MySQL)
High throughput, simple queries DynamoDB
Graph relationships Neptune
Search and analytics OpenSearch
Time-series data Timestream
In-memory cache ElastiCache (Redis/Memcached)

Performance Checklist

  • Right-sized compute instances (not over-provisioned)
  • Content delivery through CloudFront
  • Database read replicas for read-heavy workloads
  • Caching layer (ElastiCache, DAX, CloudFront)
  • Asynchronous processing with SQS/SNS/EventBridge
  • Auto Scaling configured appropriately
  • Database indexes optimized
  • Monitoring with CloudWatch and X-Ray
  • Regular performance testing under load

Pillar 5: Cost Optimization

Goal: Run systems to deliver business value at lowest price point.

Design Principles

  • Implement cloud financial management
  • Adopt consumption model
  • Measure overall efficiency
  • Stop spending on undifferentiated heavy lifting
  • Analyze and attribute expenditure

Key Areas

Practice Cloud Financial Management:

  • Cost allocation tags implemented?
  • Budgets and alerts configured?

Expenditure and Usage Awareness:

  • How do you govern usage?
  • Cost Explorer and AWS Budgets configured?

Cost-Effective Resources:

  • How do you evaluate cost when selecting services?
  • Reserved Instances or Savings Plans for predictable workloads?

Manage Demand:

  • How do you manage demand and supply resources?
  • Throttling, caching to reduce demand?

Optimize Over Time:

  • How do you evaluate new services?
  • Regular review of cost optimization opportunities?

Cost Optimization Strategies

Strategy Implementation Potential Savings
Right-sizing Use Compute Optimizer recommendations 20-40%
Reserved Instances 1-year or 3-year commitments 30-75%
Savings Plans Flexible compute commitments 30-70%
Spot Instances Fault-tolerant workloads 50-90%
S3 Intelligent-Tiering Automatic storage class optimization 40-60%
Auto Scaling Scale resources with demand 30-50%
Lambda instead of EC2 For appropriate workloads Varies

Cost Monitoring

// CDK Example: Set up budget alerts
import * as budgets from 'aws-cdk-lib/aws-budgets';

new budgets.CfnBudget(this, 'MonthlyBudget', {
  budget: {
    budgetType: 'COST',
    timeUnit: 'MONTHLY',
    budgetLimit: {
      amount: 1000,
      unit: 'USD',
    },
  },
  notificationsWithSubscribers: [{
    notification: {
      notificationType: 'ACTUAL',
      comparisonOperator: 'GREATER_THAN',
      threshold: 80, // Alert at 80%
    },
    subscribers: [{
      subscriptionType: 'EMAIL',
      address: 'team@example.com',
    }],
  }],
});

Cost Optimization Checklist

  • Cost allocation tags applied consistently
  • AWS Budgets configured with alerts
  • Cost Explorer reviewed monthly
  • Reserved Instances or Savings Plans for stable workloads
  • Spot Instances for fault-tolerant workloads
  • Unused resources identified and terminated
  • S3 lifecycle policies for data management
  • Right-sized instances (not over-provisioned)
  • Lambda memory optimization
  • DynamoDB on-demand vs provisioned analysis
  • Data transfer costs analyzed and optimized

Pillar 6: Sustainability

Goal: Minimize environmental impact of running cloud workloads.

Design Principles

  • Understand your impact
  • Establish sustainability goals
  • Maximize utilization
  • Anticipate and adopt new, more efficient offerings
  • Use managed services
  • Reduce downstream impact

Key Areas

Region Selection:

  • Choose regions with renewable energy
  • AWS regions with lower carbon intensity

User Behavior Patterns:

  • Scale resources with demand
  • Remove unused resources

Software and Architecture:

  • Optimize code for efficiency
  • Use appropriate services (serverless over provisioned)

Data Patterns:

  • Minimize data movement
  • Use data compression
  • Implement lifecycle policies

Hardware Patterns:

  • Use minimum necessary hardware
  • Use instance types with best performance per watt

Development Process:

  • Test sustainability improvements
  • Measure and report carbon footprint

Sustainability Checklist

  • Workloads in regions with renewable energy
  • Auto Scaling to match demand (no idle resources)
  • Unused resources regularly cleaned up
  • Graviton processors considered for better efficiency
  • Managed services used where appropriate
  • Data lifecycle policies to reduce storage
  • Efficient code (async processing, optimized queries)
  • Monitoring resource utilization
  • Carbon footprint tracked (AWS Customer Carbon Footprint Tool)

Review Process

1. Scoping Phase

Questions to ask:

  • What is the workload scope? (entire system vs specific component)
  • What are the business objectives?
  • What are the compliance requirements?
  • What are the current pain points?

2. Review Each Pillar

For each pillar, use this template:

Current State:

  • Document what exists today

Gaps:

  • What's missing or needs improvement?

Risks:

  • What are the high/medium/low priority risks?

Recommendations:

  • Specific, actionable improvements

3. Prioritization Matrix

Priority Criteria
High Security vulnerabilities, critical availability risks, major cost waste
Medium Performance issues, moderate cost optimization, operational improvements
Low Nice-to-haves, future considerations, minor optimizations

4. Action Plan Template

## Pillar: [Name]

### Issue: [Description]
- **Risk Level:** High/Medium/Low
- **Impact:** [Business impact]
- **Effort:** Low/Medium/High

### Recommendation:
[Specific actions]

### Implementation Steps:
1. [Step 1]
2. [Step 2]
3. [Step 3]

### Success Criteria:
- [Measurable outcome 1]
- [Measurable outcome 2]

### Resources:
- [AWS documentation links]
- [Blog posts or examples]

Common Anti-Patterns

Anti-Pattern Issue Better Approach
Single AZ deployment No fault tolerance Multi-AZ architecture
No IaC Manual config, drift CloudFormation/CDK/Terraform
Hardcoded secrets Security vulnerability Secrets Manager/Parameter Store
No monitoring Blind operation CloudWatch dashboards + alarms
No backups Data loss risk Automated backup strategy
Over-provisioning Cost waste Right-sizing + Auto Scaling
No cost tracking Budget overruns Tags + Budgets + Cost Explorer
Monolithic architecture Hard to scale Microservices or serverless

Real-World Example

Scenario: Serverless API with authentication

Architecture Review:

Operational Excellence:

  • ✅ Lambda functions deployed via CDK
  • ✅ CloudWatch logs enabled
  • ❌ Missing: Distributed tracing (X-Ray), dashboards

Security:

  • ❌ CRITICAL: Hardcoded API keys in Lambda environment variables
  • ✅ API Gateway with IAM authorization
  • ❌ Missing: Secrets Manager, encryption at rest

Reliability:

  • ✅ Multi-AZ DynamoDB table
  • ❌ Single region deployment
  • ❌ Missing: Backup strategy, DR plan

Performance:

  • ✅ CloudFront for static assets
  • ❌ No caching for API responses
  • ❌ Lambda cold starts not optimized

Cost:

  • ❌ DynamoDB provisioned capacity, but traffic is spiky
  • ✅ Lambda usage-based pricing
  • ❌ Missing: Budget alerts, cost allocation tags

Sustainability:

  • ✅ Serverless architecture (good utilization)
  • ❌ Unused dev/test resources running 24/7

Priority Actions:

  1. HIGH: Move API keys to Secrets Manager (Security)
  2. HIGH: Implement DynamoDB backups (Reliability)
  3. MEDIUM: Add X-Ray tracing (Operational Excellence)
  4. MEDIUM: Switch DynamoDB to on-demand (Cost)
  5. LOW: Add API Gateway caching (Performance)

Resources

Common Mistakes When Using This Framework

Mistake Why It's Wrong Correct Approach
"Sustainability doesn't apply to this workload" Every workload consumes resources and energy Review all 6 pillars, even if findings are minimal
Skipping current state documentation Can't measure improvement without baseline Always document "Current State" before recommendations
Generic recommendations Not actionable or specific to this workload Provide specific AWS services, code examples, priorities
No prioritization Everything seems equally important Use HIGH/MEDIUM/LOW risk levels, create phased plan
Forgetting about trade-offs Optimizing one pillar at expense of others Explicitly call out trade-offs (e.g., multi-region cost vs reliability)

Using This Skill

When conducting architecture reviews:

  1. Start with context - understand business objectives and constraints
  2. Review systematically - go through all 6 pillars, don't skip ANY
  3. Document findings - use consistent format per pillar (Current State → Gaps → Recommendations)
  4. Prioritize ruthlessly - security and availability issues first
  5. Be specific - actionable recommendations with examples and AWS service names
  6. Provide resources - link to AWS docs and examples
  7. Create action plan - clear next steps with success criteria and effort estimates
  8. Call out trade-offs - be explicit about costs and benefits of each recommendation

Remember: Architecture is about trade-offs. A perfect architecture doesn't exist - aim for a well-balanced one that meets business needs.

No exceptions to reviewing all 6 pillars - even if a pillar seems "not applicable", document why and what the current state is.

Weekly Installs
3
Repository
rameshvr/skills
First Seen
Feb 9, 2026
Installed on
opencode3
claude-code3
github-copilot3
codex3
kimi-cli3
gemini-cli3