Senior Cloud Architect

Expert cloud architecture and infrastructure design across AWS, GCP, and Azure.

Keywords

cloud, aws, gcp, azure, terraform, infrastructure, vpc, eks, ecs, lambda, cost-optimization, disaster-recovery, multi-region, iam, security, migration

Quick Start

# Analyze infrastructure costs
python scripts/cost_analyzer.py --account production --period monthly

# Run DR validation
python scripts/dr_test.py --region us-west-2 --type failover

# Audit security posture
python scripts/security_audit.py --framework cis --output report.html

# Generate resource inventory
python scripts/inventory.py --accounts all --format csv

Tools

Script	Purpose
`scripts/cost_analyzer.py`	Analyze cloud spend by service, environment, and tag
`scripts/dr_test.py`	Validate disaster recovery failover procedures
`scripts/security_audit.py`	Audit against CIS benchmarks and compliance frameworks
`scripts/inventory.py`	Inventory all resources across accounts and regions

Cloud Platform Comparison

Service	AWS	GCP	Azure
Compute	EC2, ECS, EKS	GCE, GKE	VMs, AKS
Serverless	Lambda	Cloud Functions	Azure Functions
Storage	S3	Cloud Storage	Blob Storage
Database	RDS, DynamoDB	Cloud SQL, Spanner	SQL DB, CosmosDB
ML	SageMaker	Vertex AI	Azure ML
CDN	CloudFront	Cloud CDN	Azure CDN

Workflow 1: Design a Production AWS Architecture

Define requirements -- Identify compute, storage, database, and networking needs. Determine RTO/RPO targets.

Provision VPC with Terraform:

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"
  name    = "${var.project}-${var.environment}"
  cidr    = var.vpc_cidr
  azs             = ["${var.region}a", "${var.region}b", "${var.region}c"]
  private_subnets = var.private_subnets
  public_subnets  = var.public_subnets
  enable_nat_gateway   = true
  single_nat_gateway   = var.environment != "production"
  enable_dns_hostnames = true
  tags = local.common_tags
}

Deploy compute -- ECS/EKS in private subnets behind an ALB in public subnets. Use at least 2 AZs for redundancy.
Configure database -- RDS Multi-AZ for production, single-AZ for staging. Set backup retention to 30 days (production) or 7 days (non-production).
Add caching layer -- ElastiCache (Redis) between application and database.
Layer security -- WAF on CloudFront, NACLs on subnets, security groups on instances. Apply least-privilege IAM.
Validate -- Run python scripts/security_audit.py --framework cis and resolve all high-severity findings.

Reference Architecture

Route 53 (DNS) -> CloudFront + WAF -> ALB
  -> ECS/EKS Cluster (AZ-a) + ECS/EKS Cluster (AZ-b)
    -> ElastiCache (Redis)
      -> RDS Multi-AZ (Primary + Standby)

Workflow 2: Optimize Cloud Costs

Audit current spend -- python scripts/cost_analyzer.py --account production --period monthly

Right-size instances -- Identify instances with avg CPU <10% and max CPU <30% as downsize candidates:

# Pseudocode for right-sizing logic
if avg_cpu < 10 and max_cpu < 30:
    recommendation = 'downsize'
elif avg_cpu > 80:
    recommendation = 'upsize'
else:
    recommendation = 'optimal'

Convert steady-state workloads to Reserved Instances or Savings Plans:

Type	Discount	Commitment	Use Case
On-Demand	0%	None	Variable workloads
Reserved	30-72%	1-3 years	Steady-state
Savings Plans	30-72%	1-3 years	Flexible compute
Spot	60-90%	None	Fault-tolerant batch

Enforce cost allocation tags -- Require Environment, Project, Owner, CostCenter on all resources. Alert on untagged resources after 24 hours.
Validate -- Re-run cost analyzer and confirm savings target achieved.

Workflow 3: Plan Disaster Recovery

Select DR strategy based on RTO/RPO requirements:

Strategy RTO RPO Cost

Backup & Restore Hours Hours $

Pilot Light Minutes Minutes $$

Warm Standby Minutes Seconds $$$

Multi-Site Active Seconds Near-zero $$$$
Configure cross-region replication -- Database replication to secondary region. S3 cross-region replication for object storage.
Set up Route 53 failover routing -- Health checks on primary. Automatic DNS failover to secondary.
Define backup policy:
- Database: continuous replication, 35-day retention, cross-region, encrypted
- Application data: daily, 90-day retention, lifecycle to IA at 30d, Glacier at 90d
- Configuration: on-change via git + S3, unlimited retention
Test -- python scripts/dr_test.py --region us-west-2 --type failover and confirm RTO/RPO targets met.

Strategy	RTO	RPO	Cost
Backup & Restore	Hours	Hours	$
Pilot Light	Minutes	Minutes	$$
Warm Standby	Minutes	Seconds	$$$
Multi-Site Active	Seconds	Near-zero	$$$$

Workflow 4: Audit Security Posture

Run audit -- python scripts/security_audit.py --framework cis --output report.html
Review network segmentation -- Public subnets contain only NAT GW, ALB, bastion. Private subnets contain application tier. Data subnets contain RDS, Redis, Elasticsearch.

Enforce least-privilege IAM -- Every policy scoped to specific resources and conditions:

{
  "Effect": "Allow",
  "Action": ["s3:GetObject", "s3:PutObject"],
  "Resource": "arn:aws:s3:::my-bucket/uploads/*",
  "Condition": {
    "StringEquals": { "aws:PrincipalTag/Team": "engineering" },
    "IpAddress": { "aws:SourceIp": ["10.0.0.0/8"] }
  }
}

Verify encryption -- Data encrypted at rest (KMS) and in transit (TLS 1.2+).
Validate -- Re-run audit and confirm all critical and high findings resolved.

AWS Well-Architected Pillars (Decision Checklist)

Operational Excellence: IaC everywhere? Monitoring and alerting? Runbooks for incidents?
Security: Least-privilege IAM? Encryption at rest and in transit? VPC segmentation?
Reliability: Multi-AZ? Auto-scaling? DR tested?
Performance: Right-sized instances? Caching layer? CDN for static assets?
Cost Optimization: Reserved capacity for steady-state? Spot for batch? Unused resources cleaned?
Sustainability: Efficient regions? Right-sized compute? Data lifecycle policies?

Reference Materials

Document	Path
AWS Patterns	references/aws_patterns.md
GCP Patterns	references/gcp_patterns.md
Multi-Cloud Strategies	references/multi_cloud.md
Cost Optimization Guide	references/cost_optimization.md

Troubleshooting

Problem	Cause	Solution
Cross-region latency exceeds 200ms	No regional caching or CDN configured	Deploy CloudFront/Cloud CDN with edge locations closest to user base; enable regional API Gateway caches
Terraform state lock conflicts across teams	Shared state backend without proper locking	Use DynamoDB (AWS) or GCS (GCP) state locking with per-team state file partitioning via workspaces
Multi-cloud DNS failover not triggering	Health check thresholds too lenient or misconfigured endpoints	Set health check interval to 10s, failure threshold to 3, and verify endpoint returns 200 on the exact path monitored
IAM permission errors after cross-account migration	Trust policies not updated for new account IDs	Update AssumeRole trust policies with correct account principals and external IDs; validate with `aws sts assume-role`
Cloud costs spike unexpectedly after scaling event	Auto-scaling max limits set too high or no budget alerts	Set hard max instance counts per ASG, configure billing alerts at 80%/100%/120% thresholds, and review Spot fallback behavior
VPC peering routes not propagating between clouds	Route tables missing entries for peered CIDR ranges	Add explicit route entries in both VPCs pointing peered CIDRs to the peering connection; verify no overlapping CIDRs
DR failover test fails with data inconsistency	Replication lag between primary and secondary regions	Switch to synchronous replication for critical databases or implement application-level consistency checks pre-failover

Success Criteria

99.99% availability SLA met across all production workloads with documented uptime reports
Cost optimization savings above 25% compared to on-demand baseline through Reserved Instances, Savings Plans, and right-sizing
RTO < 15 minutes and RPO < 1 minute validated through quarterly DR failover tests
Zero critical CIS benchmark findings in production accounts after security audit remediation
Infrastructure drift < 2% measured by Terraform plan diffs on scheduled compliance scans
Cross-region failover completes within 60 seconds with automated Route 53 health check validation
100% resource tagging compliance enforced via automated policy checks with no untagged resources older than 24 hours

Scope & Limitations

This skill covers:

Multi-cloud architecture design and comparison across AWS, GCP, and Azure
Infrastructure-as-Code with Terraform including VPC, compute, database, and networking
Disaster recovery planning, cross-region replication, and failover strategies
Cloud cost optimization, right-sizing, and reserved capacity planning

This skill does NOT cover:

Application-level code architecture or microservice design patterns (see senior-architect)
Kubernetes cluster internals, pod scheduling, or service mesh configuration (see senior-devops)
Security compliance frameworks beyond CIS benchmarks such as SOC 2, HIPAA, or GDPR (see ra-qm-team/ compliance skills)
CI/CD pipeline design, build automation, or deployment workflows (see senior-devops)

Integration Points

Skill	Integration	Data Flow
`senior-devops`	Infrastructure provisioning feeds into CI/CD deployment pipelines	Terraform outputs (endpoints, ARNs) → deployment configs
`senior-secops`	Security audit findings inform cloud hardening decisions	CIS benchmark results → security remediation tasks
`senior-architect`	Application architecture requirements drive cloud resource selection	Capacity requirements → compute/storage/network sizing
`aws-solution-architect`	AWS-specific deep dives complement multi-cloud strategy	Cloud platform comparison → AWS implementation details
`ra-qm-team/soc2-compliance`	Compliance requirements shape infrastructure security controls	Compliance matrices → IAM policies, encryption configs, audit logging
`senior-fullstack`	Fullstack application stacks deploy onto cloud infrastructure	Application stack definitions → ECS/EKS task definitions, RDS configs

senior-cloud-architect