Cloud & DevOps

Master skill for containerization, orchestration, infrastructure as code, CI/CD pipelines, cloud providers, and serverless architecture. Philosophy: infrastructure should be reproducible, version-controlled, and immutable -- treat servers as cattle, not pets.

Task Router

Task	Reference
Dockerfiles, images, Compose, security	references/docker.md
Kubernetes objects, Helm, troubleshooting	references/kubernetes.md
Terraform modules, state, workspaces	references/terraform.md
GitHub Actions, GitLab CI, pipelines	references/cicd.md
AWS/GCP/Azure services, architectures	references/cloud-providers.md
Lambda, Cloud Functions, event-driven	references/serverless.md
Prometheus, Grafana, OTel, SLO/SLI, tracing, alerting	references/observability.md
FinOps, cost optimization, reserved instances, Kubecost	references/finops.md
ArgoCD, Flux, GitOps, progressive delivery, Sealed Secrets	references/gitops.md
OPA, Kyverno, NetworkPolicy, image scanning, SBOM, SLSA	references/security-policy.md
Istio, Linkerd, mTLS, traffic management, service graph	references/service-mesh.md
CRDs, operators, admission webhooks, Velero, DR, vCluster	references/kubernetes-advanced.md

Before Starting

Answer these questions before writing any infrastructure code:

Which cloud provider? AWS, GCP, Azure, or multi-cloud? Each has different service names and idioms
What orchestrator? Kubernetes, ECS, Cloud Run, or plain VMs? Determines deployment strategy
Existing infra? Importing existing resources into Terraform is different from greenfield
IaC tool? Terraform (multi-cloud), CDK/Pulumi (imperative), CloudFormation (AWS-only)
CI/CD platform? GitHub Actions, GitLab CI, Jenkins, CircleCI -- each has different syntax
Environment strategy? How many envs (dev/staging/prod)? How are they separated?

Quick Reference

Docker Commands

Need	Command
Build image	`docker build -t app:v1 .`
Run with port	`docker run -p 8080:80 app:v1`
Shell into container	`docker exec -it <id> /bin/sh`
View logs	`docker logs -f <id>`
Prune everything	`docker system prune -a --volumes`
Multi-arch build	`docker buildx build --platform linux/amd64,linux/arm64 -t app:v1 .`

Kubernetes Commands

Need	Command
Apply manifest	`kubectl apply -f deployment.yaml`
Get pod logs	`kubectl logs -f deploy/app -n namespace`
Shell into pod	`kubectl exec -it pod/app -- /bin/sh`
Port forward	`kubectl port-forward svc/app 8080:80`
Describe resource	`kubectl describe pod/app -n namespace`
Watch rollout	`kubectl rollout status deploy/app`
Debug failing pod	`kubectl get events --sort-by=.lastTimestamp`

Terraform Commands

Need	Command
Initialize	`terraform init`
Preview changes	`terraform plan -out=plan.tfplan`
Apply changes	`terraform apply plan.tfplan`
Import resource	`terraform import aws_instance.web i-1234567890`
Show state	`terraform state list`
Destroy all	`terraform destroy` (use with caution)

Core Principles

Infrastructure as Code -- every resource defined in version-controlled files. No manual console clicks. Why: reproducibility, audit trail, code review for infra changes
Immutable infrastructure -- replace, don't patch. Build new images instead of SSHing in to fix. Why: eliminates configuration drift, makes rollbacks trivial
Least privilege -- every service account, IAM role, and container runs with minimum permissions. Why: blast radius reduction when (not if) something is compromised
12-Factor App -- config via env vars, stateless processes, disposable containers. Why: enables horizontal scaling and easy deployment
GitOps -- git is the single source of truth. Infra changes go through PRs. Why: audit trail, rollback via git revert, consistent environments
Defense in depth -- network policies + IAM + secrets management + image scanning. Why: no single security layer is sufficient
Observe everything -- metrics, logs, traces (the three pillars). Alert on symptoms, not causes. Why: you can't fix what you can't see

Decision Trees

Containers vs Serverless

Need persistent connections (WebSocket, gRPC)?
  YES → Containers (ECS/K8s/Cloud Run)
  NO → How long do requests take?
    > 15 minutes → Containers
    < 15 minutes → How much traffic?
      Spiky/unpredictable → Serverless (Lambda/Cloud Functions)
      Steady high volume → Containers (cheaper at scale)
      Very low → Serverless (pay per invocation)

Kubernetes vs ECS vs Cloud Run

Multi-cloud or vendor-agnostic required?
  YES → Kubernetes (EKS/GKE/AKS)
  NO → AWS-only?
    YES → How complex is the workload?
      Simple HTTP services → App Runner or ECS Fargate
      Complex orchestration, service mesh → EKS
    NO → GCP?
      Simple HTTP/gRPC → Cloud Run (easiest option)
      Complex orchestration → GKE

Terraform vs CDK vs Pulumi

Multi-cloud?
  YES → Terraform (widest provider support) or Pulumi
  NO → Team prefers HCL (declarative)?
    YES → Terraform
    NO → Team prefers TypeScript/Python (imperative)?
      YES → CDK (AWS-only) or Pulumi (multi-cloud)

Anti-Patterns

Hardcoding secrets -- never put credentials in Dockerfiles, manifests, or Terraform files. Use secrets managers (Vault, AWS Secrets Manager, K8s Secrets with external-secrets-operator)
Running as root -- containers should use non-root users. Add USER nonroot in Dockerfile
Using latest tag -- always pin image versions. latest breaks reproducibility and makes rollbacks impossible
Monolithic Terraform state -- split state per environment and per service. Large state files are slow and risky
No health checks -- every container needs liveness and readiness probes. Without them, K8s routes traffic to broken pods
Ignoring resource limits -- set CPU/memory requests AND limits. Without them, one pod can starve the entire node
Manual deployments -- if you SSH into prod to deploy, you've already lost. Automate everything through CI/CD
Storing state locally -- Terraform state belongs in a remote backend (S3, GCS) with locking (DynamoDB). Local state causes conflicts
No network policies -- default K8s allows all pod-to-pod traffic. Define explicit NetworkPolicies
Skipping staging -- deploy to staging first, always. "It works on my machine" is not a deployment strategy

cloud-devops