cloud-devops
SKILL.md
Cloud & DevOps
Master skill for containerization, orchestration, infrastructure as code, CI/CD pipelines, cloud providers, and serverless architecture. Philosophy: infrastructure should be reproducible, version-controlled, and immutable -- treat servers as cattle, not pets.
Task Router
| Task | Reference |
|---|---|
| Dockerfiles, images, Compose, security | references/docker.md |
| Kubernetes objects, Helm, troubleshooting | references/kubernetes.md |
| Terraform modules, state, workspaces | references/terraform.md |
| GitHub Actions, GitLab CI, pipelines | references/cicd.md |
| AWS/GCP/Azure services, architectures | references/cloud-providers.md |
| Lambda, Cloud Functions, event-driven | references/serverless.md |
| Prometheus, Grafana, OTel, SLO/SLI, tracing, alerting | references/observability.md |
| FinOps, cost optimization, reserved instances, Kubecost | references/finops.md |
| ArgoCD, Flux, GitOps, progressive delivery, Sealed Secrets | references/gitops.md |
| OPA, Kyverno, NetworkPolicy, image scanning, SBOM, SLSA | references/security-policy.md |
| Istio, Linkerd, mTLS, traffic management, service graph | references/service-mesh.md |
| CRDs, operators, admission webhooks, Velero, DR, vCluster | references/kubernetes-advanced.md |
Before Starting
Answer these questions before writing any infrastructure code:
- Which cloud provider? AWS, GCP, Azure, or multi-cloud? Each has different service names and idioms
- What orchestrator? Kubernetes, ECS, Cloud Run, or plain VMs? Determines deployment strategy
- Existing infra? Importing existing resources into Terraform is different from greenfield
- IaC tool? Terraform (multi-cloud), CDK/Pulumi (imperative), CloudFormation (AWS-only)
- CI/CD platform? GitHub Actions, GitLab CI, Jenkins, CircleCI -- each has different syntax
- Environment strategy? How many envs (dev/staging/prod)? How are they separated?
Quick Reference
Docker Commands
| Need | Command |
|---|---|
| Build image | docker build -t app:v1 . |
| Run with port | docker run -p 8080:80 app:v1 |
| Shell into container | docker exec -it <id> /bin/sh |
| View logs | docker logs -f <id> |
| Prune everything | docker system prune -a --volumes |
| Multi-arch build | docker buildx build --platform linux/amd64,linux/arm64 -t app:v1 . |
Kubernetes Commands
| Need | Command |
|---|---|
| Apply manifest | kubectl apply -f deployment.yaml |
| Get pod logs | kubectl logs -f deploy/app -n namespace |
| Shell into pod | kubectl exec -it pod/app -- /bin/sh |
| Port forward | kubectl port-forward svc/app 8080:80 |
| Describe resource | kubectl describe pod/app -n namespace |
| Watch rollout | kubectl rollout status deploy/app |
| Debug failing pod | kubectl get events --sort-by=.lastTimestamp |
Terraform Commands
| Need | Command |
|---|---|
| Initialize | terraform init |
| Preview changes | terraform plan -out=plan.tfplan |
| Apply changes | terraform apply plan.tfplan |
| Import resource | terraform import aws_instance.web i-1234567890 |
| Show state | terraform state list |
| Destroy all | terraform destroy (use with caution) |
Core Principles
- Infrastructure as Code -- every resource defined in version-controlled files. No manual console clicks. Why: reproducibility, audit trail, code review for infra changes
- Immutable infrastructure -- replace, don't patch. Build new images instead of SSHing in to fix. Why: eliminates configuration drift, makes rollbacks trivial
- Least privilege -- every service account, IAM role, and container runs with minimum permissions. Why: blast radius reduction when (not if) something is compromised
- 12-Factor App -- config via env vars, stateless processes, disposable containers. Why: enables horizontal scaling and easy deployment
- GitOps -- git is the single source of truth. Infra changes go through PRs. Why: audit trail, rollback via git revert, consistent environments
- Defense in depth -- network policies + IAM + secrets management + image scanning. Why: no single security layer is sufficient
- Observe everything -- metrics, logs, traces (the three pillars). Alert on symptoms, not causes. Why: you can't fix what you can't see
Decision Trees
Containers vs Serverless
Need persistent connections (WebSocket, gRPC)?
YES → Containers (ECS/K8s/Cloud Run)
NO → How long do requests take?
> 15 minutes → Containers
< 15 minutes → How much traffic?
Spiky/unpredictable → Serverless (Lambda/Cloud Functions)
Steady high volume → Containers (cheaper at scale)
Very low → Serverless (pay per invocation)
Kubernetes vs ECS vs Cloud Run
Multi-cloud or vendor-agnostic required?
YES → Kubernetes (EKS/GKE/AKS)
NO → AWS-only?
YES → How complex is the workload?
Simple HTTP services → App Runner or ECS Fargate
Complex orchestration, service mesh → EKS
NO → GCP?
Simple HTTP/gRPC → Cloud Run (easiest option)
Complex orchestration → GKE
Terraform vs CDK vs Pulumi
Multi-cloud?
YES → Terraform (widest provider support) or Pulumi
NO → Team prefers HCL (declarative)?
YES → Terraform
NO → Team prefers TypeScript/Python (imperative)?
YES → CDK (AWS-only) or Pulumi (multi-cloud)
Anti-Patterns
- Hardcoding secrets -- never put credentials in Dockerfiles, manifests, or Terraform files. Use secrets managers (Vault, AWS Secrets Manager, K8s Secrets with external-secrets-operator)
- Running as root -- containers should use non-root users. Add
USER nonrootin Dockerfile - Using
latesttag -- always pin image versions.latestbreaks reproducibility and makes rollbacks impossible - Monolithic Terraform state -- split state per environment and per service. Large state files are slow and risky
- No health checks -- every container needs liveness and readiness probes. Without them, K8s routes traffic to broken pods
- Ignoring resource limits -- set CPU/memory requests AND limits. Without them, one pod can starve the entire node
- Manual deployments -- if you SSH into prod to deploy, you've already lost. Automate everything through CI/CD
- Storing state locally -- Terraform state belongs in a remote backend (S3, GCS) with locking (DynamoDB). Local state causes conflicts
- No network policies -- default K8s allows all pod-to-pod traffic. Define explicit NetworkPolicies
- Skipping staging -- deploy to staging first, always. "It works on my machine" is not a deployment strategy
Weekly Installs
1
Repository
george11642/geo…-pluginsGitHub Stars
2
First Seen
9 days ago
Security Audits
Installed on
amp1
cline1
opencode1
cursor1
kimi-cli1
codex1