cloud-management
SKILL.md
Cloud Management
Operate AWS, Azure, and GCP from the terminal as the control plane. Use provider CLIs plus terminal-invoked tools such as docker, terraform, pulumi, kubectl, helm, and CI runners. Do not fall back to a portal unless the user explicitly asks for a console workflow.
Non-Negotiables
- Start read-only. Inspect the repo, current provider scope, existing IaC, and deployed state before proposing writes.
- Treat cloud as a set of concerns, not a single logo. Runtime, database, storage, DNS, secrets, and observability may live on different providers.
- Use CLI-only execution. Prefer
aws,az, andgcloud, plus CLI-driven IaC or deployment tools already present in the repo. - Reuse the existing state manager. Extend Terraform, Bicep, CloudFormation, Pulumi, Helm, or repo-owned deploy scripts instead of creating a parallel path.
- Default to the simplest managed service that fits the workload. Favor managed containers, managed databases, and managed queues over VMs or Kubernetes unless the repo already needs lower-level control.
- Default to cost-conscious dev sizing unless the user explicitly asks for production-ready HA, multi-region, or higher compliance posture.
- Require explicit approval before high-cost, destructive, public-ingress, org-scope, identity-sensitive, or hard-to-reverse changes.
- Prefer short-lived credentials, SSO, managed identity, workload identity, or OIDC federation over static keys.
- When a command surface is uncertain, inspect
--help, provider docs, or existing repo automation before guessing. - Finish the loop: apply, verify, capture outputs, and record rollback posture.
Start Sequence
- Run
python .agents/skills/cloud-management/scripts/detect_repo_stack.py .from the repo root. - Read cli-operating-model.md for the shared operating discipline.
- Load only the references needed for the task:
- mixed-cloud selection and workload mapping: provider-selection.md
- approval gates and cost risk: approval-policy.md
- CI/CD and automatic deployments: cicd-and-auto-deploy.md
- inventory, optimization, and incident repair: inventory-optimization-remediation.md
- provider-specific command runbooks:
- Verify provider scope before any write:
- AWS: profile or role, account, region
- Azure: cloud, tenant, subscription, resource group
- GCP: configuration, account, project, enabled APIs
- For any non-trivial mutation, run the guard script before presenting or executing the change:
python .agents/skills/cloud-management/scripts/cloud_change_guard.py \
--provider aws \
--environment prod \
--operation "create ecs service, alb, and rds postgres instance" \
--resource-type database \
--monthly-cost-usd 180 \
--stateful \
--public-ingress \
--dns-change \
--format markdown
Shared Execution Loop
Use this loop for every cloud task:
- Discover: inspect repo shape, existing infra ownership, and live cloud state.
- Decide: choose the least-complex provider-native mapping that fits the workload and current estate.
- Preview: validate templates, inspect diffs, or run dry-run and what-if surfaces where available.
- Approve: request permission when cost, blast radius, identity, ingress, data, or rollback risk justifies it.
- Apply: execute exact CLI or CLI-driven IaC commands.
- Verify: wait for health, deployment status, logs, revisions, and connectivity.
- Record: capture outputs, rollback path, and any follow-up risks.
Task Router
Identify the Right Provider or Multi-Cloud Shape
- Use
detect_repo_stack.pyto infer frameworks, CI, IaC, cloud hints, and likely runtime bias. - Load provider-selection.md.
- Determine ownership by concern:
- runtime
- data
- storage
- registry
- secrets
- DNS and CDN
- CI identity
- Preserve existing systems of record unless the user explicitly wants migration or consolidation.
Deploy New Resources or Services
- Infer the workload shape and choose the least-complex managed target.
- Estimate recurring and one-time cost before provisioning.
- Run
cloud_change_guard.pyand request approval if required. - Prefer the repo's existing IaC path. If none exists, use the provider CLI or CLI-invoked IaC in the smallest reasonable footprint.
- Wait, verify, and return the exact endpoints, IDs, and health evidence.
Wire Automatic Deployments
- Reuse the existing CI system when it exists.
- Prefer OIDC, federated credentials, managed identity, or workload identity for CI-to-cloud auth.
- Build immutable artifacts once, publish them to the provider registry, then roll out by reference.
- Split infra deploys from runtime deploys when stateful resources or migrations are involved.
- Load cicd-and-auto-deploy.md.
Inventory and Optimize
- Inventory first with list, describe, graph, asset, billing, and deployment-state commands.
- Review each layer independently: compute, ingress, database, cache, storage, identity, logging, and spend.
- Look for obvious waste before redesign:
- idle or duplicate resources
- oversized data tiers
- always-on non-prod capacity
- needless NAT, egress, or public ingress
- missing lifecycle rules, autoscaling, or concurrency limits
- Load inventory-optimization-remediation.md.
Diagnose and Fix Cloud Errors
- Confirm provider scope and environment.
- Reproduce or isolate the failing surface.
- Inspect deployment history, logs, events, identity, secrets, network, and dependencies.
- Prefer the smallest reversible change first.
- Re-verify service health, rollout status, and rollback readiness.
- Capture the root cause and the exact corrective command sequence.
Architecture Bias
Favor the simplest managed platform that fits the repo:
- Static site or SPA: storage plus CDN, or the provider's lightweight static hosting path.
- Containerized API: ECS Fargate, Azure Container Apps, or Cloud Run before Kubernetes.
- Web plus worker plus websocket or realtime: split services by responsibility instead of forcing one long-running process shape.
- Event-driven jobs or schedulers: use provider-native schedulers and queues instead of cron inside app containers.
- Kubernetes: choose only when the repo already needs k8s primitives, advanced ingress, sidecars, daemon workloads, or node-level tuning.
For Cloush-style backends, use this as the baseline example, not a hard requirement:
- separate
web,worker, andsocketorrealtimeruntime surfaces - managed PostgreSQL, managed Redis or Valkey, object storage, registry, and secrets
- provider-native schedulers and queues
- registry push plus rolling or revision-based service updates, not hand-managed VMs
Approval Model
Always request explicit permission before:
- deleting, replacing, migrating, restoring, or resizing stateful resources
- changing DNS, TLS, public ingress, private networking, or auth trust
- creating likely expensive services such as HA databases, premium caches, NAT gateways, large load balancers, dedicated clusters, or cross-region replication
- changing org-scope policy, IAM, RBAC, or workload identity bindings
- performing production changes with downtime or restart risk
Use cloud_change_guard.py to classify risk and generate the checklist or approval request. Read approval-policy.md for the full model.
Bundled Scripts
scripts/detect_repo_stack.py- Inspect a repo and emit deploy-relevant signals: languages, frameworks, CI, IaC, cloud hints, identity hints, and recommended runtime bias.
scripts/cloud_change_guard.py- Score change risk, determine whether approval is required, and emit a structured checklist or approval template before execution.
Weekly Installs
1
Repository
alvarovillalbaa…nt-suiteFirst Seen
6 days ago
Security Audits
Installed on
amp1
cline1
opencode1
cursor1
kimi-cli1
droid1