managing-cluster-capacity

Installation

SKILL.md

Managing Cluster Capacity

Manages cluster capacity across all CockroachDB deployment tiers. What "capacity" means varies by tier — Self-Hosted manages individual nodes, Advanced/BYOC manage node count and machine size, Standard manages provisioned vCPUs, and Basic auto-scales with cost controls.

When to Use This Skill

Permanently removing a node from a cluster (Self-Hosted)
Adding nodes to increase capacity (Self-Hosted)
Scaling cluster node count or machine size (Advanced, BYOC)
Adjusting provisioned compute (Standard)
Managing costs on a serverless cluster (Basic)
Replacing hardware or migrating infrastructure (Self-Hosted, BYOC)
Replacing a failed or dead node (Self-Hosted)
Managing storage utilization and disk pressure (Self-Hosted)

For temporary maintenance (not capacity changes): Use performing-cluster-maintenance. For pre-operation health check: Use reviewing-cluster-health.

Step 1: Gather Context

Required Context

Question	Options	Why It Matters
Deployment tier?	Self-Hosted, Advanced, BYOC, Standard, Basic	Different capacity model per tier
Direction?	Scale up (add capacity), Scale down (reduce capacity)	Determines procedure

Additional Context (by tier)

If Self-Hosted (scaling down):

Question	Options	Why It Matters
How many nodes to remove?	1, multiple	Multi-node decommission should be done simultaneously
Target node IDs?	Node IDs from `cockroach node status`	Required for CLI commands
Is the node alive or dead?	Alive, Dead	Dead nodes use a different procedure
Deployment platform?	Bare metal, VMs, Kubernetes	Changes CLI and cleanup steps
Current replication factor?	3, 5, custom	Must have enough nodes remaining
Current node count?	Number	Validates remaining capacity
Storage utilization?	Low (<60%), Medium (60-80%), High (>80%)	Determines urgency and whether storage maintenance is needed

If Advanced or BYOC:

Question	Options	Why It Matters
Scale method?	Cloud Console, API, Terraform	Determines procedure
Current and target configuration?	e.g., 5 nodes → 3 nodes, or 4 vCPU → 8 vCPU	Validates constraints
Cloud provider? (BYOC only)	AWS, GCP, Azure	Affects infrastructure verification

If Standard:

Question	Options	Why It Matters
Current provisioned vCPUs?	Number	Context for scaling decision
Target vCPUs?	Number	Validates workload will fit

If Basic: Gather cost management goals — Basic auto-scales with no manual capacity control.

Context-Driven Routing

Tier	Go To
Self-Hosted	Self-Hosted Capacity Management
Advanced	Advanced Scaling
BYOC	BYOC Scaling
Standard	Standard Compute Management
Basic	Basic Cost Management

Self-Hosted Capacity Management

Applies when: Tier = Self-Hosted

Scaling Down: Decommission Nodes

Pre-Decommission Validation

-- All nodes live
SELECT n.node_id, n.is_live, n.build_tag
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY n.node_id;

-- Ranges fully replicated
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
            ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;

-- Remaining capacity check
SELECT node_id, store_id,
  ROUND(capacity / 1073741824.0, 2) AS total_gb,
  ROUND(available / 1073741824.0, 2) AS available_gb,
  ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;

-- Replication factor
SHOW ZONE CONFIGURATION FOR RANGE default;

Remaining nodes must stay < 60% utilization after absorbing data. Node count after decommission must be >= replication factor.

If Node Is Alive: Drain Then Decommission

# Step 1: Drain
cockroach node drain <node_id> --certs-dir=<certs-dir> --host=<any-live-node>

# Step 2: Decommission (single node)
cockroach node decommission <node_id> --certs-dir=<certs-dir> --host=<any-live-node>

# Step 2: Decommission (multiple nodes — more efficient, do simultaneously)
cockroach node decommission <id_1> <id_2> <id_3> --certs-dir=<certs-dir> --host=<any-live-node>

If Node Is Dead: Replace Failed Node

When a node has been dead longer than server.time_until_store_dead (default 5m), CockroachDB automatically re-replicates its data to surviving nodes. Use this procedure to clean up the dead node and optionally add a replacement.

Step 1: Confirm the node is dead and data is safe

-- Confirm node is dead
SELECT node_id, is_live FROM crdb_internal.gossip_nodes WHERE node_id = <dead_node_id>;

-- Verify all ranges are fully replicated (no under-replicated after re-replication)
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
            ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;

-- Check remaining capacity can handle the load
SELECT node_id, ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;

If under-replicated ranges exist, wait for re-replication to complete before proceeding.

Step 2: Decommission the dead node (metadata cleanup)

cockroach node decommission <dead_node_id> --certs-dir=<certs-dir> --host=<any-live-node>

Step 3: Add a replacement node (recommended)

If remaining nodes are above 60% utilization, provision a replacement node using the Scaling Up: Add Nodes procedure.

Multiple dead nodes: Decommission all dead nodes simultaneously:

cockroach node decommission <id_1> <id_2> --certs-dir=<certs-dir> --host=<any-live-node>

See replacing-failed-nodes reference for detailed failure scenarios and recovery procedures.

Monitor Decommission Progress

cockroach node status --decommission --certs-dir=<certs-dir> --host=<any-live-node>

Wait for gossiped_replicas = 0 and membership = 'decommissioned'. Then stop the process on the decommissioned node.

Cancel a Decommission

cockroach node recommission <node_id> --certs-dir=<certs-dir> --host=<any-live-node>

Only works while still in decommissioning state.

Scaling Up: Add Nodes

Provision new hardware/VM with same specs as existing nodes
Install same CockroachDB version (cockroach version to confirm)
Start node with --join pointing to existing cluster nodes

Verify join:

SELECT node_id, address, is_live FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY node_id;

Data rebalances automatically — monitor with:

SELECT node_id, range_count, lease_count
FROM crdb_internal.kv_store_status ORDER BY node_id;

Post-Scaling Verification

SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
            ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;

SELECT node_id, range_count, lease_count,
  ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;

Advanced Scaling

Applies when: Tier = Advanced

Advanced clusters are managed by Cockroach Labs. Capacity is adjusted by changing node count or machine size.

Via Cloud Console

Cluster → Capacity
Adjust node count or machine type (vCPUs per node)
CRL handles all node operations (drain, decommission, provisioning) safely
Monitor progress in Cloud Console

Via Cloud API

# Scale node count
curl -X PATCH -H "Authorization: Bearer $COCKROACH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"config": {"num_nodes": <new_count>}}' \
  "https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>"

Via Terraform

resource "cockroach_cluster" "example" {
  dedicated {
    num_virtual_cpus = 8     # vCPUs per node
    storage_gib      = 150
    num_nodes        = 5     # total nodes
  }
}

Pre-Scaling Check

-- Ensure no disruptive jobs are running before scaling down
WITH j AS (SHOW JOBS)
SELECT job_type, status, COUNT(*) FROM j WHERE status = 'running' GROUP BY 1, 2;

Constraints

Minimum: 3 nodes x 4 vCPUs (12 vCPUs total)
Scale down: Data must fit on remaining nodes; zone configs must be satisfiable
Scale up: Additional nodes available within your plan limits

BYOC Scaling

Applies when: Tier = BYOC

Follow all Advanced Scaling steps. BYOC scaling is managed through the same Cloud Console/API/Terraform interfaces.

Cloud Provider Verification (after scaling down)

If AWS:

aws ec2 describe-instances --filters "Name=tag:cockroach-cluster,Values=<cluster-name>" \
  --query 'Reservations[].Instances[].{ID:InstanceId,State:State.Name}'

If GCP:

gcloud compute instances list --filter="labels.cockroach-cluster=<cluster-name>"

If Azure:

az vm list --resource-group <rg> --query "[?tags.cockroachCluster=='<name>']"

Additional BYOC Considerations

Verify security groups/firewall rules after scaling
Update reserved instance or committed use discount allocations
Verify network connectivity (PrivateLink/PSC/VPC Peering) is unaffected
Check cloud billing reflects the new instance count

Standard Compute Management

Applies when: Tier = Standard

Standard is a multi-tenant managed service. There are no individual nodes. Capacity is managed by adjusting provisioned compute (vCPUs).

Adjust Provisioned vCPUs

Cloud Console → Cluster → Capacity
Increase or decrease provisioned vCPUs
Change takes effect without downtime

Before Scaling Down

Review CPU utilization in Cloud Console — ensure workload fits within reduced compute
Storage is usage-based and unaffected by compute changes

After Scaling

Monitor P99 latency and QPS in Cloud Console for 24-48 hours. If latency increases after scaling down, scale compute back up.

Basic Cost Management

Applies when: Tier = Basic

Basic is a serverless offering that auto-scales. There are no nodes or provisioned compute to manage. Capacity scales automatically based on demand. Cost is managed through spending controls.

Manage Spending

Set spending limits: Cloud Console → Cluster → Settings → configure monthly spending cap
Review usage: Cloud Console shows Request Unit (RU) consumption over time
Optimize queries: Reduce RU consumption through query tuning and indexing
Archive data: Delete unused tables or databases to reduce storage costs

When to Consider Upgrading

If you need explicit control over compute capacity (guaranteed vCPUs), consider upgrading to Standard. If you need dedicated infrastructure, consider Advanced.

Safety Considerations

Operation	Tier	Reversible?
`cockroach node decommission`	SH	Recommission only before completion
Stop decommissioned node	SH	No (must rejoin as new node)
Add node to cluster	SH	Yes (decommission to remove)
Scale via Console/API	ADV/BYOC	Contact support to reverse
Adjust provisioned vCPUs	STD	Yes (scale back)
Set spending limit	BAS	Yes (adjust anytime)

Critical (Self-Hosted):

Never decommission below the replication factor
Always drain before decommission (for live nodes)
Decommission multiple nodes simultaneously (not sequentially)
Verify remaining capacity can absorb the data
For dead nodes: wait for re-replication to complete before decommissioning
Monitor storage utilization — nodes above 80% risk performance degradation

Troubleshooting

Issue	Tier	Fix
Decommission hangs	SH	Check zone config constraints; investigate stalled ranges
Recommission fails	SH	Node already fully decommissioned; must rejoin as new
New node not rebalancing	SH	Wait for automatic rebalancing; check `range_count`
Scale-down rejected	ADV/BYOC	Below minimum or data won't fit
Latency spike after reduction	STD	Scale provisioned vCPUs back up
Cloud instances not cleaned up	BYOC	Contact support; verify in cloud console
Dead node not re-replicating	SH	Check `server.time_until_store_dead`; verify surviving nodes have capacity
Storage utilization high after scale-down	SH	Add replacement node or increase disk size

References

Skill references:

Related skills:

reviewing-cluster-health — Pre/post health checks
performing-cluster-maintenance — Drain procedure (SH)
upgrading-cluster-version — Upgrades and lifecycle

Official CockroachDB Documentation:

Related skills

More from cockroachlabs/cockroachdb-skills

Installs

Repository

cockroachlabs/c…b-skills

GitHub Stars

First Seen

Mar 23, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass