databricks-cost-tuning
SKILL.md
Databricks Cost Tuning
Overview
Reduce Databricks spending by optimizing cluster configurations, leveraging spot instances, right-sizing SQL warehouses, and implementing cost governance policies. Databricks charges per DBU (Databricks Unit) with rates varying by workload type: Jobs Compute ($0.15/DBU), All-Purpose Compute ($0.40/DBU), SQL Compute ($0.22/DBU), and Serverless ($0.07/DBU).
Prerequisites
- Databricks Premium or Enterprise workspace
- Access to
system.billing.usagetables - Workspace admin for cluster policy creation
Instructions
Step 1: Identify Top Cost Drivers
-- Top 10 most expensive clusters this month
SELECT cluster_id, cluster_name, sku_name,
SUM(usage_quantity) AS total_dbus,
ROUND(SUM(usage_quantity * list_price), 2) AS cost_usd
FROM system.billing.usage
WHERE usage_date >= date_trunc('month', current_date())
GROUP BY cluster_id, cluster_name, sku_name
ORDER BY cost_usd DESC
LIMIT 10;
Step 2: Enforce Cluster Policies
{
"name": "cost-optimized-interactive",
"definition": {
"autotermination_minutes": { "type": "range", "minValue": 10, "maxValue": 60, "defaultValue": 20 },
"num_workers": { "type": "range", "minValue": 1, "maxValue": 8 },
"node_type_id": { "type": "allowlist", "values": ["m5.xlarge", "m5.2xlarge"] },
"aws_attributes.availability": { "type": "fixed", "value": "SPOT_WITH_FALLBACK" },
"spark_conf.spark.databricks.cluster.profile": { "type": "fixed", "value": "singleNode" }
}
}
Step 3: Use Spot Instances for Jobs
# jobs/cost_optimized_job.py
job_config = {
"name": "nightly-etl",
"new_cluster": {
"num_workers": 4,
"node_type_id": "m5.2xlarge",
"aws_attributes": {
"availability": "SPOT_WITH_FALLBACK",
"first_on_demand": 1, # Driver on-demand, workers spot
"spot_bid_price_percent": 100,
},
"autotermination_minutes": 0, # Jobs clusters auto-terminate on completion
},
}
# Spot instances save 60-90% on worker node costs
Step 4: Right-Size SQL Warehouses
-- Check warehouse utilization (queries/hour and queue time)
SELECT warehouse_id, warehouse_name,
COUNT(*) AS query_count,
AVG(total_duration_ms) AS avg_duration_ms,
MAX(queue_duration_ms) AS max_queue_ms
FROM system.query.history
WHERE start_time > current_timestamp() - INTERVAL 7 DAYS
GROUP BY warehouse_id, warehouse_name;
-- If max_queue_ms is near 0, warehouse is oversized. Reduce cluster size.
-- If max_queue_ms > 30000, warehouse needs more capacity or auto-scaling. # 30000: 30 seconds in ms
Step 5: Schedule Auto-Stop for Development Clusters
# Set aggressive auto-termination for all dev clusters
databricks clusters list --output JSON | \
jq -r '.[] | select(.cluster_name | test("dev|sandbox|test")) | .cluster_id' | \
xargs -I{} databricks clusters edit {} --json '{"autotermination_minutes": 15}'
Error Handling
| Issue | Cause | Solution |
|---|---|---|
| Spot instance interruption | Cloud provider reclaiming capacity | Use SPOT_WITH_FALLBACK, checkpoint long jobs |
| Cluster policy too restrictive | Workers can't handle workload | Increase max_workers or allow larger instance types |
| SQL warehouse idle but running | No auto-stop configured | Enable auto-stop with 10-minute timeout |
| Billing data not available | System tables not enabled | Enable system table access in account settings |
Examples
Basic usage: Apply databricks cost tuning to a standard project setup with default configuration options.
Advanced scenario: Customize databricks cost tuning for production environments with multiple constraints and team-specific requirements.
Output
- Configuration files or code changes applied to the project
- Validation report confirming correct implementation
- Summary of changes made and their rationale
Resources
- Official monitoring documentation
- Community best practices and patterns
- Related skills in this plugin pack
Weekly Installs
17
Repository
jeremylongshore…s-skillsGitHub Stars
1.6K
First Seen
Feb 14, 2026
Security Audits
Installed on
codex17
gemini-cli16
github-copilot16
kimi-cli16
amp16
opencode16