databricks-cost-tuning

SKILL.md

Databricks Cost Tuning

Overview

Reduce Databricks spending by optimizing cluster configurations, leveraging spot instances, right-sizing SQL warehouses, and implementing cost governance policies. Databricks charges per DBU (Databricks Unit) with rates varying by workload type: Jobs Compute ($0.15/DBU), All-Purpose Compute ($0.40/DBU), SQL Compute ($0.22/DBU), and Serverless ($0.07/DBU).

Prerequisites

  • Databricks Premium or Enterprise workspace
  • Access to system.billing.usage tables
  • Workspace admin for cluster policy creation

Instructions

Step 1: Identify Top Cost Drivers

-- Top 10 most expensive clusters this month
SELECT cluster_id, cluster_name, sku_name,
       SUM(usage_quantity) AS total_dbus,
       ROUND(SUM(usage_quantity * list_price), 2) AS cost_usd
FROM system.billing.usage
WHERE usage_date >= date_trunc('month', current_date())
GROUP BY cluster_id, cluster_name, sku_name
ORDER BY cost_usd DESC
LIMIT 10;

Step 2: Enforce Cluster Policies

{
  "name": "cost-optimized-interactive",
  "definition": {
    "autotermination_minutes": { "type": "range", "minValue": 10, "maxValue": 60, "defaultValue": 20 },
    "num_workers": { "type": "range", "minValue": 1, "maxValue": 8 },
    "node_type_id": { "type": "allowlist", "values": ["m5.xlarge", "m5.2xlarge"] },
    "aws_attributes.availability": { "type": "fixed", "value": "SPOT_WITH_FALLBACK" },
    "spark_conf.spark.databricks.cluster.profile": { "type": "fixed", "value": "singleNode" }
  }
}

Step 3: Use Spot Instances for Jobs

# jobs/cost_optimized_job.py
job_config = {
    "name": "nightly-etl",
    "new_cluster": {
        "num_workers": 4,
        "node_type_id": "m5.2xlarge",
        "aws_attributes": {
            "availability": "SPOT_WITH_FALLBACK",
            "first_on_demand": 1,  # Driver on-demand, workers spot
            "spot_bid_price_percent": 100,
        },
        "autotermination_minutes": 0,  # Jobs clusters auto-terminate on completion
    },
}
# Spot instances save 60-90% on worker node costs

Step 4: Right-Size SQL Warehouses

-- Check warehouse utilization (queries/hour and queue time)
SELECT warehouse_id, warehouse_name,
       COUNT(*) AS query_count,
       AVG(total_duration_ms) AS avg_duration_ms,
       MAX(queue_duration_ms) AS max_queue_ms
FROM system.query.history
WHERE start_time > current_timestamp() - INTERVAL 7 DAYS
GROUP BY warehouse_id, warehouse_name;
-- If max_queue_ms is near 0, warehouse is oversized. Reduce cluster size.
-- If max_queue_ms > 30000, warehouse needs more capacity or auto-scaling.  # 30000: 30 seconds in ms

Step 5: Schedule Auto-Stop for Development Clusters

# Set aggressive auto-termination for all dev clusters
databricks clusters list --output JSON | \
  jq -r '.[] | select(.cluster_name | test("dev|sandbox|test")) | .cluster_id' | \
  xargs -I{} databricks clusters edit {} --json '{"autotermination_minutes": 15}'

Error Handling

Issue Cause Solution
Spot instance interruption Cloud provider reclaiming capacity Use SPOT_WITH_FALLBACK, checkpoint long jobs
Cluster policy too restrictive Workers can't handle workload Increase max_workers or allow larger instance types
SQL warehouse idle but running No auto-stop configured Enable auto-stop with 10-minute timeout
Billing data not available System tables not enabled Enable system table access in account settings

Examples

Basic usage: Apply databricks cost tuning to a standard project setup with default configuration options.

Advanced scenario: Customize databricks cost tuning for production environments with multiple constraints and team-specific requirements.

Output

  • Configuration files or code changes applied to the project
  • Validation report confirming correct implementation
  • Summary of changes made and their rationale

Resources

  • Official monitoring documentation
  • Community best practices and patterns
  • Related skills in this plugin pack
Weekly Installs
17
GitHub Stars
1.6K
First Seen
Feb 14, 2026
Installed on
codex17
gemini-cli16
github-copilot16
kimi-cli16
amp16
opencode16