databricks-autonomous-operations

SKILL.md

Databricks Autonomous Operations

1. Overview

This skill is both an SDK/CLI/Connect reference and an autonomous operations playbook. It teaches the AI agent to operate as an SRE — independently deploying, monitoring, diagnosing failures, applying fixes, redeploying, and verifying results across all Databricks resource types.

Core Loop: Deploy → Poll → Diagnose → Fix → Redeploy → Verify (max 3 iterations before escalation to user).

When to Activate This Skill

  • Deploying or running Databricks Asset Bundles (bundle deploy, bundle run)
  • Monitoring job or pipeline runs for completion
  • Troubleshooting ANY failure: jobs, DLT pipelines, monitors, alerts, clusters, Genie Spaces
  • Using the Databricks Python SDK, CLI, Connect, or REST API
  • Encountering error messages from Databricks services
  • Operating in a self-healing deploy-fix-redeploy cycle
  • Checking job/task/pipeline status or retrieving run output

SDK Docs: https://databricks-sdk-py.readthedocs.io/en/latest/ GitHub: https://github.com/databricks/databricks-sdk-py

Runnable Examples: See examples/ directory for complete, copy-pasteable scripts:

  • examples/1-authentication.py — All auth patterns (env vars, profiles, Azure SP, AccountClient, notebook context)
  • examples/2-clusters-and-jobs.py — Cluster auto-selection, autoscaling, job CRUD, submit_and_wait, run_now_and_wait
  • examples/3-sql-and-warehouses.py — Parameterized queries, chunked results, query_to_dataframe helper
  • examples/4-unity-catalog.py — Tables, schemas, catalogs, volumes, file operations, pattern matching
  • examples/5-serving-and-vector-search.py — Endpoint creation with TrafficConfig, chat/embedding queries, vector search with filters
  • examples/6-autonomous-operations.py — Job monitoring, multi-task output retrieval, failure diagnosis, self-healing loop

2. Environment & Authentication

Setup

  • SDK: uv pip install databricks-sdk; Connect: uv pip install databricks-connect
  • CLI version: >= 0.278.0 (databricks --version)
  • Config: ~/.databrickscfg or env vars DATABRICKS_HOST, DATABRICKS_TOKEN

Quick Auth

from databricks.sdk import WorkspaceClient
w = WorkspaceClient()                    # Auto-detect (env, config file, or notebook)
w = WorkspaceClient(profile="MY_PROFILE") # Named profile
  • Token expired? databricks auth login --host <url> --profile <name>
  • Profile-based CLI: DATABRICKS_CONFIG_PROFILE=<name> databricks <command>
  • Full auth patterns (Azure SP, AccountClient, etc.): see references/sdk-api-reference.md

3. SDK API Quick Reference

Full code examples and patterns: references/sdk-api-reference.md Runnable scripts: examples/ directory (see file listing in Section 1)

API Key Operations Critical Notes
Clusters list, get, create_and_wait, ensure_cluster_is_running, start, stop Use select_spark_version() and select_node_type()
Jobs list, run_now_and_wait, submit_and_wait, get_run, get_run_output, cancel_run get_run_output needs task run_id, not parent job run_id
SQL Execution execute_statement with wait_timeout Use StatementParameterListItem for parameterized queries
Warehouses list, start, stop, create_and_wait Use auto_stop_mins to control costs
Unity Catalog tables.list, tables.get, tables.exists, catalogs.list, schemas.list list_summaries for fast pattern matching
Volumes/Files files.upload, files.download, files.list_directory_contents Path format: /Volumes/catalog/schema/volume/file
Serving create_and_wait, query (custom/chat/embeddings), get_open_ai_client scale_to_zero_enabled for cost control
Vector Search query_index, upsert_data_vector_index, sync_index Use filters_json for filtered queries
Pipelines list_pipelines, get, start_update, stop_and_wait, list_pipeline_events Events contain error details for troubleshooting
Monitors quality_monitors.get, delete, update, run_refresh Update MUST include ALL custom_metrics
Alerts V2 alerts_v2.create_alert, update_alert, delete_alert Update requires full update_mask
Secrets create_scope, put_secret, get_secret Use for API keys, tokens

Critical Patterns

  • Async apps (FastAPI): SDK is synchronous — wrap with asyncio.to_thread()
  • Long-running ops: Use _and_wait() variants with timeout=timedelta(...)
  • Error handling: from databricks.sdk.errors import NotFound, PermissionDenied, ResourceAlreadyExists
  • Direct REST: w.api_client.do(method="GET", path="/api/2.0/...")

4. CLI Operations Reference

Command Matrix

Category Command Purpose
Bundle databricks bundle validate Pre-deploy validation (catches ~80% of errors)
Bundle databricks bundle deploy -t <target> Deploy all resources
Bundle databricks bundle run -t <target> <job> Start a job run
Bundle databricks bundle destroy -t <target> Cleanup all resources
Jobs databricks jobs get-run <RUN_ID> --output json Get run status
Jobs databricks jobs get-run-output <TASK_RUN_ID> --output json Get task notebook output
Jobs databricks jobs list-runs --job-id <JOB_ID> --output json List recent runs
Jobs databricks jobs cancel-run <RUN_ID> Cancel running job
Pipelines databricks pipelines get <PIPELINE_ID> --output json Get pipeline status
Pipelines databricks pipelines get-update <PID> <UID> --output json Get update details
Pipelines databricks pipelines list-pipeline-events <PID> --output json Get events/errors
Pipelines databricks pipelines start-update <PIPELINE_ID> Start pipeline update
Clusters databricks clusters get <CLUSTER_ID> --output json Get cluster status
Clusters databricks clusters events <CLUSTER_ID> --output json Get cluster events
Warehouses databricks warehouses get <WH_ID> --output json Get warehouse status
Auth databricks auth login --host <url> --profile <name> Re-authenticate
Workspace databricks workspace export <path> --format SOURCE Export notebook
Apps databricks apps logs <app-name> App deployment/runtime logs

Key jq Patterns

See references/cli-jq-patterns.md for the complete catalog. Most critical:

# Job state
databricks jobs get-run <RUN_ID> --output json | jq '.state'

# Task summary for multi-task jobs
databricks jobs get-run <RUN_ID> --output json | jq '.tasks[] | {task: .task_key, run_id: .run_id, result: .state.result_state}'

# Failed tasks only
databricks jobs get-run <RUN_ID> --output json | jq '.tasks[] | select(.state.result_state == "FAILED") | {task: .task_key, error: .state.state_message, url: .run_page_url}'

# Task output (MUST use task run_id, not parent job run_id)
databricks jobs get-run-output <TASK_RUN_ID> --output json | jq -r '.notebook_output.result // "No output"'

5. Autonomous Deploy-Run-Fix Playbook

This is the core workflow. Follow these steps whenever deploying a bundle, running a job, or operating autonomously.

Step 0: Discover Bundle Resources

Before deploying, understand what's in the bundle:

# Read databricks.yml to find targets and included resources
# Then read resource YAML files under resources/ directory
# Identify: jobs (atomic/composite/orchestrator), pipelines, dashboards, alerts

Determine the target (dev, staging, prod) from the user or default to dev.

Step 1: Validate & Deploy

databricks bundle validate -t <target>   # Pre-flight — catches ~80% of errors
databricks bundle deploy -t <target>     # Deploy all resources

If validate fails: Read the error, fix the YAML (see Section 6 + common/databricks-asset-bundles skill), re-validate. If deploy fails: Common causes: auth expired (403), path resolution errors, invalid task types. See Section 6.

Step 2: Run

For jobs:

databricks bundle run -t <target> <job_name>
# Output includes a Run URL like: https://<host>/#job/<JOB_ID>/run/<RUN_ID>
# EXTRACT the RUN_ID from this URL — you need it for polling

For DLT pipelines:

databricks bundle run -t <target> <pipeline_name>
# Output includes: Pipeline URL with PIPELINE_ID
# Also shows UPDATE_ID for this specific update

Cursor-specific: Use block_until_ms: 0 to background bundle run immediately. Then read the terminal output file to capture the Run URL. Parse the RUN_ID or PIPELINE_ID from the URL.

Step 3: Poll Until Terminal State

Jobs — poll with exponential backoff (30s → 60s → 120s):

databricks jobs get-run <RUN_ID> --output json | jq -r '.state.life_cycle_state'
# PENDING → RUNNING → TERMINATED (also: INTERNAL_ERROR, SKIPPED)

When life_cycle_state == TERMINATED, check the result:

databricks jobs get-run <RUN_ID> --output json | jq -r '.state.result_state'
# SUCCESS → go to Step 4a    FAILED → go to Step 4b

DLT Pipelines — poll the latest update state:

databricks pipelines get <PIPELINE_ID> --output json | jq -r '.latest_updates[0].state'
# COMPLETED → success    FAILED → get events for errors

Step 4a: On SUCCESS — Verify & Report

# For multi-task jobs, verify all tasks succeeded:
databricks jobs get-run <RUN_ID> --output json \
  | jq '.tasks[] | {task: .task_key, result: .state.result_state}'

# Get task output (structured JSON from dbutils.notebook.exit):
databricks jobs get-run-output <TASK_RUN_ID> --output json \
  | jq -r '.notebook_output.result // "No output"'

Report success to user with the Run URL for verification. If this was part of a multi-job deployment, proceed to the next job.

Step 4b: On FAILURE — Diagnose

CRITICAL for multi-task jobs: get-run-output only works on the task run_id, NOT the parent job run_id.

# Step A: Find failed tasks and their task-level run_ids
databricks jobs get-run <JOB_RUN_ID> --output json \
  | jq '.tasks[] | select(.state.result_state == "FAILED") | {task: .task_key, run_id: .run_id, error: .state.state_message}'

# Step B: Get detailed output for EACH failed task
databricks jobs get-run-output <TASK_RUN_ID> --output json \
  | jq -r '.notebook_output.result // .error // "No output"'

For DLT pipeline failures:

databricks pipelines list-pipeline-events <PID> --output json \
  | jq '[.events[] | select(.level == "ERROR")] | .[0:5]'

Match the error against Section 6 (decision tree) and references/error-solution-matrix.md.

Step 5: Fix

  1. Read the source file(s) identified from the stack trace or error message
  2. Apply fix using editor tools (StrReplace, Write)
  3. If YAML/config issue → fix the DAB YAML (consult common/databricks-asset-bundles skill)
  4. If dependency issue → deploy upstream assets first (see Section 6 ordering)

Step 6: Redeploy & Re-run (Loop)

databricks bundle deploy -t <target>    # Redeploy with fix
databricks bundle run -t <target> <job>  # Re-run
# → Return to Step 3 (Poll)

Maximum 3 iterations. Track each iteration's error and fix.

Step 7: Escalation (After 3 Failed Attempts)

Present to the user:

  1. All errors encountered — with run IDs, task keys, error messages
  2. All fixes attempted — what was changed and why
  3. Root cause hypothesis — best guess based on evidence
  4. Run page URLs — direct links to the Databricks UI

Step 8: Capture Learnings

After every resolved failure (or escalation), trigger admin/self-improvement:

  • Add new error → fix mappings to references/error-solution-matrix.md
  • Update "Top 10" table in Section 6 if this error is frequent
  • Always capture if fix took 2+ iterations or was escalated

6. Master Troubleshooting Decision Tree

Full error-solution tables: references/error-solution-matrix.md (7 categories, 40+ patterns) DLT-specific troubleshooting: references/dlt-pipeline-troubleshooting.md Diagnostic SQL queries: references/diagnostic-queries.md

Quick Decision Guide

  1. Deployment error? → Check auth (403), YAML syntax, task type (notebook_task not python_task), path resolution
  2. Job runtime error? → Get task-level output (use task run_id), match error to solution matrix
  3. DLT pipeline error? → Get events via CLI, check upstream tables exist, verify schema
  4. Monitor/Alert error? → Check ResourceAlreadyExists (delete+recreate), schema existence, update_mask
  5. Dependency error? → Follow required order: Bronze → Gold Setup → Gold Merge → Semantic → Monitoring → Genie
  6. Cluster error? → Check events via CLI, look for OOM/unreachable/timeout patterns

Top 10 Most Common Errors (Inline)

Error Quick Fix
ModuleNotFoundError Add to %pip install or DAB environment spec
TABLE_OR_VIEW_NOT_FOUND Run setup job first; check 3-part catalog.schema.table path
DELTA_MULTIPLE_SOURCE_ROW_MATCHING Deduplicate source before MERGE
Invalid access token (403) databricks auth login --host <url> --profile <name>
ResourceAlreadyExists Delete + recreate (monitors, alerts)
python_task not recognized Use notebook_task with notebook_path
PARSE_SYNTAX_ERROR Read failing SQL file, fix syntax, redeploy
Parameter not found Use base_parameters dict, not CLI-style parameters
run_job_task vs job_task Use run_job_task (not job_task)
Genie INTERNAL_ERROR Deploy semantic layer (TVFs + Metric Views) first

7. Job Design for Monitorability

Full patterns: references/job-design-patterns.md — structured output, progress messages, fail-fast, partial success, 3-layer hierarchy.

Key rules:

  • Every notebook must dbutils.notebook.exit(json.dumps({...})) with structured status
  • Print [Step N/M] progress messages so the agent can locate failures
  • Fail fast: include actionable error messages (e.g., "Run bronze_setup_job first")
  • 3-layer hierarchy: atomic (notebook_task) → composite (run_job_task) → orchestrator (run_job_task)
  • Partial success: ≥90% items succeeding = OK; fix individual failures

8. Agent Behavioral Patterns

Background Command Pattern (Cursor-Specific)

For databricks bundle run or any long-running command:

  1. Execute with block_until_ms: 0 to background immediately
  2. Read the terminal output file to check for completion
  3. Parse RUN_ID from the output URL

Exponential Backoff for Polling

  • Normal jobs: 30s → 60s → 120s (no state change = increase interval)
  • Long-running training jobs: start at 60s, max 300s
  • DLT pipelines: 30s (typically faster)

Context Capture on Every Failure

On each failure, record and maintain:

  • run_id and task_run_id
  • job_name and task_key
  • Target environment (dev, staging, prod)
  • Full error message
  • run_page_url for UI access
  • Iteration count (which fix attempt this is)

Self-Healing Loop

Iteration 1: Deploy → Run → Monitor → [FAIL] → Diagnose → Fix → Redeploy
Iteration 2: Run → Monitor → [FAIL] → Diagnose → Fix → Redeploy
Iteration 3: Run → Monitor → [FAIL] → ESCALATE TO USER

Dependency Awareness

Before deploying, verify prerequisites exist:

  • Genie Spaces require: TVFs + Metric Views (semantic layer)
  • Monitoring requires: Gold tables (populated)
  • Gold merge requires: Gold tables (setup job)
  • Silver DLT requires: Bronze tables
  • Alerts require: Monitoring output tables
  • Always follow: Bronze → Gold → Semantic → Monitoring → Alerts → Genie

Partial Success Handling

Some jobs use ≥90% success thresholds:

  • Alerting deployment: 27/30 alerts succeeding = OK, fix the 3 individually
  • Monitoring setup: 7/8 monitors = OK, debug the 1 failure
  • Do NOT fail the entire workflow for 1 out of N items failing

CLI Limitations Awareness

Some diagnostics require the Databricks Workspace UI:

  • Full notebook execution logs (when get-run-output returns truncated or empty)
  • DLT pipeline data flow visualization
  • Detailed Spark execution plans

Always provide run_page_url to the user when CLI output is insufficient.

Learning Capture Pattern (Self-Improvement Integration)

After every troubleshooting session (successful or escalated), run this checklist:

Post-Troubleshooting Self-Improvement:
1. Was this error already in references/error-solution-matrix.md?
   ├── YES → Was the documented fix accurate? If not, update it.
   └── NO  → Add the error pattern and fix to the matrix.

2. Did the initial diagnosis match the actual root cause?
   ├── YES → No action needed.
   └── NO  → Document the misleading signals in the relevant skill.

3. How many iterations were needed?
   ├── 1 → Capture only if error was novel or non-obvious.
   ├── 2-3 → ALWAYS update the relevant skill with the fix.
   └── Escalated → Document what the user found; update skills with user's fix.

4. Is this error likely to recur?
   ├── YES → Ensure it's in Top 10 table (if frequent) or error-solution matrix.
   └── NO  → Document as a one-off note in the session, skip skill update.

Skill priority for updates (search in this order):

  1. references/error-solution-matrix.md — error-to-fix mapping
  2. common/databricks-autonomous-operations — troubleshooting decision tree
  3. common/databricks-asset-bundles — if deployment-related
  4. Domain-specific skill (e.g., gold/pipeline-workers/02-merge-patterns for MERGE errors)
  5. New skill — only if all above are irrelevant (see admin/self-improvement justification checklist)

Safety: Never Retry Destructive Operations

Without explicit user confirmation, NEVER retry:

  • databricks bundle destroy
  • DROP TABLE / DROP SCHEMA
  • w.quality_monitors.delete()
  • w.alerts_v2.delete_alert()

References

Reference Documents (Load on Demand)

  • references/sdk-api-reference.md — Full SDK API with code examples for all services
  • references/error-solution-matrix.md — Error tables for all 7 resource categories (40+ patterns)
  • references/diagnostic-queries.md — SQL queries for investigation
  • references/dlt-pipeline-troubleshooting.md — DLT pipeline failure patterns
  • references/cli-jq-patterns.md — jq patterns for parsing CLI JSON output
  • references/cli-quick-reference.md — CLI vs SDK side-by-side table
  • references/job-design-patterns.md — Structured output, progress messages, 3-layer hierarchy

Runnable Examples

  • examples/1-authentication.py through examples/6-autonomous-operations.py (see Section 1)

Scripts

  • scripts/monitor_multitask_job.sh — Multi-task job monitoring shell script

External Docs

Weekly Installs
1
GitHub Stars
2
First Seen
8 days ago
Installed on
amp1
cline1
opencode1
cursor1
kimi-cli1
codex1