databricks-autonomous-operations
Databricks Autonomous Operations
1. Overview
This skill is both an SDK/CLI/Connect reference and an autonomous operations playbook. It teaches the AI agent to operate as an SRE — independently deploying, monitoring, diagnosing failures, applying fixes, redeploying, and verifying results across all Databricks resource types.
Core Loop: Deploy → Poll → Diagnose → Fix → Redeploy → Verify (max 3 iterations before escalation to user).
When to Activate This Skill
- Deploying or running Databricks Asset Bundles (
bundle deploy,bundle run) - Monitoring job or pipeline runs for completion
- Troubleshooting ANY failure: jobs, DLT pipelines, monitors, alerts, clusters, Genie Spaces
- Using the Databricks Python SDK, CLI, Connect, or REST API
- Encountering error messages from Databricks services
- Operating in a self-healing deploy-fix-redeploy cycle
- Checking job/task/pipeline status or retrieving run output
SDK Docs: https://databricks-sdk-py.readthedocs.io/en/latest/ GitHub: https://github.com/databricks/databricks-sdk-py
Runnable Examples: See examples/ directory for complete, copy-pasteable scripts:
examples/1-authentication.py— All auth patterns (env vars, profiles, Azure SP, AccountClient, notebook context)examples/2-clusters-and-jobs.py— Cluster auto-selection, autoscaling, job CRUD, submit_and_wait, run_now_and_waitexamples/3-sql-and-warehouses.py— Parameterized queries, chunked results, query_to_dataframe helperexamples/4-unity-catalog.py— Tables, schemas, catalogs, volumes, file operations, pattern matchingexamples/5-serving-and-vector-search.py— Endpoint creation with TrafficConfig, chat/embedding queries, vector search with filtersexamples/6-autonomous-operations.py— Job monitoring, multi-task output retrieval, failure diagnosis, self-healing loop
2. Environment & Authentication
Setup
- SDK:
uv pip install databricks-sdk; Connect:uv pip install databricks-connect - CLI version: >= 0.278.0 (
databricks --version) - Config:
~/.databrickscfgor env varsDATABRICKS_HOST,DATABRICKS_TOKEN
Quick Auth
from databricks.sdk import WorkspaceClient
w = WorkspaceClient() # Auto-detect (env, config file, or notebook)
w = WorkspaceClient(profile="MY_PROFILE") # Named profile
- Token expired?
databricks auth login --host <url> --profile <name> - Profile-based CLI:
DATABRICKS_CONFIG_PROFILE=<name> databricks <command> - Full auth patterns (Azure SP, AccountClient, etc.): see
references/sdk-api-reference.md
3. SDK API Quick Reference
Full code examples and patterns: references/sdk-api-reference.md
Runnable scripts: examples/ directory (see file listing in Section 1)
| API | Key Operations | Critical Notes |
|---|---|---|
| Clusters | list, get, create_and_wait, ensure_cluster_is_running, start, stop |
Use select_spark_version() and select_node_type() |
| Jobs | list, run_now_and_wait, submit_and_wait, get_run, get_run_output, cancel_run |
get_run_output needs task run_id, not parent job run_id |
| SQL Execution | execute_statement with wait_timeout |
Use StatementParameterListItem for parameterized queries |
| Warehouses | list, start, stop, create_and_wait |
Use auto_stop_mins to control costs |
| Unity Catalog | tables.list, tables.get, tables.exists, catalogs.list, schemas.list |
list_summaries for fast pattern matching |
| Volumes/Files | files.upload, files.download, files.list_directory_contents |
Path format: /Volumes/catalog/schema/volume/file |
| Serving | create_and_wait, query (custom/chat/embeddings), get_open_ai_client |
scale_to_zero_enabled for cost control |
| Vector Search | query_index, upsert_data_vector_index, sync_index |
Use filters_json for filtered queries |
| Pipelines | list_pipelines, get, start_update, stop_and_wait, list_pipeline_events |
Events contain error details for troubleshooting |
| Monitors | quality_monitors.get, delete, update, run_refresh |
Update MUST include ALL custom_metrics |
| Alerts V2 | alerts_v2.create_alert, update_alert, delete_alert |
Update requires full update_mask |
| Secrets | create_scope, put_secret, get_secret |
Use for API keys, tokens |
Critical Patterns
- Async apps (FastAPI): SDK is synchronous — wrap with
asyncio.to_thread() - Long-running ops: Use
_and_wait()variants withtimeout=timedelta(...) - Error handling:
from databricks.sdk.errors import NotFound, PermissionDenied, ResourceAlreadyExists - Direct REST:
w.api_client.do(method="GET", path="/api/2.0/...")
4. CLI Operations Reference
Command Matrix
| Category | Command | Purpose |
|---|---|---|
| Bundle | databricks bundle validate |
Pre-deploy validation (catches ~80% of errors) |
| Bundle | databricks bundle deploy -t <target> |
Deploy all resources |
| Bundle | databricks bundle run -t <target> <job> |
Start a job run |
| Bundle | databricks bundle destroy -t <target> |
Cleanup all resources |
| Jobs | databricks jobs get-run <RUN_ID> --output json |
Get run status |
| Jobs | databricks jobs get-run-output <TASK_RUN_ID> --output json |
Get task notebook output |
| Jobs | databricks jobs list-runs --job-id <JOB_ID> --output json |
List recent runs |
| Jobs | databricks jobs cancel-run <RUN_ID> |
Cancel running job |
| Pipelines | databricks pipelines get <PIPELINE_ID> --output json |
Get pipeline status |
| Pipelines | databricks pipelines get-update <PID> <UID> --output json |
Get update details |
| Pipelines | databricks pipelines list-pipeline-events <PID> --output json |
Get events/errors |
| Pipelines | databricks pipelines start-update <PIPELINE_ID> |
Start pipeline update |
| Clusters | databricks clusters get <CLUSTER_ID> --output json |
Get cluster status |
| Clusters | databricks clusters events <CLUSTER_ID> --output json |
Get cluster events |
| Warehouses | databricks warehouses get <WH_ID> --output json |
Get warehouse status |
| Auth | databricks auth login --host <url> --profile <name> |
Re-authenticate |
| Workspace | databricks workspace export <path> --format SOURCE |
Export notebook |
| Apps | databricks apps logs <app-name> |
App deployment/runtime logs |
Key jq Patterns
See references/cli-jq-patterns.md for the complete catalog. Most critical:
# Job state
databricks jobs get-run <RUN_ID> --output json | jq '.state'
# Task summary for multi-task jobs
databricks jobs get-run <RUN_ID> --output json | jq '.tasks[] | {task: .task_key, run_id: .run_id, result: .state.result_state}'
# Failed tasks only
databricks jobs get-run <RUN_ID> --output json | jq '.tasks[] | select(.state.result_state == "FAILED") | {task: .task_key, error: .state.state_message, url: .run_page_url}'
# Task output (MUST use task run_id, not parent job run_id)
databricks jobs get-run-output <TASK_RUN_ID> --output json | jq -r '.notebook_output.result // "No output"'
5. Autonomous Deploy-Run-Fix Playbook
This is the core workflow. Follow these steps whenever deploying a bundle, running a job, or operating autonomously.
Step 0: Discover Bundle Resources
Before deploying, understand what's in the bundle:
# Read databricks.yml to find targets and included resources
# Then read resource YAML files under resources/ directory
# Identify: jobs (atomic/composite/orchestrator), pipelines, dashboards, alerts
Determine the target (dev, staging, prod) from the user or default to dev.
Step 1: Validate & Deploy
databricks bundle validate -t <target> # Pre-flight — catches ~80% of errors
databricks bundle deploy -t <target> # Deploy all resources
If validate fails: Read the error, fix the YAML (see Section 6 + common/databricks-asset-bundles skill), re-validate.
If deploy fails: Common causes: auth expired (403), path resolution errors, invalid task types. See Section 6.
Step 2: Run
For jobs:
databricks bundle run -t <target> <job_name>
# Output includes a Run URL like: https://<host>/#job/<JOB_ID>/run/<RUN_ID>
# EXTRACT the RUN_ID from this URL — you need it for polling
For DLT pipelines:
databricks bundle run -t <target> <pipeline_name>
# Output includes: Pipeline URL with PIPELINE_ID
# Also shows UPDATE_ID for this specific update
Cursor-specific: Use block_until_ms: 0 to background bundle run immediately. Then read the terminal output file to capture the Run URL. Parse the RUN_ID or PIPELINE_ID from the URL.
Step 3: Poll Until Terminal State
Jobs — poll with exponential backoff (30s → 60s → 120s):
databricks jobs get-run <RUN_ID> --output json | jq -r '.state.life_cycle_state'
# PENDING → RUNNING → TERMINATED (also: INTERNAL_ERROR, SKIPPED)
When life_cycle_state == TERMINATED, check the result:
databricks jobs get-run <RUN_ID> --output json | jq -r '.state.result_state'
# SUCCESS → go to Step 4a FAILED → go to Step 4b
DLT Pipelines — poll the latest update state:
databricks pipelines get <PIPELINE_ID> --output json | jq -r '.latest_updates[0].state'
# COMPLETED → success FAILED → get events for errors
Step 4a: On SUCCESS — Verify & Report
# For multi-task jobs, verify all tasks succeeded:
databricks jobs get-run <RUN_ID> --output json \
| jq '.tasks[] | {task: .task_key, result: .state.result_state}'
# Get task output (structured JSON from dbutils.notebook.exit):
databricks jobs get-run-output <TASK_RUN_ID> --output json \
| jq -r '.notebook_output.result // "No output"'
Report success to user with the Run URL for verification. If this was part of a multi-job deployment, proceed to the next job.
Step 4b: On FAILURE — Diagnose
CRITICAL for multi-task jobs: get-run-output only works on the task run_id, NOT the parent job run_id.
# Step A: Find failed tasks and their task-level run_ids
databricks jobs get-run <JOB_RUN_ID> --output json \
| jq '.tasks[] | select(.state.result_state == "FAILED") | {task: .task_key, run_id: .run_id, error: .state.state_message}'
# Step B: Get detailed output for EACH failed task
databricks jobs get-run-output <TASK_RUN_ID> --output json \
| jq -r '.notebook_output.result // .error // "No output"'
For DLT pipeline failures:
databricks pipelines list-pipeline-events <PID> --output json \
| jq '[.events[] | select(.level == "ERROR")] | .[0:5]'
Match the error against Section 6 (decision tree) and references/error-solution-matrix.md.
Step 5: Fix
- Read the source file(s) identified from the stack trace or error message
- Apply fix using editor tools (StrReplace, Write)
- If YAML/config issue → fix the DAB YAML (consult
common/databricks-asset-bundlesskill) - If dependency issue → deploy upstream assets first (see Section 6 ordering)
Step 6: Redeploy & Re-run (Loop)
databricks bundle deploy -t <target> # Redeploy with fix
databricks bundle run -t <target> <job> # Re-run
# → Return to Step 3 (Poll)
Maximum 3 iterations. Track each iteration's error and fix.
Step 7: Escalation (After 3 Failed Attempts)
Present to the user:
- All errors encountered — with run IDs, task keys, error messages
- All fixes attempted — what was changed and why
- Root cause hypothesis — best guess based on evidence
- Run page URLs — direct links to the Databricks UI
Step 8: Capture Learnings
After every resolved failure (or escalation), trigger admin/self-improvement:
- Add new error → fix mappings to
references/error-solution-matrix.md - Update "Top 10" table in Section 6 if this error is frequent
- Always capture if fix took 2+ iterations or was escalated
6. Master Troubleshooting Decision Tree
Full error-solution tables: references/error-solution-matrix.md (7 categories, 40+ patterns)
DLT-specific troubleshooting: references/dlt-pipeline-troubleshooting.md
Diagnostic SQL queries: references/diagnostic-queries.md
Quick Decision Guide
- Deployment error? → Check auth (
403), YAML syntax, task type (notebook_tasknotpython_task), path resolution - Job runtime error? → Get task-level output (use task
run_id), match error to solution matrix - DLT pipeline error? → Get events via CLI, check upstream tables exist, verify schema
- Monitor/Alert error? → Check
ResourceAlreadyExists(delete+recreate), schema existence,update_mask - Dependency error? → Follow required order: Bronze → Gold Setup → Gold Merge → Semantic → Monitoring → Genie
- Cluster error? → Check events via CLI, look for OOM/unreachable/timeout patterns
Top 10 Most Common Errors (Inline)
| Error | Quick Fix |
|---|---|
ModuleNotFoundError |
Add to %pip install or DAB environment spec |
TABLE_OR_VIEW_NOT_FOUND |
Run setup job first; check 3-part catalog.schema.table path |
DELTA_MULTIPLE_SOURCE_ROW_MATCHING |
Deduplicate source before MERGE |
Invalid access token (403) |
databricks auth login --host <url> --profile <name> |
ResourceAlreadyExists |
Delete + recreate (monitors, alerts) |
python_task not recognized |
Use notebook_task with notebook_path |
PARSE_SYNTAX_ERROR |
Read failing SQL file, fix syntax, redeploy |
Parameter not found |
Use base_parameters dict, not CLI-style parameters |
run_job_task vs job_task |
Use run_job_task (not job_task) |
Genie INTERNAL_ERROR |
Deploy semantic layer (TVFs + Metric Views) first |
7. Job Design for Monitorability
Full patterns: references/job-design-patterns.md — structured output, progress messages, fail-fast, partial success, 3-layer hierarchy.
Key rules:
- Every notebook must
dbutils.notebook.exit(json.dumps({...}))with structured status - Print
[Step N/M]progress messages so the agent can locate failures - Fail fast: include actionable error messages (e.g., "Run bronze_setup_job first")
- 3-layer hierarchy: atomic (
notebook_task) → composite (run_job_task) → orchestrator (run_job_task) - Partial success: ≥90% items succeeding = OK; fix individual failures
8. Agent Behavioral Patterns
Background Command Pattern (Cursor-Specific)
For databricks bundle run or any long-running command:
- Execute with
block_until_ms: 0to background immediately - Read the terminal output file to check for completion
- Parse RUN_ID from the output URL
Exponential Backoff for Polling
- Normal jobs: 30s → 60s → 120s (no state change = increase interval)
- Long-running training jobs: start at 60s, max 300s
- DLT pipelines: 30s (typically faster)
Context Capture on Every Failure
On each failure, record and maintain:
run_idandtask_run_idjob_nameandtask_key- Target environment (
dev,staging,prod) - Full error message
run_page_urlfor UI access- Iteration count (which fix attempt this is)
Self-Healing Loop
Iteration 1: Deploy → Run → Monitor → [FAIL] → Diagnose → Fix → Redeploy
Iteration 2: Run → Monitor → [FAIL] → Diagnose → Fix → Redeploy
Iteration 3: Run → Monitor → [FAIL] → ESCALATE TO USER
Dependency Awareness
Before deploying, verify prerequisites exist:
- Genie Spaces require: TVFs + Metric Views (semantic layer)
- Monitoring requires: Gold tables (populated)
- Gold merge requires: Gold tables (setup job)
- Silver DLT requires: Bronze tables
- Alerts require: Monitoring output tables
- Always follow: Bronze → Gold → Semantic → Monitoring → Alerts → Genie
Partial Success Handling
Some jobs use ≥90% success thresholds:
- Alerting deployment: 27/30 alerts succeeding = OK, fix the 3 individually
- Monitoring setup: 7/8 monitors = OK, debug the 1 failure
- Do NOT fail the entire workflow for 1 out of N items failing
CLI Limitations Awareness
Some diagnostics require the Databricks Workspace UI:
- Full notebook execution logs (when
get-run-outputreturns truncated or empty) - DLT pipeline data flow visualization
- Detailed Spark execution plans
Always provide run_page_url to the user when CLI output is insufficient.
Learning Capture Pattern (Self-Improvement Integration)
After every troubleshooting session (successful or escalated), run this checklist:
Post-Troubleshooting Self-Improvement:
1. Was this error already in references/error-solution-matrix.md?
├── YES → Was the documented fix accurate? If not, update it.
└── NO → Add the error pattern and fix to the matrix.
2. Did the initial diagnosis match the actual root cause?
├── YES → No action needed.
└── NO → Document the misleading signals in the relevant skill.
3. How many iterations were needed?
├── 1 → Capture only if error was novel or non-obvious.
├── 2-3 → ALWAYS update the relevant skill with the fix.
└── Escalated → Document what the user found; update skills with user's fix.
4. Is this error likely to recur?
├── YES → Ensure it's in Top 10 table (if frequent) or error-solution matrix.
└── NO → Document as a one-off note in the session, skip skill update.
Skill priority for updates (search in this order):
references/error-solution-matrix.md— error-to-fix mappingcommon/databricks-autonomous-operations— troubleshooting decision treecommon/databricks-asset-bundles— if deployment-related- Domain-specific skill (e.g.,
gold/pipeline-workers/02-merge-patternsfor MERGE errors) - New skill — only if all above are irrelevant (see
admin/self-improvementjustification checklist)
Safety: Never Retry Destructive Operations
Without explicit user confirmation, NEVER retry:
databricks bundle destroyDROP TABLE/DROP SCHEMAw.quality_monitors.delete()w.alerts_v2.delete_alert()
References
Reference Documents (Load on Demand)
references/sdk-api-reference.md— Full SDK API with code examples for all servicesreferences/error-solution-matrix.md— Error tables for all 7 resource categories (40+ patterns)references/diagnostic-queries.md— SQL queries for investigationreferences/dlt-pipeline-troubleshooting.md— DLT pipeline failure patternsreferences/cli-jq-patterns.md— jq patterns for parsing CLI JSON outputreferences/cli-quick-reference.md— CLI vs SDK side-by-side tablereferences/job-design-patterns.md— Structured output, progress messages, 3-layer hierarchy
Runnable Examples
examples/1-authentication.pythroughexamples/6-autonomous-operations.py(see Section 1)
Scripts
scripts/monitor_multitask_job.sh— Multi-task job monitoring shell script