databricks-model-serving
Databricks Model Serving
Deploy MLflow models and AI agents to scalable REST API endpoints.
Quick Decision: What Are You Deploying?
| Model Type | Pattern | Reference |
|---|---|---|
| Traditional ML (sklearn, xgboost) | mlflow.sklearn.autolog() |
1-classical-ml.md |
| Custom Python model | mlflow.pyfunc.PythonModel |
2-custom-pyfunc.md |
| GenAI Agent (LangGraph, tool-calling) | ResponsesAgent |
3-genai-agents.md |
Prerequisites
- DBR 16.1+ recommended (pre-installed GenAI packages)
- Unity Catalog enabled workspace
- Model Serving enabled
Foundation Model API Endpoints
ALWAYS use exact endpoint names from this table. NEVER guess or abbreviate.
Chat / Instruct Models
| Endpoint Name | Provider | Notes |
|---|---|---|
databricks-gpt-5-2 |
OpenAI | Latest GPT, 400K context |
databricks-gpt-5-1 |
OpenAI | Instant + Thinking modes |
databricks-gpt-5-1-codex-max |
OpenAI | Code-specialized (high perf) |
databricks-gpt-5-1-codex-mini |
OpenAI | Code-specialized (cost-opt) |
databricks-gpt-5 |
OpenAI | 400K context, reasoning |
databricks-gpt-5-mini |
OpenAI | Cost-optimized reasoning |
databricks-gpt-5-nano |
OpenAI | High-throughput, lightweight |
databricks-gpt-oss-120b |
OpenAI | Open-weight, 128K context |
databricks-gpt-oss-20b |
OpenAI | Lightweight open-weight |
databricks-claude-opus-4-6 |
Anthropic | Most capable, 1M context |
databricks-claude-sonnet-4-6 |
Anthropic | Hybrid reasoning |
databricks-claude-sonnet-4-5 |
Anthropic | Hybrid reasoning |
databricks-claude-opus-4-5 |
Anthropic | Deep analysis, 200K context |
databricks-claude-sonnet-4 |
Anthropic | Hybrid reasoning |
databricks-claude-opus-4-1 |
Anthropic | 200K context, 32K output |
databricks-claude-haiku-4-5 |
Anthropic | Fastest, cost-effective |
databricks-claude-3-7-sonnet |
Anthropic | Retiring April 2026 |
databricks-meta-llama-3-3-70b-instruct |
Meta | 128K context, multilingual |
databricks-meta-llama-3-1-405b-instruct |
Meta | Retiring May 2026 (PT) |
databricks-meta-llama-3-1-8b-instruct |
Meta | Lightweight, 128K context |
databricks-llama-4-maverick |
Meta | MoE architecture |
databricks-gemini-3-1-pro |
1M context, hybrid reasoning | |
databricks-gemini-3-pro |
1M context, hybrid reasoning | |
databricks-gemini-3-flash |
Fast, cost-efficient | |
databricks-gemini-2-5-pro |
1M context, Deep Think | |
databricks-gemini-2-5-flash |
1M context, hybrid reasoning | |
databricks-gemma-3-12b |
128K context, multilingual | |
databricks-qwen3-next-80b-a3b-instruct |
Alibaba | Efficient MoE |
Embedding Models
| Endpoint Name | Dimensions | Max Tokens | Notes |
|---|---|---|---|
databricks-gte-large-en |
1024 | 8192 | English, not normalized |
databricks-bge-large-en |
1024 | 512 | English, normalized |
databricks-qwen3-embedding-0-6b |
up to 1024 | ~32K | 100+ languages, instruction-aware |
Common Defaults
- Agent LLM:
databricks-meta-llama-3-3-70b-instruct(good balance of quality/cost) - Embedding:
databricks-gte-large-en - Code tasks:
databricks-gpt-5-1-codex-miniordatabricks-gpt-5-1-codex-max
These are pay-per-token endpoints available in every workspace. For production, consider provisioned throughput mode. See supported models.
Reference Files
| Topic | File | When to Read |
|---|---|---|
| Classical ML | 1-classical-ml.md | sklearn, xgboost, autolog |
| Custom PyFunc | 2-custom-pyfunc.md | Custom preprocessing, signatures |
| GenAI Agents | 3-genai-agents.md | ResponsesAgent, LangGraph |
| Tools Integration | 4-tools-integration.md | UC Functions, Vector Search |
| Development & Testing | 5-development-testing.md | MCP workflow, iteration |
| Logging & Registration | 6-logging-registration.md | mlflow.pyfunc.log_model |
| Deployment | 7-deployment.md | Job-based async deployment |
| Querying Endpoints | 8-querying-endpoints.md | SDK, REST, MCP tools |
| Package Requirements | 9-package-requirements.md | DBR versions, pip |
Quick Start: Deploy a GenAI Agent
Step 1: Install Packages (in notebook or via MCP)
%pip install -U mlflow==3.6.0 databricks-langchain langgraph==0.3.4 databricks-agents pydantic
dbutils.library.restartPython()
Or via MCP:
execute_code(code="%pip install -U mlflow==3.6.0 databricks-langchain langgraph==0.3.4 databricks-agents pydantic")
Step 2: Create Agent File
Create agent.py locally with ResponsesAgent pattern (see 3-genai-agents.md).
Step 3: Upload to Workspace
manage_workspace_files(
action="upload",
local_path="./my_agent",
workspace_path="/Workspace/Users/you@company.com/my_agent"
)
Step 4: Test Agent
execute_code(
file_path="./my_agent/test_agent.py",
cluster_id="<cluster_id>"
)
Step 5: Log Model
execute_code(
file_path="./my_agent/log_model.py",
cluster_id="<cluster_id>"
)
Step 6: Deploy (Async via Job)
See 7-deployment.md for job-based deployment that doesn't timeout.
Step 7: Query Endpoint
manage_serving_endpoint(
action="query",
name="my-agent-endpoint",
messages=[{"role": "user", "content": "Hello!"}]
)
Quick Start: Deploy a Classical ML Model
import mlflow
import mlflow.sklearn
from sklearn.linear_model import LogisticRegression
# Enable autolog with auto-registration
mlflow.sklearn.autolog(
log_input_examples=True,
registered_model_name="main.models.my_classifier"
)
# Train - model is logged and registered automatically
model = LogisticRegression()
model.fit(X_train, y_train)
Then deploy via UI or SDK. See 1-classical-ml.md.
MCP Tools
If MCP tools are not available, use the SDK/CLI examples in the reference files below.
Development & Testing
| Tool | Purpose |
|---|---|
manage_workspace_files (action="upload") |
Upload agent files to workspace |
execute_code |
Install packages, test agent, log model |
Deployment
| Tool | Purpose |
|---|---|
manage_jobs (action="create") |
Create deployment job (one-time) |
manage_job_runs (action="run_now") |
Kick off deployment (async) |
manage_job_runs (action="get") |
Check deployment job status |
manage_serving_endpoint - Querying
| Action | Description | Required Params |
|---|---|---|
get |
Check endpoint status (READY/NOT_READY/NOT_FOUND) | name |
list |
List all endpoints | (none, optional limit) |
query |
Send requests to endpoint | name + one of: messages, inputs, dataframe_records |
Example usage:
# Check endpoint status
manage_serving_endpoint(action="get", name="my-agent-endpoint")
# List all endpoints
manage_serving_endpoint(action="list")
# Query a chat/agent endpoint
manage_serving_endpoint(
action="query",
name="my-agent-endpoint",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=500
)
# Query a traditional ML endpoint
manage_serving_endpoint(
action="query",
name="sklearn-classifier",
dataframe_records=[{"age": 25, "income": 50000, "credit_score": 720}]
)
Common Workflows
Check Endpoint Status After Deployment
manage_serving_endpoint(action="get", name="my-agent-endpoint")
Returns:
{
"name": "my-agent-endpoint",
"state": "READY",
"served_entities": [...]
}
Query a Chat/Agent Endpoint
manage_serving_endpoint(
action="query",
name="my-agent-endpoint",
messages=[
{"role": "user", "content": "What is Databricks?"}
],
max_tokens=500
)
Query a Traditional ML Endpoint
manage_serving_endpoint(
action="query",
name="sklearn-classifier",
dataframe_records=[
{"age": 25, "income": 50000, "credit_score": 720}
]
)
Common Issues
| Issue | Solution |
|---|---|
| Invalid output format | Use self.create_text_output_item(text, id) - NOT raw dicts! |
| Endpoint NOT_READY | Deployment takes ~15 min. Use manage_serving_endpoint(action="get") to poll. |
| Package not found | Specify exact versions in pip_requirements when logging model |
| Tool timeout | Use job-based deployment, not synchronous calls |
| Auth error on endpoint | Ensure resources specified in log_model for auto passthrough |
| Model not found | Check Unity Catalog path: catalog.schema.model_name |
Critical: ResponsesAgent Output Format
WRONG - raw dicts don't work:
return ResponsesAgentResponse(output=[{"role": "assistant", "content": "..."}])
CORRECT - use helper methods:
return ResponsesAgentResponse(
output=[self.create_text_output_item(text="...", id="msg_1")]
)
Available helper methods:
self.create_text_output_item(text, id)- text responsesself.create_function_call_item(id, call_id, name, arguments)- tool callsself.create_function_call_output_item(call_id, output)- tool results
Related Skills
- databricks-agent-bricks - Pre-built agent tiles that deploy to model-serving endpoints
- databricks-vector-search - Create vector indexes used as retriever tools in agents
- databricks-genie - Genie Spaces can serve as agents in multi-agent setups
- databricks-mlflow-evaluation - Evaluate model and agent quality before deployment
- databricks-jobs - Job-based async deployment used for agent endpoints
Resources
More from databricks-solutions/ai-dev-kit
databricks-python-sdk
Databricks development guidance including Python SDK, Databricks Connect, CLI, and REST API. Use when working with databricks-sdk, databricks-connect, or Databricks APIs.
132python-dev
Python development guidance with code quality standards, error handling, testing practices, and environment management. Use when writing, reviewing, or modifying Python code (.py files) or Jupyter notebooks (.ipynb files).
68skill-test
Testing framework for evaluating Databricks skills. Use when building test cases for skills, running skill evaluations, comparing skill versions, or creating ground truth datasets with the Generate-Review-Promote (GRP) pipeline. Triggers include "test skill", "evaluate skill", "skill regression", "ground truth", "GRP pipeline", "skill quality", and "skill metrics".
53databricks-docs
Databricks documentation reference via llms.txt index. Use when other skills do not cover a topic, looking up unfamiliar Databricks features, or needing authoritative docs on APIs, configurations, or platform capabilities.
29databricks-config
Manage Databricks workspace connections: check current workspace, switch profiles, list available workspaces, or authenticate to a new workspace. Use when the user mentions \"switch workspace\", \"which workspace\", \"current profile\", \"databrickscfg\", \"connect to workspace\", or \"databricks auth\".
26databricks-app-python
Builds Python-based Databricks applications using Dash, Streamlit, Gradio, Flask, FastAPI, or Reflex. Handles OAuth authorization (app and user auth), app resources, SQL warehouse and Lakebase connectivity, model serving integration, foundation model APIs, LLM integration, and deployment. Use when building Python web apps, dashboards, ML demos, or REST APIs for Databricks, or when the user mentions Streamlit, Dash, Gradio, Flask, FastAPI, Reflex, or Databricks app.
22