genai-agents-setup
GenAI Agent Implementation Orchestrator
End-to-end guide for implementing production-grade Databricks GenAI agents using the MLflow 3.0 ResponsesAgent framework with Genie Spaces, Lakebase memory, and comprehensive evaluation.
When to Use
- Implementing a new AI agent on Databricks from scratch
- Creating multi-agent systems with Genie Space data access
- Setting up agent evaluation pipelines with LLM judges
- Deploying agents to Model Serving with OBO authentication
- Adding memory (short-term conversations, long-term preferences) to agents
- Troubleshooting agent implementation errors (AI Playground, authentication, evaluation)
- Planning agent architecture (orchestrator + worker pattern)
Architecture Overview
Recommended: Hybrid Architecture (Genie + Custom Orchestrator)
User Query
│
▼
┌─────────────────────┐
│ ResponsesAgent │ ← MLflow pyfunc model (MANDATORY)
│ (Orchestrator) │
├─────────────────────┤
│ Intent Classifier │ ← Routes to domains
│ LangGraph Workflow │ ← State management
│ Memory Manager │ ← Short-term + Long-term
└────────┬────────────┘
│
┌────┴────┐
▼ ▼
┌────────┐ ┌────────┐
│ Genie │ │ Genie │ ← Domain-specific data access
│ Space │ │ Space │ (OBO authentication)
│ (Cost) │ │ (Perf) │
└────────┘ └────────┘
For full architecture details, see: references/architecture-overview.md
Technology Stack
| Component | Technology | Purpose |
|---|---|---|
| Agent Framework | MLflow ResponsesAgent | AI Playground compatible model |
| Data Access | Genie Spaces + Conversation API | Natural language data queries |
| Orchestration | LangGraph | Multi-agent state management |
| Short-term Memory | Lakebase CheckpointSaver | Conversation continuity |
| Long-term Memory | Lakebase DatabricksStore | User preferences |
| Prompt Management | Unity Catalog + MLflow Registry | Versioned prompts with A/B testing |
| Evaluation | MLflow GenAI Evaluate | LLM judges + custom scorers |
| Monitoring | MLflow Registered Scorers | Continuous production quality |
| Deployment | MLflow Deployment Jobs | Auto-trigger on model version |
| Authentication | OBO + Automatic Auth Passthrough | User-level data governance |
Working Memory Management
This orchestrator spans 9 phases (Phase 0-8) — one of the longest in the pipeline. To maintain coherence without context pollution:
After each phase, persist a brief summary note capturing:
- Phase 0 output: Manifest found (yes/no), agent count, Genie Space IDs from manifest, tool definitions
- Phase 1 output: ResponsesAgent class path, tracing config, base agent working
- Phase 2 output: Genie Space tool IDs, data access verified, query routing decisions
- Phase 3 output: Lakebase table names (conversations, preferences), memory config
- Phase 4 output: Prompt template paths, system prompt derivation from Genie Space instructions
- Phase 5 output: Evaluation dataset table path, scorer names, baseline metrics
- Phase 6 output: Model Serving endpoint name, OBO auth config, deployment job paths
- Phase 7 output: Production monitor names, alert thresholds, drift detection config
- Phase 8 output: Genie Space optimization results, benchmark accuracy improvements
What to keep in working memory: Only the current phase's context, the agent inventory (from Phase 0), and the previous phase's summary note. Discard intermediate outputs (full evaluation DataFrames, raw trace logs, complete prompt texts) — they are in MLflow and on disk.
Implementation Phases
Phase 0: Read Plan (5 minutes)
Before starting implementation, check for a planning manifest that defines what to build.
import yaml
from pathlib import Path
manifest_path = Path("plans/manifests/genai-agents-manifest.yaml")
if manifest_path.exists():
with open(manifest_path) as f:
manifest = yaml.safe_load(f)
# Extract implementation checklist from manifest
architecture = manifest.get('architecture', {})
agents = manifest.get('agents', [])
evaluation = manifest.get('evaluation', {})
prompts = manifest.get('prompts', [])
deployment = manifest.get('deployment', {})
print(f"Plan: {len(agents)} agents, {len(prompts)} prompts")
print(f"Architecture: {architecture.get('pattern', 'unknown')}")
# Each agent has: name, domain, genie_space, capabilities, tools, sample_questions
# Evaluation has: dataset_table, questions_per_domain, scorers, thresholds
else:
# Fallback: self-discovery from Genie Spaces
print("No manifest found — falling back to Genie Space self-discovery")
# Discover existing Genie Spaces, create one agent per space
If manifest exists: Use it as the implementation checklist. Every agent, tool, evaluation question, and deployment configuration is pre-defined. Track completion against the manifest's summary counts.
If manifest doesn't exist: Fall back to self-discovery — inventory existing Genie Spaces from the catalog, create one agent per Genie Space, and use default evaluation criteria. This works but may miss specific tool definitions and cross-domain orchestration patterns the planning phase would have defined.
Phase 1: Foundation (ResponsesAgent + Tracing)
Load skill: responses-agent-patterns
Goal: Create a working ResponsesAgent with MLflow tracing that loads in AI Playground.
Artifacts produced:
- Agent class inheriting from
ResponsesAgent predict()andpredict_stream()methods- MLflow tracing with
@mlflow.tracedecorators - OBO authentication with context detection
Validate before proceeding:
- Agent loads in AI Playground without errors
- Streaming responses work (delta + done events)
- Traces visible in MLflow Experiment UI
- No
signatureparameter inlog_model()
Phase 2: Data Access (Genie Spaces)
Load skill: multi-agent-genie-orchestration
Goal: Connect agent to Genie Spaces for structured data access with NO LLM fallback.
Artifacts produced:
- Genie Space configuration registry
- Conversation API integration (start_conversation_and_wait)
- Intent classification (domain routing)
- Parallel domain queries (ThreadPoolExecutor)
- Cross-domain response synthesis
Validate before proceeding:
- Genie queries return real data (not hallucinated)
- Multi-domain queries execute in parallel
- Error messages are explicit (no data fabrication)
- Intent classification routes correctly
Phase 3: Memory (Lakebase)
Load skill: lakebase-memory-patterns
Goal: Add conversation continuity and user preferences.
Artifacts produced:
- Lakebase setup script (tables creation)
- CheckpointSaver integration with LangGraph
- DatabricksStore for long-term memory
- Memory tools for agent autonomy
- Graceful degradation (agent works without memory)
Validate before proceeding:
- thread_id returned in custom_outputs
- Multi-turn conversations maintain context
- Agent works without memory tables (graceful fallback)
- User preferences persist across sessions
Phase 4: Prompt Management
Load skill: prompt-registry-patterns
Goal: Externalize prompts for runtime updates and A/B testing.
Artifacts produced:
- Unity Catalog agent_config table
- Prompt registration script
- Lazy loading in agent
- A/B testing (champion/challenger)
Validate before proceeding:
- Prompts loaded from Unity Catalog at runtime
- Fallback to embedded defaults if UC unavailable
- SQL injection prevented (escaped quotes)
Phase 5: Evaluation
Load skill: mlflow-genai-evaluation
Goal: Set up pre-deployment evaluation with LLM judges.
Artifacts produced:
- Evaluation dataset (UC table)
- Built-in judges (Relevance, Safety, Guidelines)
- Custom domain-specific scorers
- Threshold configuration
_extract_response_text()helper_call_llm_for_scoring()helper
Validate before proceeding:
- All custom scorers use
_extract_response_text() - Databricks SDK used for LLM calls (NOT langchain_databricks)
- 4-6 guidelines (not 8+)
- Metric aliases defined for backward compatibility
Phase 6: Deployment
Load skill: deployment-automation
Goal: Automated deployment pipeline triggered by model version creation.
Artifacts produced:
- Deployment job YAML (Asset Bundle)
- Dataset linking with
mlflow.log_input() - Evaluation-then-promote workflow
- Three MLflow experiments (dev, eval, deploy)
Validate before proceeding:
- Job triggers on MODEL_VERSION_CREATED
- Dataset linked to evaluation runs
- Thresholds checked before promotion
- Model promoted with alias management
Phase 7: Production Monitoring
Load skill: production-monitoring
Goal: Continuous quality monitoring with sampling-based scorers.
Artifacts produced:
- Registered scorers with sampling rates
- Custom heuristic scorers (fast, 100% sampling)
- Trace archival to Unity Catalog
- Monitoring dashboard queries
Validate before proceeding:
- Immutable pattern used (scorer = scorer.start())
- Safety at 100% sampling, expensive judges at 5-20%
- Trace archival enabled
- Dashboard queries return metrics
Phase 8: Genie Optimization
Load skill: @data_product_accelerator/skills/semantic-layer/05-genie-optimization-orchestrator/SKILL.md (cross-domain)
Goal: Improve Genie Space accuracy and repeatability.
Artifacts produced:
- Accuracy/repeatability test scripts
- Control lever adjustments
- API update with array sorting
- Dual persistence (API + repository)
Validate before proceeding:
- Repeatability tested (3+ iterations per question)
- Dual persistence applied (API + repo)
- Arrays sorted before API updates
- Rate limiting in place (12+ seconds)
Critical Anti-Patterns (NEVER Do These)
For the complete anti-patterns reference, see references/anti-patterns.md.
| Anti-Pattern | Consequence | Correct Pattern |
|---|---|---|
Manual signature in log_model() |
Breaks AI Playground | Let MLflow auto-infer |
| LLM fallback for data queries | Hallucinated fake data | Return explicit error messages |
| OBO in non-serving contexts | Permission errors | Context detection with env vars |
Missing _extract_response_text() |
All scorer scores = 0.0 | Universal extraction helper |
langchain_databricks in scorers |
Serverless failures | Databricks SDK |
| 8+ guidelines sections | Score = 0.20 | 4-6 focused guidelines |
| Missing metric aliases | Silent threshold bypass | Define aliases for MLflow versions |
scorer.start() without assignment |
Scorers don't start | scorer = scorer.start() |
| API-only Genie updates | Lost on redeployment | Dual persistence (API + repo) |
| Unsorted Genie API arrays | API rejection | Sort all arrays before PATCH |
File Structure
For the recommended project file structure, see references/file-structure.md.
Summary Validation Checklist
See references/implementation-checklist.md for the complete consolidated checklist covering all phases.
Templates
- Agent class template:
assets/templates/agent-class-template.py - Agent settings:
assets/templates/agent-settings-template.py
References
Official Documentation
- Databricks Agent Framework
- MLflow ResponsesAgent
- MLflow GenAI Evaluate
- MLflow Tracing
- Genie Conversation API
- Production Monitoring
- Lakebase Documentation
- Multi-Agent with Genie
Dependency Skills
| Skill | Purpose |
|---|---|
responses-agent-patterns |
ResponsesAgent, streaming, tracing, OBO auth |
mlflow-genai-evaluation |
LLM judges, custom scorers, thresholds |
lakebase-memory-patterns |
CheckpointSaver, DatabricksStore, graceful degradation |
prompt-registry-patterns |
UC storage, MLflow versioning, A/B testing |
multi-agent-genie-orchestration |
Genie API, parallel queries, intent classification |
deployment-automation |
Deployment jobs, dataset linking, promotion |
production-monitoring |
Registered scorers, trace archival, dashboards |
mlflow-genai-foundation |
Core MLflow GenAI patterns (tracing, eval basics) |
Troubleshooting
| Skill | Purpose |
|---|---|
common/databricks-autonomous-operations |
Deploy → Poll → Diagnose → Fix → Redeploy loop when jobs fail. Read when stuck during deployment or runtime errors. |
Cross-Domain Dependencies
| Skill | Domain | Purpose |
|---|---|---|
semantic-layer/05-genie-optimization-orchestrator |
Semantic Layer | Accuracy testing, control levers, dual persistence |
Upstream Agent Bricks (databricks-agent-bricks)
This skill extends the upstream databricks-agent-bricks skill from ai-dev-kit. The following capabilities are available.
Agent Types in Supervisor Agents (MAS)
The upstream databricks-agent-bricks skill supports 5 agent types in Supervisor Agents:
| Agent Type | Parameter | Description |
|---|---|---|
| Knowledge Assistant | ka_tile_id |
Document Q&A via RAG from PDFs in Volumes |
| Genie Space | genie_space_id |
Natural language to SQL |
| Custom Endpoint | endpoint_name |
Model serving endpoint for custom agents |
| UC Function | uc_function_name |
Unity Catalog function (format: catalog.schema.function_name). Agent service principal needs EXECUTE privilege. |
| External MCP Server | connection_name |
UC HTTP Connection with is_mcp_connection: 'true'. Agent service principal needs USE CONNECTION privilege. |
Provide exactly one of: ka_tile_id, genie_space_id, endpoint_name, uc_function_name, or connection_name per agent.
New Actions
find_by_nameaction for bothmanage_kaandmanage_mastools — looks up an existing KA or Supervisor Agent by name when you don't have the tile_id
MCP Tools
manage_ka: Manage Knowledge Assistants (create_or_update, get, find_by_name, delete)manage_mas: Manage Supervisor Agents (create_or_update, get, find_by_name, delete)
Pipeline Progression
Previous stage: ml/00-ml-pipeline-setup → ML models and experiments should be set up
This is the final stage in the standard pipeline progression. After completing GenAI agent implementation, the data platform is fully operational with:
- Bronze → Silver → Gold data pipeline
- Semantic layer (Metric Views, TVFs, Genie Spaces)
- Observability (Monitoring, Dashboards, Alerts)
- ML models and experiments
- GenAI agents with evaluation and deployment
Post-Completion: Skill Usage Summary (MANDATORY)
After completing all phases of this orchestrator, output a Skill Usage Summary reflecting what you ACTUALLY did — not a pre-written summary.
What to Include
- Every skill
SKILL.mdorreferences/file you read (via the Read tool), in the order you read them - Which phase you were in when you read it
- Whether it was a Worker, Common, Cross-domain, or Reference file
- A one-line description of what you specifically used it for in this session
Format
| # | Phase | Skill / Reference Read | Type | What It Was Used For |
|---|---|---|---|---|
| 1 | Phase N | path/to/SKILL.md |
Worker / Common / Cross-domain / Reference | One-line description |
Summary Footer
End with:
- Totals: X worker skills, Y common skills, Z cross-domain skills, W reference files read across N phases
- Skipped: List any skills from the dependency table above that you did NOT need to read, and why (e.g., "phase not applicable", "user skipped", "no issues encountered")
- Unplanned: List any skills you read that were NOT listed in the dependency table (e.g., for troubleshooting, edge cases, or user-requested detours)
Version History
| Date | Version | Changes |
|---|---|---|
| Jan 27, 2026 | 1.0.0 | Initial skill creation from cursor rules 28, 30-35, context prompts, and implementation docs |