GenAI Agent Implementation Orchestrator

End-to-end guide for implementing production-grade Databricks GenAI agents using the MLflow 3.0 ResponsesAgent framework with Genie Spaces, Lakebase memory, and comprehensive evaluation.

When to Use

Implementing a new AI agent on Databricks from scratch
Creating multi-agent systems with Genie Space data access
Setting up agent evaluation pipelines with LLM judges
Deploying agents to Model Serving with OBO authentication
Adding memory (short-term conversations, long-term preferences) to agents
Troubleshooting agent implementation errors (AI Playground, authentication, evaluation)
Planning agent architecture (orchestrator + worker pattern)

Architecture Overview

Recommended: Hybrid Architecture (Genie + Custom Orchestrator)

User Query
    │
    ▼
┌─────────────────────┐
│  ResponsesAgent     │  ← MLflow pyfunc model (MANDATORY)
│  (Orchestrator)     │
├─────────────────────┤
│  Intent Classifier  │  ← Routes to domains
│  LangGraph Workflow  │  ← State management
│  Memory Manager     │  ← Short-term + Long-term
└────────┬────────────┘
         │
    ┌────┴────┐
    ▼         ▼
┌────────┐ ┌────────┐
│ Genie  │ │ Genie  │  ← Domain-specific data access
│ Space  │ │ Space  │     (OBO authentication)
│ (Cost) │ │ (Perf) │
└────────┘ └────────┘

For full architecture details, see: references/architecture-overview.md

Technology Stack

Component	Technology	Purpose
Agent Framework	MLflow ResponsesAgent	AI Playground compatible model
Data Access	Genie Spaces + Conversation API	Natural language data queries
Orchestration	LangGraph	Multi-agent state management
Short-term Memory	Lakebase CheckpointSaver	Conversation continuity
Long-term Memory	Lakebase DatabricksStore	User preferences
Prompt Management	Unity Catalog + MLflow Registry	Versioned prompts with A/B testing
Evaluation	MLflow GenAI Evaluate	LLM judges + custom scorers
Monitoring	MLflow Registered Scorers	Continuous production quality
Deployment	MLflow Deployment Jobs	Auto-trigger on model version
Authentication	OBO + Automatic Auth Passthrough	User-level data governance

Working Memory Management

This orchestrator spans 9 phases (Phase 0-8) — one of the longest in the pipeline. To maintain coherence without context pollution:

After each phase, persist a brief summary note capturing:

Phase 0 output: Manifest found (yes/no), agent count, Genie Space IDs from manifest, tool definitions
Phase 1 output: ResponsesAgent class path, tracing config, base agent working
Phase 2 output: Genie Space tool IDs, data access verified, query routing decisions
Phase 3 output: Lakebase table names (conversations, preferences), memory config
Phase 4 output: Prompt template paths, system prompt derivation from Genie Space instructions
Phase 5 output: Evaluation dataset table path, scorer names, baseline metrics
Phase 6 output: Model Serving endpoint name, OBO auth config, deployment job paths
Phase 7 output: Production monitor names, alert thresholds, drift detection config
Phase 8 output: Genie Space optimization results, benchmark accuracy improvements

What to keep in working memory: Only the current phase's context, the agent inventory (from Phase 0), and the previous phase's summary note. Discard intermediate outputs (full evaluation DataFrames, raw trace logs, complete prompt texts) — they are in MLflow and on disk.

Implementation Phases

Phase 0: Read Plan (5 minutes)

Before starting implementation, check for a planning manifest that defines what to build.

import yaml
from pathlib import Path

manifest_path = Path("plans/manifests/genai-agents-manifest.yaml")

if manifest_path.exists():
    with open(manifest_path) as f:
        manifest = yaml.safe_load(f)
    
    # Extract implementation checklist from manifest
    architecture = manifest.get('architecture', {})
    agents = manifest.get('agents', [])
    evaluation = manifest.get('evaluation', {})
    prompts = manifest.get('prompts', [])
    deployment = manifest.get('deployment', {})
    print(f"Plan: {len(agents)} agents, {len(prompts)} prompts")
    print(f"Architecture: {architecture.get('pattern', 'unknown')}")
    
    # Each agent has: name, domain, genie_space, capabilities, tools, sample_questions
    # Evaluation has: dataset_table, questions_per_domain, scorers, thresholds
else:
    # Fallback: self-discovery from Genie Spaces
    print("No manifest found — falling back to Genie Space self-discovery")
    # Discover existing Genie Spaces, create one agent per space

If manifest exists: Use it as the implementation checklist. Every agent, tool, evaluation question, and deployment configuration is pre-defined. Track completion against the manifest's summary counts.

If manifest doesn't exist: Fall back to self-discovery — inventory existing Genie Spaces from the catalog, create one agent per Genie Space, and use default evaluation criteria. This works but may miss specific tool definitions and cross-domain orchestration patterns the planning phase would have defined.

Phase 1: Foundation (ResponsesAgent + Tracing)

Load skill: responses-agent-patterns

Goal: Create a working ResponsesAgent with MLflow tracing that loads in AI Playground.

Artifacts produced:

Agent class inheriting from ResponsesAgent
predict() and predict_stream() methods
MLflow tracing with @mlflow.trace decorators
OBO authentication with context detection

Validate before proceeding:

Agent loads in AI Playground without errors
Streaming responses work (delta + done events)
Traces visible in MLflow Experiment UI
No signature parameter in log_model()

Phase 2: Data Access (Genie Spaces)

Load skill: multi-agent-genie-orchestration

Goal: Connect agent to Genie Spaces for structured data access with NO LLM fallback.

Artifacts produced:

Genie Space configuration registry
Conversation API integration (start_conversation_and_wait)
Intent classification (domain routing)
Parallel domain queries (ThreadPoolExecutor)
Cross-domain response synthesis

Validate before proceeding:

Genie queries return real data (not hallucinated)
Multi-domain queries execute in parallel
Error messages are explicit (no data fabrication)
Intent classification routes correctly

Phase 3: Memory (Lakebase)

Load skill: lakebase-memory-patterns

Goal: Add conversation continuity and user preferences.

Artifacts produced:

Lakebase setup script (tables creation)
CheckpointSaver integration with LangGraph
DatabricksStore for long-term memory
Memory tools for agent autonomy
Graceful degradation (agent works without memory)

Validate before proceeding:

thread_id returned in custom_outputs
Multi-turn conversations maintain context
Agent works without memory tables (graceful fallback)
User preferences persist across sessions

Phase 4: Prompt Management

Load skill: prompt-registry-patterns

Goal: Externalize prompts for runtime updates and A/B testing.

Artifacts produced:

Unity Catalog agent_config table
Prompt registration script
Lazy loading in agent
A/B testing (champion/challenger)

Validate before proceeding:

Prompts loaded from Unity Catalog at runtime
Fallback to embedded defaults if UC unavailable
SQL injection prevented (escaped quotes)

Phase 5: Evaluation

Load skill: mlflow-genai-evaluation

Goal: Set up pre-deployment evaluation with LLM judges.

Artifacts produced:

Evaluation dataset (UC table)
Built-in judges (Relevance, Safety, Guidelines)
Custom domain-specific scorers
Threshold configuration
_extract_response_text() helper
_call_llm_for_scoring() helper

Validate before proceeding:

All custom scorers use _extract_response_text()
Databricks SDK used for LLM calls (NOT langchain_databricks)
4-6 guidelines (not 8+)
Metric aliases defined for backward compatibility

Phase 6: Deployment

Load skill: deployment-automation

Goal: Automated deployment pipeline triggered by model version creation.

Artifacts produced:

Deployment job YAML (Asset Bundle)
Dataset linking with mlflow.log_input()
Evaluation-then-promote workflow
Three MLflow experiments (dev, eval, deploy)

Validate before proceeding:

Job triggers on MODEL_VERSION_CREATED
Dataset linked to evaluation runs
Thresholds checked before promotion
Model promoted with alias management

Phase 7: Production Monitoring

Load skill: production-monitoring

Goal: Continuous quality monitoring with sampling-based scorers.

Artifacts produced:

Registered scorers with sampling rates
Custom heuristic scorers (fast, 100% sampling)
Trace archival to Unity Catalog
Monitoring dashboard queries

Validate before proceeding:

Immutable pattern used (scorer = scorer.start())
Safety at 100% sampling, expensive judges at 5-20%
Trace archival enabled
Dashboard queries return metrics

Phase 8: Genie Optimization

Load skill: @data_product_accelerator/skills/semantic-layer/05-genie-optimization-orchestrator/SKILL.md (cross-domain)

Goal: Improve Genie Space accuracy and repeatability.

Artifacts produced:

Accuracy/repeatability test scripts
Control lever adjustments
API update with array sorting
Dual persistence (API + repository)

Validate before proceeding:

Repeatability tested (3+ iterations per question)
Dual persistence applied (API + repo)
Arrays sorted before API updates
Rate limiting in place (12+ seconds)

Critical Anti-Patterns (NEVER Do These)

For the complete anti-patterns reference, see references/anti-patterns.md.

Anti-Pattern	Consequence	Correct Pattern
Manual `signature` in `log_model()`	Breaks AI Playground	Let MLflow auto-infer
LLM fallback for data queries	Hallucinated fake data	Return explicit error messages
OBO in non-serving contexts	Permission errors	Context detection with env vars
Missing `_extract_response_text()`	All scorer scores = 0.0	Universal extraction helper
`langchain_databricks` in scorers	Serverless failures	Databricks SDK
8+ guidelines sections	Score = 0.20	4-6 focused guidelines
Missing metric aliases	Silent threshold bypass	Define aliases for MLflow versions
`scorer.start()` without assignment	Scorers don't start	`scorer = scorer.start()`
API-only Genie updates	Lost on redeployment	Dual persistence (API + repo)
Unsorted Genie API arrays	API rejection	Sort all arrays before PATCH

File Structure

For the recommended project file structure, see references/file-structure.md.

Summary Validation Checklist

See references/implementation-checklist.md for the complete consolidated checklist covering all phases.

Templates

Agent class template: assets/templates/agent-class-template.py
Agent settings: assets/templates/agent-settings-template.py

References

Official Documentation

Dependency Skills

Skill	Purpose
`responses-agent-patterns`	ResponsesAgent, streaming, tracing, OBO auth
`mlflow-genai-evaluation`	LLM judges, custom scorers, thresholds
`lakebase-memory-patterns`	CheckpointSaver, DatabricksStore, graceful degradation
`prompt-registry-patterns`	UC storage, MLflow versioning, A/B testing
`multi-agent-genie-orchestration`	Genie API, parallel queries, intent classification
`deployment-automation`	Deployment jobs, dataset linking, promotion
`production-monitoring`	Registered scorers, trace archival, dashboards
`mlflow-genai-foundation`	Core MLflow GenAI patterns (tracing, eval basics)

Troubleshooting

Skill	Purpose
`common/databricks-autonomous-operations`	Deploy → Poll → Diagnose → Fix → Redeploy loop when jobs fail. Read when stuck during deployment or runtime errors.

Cross-Domain Dependencies

Skill	Domain	Purpose
`semantic-layer/05-genie-optimization-orchestrator`	Semantic Layer	Accuracy testing, control levers, dual persistence

Upstream Agent Bricks (databricks-agent-bricks)

This skill extends the upstream databricks-agent-bricks skill from ai-dev-kit. The following capabilities are available.

Agent Types in Supervisor Agents (MAS)

The upstream databricks-agent-bricks skill supports 5 agent types in Supervisor Agents:

Agent Type	Parameter	Description
Knowledge Assistant	`ka_tile_id`	Document Q&A via RAG from PDFs in Volumes
Genie Space	`genie_space_id`	Natural language to SQL
Custom Endpoint	`endpoint_name`	Model serving endpoint for custom agents
UC Function	`uc_function_name`	Unity Catalog function (format: `catalog.schema.function_name`). Agent service principal needs `EXECUTE` privilege.
External MCP Server	`connection_name`	UC HTTP Connection with `is_mcp_connection: 'true'`. Agent service principal needs `USE CONNECTION` privilege.

Provide exactly one of: ka_tile_id, genie_space_id, endpoint_name, uc_function_name, or connection_name per agent.

New Actions

find_by_name action for both manage_ka and manage_mas tools — looks up an existing KA or Supervisor Agent by name when you don't have the tile_id

MCP Tools

manage_ka: Manage Knowledge Assistants (create_or_update, get, find_by_name, delete)
manage_mas: Manage Supervisor Agents (create_or_update, get, find_by_name, delete)

Pipeline Progression

Previous stage: ml/00-ml-pipeline-setup → ML models and experiments should be set up

This is the final stage in the standard pipeline progression. After completing GenAI agent implementation, the data platform is fully operational with:

Bronze → Silver → Gold data pipeline
Semantic layer (Metric Views, TVFs, Genie Spaces)
Observability (Monitoring, Dashboards, Alerts)
ML models and experiments
GenAI agents with evaluation and deployment

Post-Completion: Skill Usage Summary (MANDATORY)

After completing all phases of this orchestrator, output a Skill Usage Summary reflecting what you ACTUALLY did — not a pre-written summary.

What to Include

Every skill SKILL.md or references/ file you read (via the Read tool), in the order you read them
Which phase you were in when you read it
Whether it was a Worker, Common, Cross-domain, or Reference file
A one-line description of what you specifically used it for in this session

Format

#	Phase	Skill / Reference Read	Type	What It Was Used For
1	Phase N	`path/to/SKILL.md`	Worker / Common / Cross-domain / Reference	One-line description

Summary Footer

End with:

Totals: X worker skills, Y common skills, Z cross-domain skills, W reference files read across N phases
Skipped: List any skills from the dependency table above that you did NOT need to read, and why (e.g., "phase not applicable", "user skipped", "no issues encountered")
Unplanned: List any skills you read that were NOT listed in the dependency table (e.g., for troubleshooting, edge cases, or user-requested detours)

Version History

Date	Version	Changes
Jan 27, 2026	1.0.0	Initial skill creation from cursor rules 28, 30-35, context prompts, and implementation docs

genai-agents-setup

GenAI Agent Implementation Orchestrator

When to Use

Architecture Overview

Recommended: Hybrid Architecture (Genie + Custom Orchestrator)

Technology Stack

Working Memory Management

Implementation Phases

Phase 0: Read Plan (5 minutes)

Phase 1: Foundation (ResponsesAgent + Tracing)

Phase 2: Data Access (Genie Spaces)

Phase 3: Memory (Lakebase)

Phase 4: Prompt Management

Phase 5: Evaluation

Phase 6: Deployment

Phase 7: Production Monitoring

Phase 8: Genie Optimization

Critical Anti-Patterns (NEVER Do These)

File Structure

Summary Validation Checklist

Templates

References

Official Documentation

Dependency Skills

Troubleshooting

Cross-Domain Dependencies

Upstream Agent Bricks (databricks-agent-bricks)

Agent Types in Supervisor Agents (MAS)

New Actions

MCP Tools

Pipeline Progression

Post-Completion: Skill Usage Summary (MANDATORY)

What to Include

Format

Summary Footer

Version History