cxas-agent-foundry
Agent Foundry
End-to-end lifecycle for GECX conversational agents: build, test, debug, iterate.
Step tracking — MANDATORY (Phase 0, blocking)
Before doing ANY work — including running setup, asking questions, or scaffolding files — initialize <project>/todo.md from the relevant sub-skill's checklist (verbatim). The checklist is a contract, not a suggestion. If todo.md doesn't exist for the current task, refuse to proceed and create it first.
Long debug/build runs skip verification steps under pressure (e.g., pushing without linting, scaffolding without a TDD, claiming "deployed" without actually pushing). The checklist exists because of this. The instinct to skip a step is the moment the checklist earns its keep — that's when you must consult it, not the moment to bypass it.
Quick Reference
# Lint: dispatch agents/lint-fixer.md sub-agent — DO NOT run `cxas lint` on the main thread.
# Lint output is verbose; keep it inside the sub-agent context.
# Push local files to platform (only after lint-fixer returns status: clean)
cxas push --app-dir <project>/cxas_app/<AppName> \
--to projects/<project_id>/locations/<location>/apps/<app_id> \
--project-id <project_id> --location <location>
# Pull platform state to local
cxas pull projects/<project_id>/locations/<location>/apps/<app_id> \
--project-id <project_id> --location <location> --target-dir <project>/cxas_app/
# Run evals + triage + report (single command)
python .agents/skills/cxas-agent-foundry/scripts/run-and-report.py --message "what changed" --runs 5
# Inspect app architecture
python .agents/skills/cxas-agent-foundry/scripts/inspect-app.py
# Triage failures
python .agents/skills/cxas-agent-foundry/scripts/triage-results.py --last 3
# Run all 6 build-verification gates against the deployed app
python .agents/skills/cxas-agent-foundry/scripts/gate-check.py
# Tune scoring thresholds (similarity, hallucination, extra-tools)
python .agents/skills/cxas-agent-foundry/scripts/app-thresholds.py show
# Sync callback Python code into evals/callback_tests/agents/ + create test.py symlinks.
# Required for tests to be discoverable by test_all_callbacks_in_app_dir.
python .agents/skills/cxas-agent-foundry/scripts/sync-callbacks.py # post-push: pull from platform
python .agents/skills/cxas-agent-foundry/scripts/sync-callbacks.py --from-local <app_dir> # pre-push: copy from local app dir
# Cold-start setup (first-time only — venv + project bootstrap)
.agents/skills/cxas-agent-foundry/scripts/setup.sh
python .agents/skills/cxas-agent-foundry/scripts/setup-project.py
Disambiguation: gate-check.py and inspect-app.py overlap on "show me the architecture" but gate-check.py is the answer whenever the user is about to push, finished building, or wants a verification pass. inspect-app.py is for a quick "what's in here" look without the verification gates. When in doubt, use gate-check.py.
Sub-agents
For heavy diagnosis/analysis work that would otherwise burn main-thread context, dispatch one of these sub-agents via the Agent tool. Pass the contents of the relevant .md file as the prompt, then add the inputs the file lists.
| Sub-agent | Reasoning intensity | When to use |
|---|---|---|
agents/triage-failure.md |
HIGH | Diagnose ONE failing eval. Fan out for the top 5 failures by category priority in parallel. Iterate on more after the first batch returns. |
agents/tdd-writer.md |
HIGH | Reverse-engineer a TDD from an existing agent OR draft from PRD. Returns the TDD + open-questions handoff; main thread runs the show/ask/iterate loop with the user (sub-agents can't ask). |
agents/scaffolder.md |
MEDIUM | Bulk-generate all agent code (agent JSONs, instruction.txt, tool python_code, callbacks, app.json) from an APPROVED TDD. One dispatch replaces 30-60 main-thread file writes. |
agents/coverage-analyst.md |
MEDIUM | Generate a full eval coverage report against an agent's architecture. |
agents/eval-writer.md |
MEDIUM | Generate evals for one entire eval TYPE (all goldens, all sims, etc.) — reads TDD's Coverage Map itself. Max 4 dispatches per build. |
agents/lint-fixer.md |
LOW (mechanical) | Run cxas lint and mechanically fix all errors + deterministic warnings until clean. Never run lint on main thread. |
For running evals: there is no sub-agent. Use scripts/run-and-report.py --json-summary <path> > /dev/null 2>&1 and read the summary file — see references/debug.md → "Quick Start". The work was deterministic, so it lives in the script.
Reasoning intensity is a hint to the runtime: HIGH sub-agents benefit from more thinking budget / a stronger model, LOW sub-agents are recipe-driven and don't. Each sub-agent file repeats this hint at the top with a one-line justification.
Environment Readiness Check (run BEFORE routing)
Before routing to any sub-skill, check these signals in order:
- Virtualenv exists? -- Check if
.venv/directory exists - Config exists? -- Check if
.active-projectfile exists and the referenced<project>/gecx-config.jsonexists - Has built before? -- Check if any
<project>/cxas_app/directory has content
| Signal | Action |
|---|---|
No .venv/ or no config |
First-time setup needed. Load references/setup.md before doing anything else. |
gecx-config.json exists but no cxas_app/ content |
Returning user, new project. Route normally. |
| All exist | Returning user. Route normally. |
Detect Intent and Route
Read what the user wants and load the appropriate sub-skill:
| User says... | Phase | Load |
|---|---|---|
| "Build me an agent from this PRD" | Build | references/build.md |
| "Create a new cxas app", "Make a new agent", "Set up an agent", "I wanna build an agent" | Build | references/build.md |
| "Create evals for my agent" | Build | references/build.md |
| "Generate tool tests", "create callback tests" | Build | references/build.md |
| "Update evals -- requirements changed" | Build | references/build.md |
| "Update the TDD" | Build | references/build.md |
| "Run evals", "push evals", "check results" | Run | references/run.md |
| "Run tool tests", "test the callbacks" | Run | references/run.md |
| "Generate a report" | Run | references/run.md |
| "Why is this eval failing", "get to 90%" | Debug | references/debug.md |
| "Fix the failing evals", "debug the agent" | Debug | references/debug.md |
| "Tool test is failing", "callback test broke" | Debug | references/debug.md |
| "Edit the agent's instructions", "tweak the auth tool", "fix the greeting", "update this callback" | Build (Edit cycle) | references/build.md → "Editing an Existing Agent" |
Any phrasing that implies creating, building, or setting up an agent/app routes to references/build.md — even if it sounds like "just create the app shell." "Create a new cxas app" is NOT a shortcut to scaffolding; it triggers the full build flow (todo.md → interview/PRD → TDD + approval → scaffold → lint → evals → push). Skipping the interview / TDD because the user said "create" instead of "build" is a routing failure.
Editing an existing agent (instruction tweak, tool change, callback fix) routes to build.md's "Editing an Existing Agent" section — the standard pull → edit → lint → push → run-evals cycle. Don't skip lint or the eval run after — silent regressions are how 90% rates drop to 70%.
If the intent is unclear, ask: "Are you looking to build/create evals, run them, or debug failures?"
Before Starting
Check memory for project-specific context (app ID, variable handling rules, audio scoring workarounds). If not available, ask the user.