create-context-tests
create-context-tests
nao test runs each natural-language prompt through the agent, executes both the agent's SQL and the test's expected SQL against the warehouse, and diffs the result data row-by-row. A test passes only if the actual data matches — same rows, same values. The suite is the reliability benchmark; every change to RULES.md is measured against it. Reference: docs.getnao.io/nao-agent/context-engineering/evaluation.
How many tests
One test per key metric in ## Key Metrics Reference is the floor. Then add tests for: time scoping (especially "last 8 weeks" / "last 30 days"), CTE / multi-step queries, edge cases (NULLs, empty windows), and ambiguous wording ("our users", "active") to validate naming-convention rules.
Two authoring rules — apply to every test
Rule 1 — Prompts read like real chat. Vague, short, no table/column/method hints. The test verifies the agent reaches the right answer from a real-user input.
| Bad | Good |
|---|---|
"What was the churn rate from fct_subscriptions in Q1?" |
"How's churn looking this quarter?" |
"Compute MRR as SUM(mrr_amount) where status='active'" |
"What's our MRR?" |
Rule 2 — Output column names encode format / unit, not source. A column name communicates how to interpret the value.
| Bad | Good |
|---|---|
churn_rate_from_fct_subscriptions |
churn_rate_float_0_1 |
mrr_amount_fct_stripe_mrr |
mrr_usd_dollars |
signup_at_dim_users |
signup_date_yyyy_mm_dd |
Naming patterns: <metric>_float_0_1 or <metric>_percentage_0_100 for rates; <metric>_<currency>_<unit> for money; <thing>_count; <thing>_date_yyyy_mm_dd. See templates/test.yaml.
Steps
-
Ask once: does the user have trusted source-of-truth queries (Looker, dashboards, prior benchmarks)? If yes, transform each into a test (rewrite SELECT to apply Rule 2; reverse-engineer a Rule 1 prompt). For metrics without a trusted query, draft new tests one per metric.
-
Save flat under
tests/(no subfolders), one YAML file per test. Usetemplates/test.yaml. -
Have the user validate — confirm prompts match their team's phrasing and SQL matches their definition of truth.
-
Run
nao test. Prerequisites:cdinto the project directory (wherenao_config.yamllives).- Start
nao chat &in the background (the test runner reuses the chat server). - LLM configured in
nao_config.yaml. - First run prompts for login credentials — let the user type them; don't script around it.
- If you see
AI_APICallError: Not Foundathttps://api.anthropic.com/messages(no/v1/), rununset ANTHROPIC_BASE_URL ANTHROPIC_API_KEYfirst (parent agent CLI is leaking env vars). Seesetup-contextfor the full note.
nao test -m <model_id> -t 10 # -t = parallelism -
Recap results: pass rate, token cost, wall-clock time. Cite this as the baseline.
-
Diagnose failures (optional): read
tests/outputs/for each failure, identify the rule gap, propose the smallest fix, then route towrite-context-rules(oraudit-contextfor systemic issues). Re-run between fixes so impact is attributable.
Guardrails
- Tests' SQL must execute as-is — no
<placeholder>inFROM. Use real table / column names. - Never leak the answer in
promptor output column names (Rules 1 + 2). - One test per metric is the floor; coverage tests come after.
- Apply one context fix at a time between runs.
- If a test contradicts
RULES.md, stop and ask which is correct — it's a bug in one or the other.
Templates
templates/test.yaml— single-test format.
More from getnao/nao
write-context-rules
Create or extend a nao project's RULES.md. Owns the RULES.md template. Use when the user wants to generate the initial RULES.md from synced metadata (called by setup-context), or improve their existing RULES.md. Do not use for first-time scope setup (use setup-context) or for diagnosing existing problems (use audit-context).
22audit-context
Diagnose the health of a nao context at any stage of its lifecycle. Use when the user wants a structured review of what's been synced, how RULES.md compares to the target structure, whether every table is documented, whether the data model is MECE, whether tests exist and what their failures reveal, and whether context files are bloated. Outputs a structured audit report with ranked recommendations. Do not use for first-time setup (setup-context) or routine rule writing (write-context-rules).
20setup-context
Bootstrap a nao agent for a project — gather warehouse + scope + extra-context info in one round, look up the warehouse-specific config from nao docs, write nao_config.yaml, run nao init + nao sync, set up the LLM key, and generate the first RULES.md. Use when the user has just decided to use nao on a new project. Only for first-time setup; for editing rules, generating tests, or reviewing an existing context, use write-context-rules / create-context-tests / audit-context.
19add-semantic-layer
Wire a semantic layer into a nao agent so that metric queries are routed through a single source of truth. Supports dbt MetricFlow (dbt Cloud with Semantic Layer), Snowflake (views or semantic views via MCP), an in-house nao YAML semantic layer, or other tools (via MCP discovery). Installs the right MCP server, updates RULES.md to route metric queries through the semantic layer, and (for the nao YAML option) generates starter metric files. Use after a first round of tests has shown the agent struggling with metric reliability. Do not use for raw rule writing (write-context-rules) or first-time setup (setup-context).
19