experiment-workspace
Experiment Workspace
This skill manages the full experiment lifecycle through the database.
It replaces what an online experiment dashboard does — experiment creation, data collection tracking, analysis computation, and result storage — with database records accessible via HTTP API. All data flows through a single SQLite database shared by the web UI, sandbox agent, and analysis scripts.
The database is the experiment. The script is the dashboard.
When to Activate
- Hypothesis exists, primary metric is defined, instrumentation is confirmed (came from
measurement-design) - User wants to "start the experiment" — create the tracking structure
- User wants to pull data and run analysis
- User wants to check whether enough data has accumulated
- User wants to archive a completed experiment result
- Project stage is
measuring
Do not activate if:
- Hypothesis is not yet written → go to
hypothesis-design - Primary metric is undefined → go to
measurement-design - Data already exists and the user only wants a decision → go to
evidence-analysisdirectly
On Entry — Read Current State
Before doing any work, read the project from the database using the project-sync skill's get-project command.
Check these fields:
| Field | Purpose |
|---|---|
hypothesis |
The causal claim being tested |
primaryMetric |
What is being measured |
stage |
Current lifecycle position |
experiments |
Existing experiment records and their status |
- If
hypothesisis empty → redirect tohypothesis-design - If
primaryMetricis empty → redirect tomeasurement-design - If an experiment already exists for this hypothesis → resume from its current status rather than creating a new one
What This Skill Manages
Database Records (Experiment model)
Each experiment is stored as a row in the Experiment table (Prisma schema). Key fields:
| DB Field | Purpose |
|---|---|
slug |
Experiment identifier (kebab-case) |
hypothesis |
The causal claim being tested |
primaryMetricEvent |
Event name for the primary metric |
guardrailEvents |
JSON array string of guardrail event names |
controlVariant / treatmentVariant |
Variant values |
minimumSample |
Validity floor per variant |
observationStart / observationEnd |
Observation window |
priorProper / priorMean / priorStddev |
Prior configuration |
inputData |
Collected metric data (JSON string) |
analysisResult |
Computed analysis output (JSON string) |
status |
draft / collecting / analyzing / decided |
decision / decisionSummary / decisionReason |
Final decision (summary = plain-language action, reason = technical rationale) |
Scripts (Hybrid: TypeScript for I/O, Python for algorithms)
skills/experiment-workspace/scripts/
db-client.ts ← HTTP API wrapper for DB read/write (TypeScript)
db_client.py ← HTTP API wrapper for DB read/write (Python)
collect-input.ts ← data collector placeholder (TypeScript, implement fetchMetricSummary)
stats_utils.py ← statistical utilities: GaussianPrior, bayesian_result, srm_check (Python, numpy/scipy)
analyze-bayesian.py ← Bayesian A/B analysis (Python, reads/writes DB)
analyze-bandit.py ← Thompson Sampling weight computation (Python, reads/writes DB)
TypeScript scripts (data I/O): npx tsx <script>.ts <project-id> <experiment-slug>
Python scripts (algorithms): python <script>.py <project-id> <experiment-slug>
Two experiment types:
- A/B (default): fixed 50/50 traffic split, one-shot analysis →
analyze-bayesian.py - Bandit: dynamic traffic reweighting via Thompson Sampling →
analyze-bandit.py(requires FeatBit API integration for full automation)
All experiment data lives in the shared SQLite database, accessible via the web app's HTTP API (SYNC_API_URL, default http://localhost:3000). No local experiment files needed — the web UI, sandbox agent, and scripts all read/write the same database.
Decision Actions
"First time setup"
No file copying is needed. All scripts run from skills/experiment-workspace/scripts/ using npx tsx. The only prerequisite is:
- The web app must be running (provides the HTTP API that scripts use for DB access)
collect-input.tsmust be customized with afetchMetricSummary()implementation for your data source (seereferences/data-source-guide.md)- Both analysis scripts (
analyze-bayesian.py,analyze-bandit.py) work out of the box onceinputDataexists in the DB
Python with numpy/scipy is required for the analysis scripts. Install once:
pip install numpy scipy
"I want to start an experiment"
-
Confirm the hypothesis slug — derive from the flag key, e.g.
chat-cta-v2 -
Ensure the web app is running (scripts need the HTTP API)
-
Persist the experiment to the database using the
project-syncskill'supsert-experimentcommand (see Persist State section below) -
Copy
hypothesis:verbatim from the project state read on entry -
Confirm the
observation_window.startdate — this is today if the flag was just enabled -
Set
minimum_sample_per_variantusing the following fallback chain. Do not expose the formula to the user at any step.Step 1 — read the hypothesis from project state (loaded on entry):
- Does it mention a current baseline rate? (e.g. "increase signup rate from 4% to 5%" → p_baseline = 0.04)
- Does it mention an expected lift that implies a current level? Extract the number and compute
ceil(30 / p_baseline)
Step 2 — infer from metric event name and funnel stage:
-
Re-read the primary metric event name: does it suggest a funnel position?
-
Use these heuristics as a starting estimate:
Metric type Typical baseline range Suggested floor Button click / CTA 3–10% 500 Signup / registration 1–5% 1,000 Purchase / checkout 1–3% 1,500 Feature engagement (active users) 10–30% 200 Error rate / latency (inverse) 1–5% 1,000
Step 3 — collect a short baseline sample from the control group (most accurate):
- If the flag has been live for at least 1–3 days, guide the user to pull control-only data for that period and share it with the agent.
- Tell the user exactly what numbers are needed:
"To get an accurate baseline, I need two numbers from your control group for the past few days:
- n — how many unique users were exposed to the control variant
- k — how many of those users triggered the '[metric event]' event You can get these from FeatBit's experiment results, your database, or your analytics tool."
- Once the user provides
nandk: computep_baseline = k / n, then setceil(30 / p_baseline)— this overrides any estimate from Steps 1–2
Step 4 — ask the user only if Steps 1–3 all fail: "What is the current conversion rate for [metric name]? A rough estimate is fine, e.g. 'about 5%' or 'maybe 1 in 20 users'."
Step 5 — if no estimate is available from any source:
- Use 1,000 as a safe conservative default (assumes ~3% baseline)
- Record the assumption explicitly in the experiment record so it can be revised once real data arrives
-
Ask the user whether they have prior knowledge about the expected lift for this metric:
- "Do you have results from a similar past experiment? If so, what was the approximate lift and how uncertain was it?"
- If the user provides a past
μ_relandse(or a rough range): setpriorProper: true,priorMean: <μ_rel>,priorStddev: <se>in the experiment - If the user ran a pilot phase (separate experiment window) and has its
analysisResult: readμ_relandsefrom it and use those as the prior — but only if the pilot data will not be included in the new experiment'sinputData - If no prior knowledge is available: set
priorProper: false(flat prior, the safe default)
-
Persist state to the database (see Persist State section below)
-
Tell the user: the next step is to collect data (customize
collect-input.tsif needed), then run the analysis
The agent does not need to touch any online dashboard. Persisting the experiment record to the database is the equivalent of "creating an experiment".
"I want to check if we have enough data"
- Read the experiment from the database and check
inputData- If
inputDatais empty, data has not been collected yet — direct toreferences/data-source-guide.mdor customizecollect-input.ts
- If
- If
inputDataexists, check then(total users) per variant againstminimumSample- You can inspect this from the web UI or by reading the experiment record via the API
- If below minimum: do not proceed to analysis — wait and re-check later
- If above minimum: proceed to run the analysis
"I want to run the analysis"
- Confirm
inputDataexists in the experiment record (read from the database) - If missing: customize and run
collect-input.tsor followreferences/data-source-guide.mdto populate it - Run:
python skills/experiment-workspace/scripts/analyze-bayesian.py <project-id> <experiment-slug> - The script reads
inputDatafrom the DB, computes results, and writesanalysisResultback to the DB - Key outputs to check before handing off (read
analysisResultfrom the experiment record):- P(win) ≥ 95% → strong signal; ≤ 5% → likely harmful; 20–80% → inconclusive
- risk[trt] — if P(win) is near a boundary, this tells you how costly a wrong call is
- SRM check — if χ² p-value < 0.01, stop and investigate traffic split before interpreting metrics
- Hand off to
evidence-analysiswith the experiment'sanalysisResultand definition fields - Persist experiment status to the database (see Persist State section below)
For the full list of metric types and usage patterns (proportion, continuous, inverse, multiple arms, informative prior), see references/analysis-bayesian.md.
Multi-arm threshold reminder: if the experiment has more than 2 variants (A/B/C/n), raise the P(win) threshold to compensate for multiple comparisons:
| Arms compared | Suggested threshold |
|---|---|
| 2 | 95% |
| 3 | 98.3% |
| 5 | 99% |
See references/analysis-bayesian.md → "On Family-wise Error" for details.
"I want to update the data and re-run"
- Re-run
collect-input.tsto pull fresh counts — it overwritesinputDatain the DB - Re-run:
python skills/experiment-workspace/scripts/analyze-bayesian.py <project-id> <experiment-slug> analysisResultis overwritten with fresh numbers — both scripts are idempotent- Persist updated experiment status to the database (see Persist State section below)
"I want to run a Bandit experiment"
A bandit experiment replaces fixed 50/50 traffic with dynamic reweighting. It requires a continuous cycle of data collection → weight computation → FeatBit flag update.
Setup (same as A/B — uses the same experiment record in the DB):
- Create the experiment record following the standard workflow (see "I want to start an experiment")
- Choose
primaryMetricEvent— bandit optimizes this single metric - Note: bandit works best for proportion metrics (conversion rate, CTR)
Each reweighting cycle (recommended every 6–24 hours):
- Collect fresh data → update
inputDatain DB - Run:
python skills/experiment-workspace/scripts/analyze-bandit.py <project-id> <experiment-slug> - Read
analysisResultfrom the experiment record:- If
enough_units: false→ burn-in not complete, do not apply weights yet (need ≥ 100 users per arm) - If
srm_p_value < 0.01→ SRM detected, investigate traffic split before applying weights - Otherwise → apply
bandit_weightsto the FeatBit feature flag via API
- If
- Update FeatBit feature flag rollout weights using the FeatBit API (see
references/analysis-bandit.mdfor the conversion formula)
Stopping condition: when best_arm_probabilities[arm] >= 0.95 for any arm, stop reweighting.
After stopping — transition to final analysis:
- Set winning arm to 100% in FeatBit
- Run final Bayesian analysis on full dataset:
python skills/experiment-workspace/scripts/analyze-bayesian.py <project-id> <experiment-slug> - Hand off to
evidence-analysiswith the experiment record containing:analysisResult(final Bayesian result — note: δ estimate may have wider uncertainty due to unequal traffic)- Previous bandit
analysisResult(finalbest_arm_probabilities— most reliable decision signal) - Experiment definition fields from the DB
For full details on output interpretation and FeatBit API integration, see references/analysis-bandit.md.
"I want to track long-term effects after launch"
A/B and Bandit experiments measure short-term behavior. Transient effects — novelty, seasonal spikes, event-driven traffic — can inflate results during the experiment window. A holdout group validates whether the effect persists over months.
- After full launch, adjust the feature flag traffic split to 95/5 — keep 5% of users on the old variant
- Record the holdout plan in the experiment record (e.g. in a note or dedicated field):
holdout percentage: 5%check_at_days: [30, 60, 90]launched_at: <launch date>
- At each checkpoint (day 30, 60, 90):
- Collect fresh data for both groups → update
inputDatain the DB - Run analysis with a time-stamped slug:
python skills/experiment-workspace/scripts/analyze-bayesian.py <project-id> <slug>-holdout-30d
- Collect fresh data for both groups → update
- Compare P(win) and rel Δ across checkpoints — look for stability, decay, or growth
- When holdout analysis is complete, remove the holdout split from the feature flag
For full interpretation guidance (three patterns: holds / decays / improves), see references/analysis-holdout.md.
"I want to close the experiment"
- Set experiment status to
decidedand recordobservationEnd,decision,decisionSummary,decisionReasonin the DB - Persist experiment closure to the database (see Persist State section below)
- Hand off to
learning-capture
Operating Rules
- The experiment record in the database is the contract. Do not change
primaryMetricEvent,controlVariant, ortreatmentVariantafter data collection starts — it would invalidate the data already collected. observationStartmust match when the flag was actually enabled. Do not backfill earlier — pre-flag data is not part of the experiment.- Verify
inputDatasanity before running analysis:k≤nfor every row, variant keys match the experiment record, no zeronvalues. - Do not interpret results by eyeballing
inputData. Always runanalyze-bayesian.pyand readanalysisResult. - If the SRM check flags an imbalance (χ² p < 0.01), do not proceed to
evidence-analysis— the data is unreliable. - "The script says 97% confidence" does not mean "ship it." That is
evidence-analysis's job.
Persist State
After completing work, use the project-sync skill to persist state to the database. The specific commands depend on the action performed:
Starting an experiment:
upsert-experiment— save all definition fields:--status draft--hypothesis "..."— verbatim from project state--primaryMetricEvent "..."--guardrailEvents "..."— JSON array as string, e.g.'["chat_opened"]'--controlVariant "..."and--treatmentVariant "..."--minimumSample <N>--observationStart "YYYY-MM-DD"--priorProper false(ortrueif informative prior was chosen)--priorMean <float>and--priorStddev <float>(only whenpriorProper true)
update-state— save--lastAction "Created experiment <slug>"set-stage— set tomeasuringadd-activity— e.g.--type stage_update --title "Experiment <slug> created"
Running / re-running analysis:
upsert-experiment— save--status analyzing --inputData "<JSON>" --analysisResult "<JSON>"(scripts do this automatically)
Closing an experiment:
upsert-experiment— save--status decided --observationEnd "YYYY-MM-DD"update-state— save--lastAction "Experiment <slug> closed"
Handoff Chain
measurement-design
→ experiment-workspace ← this skill
→ evidence-analysis
→ learning-capture
When handing off to evidence-analysis, pass the experiment's analysisResult and definition fields (hypothesis, primaryMetricEvent, variants, etc.) so the decision can be tied back to the hypothesis.
Reference Files
- references/experiment-folder-spec.md — DB schema reference, experiment fields,
inputDataformat,analysisResultJSON examples - references/analysis-bayesian.md — Bayesian A/B analysis: metric types, prior patterns, output interpretation, sequential testing, family-wise error
- references/analysis-bandit.md — Bandit analysis: Thompson Sampling,
analysisResultfields, FeatBit API integration, stopping condition - references/analysis-holdout.md — Holdout group: post-launch long-term validation, three effect patterns, checkpoint cadence
- references/data-source-guide.md — input contract and §FeatBit / §Database / §Custom patterns for producing
inputData - scripts/db-client.ts — HTTP API wrapper for DB read/write (TypeScript)
- scripts/db_client.py — HTTP API wrapper for DB read/write (Python)
- scripts/collect-input.ts — data collector placeholder (implement
fetchMetricSummary) - scripts/stats_utils.py — statistical utilities: GaussianPrior, bayesian_result, srm_check (Python, numpy/scipy)
- scripts/analyze-bayesian.py — ready-to-run Bayesian A/B analysis script (Python)
- scripts/analyze-bandit.py — ready-to-run Thompson Sampling weight computation script (Python)