experiment-workspace

Installation
SKILL.md

Experiment Workspace

This skill manages the full experiment lifecycle through the database.

It replaces what an online experiment dashboard does — experiment creation, data collection tracking, analysis computation, and result storage — with database records accessible via HTTP API. All data flows through a single SQLite database shared by the web UI, sandbox agent, and analysis scripts.

The database is the experiment. The script is the dashboard.


When to Activate

  • Hypothesis exists, primary metric is defined, instrumentation is confirmed (came from measurement-design)
  • User wants to "start the experiment" — create the tracking structure
  • User wants to pull data and run analysis
  • User wants to check whether enough data has accumulated
  • User wants to archive a completed experiment result
  • Project stage is measuring

Do not activate if:

  • Hypothesis is not yet written → go to hypothesis-design
  • Primary metric is undefined → go to measurement-design
  • Data already exists and the user only wants a decision → go to evidence-analysis directly

On Entry — Read Current State

Before doing any work, read the project from the database using the project-sync skill's get-project command.

Check these fields:

Field Purpose
hypothesis The causal claim being tested
primaryMetric What is being measured
stage Current lifecycle position
experiments Existing experiment records and their status
  • If hypothesis is empty → redirect to hypothesis-design
  • If primaryMetric is empty → redirect to measurement-design
  • If an experiment already exists for this hypothesis → resume from its current status rather than creating a new one

What This Skill Manages

Database Records (Experiment model)

Each experiment is stored as a row in the Experiment table (Prisma schema). Key fields:

DB Field Purpose
slug Experiment identifier (kebab-case)
hypothesis The causal claim being tested
primaryMetricEvent Event name for the primary metric
guardrailEvents JSON array string of guardrail event names
controlVariant / treatmentVariant Variant values
minimumSample Validity floor per variant
observationStart / observationEnd Observation window
priorProper / priorMean / priorStddev Prior configuration
inputData Collected metric data (JSON string)
analysisResult Computed analysis output (JSON string)
status draft / collecting / analyzing / decided
decision / decisionSummary / decisionReason Final decision (summary = plain-language action, reason = technical rationale)

Scripts (Hybrid: TypeScript for I/O, Python for algorithms)

skills/experiment-workspace/scripts/
  db-client.ts           ← HTTP API wrapper for DB read/write (TypeScript)
  db_client.py           ← HTTP API wrapper for DB read/write (Python)
  collect-input.ts       ← data collector placeholder (TypeScript, implement fetchMetricSummary)
  stats_utils.py         ← statistical utilities: GaussianPrior, bayesian_result, srm_check (Python, numpy/scipy)
  analyze-bayesian.py    ← Bayesian A/B analysis (Python, reads/writes DB)
  analyze-bandit.py      ← Thompson Sampling weight computation (Python, reads/writes DB)

TypeScript scripts (data I/O): npx tsx <script>.ts <project-id> <experiment-slug> Python scripts (algorithms): python <script>.py <project-id> <experiment-slug>

Two experiment types:

  • A/B (default): fixed 50/50 traffic split, one-shot analysis → analyze-bayesian.py
  • Bandit: dynamic traffic reweighting via Thompson Sampling → analyze-bandit.py (requires FeatBit API integration for full automation)

All experiment data lives in the shared SQLite database, accessible via the web app's HTTP API (SYNC_API_URL, default http://localhost:3000). No local experiment files needed — the web UI, sandbox agent, and scripts all read/write the same database.


Decision Actions

"First time setup"

No file copying is needed. All scripts run from skills/experiment-workspace/scripts/ using npx tsx. The only prerequisite is:

  1. The web app must be running (provides the HTTP API that scripts use for DB access)
  2. collect-input.ts must be customized with a fetchMetricSummary() implementation for your data source (see references/data-source-guide.md)
  3. Both analysis scripts (analyze-bayesian.py, analyze-bandit.py) work out of the box once inputData exists in the DB

Python with numpy/scipy is required for the analysis scripts. Install once:

pip install numpy scipy

"I want to start an experiment"

  1. Confirm the hypothesis slug — derive from the flag key, e.g. chat-cta-v2

  2. Ensure the web app is running (scripts need the HTTP API)

  3. Persist the experiment to the database using the project-sync skill's upsert-experiment command (see Persist State section below)

  4. Copy hypothesis: verbatim from the project state read on entry

  5. Confirm the observation_window.start date — this is today if the flag was just enabled

  6. Set minimum_sample_per_variant using the following fallback chain. Do not expose the formula to the user at any step.

    Step 1 — read the hypothesis from project state (loaded on entry):

    • Does it mention a current baseline rate? (e.g. "increase signup rate from 4% to 5%" → p_baseline = 0.04)
    • Does it mention an expected lift that implies a current level? Extract the number and compute ceil(30 / p_baseline)

    Step 2 — infer from metric event name and funnel stage:

    • Re-read the primary metric event name: does it suggest a funnel position?

    • Use these heuristics as a starting estimate:

      Metric type Typical baseline range Suggested floor
      Button click / CTA 3–10% 500
      Signup / registration 1–5% 1,000
      Purchase / checkout 1–3% 1,500
      Feature engagement (active users) 10–30% 200
      Error rate / latency (inverse) 1–5% 1,000

    Step 3 — collect a short baseline sample from the control group (most accurate):

    • If the flag has been live for at least 1–3 days, guide the user to pull control-only data for that period and share it with the agent.
    • Tell the user exactly what numbers are needed:

      "To get an accurate baseline, I need two numbers from your control group for the past few days:

      • n — how many unique users were exposed to the control variant
      • k — how many of those users triggered the '[metric event]' event You can get these from FeatBit's experiment results, your database, or your analytics tool."
    • Once the user provides n and k: compute p_baseline = k / n, then set ceil(30 / p_baseline) — this overrides any estimate from Steps 1–2

    Step 4 — ask the user only if Steps 1–3 all fail: "What is the current conversion rate for [metric name]? A rough estimate is fine, e.g. 'about 5%' or 'maybe 1 in 20 users'."

    Step 5 — if no estimate is available from any source:

    • Use 1,000 as a safe conservative default (assumes ~3% baseline)
    • Record the assumption explicitly in the experiment record so it can be revised once real data arrives
  7. Ask the user whether they have prior knowledge about the expected lift for this metric:

    • "Do you have results from a similar past experiment? If so, what was the approximate lift and how uncertain was it?"
    • If the user provides a past μ_rel and se (or a rough range): set priorProper: true, priorMean: <μ_rel>, priorStddev: <se> in the experiment
    • If the user ran a pilot phase (separate experiment window) and has its analysisResult: read μ_rel and se from it and use those as the prior — but only if the pilot data will not be included in the new experiment's inputData
    • If no prior knowledge is available: set priorProper: false (flat prior, the safe default)
  8. Persist state to the database (see Persist State section below)

  9. Tell the user: the next step is to collect data (customize collect-input.ts if needed), then run the analysis

The agent does not need to touch any online dashboard. Persisting the experiment record to the database is the equivalent of "creating an experiment".

"I want to check if we have enough data"

  1. Read the experiment from the database and check inputData
    • If inputData is empty, data has not been collected yet — direct to references/data-source-guide.md or customize collect-input.ts
  2. If inputData exists, check the n (total users) per variant against minimumSample
    • You can inspect this from the web UI or by reading the experiment record via the API
  3. If below minimum: do not proceed to analysis — wait and re-check later
  4. If above minimum: proceed to run the analysis

"I want to run the analysis"

  1. Confirm inputData exists in the experiment record (read from the database)
  2. If missing: customize and run collect-input.ts or follow references/data-source-guide.md to populate it
  3. Run:
    python skills/experiment-workspace/scripts/analyze-bayesian.py <project-id> <experiment-slug>
    
  4. The script reads inputData from the DB, computes results, and writes analysisResult back to the DB
  5. Key outputs to check before handing off (read analysisResult from the experiment record):
    • P(win) ≥ 95% → strong signal; ≤ 5% → likely harmful; 20–80% → inconclusive
    • risk[trt] — if P(win) is near a boundary, this tells you how costly a wrong call is
    • SRM check — if χ² p-value < 0.01, stop and investigate traffic split before interpreting metrics
  6. Hand off to evidence-analysis with the experiment's analysisResult and definition fields
  7. Persist experiment status to the database (see Persist State section below)

For the full list of metric types and usage patterns (proportion, continuous, inverse, multiple arms, informative prior), see references/analysis-bayesian.md.

Multi-arm threshold reminder: if the experiment has more than 2 variants (A/B/C/n), raise the P(win) threshold to compensate for multiple comparisons:

Arms compared Suggested threshold
2 95%
3 98.3%
5 99%

See references/analysis-bayesian.md → "On Family-wise Error" for details.

"I want to update the data and re-run"

  1. Re-run collect-input.ts to pull fresh counts — it overwrites inputData in the DB
  2. Re-run:
    python skills/experiment-workspace/scripts/analyze-bayesian.py <project-id> <experiment-slug>
    
  3. analysisResult is overwritten with fresh numbers — both scripts are idempotent
  4. Persist updated experiment status to the database (see Persist State section below)

"I want to run a Bandit experiment"

A bandit experiment replaces fixed 50/50 traffic with dynamic reweighting. It requires a continuous cycle of data collection → weight computation → FeatBit flag update.

Setup (same as A/B — uses the same experiment record in the DB):

  1. Create the experiment record following the standard workflow (see "I want to start an experiment")
  2. Choose primaryMetricEvent — bandit optimizes this single metric
  3. Note: bandit works best for proportion metrics (conversion rate, CTR)

Each reweighting cycle (recommended every 6–24 hours):

  1. Collect fresh data → update inputData in DB
  2. Run:
    python skills/experiment-workspace/scripts/analyze-bandit.py <project-id> <experiment-slug>
    
  3. Read analysisResult from the experiment record:
    • If enough_units: false → burn-in not complete, do not apply weights yet (need ≥ 100 users per arm)
    • If srm_p_value < 0.01 → SRM detected, investigate traffic split before applying weights
    • Otherwise → apply bandit_weights to the FeatBit feature flag via API
  4. Update FeatBit feature flag rollout weights using the FeatBit API (see references/analysis-bandit.md for the conversion formula)

Stopping condition: when best_arm_probabilities[arm] >= 0.95 for any arm, stop reweighting.

After stopping — transition to final analysis:

  1. Set winning arm to 100% in FeatBit
  2. Run final Bayesian analysis on full dataset:
    python skills/experiment-workspace/scripts/analyze-bayesian.py <project-id> <experiment-slug>
    
  3. Hand off to evidence-analysis with the experiment record containing:
    • analysisResult (final Bayesian result — note: δ estimate may have wider uncertainty due to unequal traffic)
    • Previous bandit analysisResult (final best_arm_probabilities — most reliable decision signal)
    • Experiment definition fields from the DB

For full details on output interpretation and FeatBit API integration, see references/analysis-bandit.md.

"I want to track long-term effects after launch"

A/B and Bandit experiments measure short-term behavior. Transient effects — novelty, seasonal spikes, event-driven traffic — can inflate results during the experiment window. A holdout group validates whether the effect persists over months.

  1. After full launch, adjust the feature flag traffic split to 95/5 — keep 5% of users on the old variant
  2. Record the holdout plan in the experiment record (e.g. in a note or dedicated field):
    • holdout percentage: 5%
    • check_at_days: [30, 60, 90]
    • launched_at: <launch date>
  3. At each checkpoint (day 30, 60, 90):
    • Collect fresh data for both groups → update inputData in the DB
    • Run analysis with a time-stamped slug:
      python skills/experiment-workspace/scripts/analyze-bayesian.py <project-id> <slug>-holdout-30d
      
  4. Compare P(win) and rel Δ across checkpoints — look for stability, decay, or growth
  5. When holdout analysis is complete, remove the holdout split from the feature flag

For full interpretation guidance (three patterns: holds / decays / improves), see references/analysis-holdout.md.

"I want to close the experiment"

  1. Set experiment status to decided and record observationEnd, decision, decisionSummary, decisionReason in the DB
  2. Persist experiment closure to the database (see Persist State section below)
  3. Hand off to learning-capture

Operating Rules

  • The experiment record in the database is the contract. Do not change primaryMetricEvent, controlVariant, or treatmentVariant after data collection starts — it would invalidate the data already collected.
  • observationStart must match when the flag was actually enabled. Do not backfill earlier — pre-flag data is not part of the experiment.
  • Verify inputData sanity before running analysis: kn for every row, variant keys match the experiment record, no zero n values.
  • Do not interpret results by eyeballing inputData. Always run analyze-bayesian.py and read analysisResult.
  • If the SRM check flags an imbalance (χ² p < 0.01), do not proceed to evidence-analysis — the data is unreliable.
  • "The script says 97% confidence" does not mean "ship it." That is evidence-analysis's job.

Persist State

After completing work, use the project-sync skill to persist state to the database. The specific commands depend on the action performed:

Starting an experiment:

  1. upsert-experiment — save all definition fields:
    • --status draft
    • --hypothesis "..." — verbatim from project state
    • --primaryMetricEvent "..."
    • --guardrailEvents "..." — JSON array as string, e.g. '["chat_opened"]'
    • --controlVariant "..." and --treatmentVariant "..."
    • --minimumSample <N>
    • --observationStart "YYYY-MM-DD"
    • --priorProper false (or true if informative prior was chosen)
    • --priorMean <float> and --priorStddev <float> (only when priorProper true)
  2. update-state — save --lastAction "Created experiment <slug>"
  3. set-stage — set to measuring
  4. add-activity — e.g. --type stage_update --title "Experiment <slug> created"

Running / re-running analysis:

  1. upsert-experiment — save --status analyzing --inputData "<JSON>" --analysisResult "<JSON>" (scripts do this automatically)

Closing an experiment:

  1. upsert-experiment — save --status decided --observationEnd "YYYY-MM-DD"
  2. update-state — save --lastAction "Experiment <slug> closed"

Handoff Chain

measurement-design
  → experiment-workspace   ← this skill
      → evidence-analysis
          → learning-capture

When handing off to evidence-analysis, pass the experiment's analysisResult and definition fields (hypothesis, primaryMetricEvent, variants, etc.) so the decision can be tied back to the hypothesis.


Reference Files

Weekly Installs
5
GitHub Stars
3
First Seen
Mar 23, 2026
Installed on
openclaw5
gemini-cli5
claude-code5
github-copilot5
codex5
kimi-cli5