GO Prompt Sensitivity Audit

v1.0 — March 2026

The Problem

We use LLMs to classify customer signals — detecting complaints, identifying vulnerability, scoring severity. It works, but we don't know how fragile it is.

Small prompt changes sometimes flip classifications: reword the system instruction, reorder the few-shot examples, change "classify" to "determine" — and suddenly edge cases go the other way. We need to understand how sensitive our classifier is to prompt variations, find the failure modes, and build a more robust prompt.

Our job: systematically test prompt variations against a labeled dataset, measure what breaks, understand why, and produce a prompt that's demonstrably more robust than what we started with.

How This Works

This is a pair exercise. The agent's job is to be a thinking partner — ask questions, provide scaffolding, help debug, challenge assumptions. Not to write the solution for you.

Work through the phases in order
At each phase, talk through your approach before coding
The agent will push back if something doesn't make sense
If you're stuck, say so — progressively more specific hints are available
If you want the agent to just write something, ask explicitly

Setup

pip install openai anthropic   # at least one
pip install pandas              # for analysis

You'll need an API key: OPENAI_API_KEY or ANTHROPIC_API_KEY.

Files

File	Purpose
`scripts/baseline_classify.py`	Starting point — runs baseline prompt against all signals
`scripts/helpers.py`	Data loading, evaluation, comparison utilities (provided)
`references/signals.json`	30 customer signals (text + channel, no labels)
`references/signals_labeled.json`	Same signals with ground truth labels

Phase 0: Establish Baseline

Run the baseline classifier:

python scripts/baseline_classify.py --provider openai --model gpt-4o-mini

Look at the results. Before doing anything else:

Which signals did the baseline get wrong? Why?
Are the errors random, or is there a pattern?
Look at the baseline prompt in scripts/baseline_classify.py — what's weak about it?

Talk through your observations before moving on.

Phase 1: Hypothesis-Driven Variation Design

Goal: Design a set of prompt variations that test specific hypotheses about sensitivity.

Don't just randomly change words. Each variation should test a specific hypothesis. Some starting points to consider — but come up with your own too:

Instruction framing:

Does it matter if you say "classify" vs "determine" vs "analyze"?
What if the system prompt uses an authoritative tone vs a neutral one?
Does adding "think step by step" change anything?

Few-shot examples:

What happens when you add 2-3 examples to the prompt?
Does the order of examples matter? (complaint first vs non-complaint first)
What if all examples are clear-cut vs including an edge case?

Definition sensitivity:

The baseline defines a complaint loosely. What if we use the FCA's formal definition?
What if we explicitly list what is NOT a complaint?

Output format:

JSON vs free text — does structured output change classification?
What if you ask for reasoning before the classification vs after?

For each variation, write down:

What you're testing (the hypothesis)
What you expect to happen
The modified prompt

Before coding anything: talk through at least 3 hypotheses. The agent should push back on weak hypotheses and ask about the reasoning.

Phase 2: Run the Experiments

Goal: Execute your prompt variants and collect results.

The baseline script supports a --prompt flag for custom prompt templates:

python scripts/baseline_classify.py --provider openai --model gpt-4o-mini --prompt my_variant.txt

Or modify the script to run multiple variants in sequence. The helpers have comparison tools:

from helpers import compare_runs, print_comparison_summary
# compare_runs(variant_a_predictions, variant_b_predictions)

Things to think about while running:

Are you controlling for everything except the one variable you're testing?
Temperature is set to 0 — is that sufficient for determinism?
How many runs would you need to be confident a difference is real vs noise?

Before moving on: collect results for at least 3 variants plus the baseline.

Phase 3: Analyze the Failure Modes

Goal: Understand why certain signals are sensitive and others aren't.

This is the interesting part. Look across all your runs:

Which signals flip classification depending on the prompt? Those are your sensitive signals.
Which signals are classified correctly every time? Those are robust.
Is there a pattern? (e.g., polite complaints are sensitive, explicit complaints are robust)

Use helpers.evaluate_binary() and helpers.compare_runs() to quantify.

Questions to investigate:

Do the sensitive signals share characteristics? (length, tone, ambiguity, channel?)
For few-shot variants: does the choice of examples have more impact than the instruction wording?
If you asked for reasoning-first (chain of thought), did the reasoning quality predict the classification accuracy?
Are there signals where the LLM is right and the ground truth label is arguably wrong?

Before moving on: you should be able to say "these types of signals are fragile because X, and these prompt elements have the most impact on accuracy."

Phase 4: Build a Robust Prompt

Goal: Combine what you learned into a single prompt that's demonstrably better than the baseline.

This isn't about making the biggest prompt — it's about making the most robust one. Consider:

Which variations improved accuracy on the sensitive signals without hurting the robust ones?
Can you combine techniques (e.g., better definition + few-shot + chain of thought)?
What's the trade-off between prompt length/cost and robustness?

Evaluate your final prompt against the full dataset. Compare to baseline:

Overall accuracy improvement
Improvement specifically on the previously-sensitive signals
Any regressions?

Write up outputs/findings.md covering:

Which prompt elements had the biggest impact and why
Your final prompt and why you chose each element
What's still fragile — what class of signals would you want to test next?
How would you set up ongoing sensitivity monitoring in production?

Agent Guidance

Critical: do not answer your own questions. When the walkthrough asks a question, pose it to the user and then stop and wait. Do not offer your interpretation. Just ask and wait.

Ask, then stop. No "I think the answer is..." No "Here's my reading..."
When they run experiments and see results, ask them to interpret first.
When they propose a hypothesis, ask them to predict the outcome before running it.
Push back on hypotheses that aren't specific enough ("I want to try a different prompt" → "What specifically are you changing and what do you expect to happen?")
If they design a variant that doesn't isolate a single variable, point that out.
Only provide analysis after the user has given theirs.
If they're stuck, give one hint — not the full answer.
Keep track of which phase they're in — don't let phases get skipped.

go-prompt-sensitivity