jury-system

SKILL.md

Jury System Skill

Specialized knowledge for running synthetic user validation using the Condorcet Jury Theorem.

When to Use

  • Validating research findings resonate broadly
  • Testing PRD user stories match mental models
  • Evaluating prototype usability
  • Checking graduation criteria between phases

Core Principle

If each synthetic persona has >50% accuracy in judging whether a feature matches their needs, aggregating 100-500+ votes produces near-certain collective judgment.

Stratified Sampling

Dimension Distribution
Role Sales Rep: 40%, Sales Leader: 25%, CSM: 20%, RevOps: 15%
Tech Proficiency Novice: 25%, Intermediate: 50%, Advanced: 25%
AI Adoption Skeptic: 15% (min), Curious: 40%, Early Adopter: 35%, Power User: 10%

Critical: Always include 15% skeptics. They catch issues optimists miss.

Validation by Phase

Research Validation

Sample: 100-200 personas Pass: >60% rate resonance 4+

PRD Validation

Sample: 200-300 personas Pass: >70% rate relevance 4+

Prototype Evaluation

Sample: 300-500 personas Pass: >70% combined pass rate

Aggregation Rules

Verdict Threshold
Validated >60% rate 4+
Contested 40-60% rate 4+
Rejected <40% rate 4+

Evaluation Prompt Templates

Research Validation Prompt

You ARE {persona.name}, a {persona.role} at a {persona.company_size} company.

Your context:
- Tech comfort: {persona.tech_literacy}
- AI attitude: {persona.ai_adoption_stage}
- Your primary pain: {persona.primary_pain}

A product team has identified this pain point from customer research:

PAIN POINT: {extracted_pain_point}
SUPPORTING QUOTE: "{supporting_quote}"

As yourself, respond in JSON:

{
  "resonance_score": [1-5, where 1="not my problem", 5="exactly my frustration"],
  "perspective": "[2-3 sentences explaining why this does/doesn't resonate, in first person]",
  "missing_aspect": "[Optional: related pain they might have overlooked]"
}

PRD User Story Validation Prompt

You ARE {persona.name}. A product team proposes this user story:

USER STORY:
As a {story.persona}, I want to {story.action} so that {story.benefit}.

ACCEPTANCE CRITERIA:
{story.criteria}

Evaluate as yourself, respond in JSON:

{
  "relevance_score": [1-5],
  "clarity": "clear" | "somewhat_clear" | "confusing",
  "missing_from_your_perspective": "[what's missing]",
  "usage_frequency": "daily" | "weekly" | "monthly" | "rarely" | "never"
}

Prototype Evaluation Prompt

You ARE {persona.name}, a {persona.role} at a {persona.company_size} company.

YOUR CONTEXT:
- Tech comfort: {persona.tech_literacy}
- AI trust: {persona.trust_in_ai}
- Patience for learning: {persona.patience_for_learning}

SCENARIO: {scenario.description}
YOUR TASK: {scenario.task.primary_goal}

PROTOTYPE: {prototype_description}

THE COMPLETE EXPERIENCE JOURNEY:
1. DISCOVERY: {discovery_mechanism}
   How you would first learn this feature exists.
2. ACTIVATION: {activation_flow}
   How you would set it up / enable it for the first time.
3. USAGE: {usage_description}
   What your first interaction looks like.
4. ONGOING VALUE: {ongoing_value_description}
   What happens when you come back the next day / next week.
5. FEEDBACK: {feedback_mechanism}
   How the product team would hear from you about whether this is working.

Evaluate this prototype AND the full experience in JSON:

{
  "first_impression": "[What you notice first, what's unclear]",
  "task_walkthrough": {
    "steps_you_would_try": ["step 1", "step 2"],
    "hesitation_points": ["where you'd pause"],
    "would_give_up": true | false,
    "give_up_reason": "[if true, why]"
  },
  "experience_journey_scores": {
    "discovery": { "score": [1-5], "reason": "Would I actually find this?" },
    "activation": { "score": [1-5], "reason": "Could I set this up alone?" },
    "usage": { "score": [1-5], "reason": "Does the first interaction make sense?" },
    "ongoing_value": { "score": [1-5], "reason": "Would I come back to this?" },
    "feedback_loop": { "score": [1-5], "reason": "Would I bother giving feedback?" },
    "experience_coherence": [1-5, "Does the full journey feel connected?"]
  },
  "weakest_experience_step": "[which of the 5 steps is weakest and why]",
  "heuristic_scores": {
    "visibility_of_status": { "score": [1-5], "reason": "..." },
    "match_with_expectations": { "score": [1-5], "reason": "..." },
    "user_control": { "score": [1-5], "reason": "..." },
    "consistency": { "score": [1-5], "reason": "..." },
    "error_prevention": { "score": [1-5], "reason": "..." }
  },
  "issues": [
    {
      "what": "[description]",
      "where": "[UI element or experience step]",
      "severity": "cosmetic" | "minor" | "major" | "catastrophic",
      "why_matters_to_you": "[persona-specific impact]"
    }
  ],
  "emotional_response": {
    "frustration": [1-5],
    "confidence": [1-5],
    "would_recommend": [1-5]
  },
  "verdict": {
    "would_use": true | false,
    "reasoning": "[why/why not]"
  }
}

Self-Consistency Filter

Run each evaluation 3x with temperature 0.7. Only count vote if 2/3 or 3/3 agree. Discard inconsistent responses.

Model Selection

Operation Model Rationale
Persona Generation Claude Haiku Cost-effective for volume
Research Validation Claude Haiku Simple resonance scoring
PRD Validation Claude Haiku Structured output
Prototype Evaluation Claude Haiku Volume of evaluations
Synthesis/Aggregation Claude Sonnet Quality of final insights

Temperature Settings:

  • Persona generation: 0.9 (maximize diversity)
  • Evaluation: 0.7 (balanced for self-consistency)
  • Synthesis: 0.3 (consistent, coherent output)

Cost Estimation

Phase Sample Size Estimated Cost
Research Validation 200 personas × 5 pains ~$0.50
PRD Validation 300 personas × 10 stories ~$1.00
Prototype Evaluation 500 personas ~$2.00
Synthesis 1 aggregation ~$0.50
Total per initiative ~$4.00

Output File Locations

Save to pm-workspace-docs/initiatives/active/[name]/jury-evaluations/:

  • research-v1.json - Pain point resonance
  • prd-v1.json - User story validation
  • proto-v1.json - Usability evaluation (raw)
  • jury-report.md - Human-readable synthesis
  • iteration-log.md - Change tracking

Quality Checks

Before trusting results:

  • Sample size adequate (≥100 research, ≥200 PRD, ≥300 proto)
  • Skeptic representation ≥15%
  • All relevant archetypes represented
  • Self-consistency applied
  • Variance check (std > 0.5 on 5-point scales)

Scripts Reference

Existing scripts in pm-workspace-docs/scripts/jury-system/:

  • simulate_jury.py - Run jury simulation
  • iterate_from_feedback.py - Generate iteration docs

This System Supplements, Not Replaces

Use for: ✅ Rapid validation between real interviews ✅ Catching obvious mismatches before investing ✅ Covering personas you haven't talked to yet

Do NOT use to: ❌ Replace actual customer conversations ❌ Make final launch decisions without real validation

Weekly Installs
5
First Seen
14 days ago
Installed on
gemini-cli5
opencode5
codebuddy5
github-copilot5
codex5
kimi-cli5