ai-native-product
AI-Native Product Development
"AI products aren't deterministic. They require continuous calibration, not just A/B tests."
This skill covers AI-Native Product Development — the overlay that modifies discovery, architecture, and delivery when AI is at the core. It addresses the unique challenges of building products where AI agents perform tasks autonomously.
Part of: Modern Product Operating Model — a collection of composable product skills.
Related skills: product-strategy, product-discovery, product-architecture, product-delivery, product-leadership
When to Use This Skill
Use this skill when:
- Building AI agents that act on behalf of users
- Adding LLM-powered features to existing products
- Designing human-AI interaction patterns
- Deciding how much autonomy to give AI
- Setting up eval strategies and calibration loops
- Managing the "agency-control tradeoff"
Not needed for: Traditional software products, ML models used only for backend optimization (no user-facing autonomy)
What Makes AI Products Different
Traditional Software vs. AI Products
| Dimension | Traditional Software | AI-Native Products |
|---|---|---|
| Behavior | Deterministic | Probabilistic |
| Testing | Unit tests, QA | Evals, calibration |
| Correctness | Binary (works or doesn't) | Spectrum (good enough?) |
| User role | Operator | Delegator + Reviewer |
| Failure mode | Error messages | Plausible but wrong outputs |
| Iteration | Ship → Measure → Iterate | Ship → Observe → Calibrate |
| Trust building | Feature completeness | Demonstrated reliability |
The Core Challenge
AI products must navigate a fundamental tension:
More autonomy = More value (fewer steps, faster outcomes)
More autonomy = More risk (errors affect real work)
This is the Agency-Control Tradeoff.
Framework: The CCCD Loop
Credit: Aishwarya Goel & Kiriti Gavini
AI products require a Continuous Calibration and Confidence Development (CCCD) loop:
┌─────────────────────────────────────────────────────────────────┐
│ CCCD LOOP │
│ │
│ CALIBRATE → CONFIDENCE → CONTINUOUS DISCOVERY → CALIBRATE │
│ ↓ ↓ ↓ ↓ │
│ Eval and Build user Observe AI Update evals │
│ adjust AI trust over interactions and models │
│ behavior time at scale │
└─────────────────────────────────────────────────────────────────┘
CCCD Components:
| Component | Purpose | Activities |
|---|---|---|
| Calibrate | Tune AI behavior to match user expectations | Run evals, adjust prompts/models, set guardrails |
| Confidence | Build appropriate user trust | Show AI reasoning, enable verification, demonstrate reliability |
| Continuous Discovery | Observe AI-user interactions at scale | Log interactions, identify failure patterns, surface edge cases |
| → Back to Calibrate | Update based on learnings | Improve evals, retrain, adjust prompts |
The Agency-Control Progression
Five Levels of AI Agency
| Level | Description | AI Does | User Does | Example |
|---|---|---|---|---|
| 1. Assist | AI suggests, user executes | Generates options | Chooses and acts | Autocomplete, suggestions |
| 2. Recommend | AI ranks, user approves | Analyzes and recommends | Reviews and approves | "AI recommends these 3 actions" |
| 3. Execute with confirmation | AI acts after approval | Prepares action | Confirms before execution | "Send this email?" → Yes/No |
| 4. Execute with notification | AI acts, notifies after | Acts autonomously | Reviews outcomes | "I scheduled the meeting and sent invites" |
| 5. Fully autonomous | AI acts without notification | Handles end-to-end | Sets goals, reviews exceptions | AI handles routine tasks silently |
Progression Strategy
Start lower, earn higher:
Level 1 → Build trust → Level 2 → Demonstrate reliability → Level 3 → ...
Graduation Criteria:
| From Level | To Level | Requires |
|---|---|---|
| 1 → 2 | Assist → Recommend | User accepts suggestions > 70% |
| 2 → 3 | Recommend → Execute with confirm | User approves recommendations > 80% |
| 3 → 4 | Execute+confirm → Execute+notify | User confirms without edit > 90% |
| 4 → 5 | Execute+notify → Autonomous | User overrides < 5%, high-stakes scenarios excluded |
Never fully autonomous for:
- Irreversible actions (delete, send, purchase)
- High-stakes decisions (financial, legal, health)
- Novel situations outside training distribution
- Actions affecting third parties
AI-Native Discovery
Standard discovery practices need adaptation for AI products.
Modified Discovery Focus
| Standard Discovery | AI-Native Adaptation |
|---|---|
| "What job are you trying to do?" | + "How much do you want to delegate?" |
| "What's your current workflow?" | + "Which steps are you comfortable AI handling?" |
| "What would success look like?" | + "What errors would be unacceptable?" |
| "Show me how you do this today" | + "Show me how you verify AI work today" |
AI-Specific Discovery Questions
Delegation appetite:
- "Which parts of this task feel tedious vs. require your judgment?"
- "If AI made an error here, what would the consequences be?"
- "How would you want to verify AI's work?"
Trust calibration:
- "What would AI need to demonstrate before you'd trust it to [action]?"
- "Have you used AI tools before? What built or broke your trust?"
- "Would you prefer AI to do more but occasionally err, or do less perfectly?"
Failure tolerance:
- "What kinds of errors are annoying vs. damaging?"
- "How quickly do you need to catch and fix AI mistakes?"
- "What's your 'undo' option if AI gets it wrong?"
Observing AI Interactions
In addition to interviews, AI discovery includes:
| Method | What to Look For |
|---|---|
| Session recordings | Where do users override AI? Where do they accept blindly? |
| Interaction logs | Patterns in edits, rejections, corrections |
| Feedback analysis | Explicit signals (thumbs down, ratings) |
| Support tickets | AI-related complaints and confusion |
AI-Native Architecture
Solution Brief Additions
For AI features, add to standard solution brief:
AI-SPECIFIC SECTION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
AGENCY LEVEL
Target: [Level 1-5]
Graduation path: [How might this evolve?]
FAILURE MODES
• [Failure mode 1]: [Consequence] → [Mitigation]
• [Failure mode 2]: [Consequence] → [Mitigation]
EVAL STRATEGY
• [Eval type 1]: [What we measure, how often]
• [Eval type 2]: [What we measure, how often]
CALIBRATION PLAN
• Initial calibration: [Approach]
• Ongoing calibration: [Cadence, triggers]
CONFIDENCE BUILDING
• How AI explains itself: [Approach]
• How users verify: [Mechanisms]
• Trust-building milestones: [Progression]
AI Bet Categories
In addition to standard bet categories:
| Category | Description | Example |
|---|---|---|
| Capability expansion | AI can handle new task types | "AI can now summarize documents" |
| Agency graduation | Move to higher autonomy level | "AI sends emails without confirmation" |
| Calibration improvement | Better accuracy/reliability | "Reduce hallucination rate from 5% to 2%" |
| Confidence building | Better user trust | "Show AI reasoning before action" |
| Guardrail strengthening | Prevent harmful outputs | "Add content policy enforcement" |
AI-Native Delivery
Eval Strategy (Replaces Traditional Testing)
Eval Types:
| Eval Type | Purpose | When to Run |
|---|---|---|
| Unit evals | Test specific capabilities | Every code change |
| Behavioral evals | Test end-to-end flows | Daily/weekly |
| Adversarial evals | Test edge cases and attacks | Before major releases |
| Human evals | Test subjective quality | Weekly sample |
| Production evals | Test on real traffic | Continuous |
Eval Metrics:
| Metric | What It Measures | Target |
|---|---|---|
| Task success rate | Does AI complete the intended task? | > 95% |
| Factual accuracy | Is output factually correct? | > 98% |
| Hallucination rate | Does AI make things up? | < 2% |
| Harmful output rate | Does AI produce unsafe content? | < 0.1% |
| User acceptance rate | Do users accept AI output? | > 80% |
| Override rate | How often do users correct AI? | < 15% |
Eval Cadence:
Code change → Unit evals (automated)
Daily → Behavioral evals (automated)
Weekly → Human evals (sample)
Release → Adversarial evals (red team)
Continuous → Production evals (monitoring)
Staged Rollout for AI Features
AI features require more cautious rollout:
| Stage | Audience | Focus | Duration |
|---|---|---|---|
| Internal | Team | Find obvious failures | 1 week |
| Alpha | 5-10 trusted users | Qualitative feedback on AI behavior | 2 weeks |
| Beta | 5% of users | Quantitative eval metrics | 2-4 weeks |
| Gradual GA | 5% → 25% → 50% → 100% | Monitor at each stage | 4+ weeks |
AI-Specific Rollout Gates:
| Gate | Criteria to Proceed |
|---|---|
| Alpha → Beta | Eval metrics above threshold, no harmful outputs |
| Beta → Gradual GA | User acceptance > 80%, override rate < 15% |
| Each GA increment | Metrics stable, no new failure modes |
Calibration Loop
Continuous calibration process:
OBSERVE → IDENTIFY → CALIBRATE → VALIDATE → DEPLOY
↑ │
└───────────────────────────────────────────┘
| Step | Activities | Cadence |
|---|---|---|
| Observe | Monitor production interactions, logs, feedback | Continuous |
| Identify | Surface failure patterns, edge cases, drift | Daily/weekly |
| Calibrate | Adjust prompts, fine-tune, add guardrails | As needed |
| Validate | Run evals on calibrated version | Before deploy |
| Deploy | Ship updates, continue observing | Staged |
Calibration Triggers:
- Eval metrics below threshold
- New failure pattern identified
- User feedback trend (negative)
- Model update available
- New use case discovered
AI Metrics Hierarchy
LAGGING
├── User retention (AI users vs. non-AI users)
├── Task completion rate (with AI assist)
└── Revenue from AI features
CORE
├── User acceptance rate
├── Override rate
├── Time-to-completion (with AI)
└── User-reported satisfaction
LEADING
├── Eval metrics (accuracy, hallucination, etc.)
├── Interaction volume
├── Feature discovery rate
└── Feedback sentiment
GUARDRAILS
├── Harmful output rate
├── Latency P95
├── Error rate
└── Cost per interaction
AI-Specific Anti-Patterns
| Anti-Pattern | Why It Fails | Instead |
|---|---|---|
| Ship and hope | AI behavior drifts without monitoring | Continuous calibration |
| Autonomous by default | Users don't trust, don't adopt | Earn autonomy progressively |
| Black box AI | Users can't verify, won't trust | Show reasoning, enable verification |
| No evals | Quality degrades silently | Comprehensive eval strategy |
| Ignore overrides | Miss calibration signals | Override patterns inform calibration |
| One-size-fits-all agency | Different tasks need different levels | Task-specific agency levels |
Templates
This skill includes templates in the templates/ directory:
agency-assessment.md— Determine appropriate agency leveleval-strategy.md— Design eval suite for AI featurecalibration-plan.md— Set up continuous calibration
Using This Skill with Claude
Ask Claude to:
- Assess agency level: "What agency level should [AI feature] have?"
- Design agency progression: "Create a graduation path from assist to autonomous for [feature]"
- Identify failure modes: "What could go wrong with [AI feature]? How do we mitigate?"
- Design eval strategy: "Design an eval suite for [AI feature]"
- Plan calibration: "Create a calibration plan for [AI feature]"
- Adapt discovery: "What AI-specific questions should I ask in discovery for [use case]?"
- Design confidence building: "How should [AI feature] show its reasoning?"
- Plan AI rollout: "Create a staged rollout plan for [AI feature]"
- Set AI metrics: "What metrics should we track for [AI feature]?"
- Review AI brief: "Critique this solution brief for AI considerations"
Connection to Other Skills
| When you need to... | Use skill |
|---|---|
| Define overall product strategy | product-strategy |
| Run discovery (with AI adaptations) | product-discovery |
| Structure bets and roadmap | product-architecture |
| Plan rollout and metrics | product-delivery |
| Scale AI products across teams | product-leadership |
Quick Reference: AI Product Checklist
Before shipping AI features:
- Agency level defined — Clear level for this feature
- Graduation criteria set — How we'll earn higher autonomy
- Failure modes mapped — Know what can go wrong
- Evals in place — Automated quality checks
- Human evals scheduled — Subjective quality review
- Calibration loop running — Continuous improvement process
- Confidence mechanisms built — Users can verify AI work
- Guardrails active — Prevent harmful outputs
- Rollout staged — More cautious than traditional features
- Override tracking — Learning from user corrections
Sources & Influences
- Aishwarya Goel & Kiriti Gavini — CCCD Loop, Agency-Control Trade-off
- Anthropic — Constitutional AI, RLHF approaches
- OpenAI — Eval best practices
- Google DeepMind — AI safety frameworks
Part of the Modern Product Operating Model by Yannick Maurice