Fix Everything

Autonomous error recovery and workflow continuation for: $ARGUMENTS

Overview

Implementation status: workflow-orchestration
Primary purpose: enable autonomous error recovery so non-technical users never see internal failures
Workflow role: cross-cutting error handling and recovery across all workflow stages
Design philosophy: closed-loop problem solving - analyze, fix, verify, continue
User experience goal: seamless workflow execution without technical interruptions

Use When

Any skill execution fails with an error
Data is missing, stale, or corrupted
Configuration is invalid or incomplete
Dependencies are not met
Network or provider failures occur
Workflow is blocked or stuck
User should not be exposed to technical details
Autonomous recovery is possible without user input

Do Not Use When

User explicitly requests to see errors for debugging
Error requires user decision (e.g., which data source to use)
Security or authentication issues require user credentials
Data ambiguity requires user clarification (e.g., which symbol they meant)
Destructive operations need user confirmation
Error is truly unrecoverable and user must be informed

Core Principles

1. Closed-Loop Recovery

Detect error → Analyze root cause → Attempt fix → Verify success → Continue workflow
Never stop at first failure - try multiple recovery strategies
Log all recovery attempts for audit trail
Only escalate to user when all recovery strategies exhausted

2. Graceful Degradation

If optimal path fails, try alternative approaches
If data source fails, try backup sources
If full data unavailable, proceed with partial data (with status tracking)
If skill fails, try equivalent skill or simplified approach

3. Transparent Logging

Log all errors and recovery attempts internally
Present clean, successful results to user
Maintain audit trail for retrospective analysis
Surface only actionable information to user

4. Progressive Recovery

Start with simple fixes (retry, refresh)
Escalate to moderate fixes (alternative sources, workarounds)
Escalate to complex fixes (data reconstruction, workflow rerouting)
Only fail if all strategies exhausted

Execution

Step 1: Error Detection and Classification

When any error occurs, immediately classify it:

Error Categories:

Data Errors:

Missing data (symbol not found, no historical data)
Stale data (data too old for intended use)
Corrupted data (invalid format, parsing errors)
Incomplete data (gaps in time series, missing fields)
Quality issues (outliers, inconsistencies)

Configuration Errors:

Missing configuration (webhook, API keys, paths)
Invalid configuration (malformed values, wrong types)
Conflicting configuration (incompatible settings)

Dependency Errors:

Missing dependencies (Python packages, system tools)
Version conflicts (incompatible versions)
Path errors (files or directories not found)

Provider Errors:

Network failures (timeout, connection refused)
API failures (rate limit, authentication, service down)
Data provider unavailable (maintenance, deprecated)

Execution Errors:

Parse errors (invalid arguments, malformed input)
Runtime errors (exceptions, crashes)
Resource errors (out of memory, disk full)

Workflow Errors:

Blocked workflow (missing prerequisites)
Circular dependencies (skill A needs B needs A)
State inconsistency (session state corrupted)

Step 2: Root Cause Analysis

For each error, analyze the root cause:

Analysis Questions:

What exactly failed? (specific operation, line, component)
Why did it fail? (immediate cause)
What was the system trying to do? (intended operation)
What are the dependencies? (what does this depend on)
What are the alternatives? (other ways to achieve same goal)
Is this transient or persistent? (retry-able or needs fix)
What's the blast radius? (what else is affected)
Can we proceed without this? (is it critical or optional)

Root Cause Categories:

Transient: Network glitch, temporary provider issue → Retry
Configuration: Missing or invalid config → Fix config
Data availability: Source doesn't have data → Try alternative source
Data quality: Data exists but poor quality → Clean or filter
Dependency: Missing prerequisite → Install or workaround
Logic: Bug in code or workflow → Workaround or fix
Resource: System resource exhausted → Clean up or optimize

Step 3: Recovery Strategy Selection

Based on root cause, select recovery strategy:

Strategy 1: Retry with Backoff

Use for: Transient network errors, temporary provider issues
Approach: Retry 3 times with exponential backoff (1s, 2s, 4s)
Success criteria: Operation succeeds on retry
Fallback: If all retries fail, escalate to Strategy 2

Strategy 2: Alternative Source

Use for: Provider unavailable, data source failure
Approach: Try alternative data providers in priority order
Example: akshare fails → try tushare → try local cache
Success criteria: Alternative source provides data
Fallback: If all sources fail, escalate to Strategy 3

Strategy 3: Graceful Degradation

Use for: Optional data unavailable, partial failures
Approach: Proceed with partial data, mark status as "partial"
Example: Fundamentals unavailable → proceed with technicals only
Success criteria: Workflow continues with reduced scope
Fallback: If critical data missing, escalate to Strategy 4

Strategy 4: Data Reconstruction

Use for: Missing or corrupted data that can be derived
Approach: Reconstruct from other available data
Example: Missing volume → estimate from price moves and liquidity
Success criteria: Reconstructed data passes quality checks
Fallback: If reconstruction fails, escalate to Strategy 5

Strategy 5: Workflow Rerouting

Use for: Blocked workflow, missing prerequisites
Approach: Find alternative workflow path to achieve goal
Example: market-brief fails → try market-analyze + decision-support separately
Success criteria: Alternative workflow produces equivalent result
Fallback: If no alternative path, escalate to Strategy 6

Strategy 6: Configuration Auto-Fix

Use for: Missing or invalid configuration
Approach: Detect and fix configuration issues automatically
Example: Missing webhook → disable notifications, continue workflow
Success criteria: Configuration fixed, workflow continues
Fallback: If config cannot be auto-fixed, escalate to Strategy 7

Strategy 7: Dependency Auto-Install

Use for: Missing dependencies that can be installed
Approach: Automatically install missing dependencies
Example: Missing Python package → pip install
Success criteria: Dependency installed, operation succeeds
Fallback: If installation fails, escalate to Strategy 8

Strategy 8: Simplified Approach

Use for: Complex operation fails, simpler alternative exists
Approach: Use simpler method that's more likely to succeed
Example: Complex pattern recognition fails → use simple trend classification
Success criteria: Simplified approach produces usable result
Fallback: If simplified approach fails, escalate to user

Step 4: Execute Recovery

Execute selected recovery strategy:

Execution Protocol:

Log recovery attempt (strategy, parameters, timestamp)
Execute recovery operation
Capture result (success/failure, output, errors)
Verify recovery success (does it solve the problem?)
If successful, continue workflow
If failed, try next strategy
If all strategies exhausted, prepare user escalation

Recovery Execution Examples:

Example 1: Data Provider Failure

Error: akshare API timeout for symbol 600519
Root Cause: Transient network issue or provider overload
Strategy: Retry with backoff → Alternative source

Recovery Steps:
1. Retry akshare with 2s delay → Still fails
2. Try tushare as alternative → Success
3. Validate data quality → OK
4. Continue workflow with tushare data
5. Log: "Data source switched from akshare to tushare due to timeout"

Example 2: Missing Fundamental Data

Error: Fundamental data not available for symbol 688001
Root Cause: New STAR market stock, limited fundamental history
Strategy: Graceful degradation

Recovery Steps:
1. Mark fundamental block status as "not_supported"
2. Continue with technical analysis only
3. Add note: "Fundamental analysis limited for STAR market IPO"
4. Proceed to decision support with technical data only
5. Log: "Fundamental data unavailable, proceeded with technical-only analysis"

Example 3: Configuration Missing

Error: Feishu webhook URL not configured
Root Cause: User hasn't set up notifications
Strategy: Configuration auto-fix

Recovery Steps:
1. Detect missing webhook configuration
2. Disable Feishu notifications (fail-open)
3. Continue workflow without notifications
4. Log: "Feishu notifications disabled due to missing webhook"
5. Do not interrupt user workflow

Example 4: Stale Data

Error: Market data is 3 days old, too stale for decision support
Root Cause: Data sync hasn't run recently
Strategy: Data refresh

Recovery Steps:
1. Detect stale data (last update 3 days ago)
2. Trigger data-sync skill automatically
3. Wait for sync completion
4. Verify data freshness (now < 1 day old)
5. Continue with fresh data
6. Log: "Auto-refreshed stale data before analysis"

Example 5: Workflow Blocked

Error: decision-dashboard requires watchlist, but no watchlist exists
Root Cause: User hasn't created watchlist yet
Strategy: Workflow rerouting

Recovery Steps:
1. Detect missing prerequisite (watchlist)
2. Offer alternative: "Would you like to analyze a specific symbol instead?"
3. If user provides symbol, route to market-brief
4. If user wants dashboard, guide through watchlist-import first
5. Log: "Rerouted from decision-dashboard to market-brief due to missing watchlist"

Step 5: Verify Recovery Success

After executing recovery, verify it actually solved the problem:

Verification Checks:

Original operation now succeeds
Data quality is acceptable (not just present but usable)
Workflow can continue to next step
No new errors introduced by recovery
Result is equivalent to what would have succeeded originally

Verification Failures:

If verification fails, recovery didn't actually solve the problem
Try next recovery strategy
If all strategies exhausted, escalate to user

Step 6: Continue Workflow Seamlessly

Once recovery succeeds, continue workflow as if nothing happened:

Continuation Protocol:

Resume workflow at the point of failure
Pass recovered data/state to next step
Maintain workflow context and history
Present clean results to user
Log recovery details internally (not shown to user)

User Experience:

User sees: "Analysis complete for 600519"
User does NOT see: "akshare failed, retried with tushare, data quality check passed, analysis complete"
Internal log records all recovery details for audit

Step 7: Audit Trail and Learning

Maintain detailed audit trail for all recoveries:

Audit Log Contents:

Timestamp of error and recovery
Error type and root cause
Recovery strategies attempted
Recovery strategy that succeeded
Time taken for recovery
Impact on workflow (delay, degradation)
Data quality after recovery

Learning from Recoveries:

Track which errors occur most frequently
Track which recovery strategies work best
Identify patterns in failures
Recommend proactive fixes (e.g., "akshare fails often, consider switching default to tushare")
Update recovery strategies based on success rates

Step 8: User Escalation (Last Resort)

Only escalate to user when all recovery strategies exhausted:

Escalation Criteria:

All recovery strategies failed
Error requires user decision or input
Security/authentication issue needs user credentials
Ambiguity requires user clarification
Destructive operation needs user confirmation

Escalation Message Format:

I encountered an issue that needs your input:

[Clear description of what we're trying to do]

[What went wrong in simple terms]

[What I've already tried]

[What I need from you]

[Options you can choose from]

Example Escalation:

I'm trying to analyze symbol "平安银行" but I found two possible matches:

1. 000001.SZ - Ping An Bank (Shenzhen)
2. 601318.SH - Ping An Insurance (Shanghai)

Which one would you like to analyze?

NOT this:

Error: AmbiguousSymbolException in watchlist-import/run.py line 247
Multiple symbols matched fuzzy search query
Traceback: [stack trace]

Output Contract

Success case: Workflow continues seamlessly, user sees clean results
Partial success: Workflow continues with degraded functionality, user sees note about limitations
Escalation case: User sees clear, actionable message asking for specific input
Audit trail: All recovery attempts logged internally for retrospective analysis
Caller-facing delivery standard:
- Present clean, successful results whenever possible
- Hide internal errors and recovery details from non-technical users
- Surface only actionable information when user input needed
- Frame escalations as questions, not errors
- Maintain professional tone even when things go wrong

Failure Handling

Never crash or show stack traces to users
Never stop workflow at first error - always try recovery
Never expose technical jargon to non-technical users
Always maintain audit trail even if recovery fails
Always provide clear escalation message if all recovery fails

Key Rules

Closed-loop recovery is mandatory. Never stop at first error.
User experience is paramount. Non-technical users should never see internal errors.
Try multiple strategies. Don't give up after first recovery attempt fails.
Graceful degradation is acceptable. Partial results are better than no results.
Audit everything. Log all errors and recoveries for learning.
Escalate clearly. When escalation needed, ask clear questions, not show errors.
Maintain workflow context. Recovery should not lose user's place in workflow.
Verify recovery success. Don't assume recovery worked - verify it.
Learn from failures. Track patterns and improve recovery strategies.
Fail-open when possible. Optional features should not block critical workflow.

Composition

Used by: All skills when errors occur
Calls: Any skill needed for recovery (data-sync, market-data, etc.)
Integrates with: session-status (for state recovery), data-sync (for data refresh)
Logs to: Internal audit trail (not shown to user)
Escalates to: User only when all recovery strategies exhausted

Recovery Strategy Priority Matrix

Error Type	Strategy 1	Strategy 2	Strategy 3	Escalate If
Network timeout	Retry	Alt source	Cache	All sources fail
Missing data	Alt source	Reconstruct	Degrade	Critical data missing
Stale data	Refresh	Cache	Degrade	Refresh fails
Config missing	Auto-fix	Disable feature	Default	Security config
Parse error	Retry	Simplify	Skip	Invalid user input
Dependency missing	Install	Workaround	Degrade	Install fails
Provider down	Alt provider	Cache	Degrade	All providers down
Ambiguous input	Fuzzy match	Ask user	-	Multiple matches

Examples

Example 1: Complete Autonomous Recovery

User Request: "Analyze 600519"

Internal Execution:

market-brief skill called
akshare data fetch fails (timeout)
fix-everything detects error
Retries akshare → still fails
Tries tushare → success
Validates data quality → OK
Continues market-brief with tushare data
Analysis completes successfully

User Sees: "Analysis complete for 600519 [clean results]"

User Does NOT See: Any mention of akshare failure or tushare fallback

Example 2: Graceful Degradation

User Request: "Full analysis for 688001"

Internal Execution:

analysis skill called
Fundamental data fetch fails (STAR market IPO, limited history)
fix-everything detects error
Tries alternative fundamental sources → all fail
Marks fundamental block as "not_supported"
Continues with technical analysis only
Analysis completes with technical-only results

User Sees: "Analysis complete for 688001. Note: Fundamental analysis limited for recent STAR market listing. Technical analysis shows [results]."

Example 3: User Escalation

User Request: "Analyze 平安"

Internal Execution:

watchlist-import tries to resolve "平安"
Finds multiple matches: 000001.SZ (Ping An Bank), 601318.SH (Ping An Insurance)
fix-everything detects ambiguity
Cannot auto-resolve (both are valid, user intent unclear)
Escalates to user with clear options

User Sees: "I found two symbols matching '平安': 1) 000001.SZ - Ping An Bank, 2) 601318.SH - Ping An Insurance. Which would you like to analyze?"

Implementation Notes

This skill is a meta-skill that wraps around all other skills. It should be invoked automatically by the agent framework whenever any error occurs, without requiring explicit user invocation.

The agent should treat this skill as a safety net that ensures workflow continuity and user experience quality, especially for non-technical users who should never see internal system errors.

fix-everything

Fix Everything

Overview

Use When

Do Not Use When

Core Principles

1. Closed-Loop Recovery

2. Graceful Degradation

3. Transparent Logging

4. Progressive Recovery

Execution

Step 1: Error Detection and Classification

Step 2: Root Cause Analysis

Step 3: Recovery Strategy Selection

Step 4: Execute Recovery

Step 5: Verify Recovery Success

Step 6: Continue Workflow Seamlessly

Step 7: Audit Trail and Learning

Step 8: User Escalation (Last Resort)

Output Contract

Failure Handling

Key Rules

Composition

Recovery Strategy Priority Matrix

Examples

Example 1: Complete Autonomous Recovery

Example 2: Graceful Degradation

Example 3: User Escalation

Implementation Notes