model-capability-advisor
Model Capability Advisor
Recommend model capability choices for: $ARGUMENTS
Overview
- Implementation status: code-backed
- Local entry script:
<bundle-root>/model-capability-advisor/run.py - Primary purpose: score provided model names against workflow needs and return quick/deep recommendation pairs
- Research layer: meta-workflow optimization (Cross-cutting concern for agent execution design)
- Workflow role: meta-planning support for agent execution design, not a market-analysis surface
- Local executor guarantee: produce heuristic workflow-fit advice from the local capability profiles it knows about
Use When
- The user asks which model names fit a workflow.
- The caller wants quick/deep pairing advice for
analysis,chat,batch,backtest, orpaper. - The user wants warnings about likely model tradeoffs before assigning a workflow.
- The user wants to optimize agent execution for speed vs. depth.
- The user wants to understand model capability differences for quantitative workflows.
Do Not Use When
- The caller needs verification that a provider is actually configured or reachable.
- The user expects benchmark-backed precision or live provider comparison; this skill is heuristic and local.
- The question is about investment analysis rather than agent execution design.
- The user wants actual performance benchmarks rather than capability profiles.
Inputs
- Optional positional
target. - Optional
--workflow; defaults toanalysis. - Optional repeatable
--quick MODEL. - Optional repeatable
--deep MODEL. - Scope note:
- the local advisor reasons over known capability profiles and workflow heuristics
- the agent should say when a recommendation is only based on naming heuristics or incomplete local profiles
Execution
Step 1: Define workflow requirements
Identify specific needs for the target workflow:
Workflow types and requirements:
Analysis workflow:
- Speed priority: Fast iteration for exploratory analysis
- Depth priority: Deep reasoning for thesis development
- Key capabilities: Long context, structured output, evidence synthesis
- Typical use:
analysis,market-brief,stock-data
Chat workflow:
- Speed priority: Low latency for interactive conversation
- Depth priority: Contextual understanding for multi-turn dialogue
- Key capabilities: Context retention, clarification, follow-up
- Typical use:
strategy-chat, interactive Q&A
Batch workflow:
- Speed priority: High throughput for screening many symbols
- Depth priority: Consistent quality across batch
- Key capabilities: Parallel processing, cost efficiency
- Typical use:
market-screen,decision-dashboard,watchlist-import
Backtest workflow:
- Speed priority: Fast iteration for parameter tuning
- Depth priority: Rigorous evaluation for validation
- Key capabilities: Numerical reasoning, statistical analysis
- Typical use:
backtest-evaluator, performance analysis
Paper trading workflow:
- Speed priority: Real-time decision making
- Depth priority: Risk assessment and validation
- Key capabilities: Numerical precision, rule adherence
- Typical use:
paper-trading,strategy-design,decision-support
Step 2: Define model capability profiles
Establish capability dimensions for model evaluation:
Capability dimensions:
Speed:
- Tokens per second (throughput)
- Time to first token (latency)
- Total response time
- Cost per token
Reasoning depth:
- Complex problem solving
- Multi-step reasoning
- Abstract thinking
- Nuanced judgment
Context handling:
- Maximum context window
- Context retention quality
- Long-range dependency handling
- Context compression efficiency
Structured output:
- JSON/XML generation quality
- Table formatting
- Numerical precision
- Consistency across outputs
Tool use:
- Function calling reliability
- Parameter extraction accuracy
- Error handling
- Multi-tool orchestration
Domain knowledge:
- Financial domain understanding
- Quantitative reasoning
- Statistical analysis
- A-share market specifics
Step 3: Score candidate models
Evaluate available models against workflow requirements:
Model scoring methodology:
For each model, score (0-100) on each dimension:
- Speed: Based on known throughput and latency
- Reasoning: Based on benchmark performance and complexity handling
- Context: Based on window size and retention quality
- Structure: Based on output formatting reliability
- Tools: Based on function calling accuracy
- Domain: Based on financial knowledge demonstrations
Workflow-specific weighting:
Analysis workflow weights:
- Reasoning: 40%
- Context: 25%
- Structure: 15%
- Domain: 15%
- Speed: 5%
Chat workflow weights:
- Speed: 35%
- Context: 30%
- Reasoning: 20%
- Structure: 10%
- Tools: 5%
Batch workflow weights:
- Speed: 50%
- Structure: 20%
- Tools: 15%
- Reasoning: 10%
- Context: 5%
Backtest workflow weights:
- Reasoning: 35%
- Structure: 30%
- Domain: 20%
- Speed: 10%
- Context: 5%
Paper trading workflow weights:
- Reasoning: 30%
- Tools: 25%
- Structure: 25%
- Speed: 15%
- Domain: 5%
Composite score calculation:
- Weighted sum of dimension scores
- Normalize to 0-100 scale
- Rank models by composite score
Step 4: Generate quick/deep recommendations
Select optimal model pairing:
Quick model selection:
- Prioritize speed and cost efficiency
- Acceptable reasoning depth for routine tasks
- Good enough for screening, monitoring, batch operations
- Typical candidates: Gemini Flash, Claude Haiku, GPT-3.5
Deep model selection:
- Prioritize reasoning depth and accuracy
- Acceptable speed for critical decisions
- Required for thesis development, risk assessment, validation
- Typical candidates: Claude Opus, GPT-4, Gemini Pro
Pairing strategy:
- Use quick model for initial screening and routine tasks
- Escalate to deep model for critical decisions and complex analysis
- Balance cost vs. quality based on workflow stage
- Consider fallback options if primary unavailable
Escalation triggers:
- High uncertainty in quick model output
- Critical decision point (large position, high risk)
- Complex reasoning required (multi-factor analysis)
- Validation needed (cross-check quick model)
- User explicitly requests deeper analysis
Step 5: Identify model tradeoffs
Highlight important considerations:
Speed vs. depth tradeoffs:
- Quick models: 5-10× faster but may miss nuances
- Deep models: More thorough but higher latency and cost
- Batch operations: Speed matters more than individual quality
- Critical decisions: Depth matters more than speed
Context window tradeoffs:
- Large context: Can process more data but slower and costier
- Small context: Faster and cheaper but may miss connections
- Long documents: Require large context models
- Iterative workflows: Can use smaller context with state management
Cost tradeoffs:
- Quick models: $0.001-0.01 per 1K tokens
- Deep models: $0.01-0.10 per 1K tokens
- Batch operations: Cost scales linearly with volume
- Critical decisions: Cost is secondary to quality
Reliability tradeoffs:
- Established models: More reliable but may be slower or costlier
- Newer models: Faster and cheaper but less proven
- Tool use: Some models better at function calling
- Structured output: Some models better at JSON/table generation
Step 6: Generate capability advice report
Organize findings into structured report:
Part 1: Workflow Requirements
- Target workflow type
- Speed vs. depth priority
- Key capability requirements
- Typical use cases
Part 2: Quick Model Recommendation
- Recommended model name
- Composite score for workflow
- Dimension scores (speed, reasoning, context, etc.)
- Strengths for this workflow
- Limitations to be aware of
Part 3: Deep Model Recommendation
- Recommended model name
- Composite score for workflow
- Dimension scores
- Strengths for this workflow
- Limitations to be aware of
Part 4: Pairing Rationale
- Why this quick/deep pairing
- When to use quick vs. deep
- Escalation triggers
- Cost/quality tradeoff
Part 5: Alternative Candidates
- Other viable quick models (with scores)
- Other viable deep models (with scores)
- Fallback options if primary unavailable
- Experimental models to consider
Part 6: Model Tradeoffs
- Speed vs. depth considerations
- Context window considerations
- Cost considerations
- Reliability considerations
Part 7: Important Unknowns
- Provider availability (not verified)
- Actual latency in host environment (not measured)
- Tool-call reliability (not tested)
- Cost in production (estimates only)
- Model updates (profiles may be outdated)
Part 8: Recommendations
- Use quick model for: [specific tasks]
- Use deep model for: [specific tasks]
- Monitor performance and adjust
- Consider benchmarking in production
Step 7: Run the local executor
python3 <bundle-root>/model-capability-advisor/run.py --workflow analysis --quick gemini-flash --deep claude-sonnet
Step 8: Deliver as heuristic planning support
When delivering results, maintain proper framing:
Advice interpretation:
- This is heuristic workflow-fit advice, not empirical benchmarking
- Scores are based on known capability profiles, not production testing
- Recommendations assume typical use cases, not specific edge cases
- Actual performance may vary based on host environment
Verification needed:
- Provider availability (is model actually configured?)
- Actual latency (measure in production environment)
- Tool-call reliability (test with actual workflows)
- Cost in production (monitor actual usage)
- Model updates (check for newer versions)
Limitation disclosure:
- Profiles may be outdated (models update frequently)
- Heuristics may not match specific use case
- No guarantee of availability or performance
- Recommendations are starting point, not final answer
- Empirical testing recommended for high-stakes workflows
Output Contract
- Success format: readable text beginning with
# Model Capability Advice (<workflow>). - Sections: selected quick model, selected deep model, rationale, optional warnings, and candidate profiles.
- Caller-facing delivery standard:
- Eight-part structure: Workflow requirements, quick model recommendation, deep model recommendation, pairing rationale, alternative candidates, model tradeoffs, important unknowns, recommendations
- Heuristic framing: Label result as workflow-fit advice, not empirical benchmarking
- Scoring transparency: Provide dimension scores and composite scores for each model
- Pairing rationale: Explain why this quick/deep pairing and when to use each
- Tradeoff disclosure: Speed vs. depth, context, cost, reliability considerations
- Unknown identification: Provider availability, actual latency, tool reliability not verified
- Verification guidance: State what needs to be verified in production
- No authoritative claims: Avoid implying recommended model is installed, configured, or benchmark-dominant unless verified
Failure Handling
- Parse and argument errors: non-zero exit with a readable
命令错误message. - Missing candidates do not cause failure; the output degrades into workflow-level guidance.
- Unknown workflow type: provide general guidance and suggest closest known workflow.
- No models provided: recommend based on typical model profiles for workflow.
- Conflicting requirements: highlight tradeoffs and recommend multiple options.
Key Rules
- Keep this skill advisory rather than authoritative.
- Do not imply that a recommended model is installed or configured unless separately verified.
- Use it to improve workflow design, not to replace empirical benchmarking when high-stakes performance matters.
- Scoring must be transparent. Provide dimension scores and weighting methodology.
- Tradeoffs must be explicit. Speed vs. depth, cost vs. quality, etc.
- Unknowns must be identified. Provider availability, actual latency, tool reliability.
- Verification must be recommended. State what needs testing in production.
- Profiles may be outdated. Acknowledge that models update frequently.
- Heuristic framing is mandatory. This is planning support, not empirical truth.
Composition
- Often used before
analysis,strategy-chat,backtest-evaluator, orpaper-trading. - Works as a planning helper rather than a market-analysis skill.
- Can inform agent execution design across all workflow stages.
- Should be rerun when new models become available or requirements change.
- Results can inform cost optimization and performance tuning.