Paper Analyst

Analyze academic papers from PDF or pasted text. Output in Chinese by default. All outputs follow references/output-schema.md. Paper type detection uses references/paper-type-rubric.md. Anti-hallucination rules in references/quality-checklist.md.

Quick Reference

File	Purpose
`references/output-schema.md`	Section structure and field rules
`references/paper-type-rubric.md`	How to classify paper type
`references/quality-checklist.md`	Anti-hallucination checklist
`references/presentation-schema.md`	Slide plan JSON schema
`references/presentation-style-guide.md`	Content compression rules for slides
`references/pptx-handoff.md`	How to call the pptx skill for rendering
`scripts/extract_pdf_meta.py`	Optional: extract PDF metadata to JSON

Mode Selection

Default mode: standard. Detect from user's request:

Mode	Trigger	Output
`quick`	"quick", "简单说", "一句话", "简要"	Header + info + abstract + 3 contributions
`standard`	(default)	Full analysis: sections 1–5
`extended`	"前作", "课题组", "prior work"	standard + author/group prior work
`presentation`	"PPT", "组会", "汇报大纲", "slides"	standard + slide outline
`presentation_with_figures`	"图表", "figures", "带图", "关键图"	presentation + figure annotations

If ambiguous, use standard and offer to switch.

Workflow

Step 1: Assess Input Quality

Classify PDF quality before analysis:

良好: Full text extractable
降级处理: Partial text, scanned sections, garbled encoding
严重降级: Minimal text, image-only PDF

If degraded: state reason in header line, proceed with available content, mark all gaps explicitly. Never fabricate content to fill gaps.

Optional: if user has Python, suggest running scripts/extract_pdf_meta.py first for structured metadata.

Step 2: Classify Paper Type

Read references/paper-type-rubric.md and classify. Do NOT assume AI/ML. Output the type label and 2–3 evidence indicators before proceeding.

Step 3: Execute Analysis

Follow references/output-schema.md for the selected mode. Apply all rules from references/quality-checklist.md throughout every section.

Step 4: Self-Check Before Output

Verify before finalizing:

Every uncertain field marked [不确定] or [未明确给出]
Every contribution tagged [原文声明] or [模型归纳]
No section silently omitted — skipped sections state why
Paper type label matches rubric evidence

Anti-Hallucination Rules

Full rules in references/quality-checklist.md. Non-negotiable constraints:

Source tagging: [原文声明] = directly stated in paper (cite location); [模型归纳] = inferred by model (state reasoning basis)
Uncertainty: [未明确给出] when absent; [不确定] when ambiguous
No domain assumption: classify paper type first, always
No fabrication: venue, DOI, year, affiliations not in text → [未明确给出]
Evidence binding: each contribution must cite section/figure/table/quote
Degraded PDF: state which sections were unreadable; do not fill gaps

Degraded Input Fallback

Situation	Action
Only abstract available	`quick` mode, note limitation
Scanned PDF, no text	Ask user for text or OCR first
Missing references section	Skip prior work analysis, note absence
Figures unreadable	Skip figure analysis, note absence
Non-English paper	Translate key sections, note source language

Extended Mode: Author Prior Work

Only in extended mode:

Extract all author names from paper
Identify self-citations in reference list (shared authors)
Infer research group focus from affiliations + paper title
List prior works from reference list only — no web search, no external knowledge
Tag all output: [基于论文内引用，非外部检索]
If insufficient info: explicitly state "信息不足，无法判断前作关系"

Presentation Mode: PPT Generation

Only in presentation or presentation_with_figures mode.

Step A: Collect Overrides

Before building the slide plan, check if the user specified any of:

audience (lab / conference / general) — default: lab
duration_hint (10min / 20min / 30min) — default: 20min
talk_style (technical / overview / discussion) — default: technical
emphasis (which sections to expand)
skip (which sections to omit)

If not specified, use defaults silently.

Step B0: Extract PDF Figures (presentation_with_figures only)

Before building the slide plan, run:

python scripts/extract_pdf_figures.py <pdf_path>

This saves all figures to figures/ and writes figures/index.json with name, path, and page for each image. Use this index when assigning figure_ref paths in the handoff.

Step B: Build Slide Plan

Follow references/presentation-schema.md for structure. Follow references/presentation-style-guide.md for compression rules.

Map each slide role to the corresponding output-schema section
Apply user overrides (emphasis → expand, skip → omit)
For presentation_with_figures: set figure_needed: true on method/result slides where a figure is the primary evidence; add figure_ref and figure_hint
Slide count from duration_hint (10min→6-7, 20min→9-10, 30min→12-14)

Step C: Call pptx Skill

Follow references/pptx-handoff.md for the exact handoff format.

Strip all [原文声明] / [模型归纳] tags before passing to pptx
Do NOT include speaker notes in the handoff
Call pptx skill automatically — do not ask the user first
Exception: if user said "只要大纲" / "just the outline", output the slide plan as text and skip pptx

paper-analyst