paper-banana

SKILL.md

PaperBanana: Academic Illustration Pipeline

Automates publication-ready academic illustrations via 5 specialized agents, each implemented as a separate Gemini API call: Retriever (categorize & select references) -> Planner (multimodal description) -> Stylist (polish) -> Visualizer (render) -> Critic (evaluate & refine).

Two output modes:

  • DIAGRAM MODE: Each agent is a Python script calling Gemini VLM/image APIs. Run scripts/orchestrate.py for end-to-end execution.
  • PLOT MODE: Statistical plots generated as executable Python matplotlib/seaborn code (code-based to eliminate data hallucination).

Requirements: GOOGLE_API_KEY env var (used for VLM calls in retriever/planner/stylist/critic AND image generation in visualizer), Python 3.10+ with google-genai, matplotlib, seaborn, numpy, pillow.

Paper: PaperBanana: Automating Academic Illustrations with Multi-Agent Systems (arXiv:2601.23265, Google/PKU)


Step 1: Determine Output Mode

Decide which track to follow:

Signal Mode
User provides raw data, table, CSV + visual intent (bar chart, scatter, etc.) PLOT MODE
User provides methodology text, description, or figure caption DIAGRAM MODE
User provides existing figure to improve Match original type

Critical rule: PLOT MODE always generates Python code (never image generation for data visualizations). Code-based generation eliminates data hallucination errors that corrupt numerical accuracy in image-based approaches.


Step 2: Execute Pipeline

DIAGRAM MODE — Automated Pipeline

Primary entry point: Run the end-to-end orchestrator:

python scripts/orchestrate.py \
  --methodology-file methodology.txt \
  --caption "Figure 1: Overview of proposed framework" \
  --mode diagram \
  --output output/diagram.png

Or with inline text:

python scripts/orchestrate.py \
  --methodology "Our framework consists of three modules..." \
  --caption "Figure 1: System overview" \
  --mode diagram \
  --output output/diagram.png

The orchestrator chains all 5 agents automatically and handles the Critic's refinement loop (up to 3 iterations). Intermediate outputs are saved to output/work/ for inspection.

Pipeline Details

Read references/DIAGRAM-PROMPTS.md for the actual Gemini prompt templates used by each agent.

Phase 1: RETRIEVER (scripts/retriever.py) — Gemini VLM call

  • Classifies methodology into 1 of 4 categories from references/DIAGRAM-CATEGORIES.md
  • Selects 2 most relevant reference diagrams from the 13 curated examples in assets/references/
  • Identifies visual intent: Framework Overview, Pipeline/Flow, Detailed Module, Architecture Diagram

Phase 2: PLANNER (scripts/planner.py) — Multimodal Gemini VLM call

  • Sends the 2 selected reference images + methodology text to Gemini as a multimodal prompt
  • The VLM "sees" what good methodology diagrams look like (in-context learning from images)
  • Generates an extremely detailed textual description of the target diagram
  • Critical: Natural language only for all visual attributes. NEVER hex codes or pixel dimensions

Phase 3: STYLIST (scripts/stylist.py) — Gemini VLM call

  • Takes the Planner's description + full NeurIPS 2025 style guide
  • Applies domain-specific styling based on the category from Phase 1
  • Follows 5 critical rules: preserve aesthetics, intervene minimally, respect domain, enrich details, preserve content
  • Outputs the polished description only

Phase 4: VISUALIZER (scripts/generate_image.py) — Gemini Image API call

  • Uses gemini-3-pro-image-preview to generate the diagram image from the styled description
  • Prepends quality prefix (high-res, legible text, clean background, no watermarks)
  • Aspect ratio selected based on visual intent (16:9 for pipelines, 3:2 for modules)

Phase 5: CRITIC (scripts/critic.py) — Multimodal Gemini VLM call

  • Sends the generated image + methodology text to Gemini for multimodal evaluation
  • Scores on 4 dimensions (faithfulness, readability, conciseness, aesthetics)
  • If faithfulness < 7 OR readability < 7: generates revised description → loops to Phase 4
  • Maximum 3 refinement iterations

DIAGRAM MODE — Manual Execution

You can also run each agent individually for more control:

# Phase 1: Retriever
python scripts/retriever.py --methodology-file text.txt --output work/retriever.json

# Phase 2: Planner
python scripts/planner.py --methodology-file text.txt --caption "Figure 1: ..." \
  --references work/retriever.json --output work/planner.json

# Phase 3: Stylist
python scripts/stylist.py --description work/planner.json --output work/stylist.json

# Phase 4: Visualizer (extract styled_description from JSON, pass to generate_image.py)
python scripts/generate_image.py --prompt-file work/styled_desc.txt --output output/diagram.png

# Phase 5: Critic
python scripts/critic.py --image output/diagram.png --methodology-file text.txt \
  --description work/stylist.json --output work/critic.json

PLOT MODE

Read references/PLOT-PROMPTS.md for detailed agent prompts. Read references/PLOT-STYLE-GUIDE.md for aesthetic rules.

Plot mode uses Claude (or the host agent) for reasoning and code generation — no Gemini API calls needed for plot generation itself.

Phase 1: CATEGORIZE (Retriever)

Match data characteristics and visual intent:

Data Type Plot Types
Categorical comparison Bar chart, grouped bar, stacked bar
Continuous trends Line chart, area chart
Correlation/distribution Scatter plot, histogram, box plot, violin
Matrix/similarity Heatmap, confusion matrix
Multi-dimensional Radar/spider chart
Proportional Pie/donut chart, treemap

Phase 2: PLAN (Planner)

Create a detailed specification that explicitly enumerates:

  • Every raw data point with exact coordinates/values
  • Axis ranges, labels, tick marks, scales (linear/log)
  • Color assignments for each series/category
  • Font sizes for title, axis labels, tick labels, legend
  • Line widths, marker sizes, marker shapes
  • Legend placement and formatting
  • Grid style (major/minor, dashed/solid)
  • Figure dimensions and DPI

Phase 3: STYLE (Stylist)

Read references/PLOT-STYLE-GUIDE.md for NeurIPS 2025 plot aesthetics.

Key styling rules:

  • White backgrounds only
  • Colorblind-friendly palettes (see assets/palettes/colorblind_safe.json)
  • Sans-serif fonts (Helvetica, Arial, or DejaVu Sans)
  • Markers on line charts for print readability
  • Inward-facing tick marks
  • Subtle grid lines (light gray, dashed)

Phase 4: VISUALIZE (Visualizer — Code Generation)

Generate complete, self-contained Python matplotlib/seaborn code. Use scripts/plot_generator.py as a reference implementation or run it directly with a JSON config:

python scripts/plot_generator.py --config plot_config.json --output figure.pdf

Code requirements:

  • Self-contained: all data defined inline, no external file dependencies
  • Apply .mplstyle from assets/matplotlib_styles/academic_default.mplstyle
  • Set OUTPUT_PATH variable for output file location
  • 300 DPI, bbox_inches='tight'
  • No plt.show() — save only
  • Support both PDF and PNG output

After generating the code, execute it to produce the plot image.

Phase 5: CRITIQUE (Critic)

Same rubric as diagram mode, plus plot-specific checks:

  • Data fidelity: Every data point correctly plotted
  • Axis accuracy: Ranges, labels, scales match specification
  • Layout: No overlapping labels, legends, or data points
  • Code correctness: Syntax valid, imports available, output saved

If code execution failed, analyze the error, simplify the approach, and regenerate.


Quick Start Examples

Diagram (automated): Run scripts/orchestrate.py with your methodology text file and caption.

Diagram (via agent): "Generate a methodology diagram for my transformer architecture. Here is the methodology section: [paste text]. Caption: Overview of our proposed multi-head attention framework."

Plot: "Create a bar chart comparing model performance. Data: {BERT: 92.3, GPT-4: 88.1, Claude: 95.7, Gemini: 91.2}. Intent: F1 score comparison across language models."

Improve: "Improve the aesthetics of this diagram: [paste existing description or attach current figure]"


File Reference

File Purpose When to Read
scripts/orchestrate.py End-to-end pipeline runner Diagram mode primary entry point
scripts/retriever.py VLM-based reference selection Phase 1 (diagram mode)
scripts/planner.py Multimodal description generation Phase 2 (diagram mode)
scripts/stylist.py VLM-based style application Phase 3 (diagram mode)
scripts/generate_image.py Gemini Image API call Phase 4 (diagram mode)
scripts/critic.py VLM-based image evaluation Phase 5 (diagram mode)
scripts/plot_generator.py Template-based matplotlib generator Phase 4 (plot mode)
scripts/validate_output.py Output validation and dependency check Post-generation validation
references/DIAGRAM-PROMPTS.md Actual Gemini prompt templates for diagrams All diagram phases
references/PLOT-PROMPTS.md Agent prompts for plots All plot phases
references/DIAGRAM-STYLE-GUIDE.md NeurIPS 2025 diagram aesthetics Phase 3 (Style)
references/PLOT-STYLE-GUIDE.md NeurIPS 2025 plot aesthetics Phase 3 (Style)
references/EVALUATION-RUBRIC.md Critic scoring criteria (4 dimensions) Phase 5 (Critique)
references/DIAGRAM-CATEGORIES.md 4 diagram categories with keywords Phase 1 (Categorize)
assets/references/index.json 13 curated reference diagram metadata Phase 1 (Retriever)
assets/references/*.jpg 13 curated reference diagram images Phase 2 (Planner multimodal input)
assets/palettes/*.json Color palette definitions Phase 3 (Style)
assets/matplotlib_styles/*.mplstyle Matplotlib style sheets Phase 4 (plot mode)

Environment Setup

# Required for all Gemini API calls (VLM reasoning + image generation)
export GOOGLE_API_KEY="your-api-key-here"

# Install dependencies
pip install google-genai matplotlib seaborn numpy pillow

Verify setup: python scripts/validate_output.py --check-deps

Weekly Installs
10
GitHub Stars
1
First Seen
Feb 18, 2026
Installed on
gemini-cli10
github-copilot10
codex10
amp10
kimi-cli10
opencode10