Extract slide text from PDF

Run the extract_slide_text.py script to extract the text content of each PDF page into a structured markdown file:

uv run .agents/skills/extract-slide-text/extract_slide_text.py <pdf_path> <output_path> [images_dir]

Arguments

pdf_path (required): Path to the PDF file.
output_path (required): Path to write the output markdown file
images_dir (optional): Path to the slide images directory. Used to generate correct relative image references. Defaults to slide_images/.

Output format

A markdown file with one section per slide:

## Slide 1

![Slide 1](slide_images/slide_1.png)

\```
Extracted text content from slide 1
\```

## Slide 2

![Slide 2](slide_images/slide_2.png)

\```
Extracted text content from slide 2
\```

Pages with no extractable text (e.g., full-bleed images) show (no extractable text).

Why this matters

PDF text extraction is deterministic — it produces ground-truth slide content without relying on vision models. This prevents misidentification of embedded screenshots or demo captures as actual slide content, a common failure mode when using only image-based slide analysis.

Prerequisites

Poppler utilities must be installed (provides the pdftotext command):

macOS: brew install poppler
Ubuntu: apt-get install poppler-utils

extract-slide-text

Extract slide text from PDF

Arguments

Output format

Why this matters

Prerequisites

More from pamelafox/presentation-skills

pdf-to-markdown

generate-writeup

review-presentation

fetch-slides

outline-slides

convert-slides-to-images