split-pdf
Split-PDF: Download, Split, and Deep-Read Academic Papers
CRITICAL RULE: Never read a full PDF. Never. Only read the 4-page split files, and only 3 splits at a time (~12 pages). Reading a full PDF will either crash the session with an unrecoverable "prompt too long" error — destroying all context — or produce shallow, hallucinated output. There are no exceptions.
When This Skill Is Invoked
The user wants you to read, review, or summarize an academic paper. The input is either:
- A file path to a local PDF (e.g.,
./articles/smith_2024.pdf) - A search query or paper title (e.g.,
"Gentzkow Shapiro Sinkinson 2014 competition newspapers")
Important: You cannot search for a paper you don't know exists. The user MUST provide either a file path or a specific search query — an author name, a title, keywords, a year, or some combination that identifies the paper. If the user invokes this skill without specifying what paper to read, ask them. Do not guess.
Step 1: Acquire the PDF
Determine the download directory:
- Inside a research project (has
CLAUDE.md,data/,paper/, etc.): use./articles/in the project directory (create if needed). - Outside a project (e.g., ad-hoc reading from Task Management root or general context): use
to-sort/downloads/in the Task Management folder. This ensures downloaded files persist across sessions.
If a local file path is provided:
- Verify the file exists
- If the file is NOT already inside the download directory, copy it there (do not move — preserve the original location)
- Proceed to Step 2
If a search query or paper title is provided:
- Use WebSearch to find the paper
- If WebSearch doesn't yield a direct PDF link, try the bibliography MCP
scholarly_searchtool first (cross-source search). Fallback to Python OpenAlex client:import sys sys.path.insert(0, ".scripts/openalex") from openalex_client import OpenAlexClient client = OpenAlexClient(email="user@example.edu") results = client.search_works(search="paper title here", per_page=5) # Check open_access.oa_url in results for direct PDF links - Use WebFetch or Bash (curl/wget) to download the PDF
- Save it to the download directory determined above
- Proceed to Step 2
CRITICAL: Always preserve the original PDF. The downloaded or provided PDF in the download directory must NEVER be deleted, moved, or overwritten at any point in this workflow. The split files are derivatives — the original is the permanent artifact. Do not clean up, do not remove, do not tidy. The original stays.
Step 2: Split the PDF
Create a subdirectory for the splits and run the splitting script:
from PyPDF2 import PdfReader, PdfWriter
import os, sys
def split_pdf(input_path, output_dir, pages_per_chunk=4):
os.makedirs(output_dir, exist_ok=True)
reader = PdfReader(input_path)
total = len(reader.pages)
prefix = os.path.splitext(os.path.basename(input_path))[0]
for start in range(0, total, pages_per_chunk):
end = min(start + pages_per_chunk, total)
writer = PdfWriter()
for i in range(start, end):
writer.add_page(reader.pages[i])
out_name = f"{prefix}_pp{start+1}-{end}.pdf"
out_path = os.path.join(output_dir, out_name)
with open(out_path, "wb") as f:
writer.write(f)
print(f"Split {total} pages into {-(-total // pages_per_chunk)} chunks in {output_dir}")
Directory convention (in-project):
articles/
├── smith_2024.pdf # original PDF — NEVER DELETE THIS
└── split_smith_2024/ # split subdirectory
├── smith_2024_pp1-4.pdf
├── smith_2024_pp5-8.pdf
├── smith_2024_pp9-12.pdf
└── ...
Directory convention (ad-hoc / outside project):
to-sort/downloads/
├── smith_2024.pdf # original PDF — NEVER DELETE THIS
└── split_smith_2024/ # split subdirectory
└── ...
The original PDF remains in the download directory permanently. The splits are working copies. If anything goes wrong, you can always re-split from the original.
If PyPDF2 is not installed, install it: uv pip install PyPDF2
Step 3: Read in Batches of 3 Splits
Read exactly 3 split files at a time (~12 pages). After each batch:
- Read the 3 split PDFs using the Read tool
- Update the running notes file (
notes.mdin the split subdirectory) - Pause and tell the user:
"I have finished reading splits [X-Y] and updated the notes. I have [N] more splits remaining. Would you like me to continue with the next 3?"
- Wait for the user to confirm before reading the next batch
Do NOT read ahead. Do NOT read all splits at once. The pause-and-confirm protocol is mandatory.
Step 4: Structured Extraction
As you read, collect information along these dimensions and write them into notes.md:
- Research question — What is the paper asking and why does it matter?
- Audience — Which sub-community of researchers cares about this?
- Method — How do they answer the question? What is the identification strategy?
- Data — What data do they use? Where precisely did they find it? What is the unit of observation? Sample size? Time period?
- Statistical methods — What econometric or statistical techniques do they use? What are the key specifications?
- Findings — What are the main results? Key coefficient estimates and standard errors?
- Contributions — What is learned from this exercise that we didn't know before?
- Replication feasibility — Is the data publicly available? Is there a replication archive? A data appendix? URLs for the underlying data?
These questions extract what a researcher needs to build on or replicate the work — a structured extraction more detailed and specific than a typical summary.
The Notes File
The output is notes.md in the split subdirectory:
articles/split_smith_2024/notes.md
This file is updated incrementally after each batch. Structure it with clear headers for each of the 8 dimensions. After each batch, update whichever dimensions have new information — do not rewrite from scratch.
By the time all splits are read, the notes should contain specific data sources, variable names, equation references, sample sizes, coefficient estimates, and standard errors. Not a summary — a structured extraction.
When NOT to Split
- Papers shorter than ~15 pages: read directly (still use the Read tool, not Bash)
- Policy briefs or non-technical documents: a rough summary is fine
- Triage only: read just the first split (pages 1-4) for abstract and introduction
Quick Reference
| Step | Action |
|---|---|
| Acquire | Download to ./articles/ (in-project) or to-sort/downloads/ (ad-hoc) |
| Split | 4-page chunks into ./articles/split_<name>/ |
| Read | 3 splits at a time, pause after each batch |
| Write | Update notes.md with structured extraction |
| Confirm | Ask user before continuing to next batch |
For detailed explanation of why this method works, see methodology.md.