arxiv-search
arXiv Search
Search arXiv through its official API for discovery, shortlisting, recent-preprint tracking, and local Markdown note generation from arXiv IDs.
Instructions
- Prefer this skill when the request is about arXiv-native preprints, recent submissions, or fields where arXiv coverage is strong: computer science, mathematics, physics, statistics, quantitative biology, and quantitative finance.
- Use the bundled CLI:
- In this repository:
skills/arxiv-search/scripts/search - After installation:
~/.agents/skills/arxiv-search/scripts/search - Markdown summary builder:
skills/arxiv-search/scripts/summarize
- In this repository:
- The CLI supports two query modes:
- Plain-text mode: bare words are converted into an
all:query joined withAND - Raw mode: if the query already uses arXiv syntax such as
ti:,au:,abs:,cat:,submittedDate:,AND,OR, orANDNOT, it is passed through unchanged
- Plain-text mode: bare words are converted into an
- For exact phrase matching in plain-text mode, add
--phrase. - For recent-paper discovery, prefer
--sort submittedDate --order descending, and optionally add--days N. - Use
--category <cat>when the scope should stay narrow, for examplecs.LG,cs.AI,stat.ML, orq-bio.QM. - For author-specific requests, do not rely on a single name form.
- By default, run the supplied full-name form and an abbreviated-first-name form separately, for example
au:"Peter Nugent"andau:"P. Nugent". - Present these result sets separately in the final answer instead of silently merging them.
- Call out that abbreviated-first-name matches can be ambiguous and may include a different person with the same initial and surname.
- By default, run the supplied full-name form and an abbreviated-first-name form separately, for example
- Treat arXiv API output as discovery metadata. If exact version, withdrawn status, or human-readable abstract-page details matter, verify the final candidates on arxiv.org before presenting a final answer.
- If the user wants peer-reviewed biomedical full text rather than preprints, use
/polars-dovmedinstead. - If the user needs massive bulk harvesting rather than interactive search, the arXiv docs recommend OAI-PMH rather than large search result slices.
- Use
scripts/summarizewhen the user wants stable local notes for specific arXiv IDs. This writes Markdown files containing paper metadata, abstract, links, and a blank## Notessection for follow-up annotation. - Do not rely on server-side
submittedDate:[...]queries for user-facing workflows. As observed on March 18, 2026 UTC, official arXiv API requests usingsubmittedDatereturnedHTTP 500even for the example pattern shown in the arXiv API manual. The bundled CLI therefore applies--daysas a local filter on returnedpublishedtimestamps instead of sendingsubmittedDateto the API. - Respect arXiv API pacing. The official API manual asks clients making repeated requests to incorporate a 3-second delay, and the service can return
HTTP 429 Rate exceededwhen queried too aggressively.
Quick Reference
| Task | Action |
|---|---|
| Search script | skills/arxiv-search/scripts/search |
| Summary builder | skills/arxiv-search/scripts/summarize |
| Base API | https://export.arxiv.org/api/query |
| Default mode | Plain-text terms compiled to all: query terms |
| Raw query | Pass arXiv syntax directly, e.g. cat:cs.LG AND ti:diffusion |
| Author search | Run full-name and abbreviated-first-name au: queries separately, e.g. au:"Peter Nugent" and au:"P. Nugent" |
| Exact phrase | Add --phrase |
| Recent submissions | --sort submittedDate --order descending |
| Relative date filter | --days 30 with local post-filtering |
| Category filter | --category cs.LG |
| ID lookup | --ids 2501.01234,2406.00001 |
| Write local notes | skills/arxiv-search/scripts/summarize 2501.01234 --output-dir arxiv-summaries |
| Network timeout | --timeout 20 |
| Pacing | wait about 3 seconds between repeated requests |
| Help | skills/arxiv-search/scripts/search --help |
Input Requirements
- Python 3
- A search query or one or more arXiv IDs
- Reasonable request pacing for repeated calls:
- arXiv recommends a 3-second delay between consecutive API requests
- cache identical queries instead of refetching them repeatedly in one session
- For Markdown note generation:
- one or more arXiv IDs
- optional
--output-dir(default:arxiv-summaries) - optional
--forceto overwrite existing files
- Optional search modifiers:
--phraseto treat the full plain-text query as one phrase--category <cat>to constrain by arXiv category--days <N>for a recent-submission window, applied locally to returnedpublishedtimestamps--sort relevance|lastUpdatedDate|submittedDate--order ascending|descending--start <N>for paging--timeout <seconds>when the network is slow or partially blocked
- When writing raw queries, use official arXiv field prefixes and operators:
ti,au,abs,co,jr,cat,rn,allAND,OR,ANDNOT
- For author-specific searches, prefer separate raw
au:queries for the full name and abbreviated-first-name form rather than a single merged query. - Avoid
submittedDate:[YYYYMMDDTTTT+TO+YYYYMMDDTTTT]in routine use until arXiv-side failures are resolved.
Output
- JSON with:
- request metadata (
query,compiled_query,start,max_results,sort_by,sort_order) query_modeshowing whether the CLI used raw or plain-text compilationdays_filtermetadata when--daysis usedwarningswhen the CLI applied local workarounds- feed metadata (
total_results,items_per_page,start_index,updated) - normalized result records with:
titlesummaryauthorsarxiv_idabs_urlpdf_urlpublishedupdatedprimary_categorycategoriescomment,journal_ref,doiwhen present
request_urlfor traceability
- request metadata (
- For
scripts/summarize:- JSON describing written files and missing IDs
- one Markdown file per returned arXiv ID, containing metadata, abstract, citation, and a
## Notessection
Quality Gates
- Query mode is appropriate: plain-text convenience or raw arXiv syntax
- Result count is scoped tightly enough to inspect the first page of titles
- Recent-paper requests use
submittedDatesorting and/or a date filter -
--daysis understood as a local filter, not a server-sidesubmittedDatequery - Category-sensitive requests use
--categoryor rawcat:filters - Author-specific requests check full-name and abbreviated-first-name variants separately
- The final answer keeps abbreviated-name matches separate and labels them as potentially ambiguous
- Final answer distinguishes arXiv preprints from peer-reviewed literature
- Exact metadata claims were verified on arxiv.org when version/date details matter
- Repeated calls are paced to avoid
HTTP 429 Rate exceeded
Examples
Example 1: Recent ML preprints
skills/arxiv-search/scripts/search "protein language model" 10 \
--phrase \
--category cs.LG \
--sort submittedDate \
--order descending \
--days 30
Example 2: Raw arXiv query syntax
skills/arxiv-search/scripts/search \
'cat:cs.LG AND ti:"diffusion" ANDNOT abs:"survey"' \
15 \
--sort relevance
Example 3: Quantitative biology preprints
skills/arxiv-search/scripts/search "single cell foundation model" 10 \
--phrase \
--category q-bio.QM \
--sort submittedDate \
--order descending
Example 4: Author-specific search with separate name variants
skills/arxiv-search/scripts/search 'au:"Peter Nugent"' 20 \
--sort submittedDate \
--order descending
skills/arxiv-search/scripts/search 'au:"P. Nugent"' 20 \
--sort submittedDate \
--order descending
Example 5: Fetch records by arXiv ID
skills/arxiv-search/scripts/search --ids 2501.01234,2406.00001
Example 6: Build local Markdown summaries from arXiv IDs
skills/arxiv-search/scripts/summarize 2501.01234 2406.00001 \
--output-dir arxiv-summaries
Troubleshooting
Issue: Results are too broad
Solution: Add --category, narrow the terms, or switch to a raw query with ti:/abs: fields.
Issue: Results are too sparse
Solution: Remove an overly narrow category, drop --phrase, or use OR in a raw query.
Issue: Need the latest papers, not the most relevant
Solution: Use --sort submittedDate --order descending, and add --days N when you want a recent window. The CLI applies the --days cutoff locally to avoid current arXiv API failures on submittedDate queries.
Issue: Need exact author or title matching
Solution: Use a raw query with field prefixes such as au: and ti: rather than plain-text mode. For author searches, run full-name and abbreviated-first-name variants separately and report them separately.
Issue: Author search looks fragmented across name variants
Solution: This is common on arXiv. Check both the full-name form and the abbreviated-first-name form, for example au:"Peter Nugent" and au:"P. Nugent", then keep those result sets separate in the final answer because the abbreviated form may be ambiguous.
Issue: Need more than an interactive page of results
Solution: Use --start for paging. For very large harvesting jobs, switch to OAI-PMH or bulk metadata workflows instead of pulling huge API slices.
Issue: The request hangs or times out
Solution: Retry with a smaller max_results, increase --timeout modestly, or verify whether arXiv is reachable from the current environment.
Issue: Need reusable local notes instead of JSON search output
Solution: Use skills/arxiv-search/scripts/summarize <arxiv-id>... --output-dir <dir> to create Markdown summaries.
Issue: Recent-date filtering returns an arXiv server error
Solution: Do not send raw submittedDate:[...] filters in normal use. Prefer --sort submittedDate --order descending --days N, which uses local published-date filtering in the bundled CLI.
Issue: HTTP 429 Rate exceeded
Solution: Slow down. arXiv recommends a 3-second delay between repeated API calls. Cache identical queries, avoid tight retry loops, and prefer fetching small result slices.
More from fmschulz/omics-skills
bio-phylogenomics
Build marker gene alignments and phylogenetic trees.
19bio-protein-clustering-pangenome
Cluster proteins into orthogroups and derive pangenome matrices.
18plotly-dashboard-skill
Build production-ready Plotly Dash dashboards with consistent theming, clear layouts, and performant callbacks.
17bio-annotation
Functional annotation and taxonomy inference from sequence homology.
16bio-foundation-housekeeping
Initialize a bioinformatics project scaffold with reproducible environments, schemas, and data cataloging. Use for new projects or repo setup.
16bio-stats-ml-reporting
Aggregate results, train ML models, and produce reports with validated references.
16