hf-papers-reporter
Hugging Face Daily Papers Reporter
Generate professional Word reports from Hugging Face Daily Papers with full text extraction and image capture.
What This Skill Does
- Scrapes huggingface.co/papers for the top papers
- Downloads PDFs from arXiv
- Extracts Abstract and Introduction sections
- Extracts figures/images from PDFs
- Generates a formatted Word document (.docx) with:
- Paper titles and arXiv links
- Cover images from HF
- Full abstracts
- Introduction sections
- Extracted figures from papers
Quick Start
Run the main script to generate today's report:
cd /path/to/hf-papers-reporter
python3 scripts/process_papers.py
Output will be saved to output/HF_Daily_Papers_Report.docx
Dependencies
Install required packages:
pip3 install PyMuPDF python-docx Pillow beautifulsoup4 requests
How It Works
Step 1: Fetch Paper List
- Scrapes huggingface.co/papers
- Extracts arXiv IDs, titles, and cover image URLs
Step 2: Download & Process (per paper)
Download PDF from arxiv.org/pdf/{id}.pdf
↓
Extract text (first 5 pages)
- Abstract (regex match)
- Introduction (regex match)
↓
Extract images (first 5 pages, max 3 per page)
- Compress to 600x400
↓
Download cover image from HF CDN
- Compress to 800x600
Step 3: Generate Word Document
- Title page with report name and date
- Each paper as a section with:
- Cover image (centered)
- Abstract section
- Introduction section
- Extracted figures (up to 4)
Output Structure
hf_papers/
├── pdfs/ # Downloaded PDFs
├── images/ # Cover images + extracted figures
└── output/
├── HF_Daily_Papers_Report.docx
└── papers_data.json
Known Issues & Solutions
| Issue | Cause | Fix |
|---|---|---|
| XML encoding error | PDF text contains control characters | Script auto-cleans 0x00-0x1F chars |
| No abstract found | PDF structure varies | Multiple regex patterns tried |
| Large PDFs | Some papers are 20MB+ | Only first 5 pages processed |
Customization
To modify the number of papers (default: 10), edit the PAPERS list in scripts/process_papers.py.
To change image sizes, modify the thumbnail() calls in the script.
More from xdrshjr/jr-openclaw-skills
doubao-open-tts
Text-to-Speech using Doubao (Volcano Engine) API. Use when converting text to natural-sounding speech, generating audio files from text, listing available TTS voices, or synthesizing speech with customizable speed/volume parameters.
91volcengine-image-gen
使用火山引擎(豆包)Seedream 系列模型生成高质量图片。支持文生图、多种分辨率、多种比例。
42paper-review
Comprehensive peer review of academic papers for top-tier computer science conferences and journals. Supports PDF and LaTeX inputs, provides detailed reviews with scores, and generates improvement plans. Use when the user asks to review, evaluate, or provide feedback on academic papers.
27session-cleaner
Clean up and manage OpenClaw sessions - kill all sub-agents, clear context, and reset the system to a clean state. Use when user says "close all sessions", "clear context", "kill all agents", "clean up sessions", "reset everything", "关掉所有子agent", "清空上下文", "关闭所有session", or similar cleanup requests.
6project-indexer
Generate and use project index for quick codebase understanding in new Claude Code sessions. Scans project structure, extracts code symbols, and creates a navigable feature map.
2reference-finder
Automatically analyze research text, extract domains and key concepts, then generate comprehensive reference lists with summaries using Gemini AI. Use when users need to (1) Generate literature references from research descriptions, (2) Find relevant academic papers for a research topic, (3) Build bibliography for research proposals, (4) Discover key papers in specific research domains, or (5) Create structured reference documentation from free-form research text.
2