chem-data-extractor
Chemistry Data Extractor | 化学数据提取器
Extract structured chemical characterization data from chemistry supplementary materials and return in strict JSON format. 从化学论文补充材料中提取结构化表征数据,以严格JSON格式返回。
Supported Data Fields
compound_name: Full IUPAC or common name (including stereochemistry if given)structure_image_description: Brief description of the molecular structurephysical_state: e.g., "white solid", "colorless oil"mass_obtained: in mgyield_percent: as number onlymelting_point_range: in °C, as string like "126.6–127.3"rf_value: Rf value and solvent systemoptical_rotation: [α]D²⁵ value, concentration, solventhplc_conditions: column, mobile phase, flow rate, wavelength, retention times (major/minor), ee%nmr_1H: frequency, solvent, chemical shifts with multiplicity and coupling constantsnmr_13C: frequency, solvent, chemical shifts with notes (e.g., d, JCF)nmr_19F: frequency, solvent, chemical shifthrms_data: ion type, calculated m/z, found m/z, formularacemic_sample_note: if mentioned
Workflow
Step 1: Ask User for Extraction Mode
When user asks to extract chemistry data, first ask:
Do you want to extract data for:
- A specific compound (provide compound ID like "3i" or "1a")
- All compounds in a single document
- Batch process multiple PDF files (creates folder for each)
Mode 1: Batch Process Multiple PDFs
For processing multiple PDF files at once. Creates a separate folder for each PDF with extracted compounds.
Usage
python scripts/batch_extract.py \
/path/to/pdf_folder \
-o ./output_folder
Options
-o, --output: Output base directory (default:./chem_extract_output)--keep-md: Keep intermediate markdown files (default: cleanup after extraction)--skip-existing: Skip PDFs that already have output folders
Output Structure
output_folder/
├── batch_summary.json # Overall summary of all processed PDFs
├── paper1/
│ ├── compounds.json # All extracted compounds
│ └── summary.json # Brief summary with compound list
├── paper2/
│ ├── compounds.json
│ └── summary.json
└── ...
Example: Batch Process
# Process all PDFs in a folder
python scripts/batch_extract.py ./pdfs/ -o ./extracted_data
# Process single PDF
python scripts/batch_extract.py ./article.pdf -o ./results
# Keep intermediate files, skip existing
python scripts/batch_extract.py ./pdfs/ --keep-md --skip-existing
Mode 2: Single Document Processing
For processing a single document (Markdown or after PDF conversion).
Step 2: Prepare Input File
If the input is a PDF file:
- Use
mineru-pdf-converterskill to convert PDF to Markdown first - Use the generated
full.mdfile as input
If the input is already a Markdown file, use it directly.
Step 3: Extract Data
Use the extraction script to parse the data:
# Extract a specific compound
python scripts/extract_chem_data.py \
/path/to/full.md -c COMPOUND_ID --compact
# Extract all compounds
python scripts/extract_chem_data.py \
/path/to/full.md --compact
Options:
-c, --compound: Extract specific compound by ID (e.g., "3i", "1a")--compact: Remove null/empty fields from output-o, --output: Save output to file instead of stdout
Step 4: Return Results
Output ONLY valid JSON without any extra text, unless the user specifically asks for explanations.
Examples
Example 1: Batch Process Multiple PDFs
# Process all PDFs in a directory
python scripts/batch_extract.py ./supplementary_pdfs/ -o ./extracted_compounds
# Output structure:
# ./extracted_compounds/
# ├── batch_summary.json
# ├── paper1/
# │ ├── compounds.json
# │ └── summary.json
# └── paper2/
# ├── compounds.json
# └── summary.json
Example 2: Extract Single Compound from Markdown
python scripts/extract_chem_data.py full.md -c 3i --compact
Output:
{
"compound_name": "(R)-N-Benzoyl-4-iodobenzenesulfonimidoyl fluoride",
"physical_state": "white solid",
"mass_obtained": 33.2,
"yield_percent": 85,
"melting_point_range": "126.6–127.3",
"rf_value": "0.37 (Pet/EtOAc, 5/1, v/v)",
"optical_rotation": "[α]D25 = +10.3 (c = 0.75, CHCl3)",
"hplc_conditions": {
"column": "CHIRALCEL AY-RH",
"mobile_phase": "n-hexane/2-propanol = 60/40",
"flow_rate": "1.0 mL/min",
"wavelength": "256 nm",
"retention_times": {
"major": "9.642 min",
"minor": "12.955 min"
},
"ee_percent": 90
},
"nmr_1H": {
"frequency": "400 MHz",
"solvent": "CDCl3",
"chemical_shifts": "δ 8.17–8.09 (m, 2H), 8.06–8.00 (m, 2H)..."
},
"nmr_13C": {
"frequency": "100 MHz",
"solvent": "CDCl3",
"chemical_shifts": "δ 170.0, 139.2, 134.3 (d, JCF = 22.0 Hz)..."
},
"nmr_19F": {
"frequency": "376 MHz",
"solvent": "CDCl3",
"chemical_shift": "δ 65.3 (s, 1F)"
},
"hrms_data": {
"ion_type": "[M+Na]+",
"calculated_mz": 411.9275,
"found_mz": 411.9273,
"formula": "C13H9FINNaO2S"
}
}
Example 3: Extract All Compounds from Single File
python scripts/extract_chem_data.py full.md --compact -o all_compounds.json
Output is a JSON array containing all extracted compounds.
Common Compound ID Patterns
Compound IDs typically follow these patterns:
- Numbers:
1,2,3 - Numbers with letters:
1a,3i,5b - Numbers with multiple letters:
1aa,3ba - Numbers with primes:
1',2''
Tips
- Compound identification: The script looks for section headers like
(R)-Compound Name (3i)or# Compound Name (1a) - Data completeness: Not all fields may be present for every compound - missing fields will be
null(or omitted with--compact) - Stereochemistry: The script preserves stereochemical descriptors like (R), (S), (±) in compound names
- Multiple compounds: When extracting all compounds, the output is a JSON array sorted by appearance in document
More from internscience/chemclaw
literature-parsing
将 PDF 文献转换为 Markdown 文件,并提取所有图表图片。使用 MinerU (opendatalab) 进行工业级高质量解析。
13molecular_properties_predictor
预测小分子多种物化性质(沸点、折射率、密度、黏度、表面张力等),当前已真实接入 bamboo_mixer 单分子物性模型后端。
13adme-prediction
ADME 性质预测工具。预测分子的吸收、分布、代谢、排泄性质,包括 Caco-2 通透性、PAMPA、HIA、Pgp 抑制、生物利用度、亲脂性等。使用 Morgan 指纹 + Random Forest/XGBoost。当用户提到 ADME 预测、药物性质、通透性、吸收、代谢等时触发。
13reaction-data-extraction
从 PDF 文献中提取化学反应数据,特别是反应条件优化信息。支持提取反应物、产物、催化剂、溶剂、温度、时间、产率等,并输出结构化 CSV 文件。使用 MinerU + NLP + 规则匹配进行精确提取。
13mol-3d-viewer
将 SMILES 或化学名称转换为分子 3D 结构。支持生成 SDF 文件、3D 分子图片和可交互 HTML 网页(可旋转观察)。
13mineru-pdf-converter
|
13