Checking ChEMBL for Structured SAR Data

Overview

ChEMBL is a manually curated database of ~99,000 medicinal chemistry papers with extracted, standardized bioactivity data. If a paper is in ChEMBL, you can access structured data without parsing PDFs.

Core principle: Check ChEMBL first for medicinal chemistry papers. Curated data is more reliable than table parsing.

When to Use

Use this skill when:

Paper describes medicinal chemistry / drug discovery
Abstract mentions compound series, SAR, or activity data
Paper has IC50, MIC, Ki, EC50, or other bioactivity measurements
Before attempting to extract data from tables/figures
Paper scored ≥ 7 in relevance evaluation

When NOT to use:

Non-medicinal chemistry papers (cell biology, genomics, etc.)
Papers without activity measurements
Reviews without primary data
Very recent papers (< 6 months, likely not curated yet)

ChEMBL API Basics

Base URL: https://www.ebi.ac.uk/chembl/api/data/

No authentication required

CRITICAL: ChEMBL can ONLY be queried by DOI, NOT by PMID

The API returns PMID in results, but does not accept it as a query parameter
Always use DOI for lookups: ?doi=10.1234/example
PMID queries will return 0 results even if paper exists in ChEMBL

Two-step process:

Check if paper (by DOI) is in ChEMBL
If yes, retrieve bioactivity data

Step 1: Check if Paper in ChEMBL

Query by DOI (ONLY method that works):

curl -s "https://www.ebi.ac.uk/chembl/api/data/document.json?doi=DOI"

⚠️ IMPORTANT: Must use DOI, not PMID

# ✅ CORRECT - Use DOI
doi="10.1021/jm401507s"
curl -s "https://www.ebi.ac.uk/chembl/api/data/document.json?doi=$doi"

# ❌ WRONG - PMID won't work (will return 0 results)
pmid="24446688"
curl -s "https://www.ebi.ac.uk/chembl/api/data/document.json?pubmed_id=$pmid"  # Does NOT work!

If you only have PMID: Fetch DOI from PubMed first, then query ChEMBL with the DOI.

Response structure:

{
  "documents": [
    {
      "document_chembl_id": "CHEMBL3120156",
      "doi": "10.1021/jm401507s",
      "title": "Discovery and development of simeprevir (TMC435), a HCV NS3/4A protease inhibitor.",
      "abstract": "Hepatitis C virus is a blood-borne infection...",
      "pubmed_id": 24446688,
      "journal": "J Med Chem",
      "year": 2014,
      "doc_type": "PUBLICATION"
    }
  ],
  "page_meta": {
    "total_count": 1
  }
}

Key fields:

document_chembl_id - Use this to retrieve activity data
doc_type - "PUBLICATION" (from literature) or "DATASET" (deposited)
pubmed_id - PMID is in the response, but cannot be used to query ChEMBL
If total_count = 0, paper not in ChEMBL

Parse response:

response=$(curl -s "https://www.ebi.ac.uk/chembl/api/data/document.json?doi=$doi")

if [ $(echo "$response" | jq -r '.page_meta.total_count') -gt 0 ]; then
  chembl_id=$(echo "$response" | jq -r '.documents[0].document_chembl_id')
  echo "✓ Found in ChEMBL: $chembl_id"
else
  echo "✗ Not in ChEMBL"
fi

Step 2: Get Activity Data Count

Query activity endpoint:

curl -s "https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=CHEMBL3120156&limit=1"

Extract total count:

activity_url="https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=$chembl_id&limit=1"
activity_count=$(curl -s "$activity_url" | jq -r '.page_meta.total_count')

echo "→ $activity_count bioactivity data points"

Step 3: Report to User and Update Summary

Report immediately:

📄 [15/127] Screening: "Discovery and development of simeprevir"
   Abstract score: 9 → Fetching full text...
   ✓ ChEMBL: CHEMBL3120156 (101 activity data points)
   → IC50 data for HCV NS3 protease inhibitors available

Add to SUMMARY.md:

### [Discovery and development of simeprevir (TMC435), a HCV NS3/4A protease inhibitor](https://doi.org/10.1021/jm401507s) (Score: 9)

**DOI:** [10.1021/jm401507s](https://doi.org/10.1021/jm401507s)
**PMID:** [24446688](https://pubmed.ncbi.nlm.nih.gov/24446688/)
**ChEMBL:** [CHEMBL3120156](https://www.ebi.ac.uk/chembl/document_report_card/CHEMBL3120156/) (101 data points)

**Key Findings:**
- IC50 data for HCV NS3/4A protease inhibitors (from ChEMBL)
- Lead compound simeprevir (TMC435) approved for HCV treatment
- Structures and full activity data: [ChEMBL API](https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=CHEMBL3120156)

**ChEMBL Activity Summary:**
- IC50 values for HCV NS3/4A protease
- PK parameters (AUC, Cmax, clearance)
- DMPK assays (metabolic stability, permeability)

Always include ChEMBL status:

If found: Add ChEMBL ID with link and data point count
If not found: Note "Not in ChEMBL" (still valuable information)

Step 4: Update Tracking Files

Add to papers-reviewed.json:

{
  "10.1021/jm401507s": {
    "pmid": "24446688",
    "status": "relevant",
    "score": 9,
    "chembl_id": "CHEMBL3120156",
    "chembl_activities": 101,
    "has_structured_data": true
  }
}

Optional: Extract Structured Data

For papers with rich ChEMBL data (>20 activities), consider extracting:

# Get all IC50 data
curl -s "https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=CHEMBL3120156&standard_type=IC50&limit=100" > chembl_data.json

# Summary statistics
jq '[.activities[] | .standard_value | tonumber] | "Min: \(min), Max: \(max), Count: \(length)"' chembl_data.json

Report to user:

📊 ChEMBL data extracted:
   - IC50 values for HCV NS3/4A protease
   - All structures downloaded
   - Data saved to: chembl_CHEMBL3120156_ic50.json

Integration with Other Skills

During evaluating-paper-relevance workflow:

After abstract screening (score ≥7)
Before deep dive into full text
Check ChEMBL using this skill
If found:
- Note ChEMBL ID in SUMMARY.md
- Extract activity data (faster than PDF parsing)
- Still fetch full text for methods, discussion, context
If not found:
- Proceed with normal PDF evaluation
- Parse tables manually if needed

Workflow integration point:

Stage 2: Deep Dive
├─ 1. Fetch Full Text (PMC → DOI → Unpaywall)
├─ 1.5. Check ChEMBL ← ADD THIS STEP
│   ├─ Query by DOI
│   ├─ If found: note ChEMBL ID + activity count
│   └─ Report to user
├─ 2. Scan for Relevant Content
└─ 3. Extract Findings

Common Activity Types in ChEMBL

Type	Description	Units
IC50	Half-maximal inhibitory concentration	nM, µM
MIC	Minimum inhibitory concentration	µg/mL, nM
Ki	Inhibition constant	nM, µM
EC50	Half-maximal effective concentration	nM, µM
Kd	Dissociation constant	nM, µM
Potency	General potency measurement	Various

Filter by activity type:

curl "https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=ID&standard_type=MIC"

ChEMBL Coverage

~99,000 documents (as of 2025)

Well represented:

Medicinal chemistry papers
SAR studies with compound series
Lead optimization campaigns
Papers in major journals (J Med Chem, Bioorg Med Chem, Eur J Med Chem, etc.)

Poorly represented:

Very recent papers (6-12 month curation lag)
Papers without extractable structures/activities
Non-drug-discovery research
Purely mechanistic studies

Typical hit rate:

~30-40% of medicinal chemistry papers
Higher for SAR-focused journals

Advantages of ChEMBL Data

vs. PDF table parsing:

✓ Structures already extracted (SMILES format)
✓ Units standardized (all IC50s in nM)
✓ Values validated and curated
✓ Machine-readable JSON
✓ No OCR errors
✓ Linked to assay protocols
✓ Queryable (filter by activity range, target, etc.)

When to still use PDF:

Full experimental procedures
Synthesis routes
Papers not in ChEMBL
Very recent papers
Context and interpretation

Progress Reporting

CRITICAL: Report ChEMBL check for every relevant paper

Example workflow report:

📄 [15/50] Screening: "Novel MmpL3 inhibitors..."
   Abstract score: 8 → Checking ChEMBL...
   ✓ ChEMBL: CHEMBL3456789 (34 data points)
   → Fetching full text...
   → Added to SUMMARY.md with ChEMBL link

For papers not in ChEMBL:

📄 [16/50] Screening: "Another paper..."
   Abstract score: 9 → Checking ChEMBL...
   ✗ Not in ChEMBL (likely too recent or review paper)
   → Fetching full text via Unpaywall...

Helper Script Pattern

For research sessions with many medicinal chemistry papers:

Create check_chembl.py:

#!/usr/bin/env python3
import requests
import json
import sys

def check_chembl(doi):
    """Check if DOI is in ChEMBL and return summary

    IMPORTANT: Must use DOI, not PMID. ChEMBL API does not accept PMID queries.
    """

    # Query document (ONLY works with DOI)
    doc_url = f"https://www.ebi.ac.uk/chembl/api/data/document.json?doi={doi}"
    try:
        doc_response = requests.get(doc_url, timeout=10).json()
    except:
        return None

    # Check if found
    if doc_response.get('page_meta', {}).get('total_count', 0) == 0:
        return {'in_chembl': False}

    doc = doc_response['documents'][0]
    chembl_id = doc['document_chembl_id']

    # Get activity count
    act_url = f"https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id={chembl_id}&limit=1"
    try:
        act_response = requests.get(act_url, timeout=10).json()
        activity_count = act_response.get('page_meta', {}).get('total_count', 0)
    except:
        activity_count = 0

    return {
        'in_chembl': True,
        'chembl_id': chembl_id,
        'activity_count': activity_count,
        'doc_type': doc.get('doc_type'),
        'title': doc.get('title')
    }

if __name__ == "__main__":
    doi = sys.argv[1]
    result = check_chembl(doi)

    if result and result['in_chembl']:
        print(f"✓ {result['chembl_id']} ({result['activity_count']} activities)")
    else:
        print("✗ Not in ChEMBL")

Usage:

python3 check_chembl.py "10.1021/jm401507s"
# Output: ✓ CHEMBL3120156 (101 activities)

Common Mistakes

Querying by PMID: Using PMID instead of DOI → Always returns 0 results, ChEMBL only accepts DOI queries Skipping ChEMBL check: Not checking medicinal chemistry papers → Missing structured data that's already extracted Checking non-medchem papers: Checking genomics/cell biology papers → Wasting time, won't be in ChEMBL Not reporting status: Silent ChEMBL checks → User can't see what's happening Not adding to SUMMARY.md: Forgetting to include ChEMBL ID → Harder for user to access data later Only using ChEMBL: Not fetching full text when paper in ChEMBL → Missing context, methods, discussion Parsing PDFs when in ChEMBL: Manually extracting tables when structured data available → Wasting time and introducing errors

Quick Reference

Task	Command
Check if DOI in ChEMBL	`curl "https://www.ebi.ac.uk/chembl/api/data/document.json?doi=DOI"`
Get activity count	`curl "https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=ID&limit=1"`
Get all activities	`curl "https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=ID&limit=1000"`
Filter by activity type	`curl "...activity.json?document_chembl_id=ID&standard_type=MIC"`
ChEMBL paper page	`https://www.ebi.ac.uk/chembl/document_report_card/CHEMBL_ID/`

Permissions

Add to .claude/settings.local.json.template:

"Bash(curl*https://www.ebi.ac.uk/chembl/api/data/*)",
"WebFetch(domain:www.ebi.ac.uk)"

Success Criteria

ChEMBL check successful when:

Every medicinal chemistry paper (score ≥7) checked
ChEMBL status reported to user immediately
ChEMBL ID added to SUMMARY.md (if found)
Activity count noted in summary
papers-reviewed.json updated with ChEMBL status

Next Steps

After checking ChEMBL:

If found: Consider extracting structured data for highly relevant papers (≥9)
Continue with full text evaluation for context
For papers not in ChEMBL: Proceed with normal PDF/table parsing
Update SUMMARY.md with all findings

Resources

Full Documentation: See docs/CHEMBL_INTEGRATION.md
ChEMBL API Docs: https://chembl.gitbook.io/chembl-interface-documentation/
ChEMBL Interface: https://www.ebi.ac.uk/chembl/