pdb-database

SKILL.md

RCSB Protein Data Bank

The Protein Data Bank hosts over 200,000 experimentally determined macromolecular structures plus computed models from AlphaFold and ModelArchive. This skill provides programmatic access to search, download, and analyze structural data.

Applicable Scenarios

Task Examples
Structure Retrieval Download coordinates for a known PDB ID
Similarity Search Find structures similar by sequence or 3D fold
Metadata Access Get resolution, method, organism, ligands
Dataset Building Compile structures for ML training or analysis
Drug Discovery Identify ligand-bound structures for a target
Quality Filtering Select high-resolution, well-refined structures

Setup

pip install rcsb-api requests

The rcsb-api package (v1.5.0+) provides:

  • rcsbapi.search - Query construction and execution
  • rcsbapi.data - DataQuery for batch retrieval

Quick Reference

Search Queries

from rcsbapi.search import TextQuery, AttributeQuery, SeqSimilarityQuery, StructSimilarityQuery

# Text search
results = list(TextQuery("kinase inhibitor")())

# Filter by organism (use string attribute paths)
human = AttributeQuery(
    attribute="rcsb_entity_source_organism.scientific_name",
    operator="exact_match",
    value="Homo sapiens"
)

# Filter by resolution
high_res = AttributeQuery(
    attribute="rcsb_entry_info.resolution_combined",
    operator="less",
    value=2.0
)

# Filter by experimental method
xray = AttributeQuery(
    attribute="exptl.method",
    operator="exact_match",
    value="X-RAY DIFFRACTION"
)

# Combine queries: & (AND), | (OR), ~ (NOT)
results = list((TextQuery("kinase") & human & high_res)())

# Sequence similarity (MMseqs2) - minimum 25 residues required
seq_query = SeqSimilarityQuery(
    value="VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH",
    evalue_cutoff=1e-5,
    identity_cutoff=0.7
)

# Structure similarity (3D fold)
struct_query = StructSimilarityQuery(
    structure_search_type="entry_id",
    entry_id="4HHB"
)

Data Retrieval (REST API)

The rcsb-api package's data module has limited functionality. Use the REST API directly for metadata:

import requests

def fetch_entry(pdb_id: str) -> dict:
    """Fetch entry metadata from RCSB REST API."""
    resp = requests.get(f"https://data.rcsb.org/rest/v1/core/entry/{pdb_id}")
    resp.raise_for_status()
    return resp.json()

# Example usage
data = fetch_entry("4HHB")
print(data["struct"]["title"])
print(data["rcsb_entry_info"]["resolution_combined"])

# Polymer entity (chain info + sequence)
def fetch_polymer_entity(pdb_id: str, entity_id: int = 1) -> dict:
    resp = requests.get(f"https://data.rcsb.org/rest/v1/core/polymer_entity/{pdb_id}/{entity_id}")
    resp.raise_for_status()
    return resp.json()

entity = fetch_polymer_entity("4HHB", 1)
sequence = entity["entity_poly"]["pdbx_seq_one_letter_code"]

File Downloads

Format URL Pattern
PDB https://files.rcsb.org/download/{ID}.pdb
mmCIF https://files.rcsb.org/download/{ID}.cif
Assembly https://files.rcsb.org/download/{ID}.pdb1
FASTA https://www.rcsb.org/fasta/entry/{ID}
import requests
from pathlib import Path

def download_structure(pdb_id: str, fmt: str = "cif", outdir: str = ".") -> Path:
    url = f"https://files.rcsb.org/download/{pdb_id}.{fmt}"
    resp = requests.get(url)
    resp.raise_for_status()
    outpath = Path(outdir) / f"{pdb_id}.{fmt}"
    outpath.write_text(resp.text)
    return outpath

Common Workflows

Find High-Quality Human Structures

from rcsbapi.search import TextQuery, AttributeQuery

query = (
    TextQuery("receptor") &
    AttributeQuery(
        attribute="rcsb_entity_source_organism.scientific_name",
        operator="exact_match",
        value="Homo sapiens"
    ) &
    AttributeQuery(
        attribute="rcsb_entry_info.resolution_combined",
        operator="less",
        value=2.5
    ) &
    AttributeQuery(
        attribute="exptl.method",
        operator="exact_match",
        value="X-RAY DIFFRACTION"
    )
)
results = list(query())

Batch Metadata Retrieval

import requests
import time

def fetch_batch(pdb_ids: list, delay: float = 0.3) -> dict:
    """Fetch metadata with rate limiting."""
    results = {}
    for pdb_id in pdb_ids:
        try:
            resp = requests.get(f"https://data.rcsb.org/rest/v1/core/entry/{pdb_id}")
            resp.raise_for_status()
            data = resp.json()
            results[pdb_id] = {
                "title": data["struct"]["title"],
                "resolution": data.get("rcsb_entry_info", {}).get("resolution_combined"),
                "method": data.get("exptl", [{}])[0].get("method"),
            }
        except Exception as e:
            results[pdb_id] = {"error": str(e)}
        time.sleep(delay)
    return results

Find Drug-Bound Structures

from rcsbapi.search import AttributeQuery

# Find structures containing imatinib (ligand ID: STI)
query = AttributeQuery(
    attribute="rcsb_nonpolymer_entity_instance_container_identifiers.comp_id",
    operator="exact_match",
    value="STI"
)
drug_complexes = list(query())

GraphQL for Complex Queries

import requests

query = """
{
  entry(entry_id: "4HHB") {
    struct { title }
    rcsb_entry_info {
      resolution_combined
      deposited_atom_count
    }
    polymer_entities {
      rcsb_polymer_entity { pdbx_description }
      entity_poly { pdbx_seq_one_letter_code }
    }
  }
}
"""

response = requests.post(
    "https://data.rcsb.org/graphql",
    json={"query": query}
)
result = response.json()["data"]["entry"]

Key Concepts

Term Definition
PDB ID 4-character alphanumeric code (e.g., "4HHB"). AlphaFold uses "AF_" prefix
Entity Distinct molecular species. A homodimer has one entity appearing twice
Resolution Quality metric in angstroms. Lower is better; <2.0 Å is high quality
Biological Assembly Functional oligomeric state (may differ from asymmetric unit)
mmCIF Modern format replacing legacy PDB; required for large structures

Common Attribute Paths

Use these string paths with AttributeQuery:

Attribute Description
rcsb_entity_source_organism.scientific_name Source organism (e.g., "Homo sapiens")
rcsb_entry_info.resolution_combined Resolution in angstroms
exptl.method Experimental method (X-RAY DIFFRACTION, ELECTRON MICROSCOPY, SOLUTION NMR)
rcsb_nonpolymer_entity_instance_container_identifiers.comp_id Ligand/small molecule ID
struct.title Structure title
rcsb_accession_info.deposit_date Deposition date

Best Practices

Practice Rationale
Use mmCIF format PDB format has atom count limits
Filter by resolution <2.5 Å for most analyses; <2.0 Å for detailed work
Check experimental method X-ray vs cryo-EM vs NMR have different quality metrics
Rate limit requests 2-3 req/s to avoid 429 errors
Cache downloads Structures rarely change after release
Prefer GraphQL Reduces requests for complex data needs

Troubleshooting

Issue Resolution
404 on entry fetch Entry may be obsoleted; check RCSB website for superseding ID
429 Too Many Requests Implement exponential backoff; reduce request rate
Empty search results Check query syntax; use query.to_dict() to debug
Large structure fails Use mmCIF format instead of PDB
Missing sequence data Query polymer entity endpoint, not entry

References

See references/api-reference.md for:

  • Complete REST endpoint documentation
  • All searchable attributes and operators
  • Advanced query patterns
  • Rate limiting strategies

External Links

Weekly Installs
26
First Seen
Feb 25, 2026
Installed on
mcpjam26
claude-code26
replit26
junie26
windsurf26
zencoder26