rdkit
SKILL.md
RDKit: Python Cheminformatics Library
Summary
RDKit (v2023+) provides comprehensive Python APIs for molecular structure manipulation, property calculation, and chemical informatics. It requires Python 3 and NumPy, offering modular components for molecule parsing, descriptors, fingerprints, substructure search, conformer generation, and reaction processing.
Applicable Scenarios
This skill applies when you need to:
| Task Category | Examples |
|---|---|
| Molecule I/O | Parse SMILES, MOL, SDF, InChI; write structures |
| Property Calculation | Molecular weight, LogP, TPSA, H-bond donors/acceptors |
| Fingerprinting | Morgan (ECFP), MACCS keys, atom pairs, topological |
| Similarity Analysis | Tanimoto, Dice, clustering compounds |
| Substructure Search | SMARTS patterns, functional group detection |
| 3D Conformers | Generate, optimize, align molecular geometries |
| Chemical Reactions | Define and execute transformations |
| Drug-Likeness | Lipinski rules, QED, lead-likeness filters |
| Visualization | 2D depictions, highlighting, grid images |
Module Organization
| Module | Purpose | Reference |
|---|---|---|
| rdkit.Chem | Core molecule parsing, serialization, substructure | references/api-reference.md |
| rdkit.Chem.Descriptors | Property calculations | references/descriptors-reference.md |
| rdkit.Chem.rdFingerprintGenerator | Modern fingerprint API | references/api-reference.md |
| rdkit.DataStructs | Similarity metrics, bulk operations | references/api-reference.md |
| rdkit.Chem.AllChem | 3D coordinates, reactions, optimization | references/api-reference.md |
| rdkit.Chem.Draw | Visualization and depiction | references/api-reference.md |
| SMARTS patterns | Substructure query language | references/smarts-patterns.md |
Setup
Install via pip or conda:
# Conda (recommended)
conda install -c conda-forge rdkit
# Pip
pip install rdkit-pypi
Standard imports:
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors, Draw
from rdkit import DataStructs
Quick Reference
Parse and Validate Molecules
from rdkit import Chem
mol = Chem.MolFromSmiles('c1ccc(O)cc1')
if mol is None:
print("Invalid SMILES")
Compute Properties
from rdkit.Chem import Descriptors
mw = Descriptors.MolWt(mol)
logp = Descriptors.MolLogP(mol)
tpsa = Descriptors.TPSA(mol)
Generate Fingerprints
from rdkit.Chem import rdFingerprintGenerator
gen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
fp = gen.GetFingerprint(mol)
Similarity Search
from rdkit import DataStructs
similarity = DataStructs.TanimotoSimilarity(fp1, fp2)
Substructure Match
pattern = Chem.MolFromSmarts('[OH1][C]') # Alcohol
has_alcohol = mol.HasSubstructMatch(pattern)
Generate 3D Conformer
from rdkit.Chem import AllChem
mol = Chem.AddHs(mol)
AllChem.EmbedMolecule(mol, randomSeed=42)
AllChem.MMFFOptimizeMolecule(mol)
Implementation Patterns
Drug-Likeness Assessment
from rdkit import Chem
from rdkit.Chem import Descriptors
def assess_druglikeness(smiles: str) -> dict | None:
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return None
mw = Descriptors.MolWt(mol)
logp = Descriptors.MolLogP(mol)
hbd = Descriptors.NumHDonors(mol)
hba = Descriptors.NumHAcceptors(mol)
return {
'MW': mw,
'LogP': logp,
'HBD': hbd,
'HBA': hba,
'TPSA': Descriptors.TPSA(mol),
'RotBonds': Descriptors.NumRotatableBonds(mol),
'Lipinski': mw <= 500 and logp <= 5 and hbd <= 5 and hba <= 10,
'QED': Descriptors.qed(mol)
}
Batch Similarity Search
from rdkit import Chem, DataStructs
from rdkit.Chem import rdFingerprintGenerator
def find_similar(query_smiles: str, database: list[str], threshold: float = 0.7) -> list:
query = Chem.MolFromSmiles(query_smiles)
if query is None:
return []
gen = rdFingerprintGenerator.GetMorganGenerator(radius=2)
query_fp = gen.GetFingerprint(query)
hits = []
for idx, smi in enumerate(database):
mol = Chem.MolFromSmiles(smi)
if mol:
fp = gen.GetFingerprint(mol)
sim = DataStructs.TanimotoSimilarity(query_fp, fp)
if sim >= threshold:
hits.append((idx, smi, sim))
return sorted(hits, key=lambda x: x[2], reverse=True)
Functional Group Screening
from rdkit import Chem
FUNCTIONAL_GROUPS = {
'alcohol': '[OH1][C]',
'amine': '[NH2,NH1][C]',
'carboxylic_acid': 'C(=O)[OH1]',
'amide': 'C(=O)N',
'ester': 'C(=O)O[C]',
'nitro': '[N+](=O)[O-]'
}
def detect_functional_groups(smiles: str) -> list[str]:
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return []
found = []
for name, smarts in FUNCTIONAL_GROUPS.items():
pattern = Chem.MolFromSmarts(smarts)
if mol.HasSubstructMatch(pattern):
found.append(name)
return found
Conformer Generation with Clustering
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.ML.Cluster import Butina
def generate_diverse_conformers(smiles: str, n_confs: int = 50, rmsd_thresh: float = 0.5) -> list:
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return []
mol = Chem.AddHs(mol)
conf_ids = AllChem.EmbedMultipleConfs(mol, numConfs=n_confs, randomSeed=42)
# Optimize all conformers
for cid in conf_ids:
AllChem.MMFFOptimizeMolecule(mol, confId=cid)
# Cluster by RMSD to get diverse set
if len(conf_ids) < 2:
return list(conf_ids)
dists = []
for i in range(len(conf_ids)):
for j in range(i):
rmsd = AllChem.GetConformerRMS(mol, conf_ids[j], conf_ids[i])
dists.append(rmsd)
clusters = Butina.ClusterData(dists, len(conf_ids), rmsd_thresh, isDistData=True)
return [conf_ids[c[0]] for c in clusters] # Cluster centroids
Batch Processing SDF Files
from rdkit import Chem
from rdkit.Chem import Descriptors
def process_sdf(input_path: str, output_path: str, min_mw: float = 200, max_mw: float = 500):
"""Filter compounds by molecular weight and add property columns."""
supplier = Chem.SDMolSupplier(input_path)
writer = Chem.SDWriter(output_path)
for mol in supplier:
if mol is None:
continue
mw = Descriptors.MolWt(mol)
if not (min_mw <= mw <= max_mw):
continue
# Add computed properties
mol.SetProp('MW', f'{mw:.2f}')
mol.SetProp('LogP', f'{Descriptors.MolLogP(mol):.2f}')
mol.SetProp('TPSA', f'{Descriptors.TPSA(mol):.2f}')
writer.write(mol)
writer.close()
Guidelines
Always validate parsed molecules:
mol = Chem.MolFromSmiles(smiles)
if mol is None:
print(f"Parse failed: {smiles}")
continue
Use bulk operations for performance:
fps = [gen.GetFingerprint(m) for m in mols]
sims = DataStructs.BulkTanimotoSimilarity(fps[0], fps[1:])
Add hydrogens for 3D work:
mol = Chem.AddHs(mol) # Required before EmbedMolecule
AllChem.EmbedMolecule(mol)
Stream large files:
# Memory-efficient: process one at a time
for mol in Chem.ForwardSDMolSupplier(file_handle):
if mol:
process(mol)
# Avoid: loading entire file
all_mols = list(Chem.SDMolSupplier('huge.sdf'))
Thread safety: Most operations are thread-safe except for concurrent access to MolSupplier objects.
Troubleshooting
| Issue | Resolution |
|---|---|
MolFromSmiles returns None |
Invalid SMILES syntax; check input |
| Sanitization error | Use Chem.DetectChemistryProblems(mol) to diagnose |
| Wrong 3D geometry | Call AddHs(mol) before embedding |
| Fingerprint size mismatch | Use same fpSize parameter for all comparisons |
| SMARTS not matching | Check aromatic vs aliphatic atoms (c vs C) |
| Slow SDF processing | Use ForwardSDMolSupplier or MultithreadedSDMolSupplier |
| Memory issues with large files | Stream with ForwardSDMolSupplier, don't load all |
Reference Documentation
Each reference file contains detailed API documentation:
| File | Contents |
|---|---|
references/api-reference.md |
Complete function/class listings by module |
references/descriptors-reference.md |
All molecular descriptors with examples |
references/smarts-patterns.md |
Common SMARTS patterns for substructure search |
External Resources
- RDKit Documentation: https://www.rdkit.org/docs/
- Getting Started Guide: https://www.rdkit.org/docs/GettingStartedInPython.html
- GitHub Repository: https://github.com/rdkit/rdkit
- Cookbook: https://www.rdkit.org/docs/Cookbook.html
Weekly Installs
26
Repository
aminoanalytica/…a-skillsFirst Seen
Feb 25, 2026
Security Audits
Installed on
mcpjam26
claude-code26
replit26
junie26
windsurf26
zencoder26