tooluniverse-rare-disease-diagnosis
Rare Disease Diagnosis Advisor
Systematic diagnosis support for rare diseases using phenotype matching, gene panel prioritization, and variant interpretation across Orphanet, OMIM, HPO, ClinVar, and structure-based analysis.
KEY PRINCIPLES:
- Report-first approach - Create report file FIRST, update progressively
- Phenotype-driven - Convert symptoms to HPO terms before searching
- Multi-database triangulation - Cross-reference Orphanet, OMIM, OpenTargets
- Evidence grading - Grade diagnoses by supporting evidence strength
- Actionable output - Prioritized differential diagnosis with next steps
- Genetic counseling aware - Consider inheritance patterns and family history
- English-first queries - Always use English terms in tool calls (phenotype descriptions, gene names, disease names), even if the user writes in another language. Only try original-language terms as a fallback. Respond in the user's language
When to Use
Apply when user asks:
- "Patient has [symptoms], what rare disease could this be?"
- "Unexplained developmental delay with [features]"
- "WES found VUS in [gene], is this pathogenic?"
- "What genes should we test for [phenotype]?"
- "Differential diagnosis for [rare symptom combination]"
Critical Workflow Requirements
1. Report-First Approach (MANDATORY)
-
Create the report file FIRST:
- File name:
[PATIENT_ID]_rare_disease_report.md - Initialize with all section headers
- Add placeholder text:
[Researching...]
- File name:
-
Progressively update as you gather data
-
Output separate data files:
[PATIENT_ID]_gene_panel.csv- Prioritized genes for testing[PATIENT_ID]_variant_interpretation.csv- If variants provided
2. Citation Requirements (MANDATORY)
Every finding MUST include source:
### Candidate Disease: Marfan Syndrome
- **ORPHA**: ORPHA:558
- **OMIM**: 154700
- **Phenotype match**: 85% (17/20 HPO terms)
- **Inheritance**: AD
- **Gene**: FBN1
*Source: Orphanet via `Orphanet_558`, OMIM via `OMIM_get_entry`*
Phase 0: Tool Verification
CRITICAL: Verify tool parameters before calling.
Known Parameter Corrections
| Tool | WRONG Parameter | CORRECT Parameter |
|---|---|---|
OpenTargets_get_associated_diseases_by_target_ensemblId |
ensemblID |
ensemblId |
ClinVar_get_variant_by_id |
variant_id |
id |
MyGene_query_genes |
gene |
q |
gnomAD_get_variant_frequencies |
variant |
variant_id |
Workflow Overview
Phase 1: Phenotype Standardization
├── Convert symptoms to HPO terms
├── Identify core vs. variable features
└── Note age of onset, inheritance hints
↓
Phase 2: Disease Matching
├── Orphanet phenotype search
├── OMIM clinical synopsis match
├── OpenTargets disease associations
└── OUTPUT: Ranked differential diagnosis
↓
Phase 3: Gene Panel Identification
├── Extract genes from top diseases
├── Cross-reference expression (GTEx)
├── Prioritize by evidence strength
└── OUTPUT: Recommended gene panel
↓
Phase 3.5: Expression & Tissue Context (NEW)
├── CELLxGENE: Cell-type specific expression
├── ChIPAtlas: Regulatory context (TF binding)
├── Tissue-specific gene networks
└── OUTPUT: Expression validation
↓
Phase 3.6: Pathway Analysis (NEW)
├── KEGG: Metabolic/signaling pathways
├── Reactome: Biological processes
├── IntAct: Protein-protein interactions
└── OUTPUT: Biological context
↓
Phase 4: Variant Interpretation (if provided)
├── ClinVar pathogenicity lookup
├── gnomAD population frequency
├── Protein domain/function impact
├── ENCODE/ChIPAtlas: Regulatory variant impact
└── OUTPUT: Variant classification
↓
Phase 5: Structure Analysis (for VUS)
├── NvidiaNIM_alphafold2 → Predict structure
├── Map variant to structure
├── Assess functional domain impact
└── OUTPUT: Structural evidence
↓
Phase 6: Literature Evidence (NEW)
├── PubMed: Published studies
├── BioRxiv/MedRxiv: Preprints
├── OpenAlex: Citation analysis
└── OUTPUT: Literature support
↓
Phase 7: Report Synthesis
├── Prioritized differential diagnosis
├── Recommended genetic testing
├── Next steps for clinician
└── OUTPUT: Final report
Phase 1: Phenotype Standardization
1.1 Convert Symptoms to HPO Terms
def standardize_phenotype(tu, symptoms_list):
"""Convert clinical descriptions to HPO terms."""
hpo_terms = []
for symptom in symptoms_list:
# Search HPO for matching terms
results = tu.tools.HPO_search_terms(query=symptom)
if results:
hpo_terms.append({
'original': symptom,
'hpo_id': results[0]['id'],
'hpo_name': results[0]['name'],
'confidence': 'exact' if symptom.lower() in results[0]['name'].lower() else 'partial'
})
return hpo_terms
1.2 Phenotype Categories
| Category | Examples | Weight |
|---|---|---|
| Core features | Always present in disease | High |
| Variable features | Present in >50% | Medium |
| Occasional features | Present in <50% | Low |
| Age-specific | Onset-dependent | Context |
1.3 Output for Report
## 1. Phenotype Analysis
### 1.1 Standardized HPO Terms
| Clinical Feature | HPO Term | HPO ID | Category |
|------------------|----------|--------|----------|
| Tall stature | Tall stature | HP:0000098 | Core |
| Long fingers | Arachnodactyly | HP:0001166 | Core |
| Heart murmur | Cardiac murmur | HP:0030148 | Variable |
| Joint hypermobility | Joint hypermobility | HP:0001382 | Core |
**Total HPO Terms**: 8
**Onset**: Childhood
**Family History**: Father with similar features (AD suspected)
*Source: HPO via `HPO_search_terms`*
Phase 2: Disease Matching
2.1 Orphanet Disease Search (NEW TOOLS)
def match_diseases_orphanet(tu, symptom_keywords):
"""Find rare diseases matching symptoms using Orphanet."""
candidate_diseases = []
# Search Orphanet by disease keywords
for keyword in symptom_keywords:
results = tu.tools.Orphanet_search_diseases(
operation="search_diseases",
query=keyword
)
if results.get('status') == 'success':
candidate_diseases.extend(results['data']['results'])
# Get genes for each disease
for disease in candidate_diseases:
orpha_code = disease.get('ORPHAcode')
genes = tu.tools.Orphanet_get_genes(
operation="get_genes",
orpha_code=orpha_code
)
disease['genes'] = genes.get('data', {}).get('genes', [])
return deduplicate_and_rank(candidate_diseases)
2.2 OMIM Cross-Reference (NEW TOOLS)
def cross_reference_omim(tu, orphanet_diseases, gene_symbols):
"""Get OMIM details for diseases and genes."""
omim_data = {}
# Search OMIM for each disease/gene
for gene in gene_symbols:
search_result = tu.tools.OMIM_search(
operation="search",
query=gene,
limit=5
)
if search_result.get('status') == 'success':
for entry in search_result['data'].get('entries', []):
mim_number = entry.get('mimNumber')
# Get detailed entry
details = tu.tools.OMIM_get_entry(
operation="get_entry",
mim_number=str(mim_number)
)
# Get clinical synopsis (phenotype features)
synopsis = tu.tools.OMIM_get_clinical_synopsis(
operation="get_clinical_synopsis",
mim_number=str(mim_number)
)
omim_data[gene] = {
'mim_number': mim_number,
'details': details.get('data', {}),
'clinical_synopsis': synopsis.get('data', {})
}
return omim_data
2.3 DisGeNET Gene-Disease Associations (NEW TOOLS)
def get_gene_disease_associations(tu, gene_symbols):
"""Get gene-disease associations from DisGeNET."""
associations = {}
for gene in gene_symbols:
# Get diseases associated with gene
result = tu.tools.DisGeNET_search_gene(
operation="search_gene",
gene=gene,
limit=20
)
if result.get('status') == 'success':
associations[gene] = result['data'].get('associations', [])
return associations
def get_disease_genes_disgenet(tu, disease_name):
"""Get all genes associated with a disease."""
result = tu.tools.DisGeNET_search_disease(
operation="search_disease",
disease=disease_name,
limit=30
)
return result.get('data', {}).get('associations', [])
2.4 Phenotype Overlap Scoring
| Match Level | Score | Criteria |
|---|---|---|
| Excellent | >80% | Most core + variable features match |
| Good | 60-80% | Core features match, some variable |
| Possible | 40-60% | Some overlap, needs consideration |
| Unlikely | <40% | Poor phenotype fit |
2.5 Output for Report
## 2. Differential Diagnosis
### Top Candidate Diseases (Ranked by Phenotype Match)
| Rank | Disease | ORPHA | OMIM | Match | Inheritance | Key Gene(s) |
|------|---------|-------|------|-------|-------------|-------------|
| 1 | Marfan syndrome | 558 | 154700 | 85% | AD | FBN1 |
| 2 | Loeys-Dietz syndrome | 60030 | 609192 | 72% | AD | TGFBR1, TGFBR2 |
| 3 | Ehlers-Danlos, vascular | 286 | 130050 | 65% | AD | COL3A1 |
| 4 | Homocystinuria | 394 | 236200 | 58% | AR | CBS |
### DisGeNET Gene-Disease Evidence
| Gene | Associated Diseases | GDA Score | Evidence |
|------|---------------------|-----------|----------|
| FBN1 | Marfan syndrome, MASS phenotype | 0.95 | ★★★ Curated |
| TGFBR1 | Loeys-Dietz syndrome | 0.89 | ★★★ Curated |
| COL3A1 | vascular EDS | 0.91 | ★★★ Curated |
*Source: DisGeNET via `DisGeNET_search_gene`*
### Disease Details
#### 1. Marfan Syndrome (★★★)
**ORPHA**: 558 | **OMIM**: 154700 | **Prevalence**: 1-5/10,000
**Phenotype Match Analysis**:
| Patient Feature | Disease Feature | Match |
|-----------------|-----------------|-------|
| Tall stature | Present in 95% | ✓ |
| Arachnodactyly | Present in 90% | ✓ |
| Joint hypermobility | Present in 85% | ✓ |
| Cardiac murmur | Aortic root dilation (70%) | Partial |
**OMIM Clinical Synopsis** (via `OMIM_get_clinical_synopsis`):
- **Cardiovascular**: Aortic root dilation, mitral valve prolapse
- **Skeletal**: Scoliosis, pectus excavatum, tall stature
- **Ocular**: Ectopia lentis, myopia
**Diagnostic Criteria**: Ghent nosology (2010)
- Aortic root dilation/dissection + FBN1 mutation = Diagnosis
- Without genetic testing: systemic score ≥7 + ectopia lentis
**Inheritance**: Autosomal dominant (25% de novo)
*Source: Orphanet via `Orphanet_get_disease`, OMIM via `OMIM_get_entry`, DisGeNET*
Phase 3: Gene Panel Identification
3.1 Extract Disease Genes
def build_gene_panel(tu, candidate_diseases):
"""Build prioritized gene panel from candidate diseases."""
genes = {}
for disease in candidate_diseases:
for gene in disease['genes']:
if gene not in genes:
genes[gene] = {
'symbol': gene,
'diseases': [],
'evidence_level': 'unknown'
}
genes[gene]['diseases'].append(disease['name'])
return genes
3.1.1 ClinGen Gene-Disease Validity Check (NEW)
Critical: Always verify gene-disease validity through ClinGen before including in panel.
def get_clingen_gene_evidence(tu, gene_symbol):
"""
Get ClinGen gene-disease validity and dosage sensitivity.
ESSENTIAL for rare disease gene panel prioritization.
"""
# 1. Gene-disease validity classification
validity = tu.tools.ClinGen_search_gene_validity(gene=gene_symbol)
validity_levels = []
diseases_with_validity = []
if validity.get('data'):
for entry in validity.get('data', []):
validity_levels.append(entry.get('Classification'))
diseases_with_validity.append({
'disease': entry.get('Disease Label'),
'mondo_id': entry.get('Disease ID (MONDO)'),
'classification': entry.get('Classification'),
'inheritance': entry.get('Inheritance')
})
# 2. Dosage sensitivity (critical for CNV interpretation)
dosage = tu.tools.ClinGen_search_dosage_sensitivity(gene=gene_symbol)
hi_score = None
ts_score = None
if dosage.get('data'):
for entry in dosage.get('data', []):
hi_score = entry.get('Haploinsufficiency Score')
ts_score = entry.get('Triplosensitivity Score')
break
# 3. Clinical actionability (return of findings context)
actionability = tu.tools.ClinGen_search_actionability(gene=gene_symbol)
is_actionable = (actionability.get('adult_count', 0) > 0 or
actionability.get('pediatric_count', 0) > 0)
# Determine best evidence level
level_priority = ['Definitive', 'Strong', 'Moderate', 'Limited', 'Disputed', 'Refuted']
best_level = 'Not curated'
for level in level_priority:
if level in validity_levels:
best_level = level
break
return {
'gene': gene_symbol,
'evidence_level': best_level,
'diseases_curated': diseases_with_validity,
'haploinsufficiency_score': hi_score,
'triplosensitivity_score': ts_score,
'is_actionable': is_actionable,
'include_in_panel': best_level in ['Definitive', 'Strong', 'Moderate']
}
def prioritize_genes_with_clingen(tu, gene_list):
"""Prioritize genes using ClinGen evidence levels."""
prioritized = []
for gene in gene_list:
evidence = get_clingen_gene_evidence(tu, gene)
# Score based on ClinGen classification
score = 0
if evidence['evidence_level'] == 'Definitive':
score = 5
elif evidence['evidence_level'] == 'Strong':
score = 4
elif evidence['evidence_level'] == 'Moderate':
score = 3
elif evidence['evidence_level'] == 'Limited':
score = 1
# Disputed/Refuted get 0
# Bonus for haploinsufficiency score 3
if evidence['haploinsufficiency_score'] == '3':
score += 1
# Bonus for actionability
if evidence['is_actionable']:
score += 1
prioritized.append({
**evidence,
'priority_score': score
})
# Sort by priority score
return sorted(prioritized, key=lambda x: x['priority_score'], reverse=True)
ClinGen Classification Impact on Panel:
| Classification | Include in Panel? | Priority |
|---|---|---|
| Definitive | YES - mandatory | Highest |
| Strong | YES - highly recommended | High |
| Moderate | YES | Medium |
| Limited | Include but flag | Low |
| Disputed | Exclude or separate | Avoid |
| Refuted | EXCLUDE | Do not test |
| Not curated | Use other evidence | Variable |
3.2 Gene Prioritization Criteria
| Priority | Criteria | Points |
|---|---|---|
| Tier 1 | Gene causes #1 ranked disease | +5 |
| Tier 2 | Gene causes multiple candidates | +3 |
| Tier 3 | ClinGen "Definitive" evidence | +3 |
| Tier 4 | Expressed in affected tissue | +2 |
| Tier 5 | Constraint score pLI >0.9 | +1 |
3.3 Expression Validation
def validate_expression(tu, gene_symbol, affected_tissue):
"""Check if gene is expressed in relevant tissue."""
# Get Ensembl ID
gene_info = tu.tools.MyGene_query_genes(q=gene_symbol, species="human")
ensembl_id = gene_info.get('ensembl', {}).get('gene')
# Check GTEx expression
expression = tu.tools.GTEx_get_median_gene_expression(
gencode_id=f"{ensembl_id}.latest"
)
return expression.get(affected_tissue, 0) > 1 # TPM > 1
3.4 Output for Report
## 3. Recommended Gene Panel
### 3.1 Prioritized Genes for Testing
| Priority | Gene | Diseases | Evidence | Constraint (pLI) | Expression |
|----------|------|----------|----------|------------------|------------|
| ★★★ | FBN1 | Marfan syndrome | Definitive | 1.00 | Heart, aorta |
| ★★★ | TGFBR1 | Loeys-Dietz 1 | Definitive | 0.98 | Ubiquitous |
| ★★★ | TGFBR2 | Loeys-Dietz 2 | Definitive | 0.99 | Ubiquitous |
| ★★☆ | COL3A1 | EDS vascular | Definitive | 1.00 | Connective tissue |
| ★☆☆ | CBS | Homocystinuria | Definitive | 0.00 | Liver |
### 3.2 Panel Design Recommendation
**Minimum Panel** (high yield): FBN1, TGFBR1, TGFBR2, COL3A1
**Extended Panel** (+differential): Add CBS, SMAD3, ACTA2
**Testing Strategy**:
1. Start with FBN1 sequencing (highest pre-test probability)
2. If negative, proceed to full connective tissue panel
3. Consider WES if panel negative
*Source: ClinGen via gene-disease validity, GTEx expression*
Phase 3.5: Expression & Tissue Context (ENHANCED)
3.5.1 Cell-Type Specific Expression (CELLxGENE)
def get_cell_type_expression(tu, gene_symbol, affected_tissues):
"""Get single-cell expression to validate tissue relevance."""
# Get expression across cell types
expression = tu.tools.CELLxGENE_get_expression_data(
gene=gene_symbol,
tissue=affected_tissues[0] if affected_tissues else "all"
)
# Get cell type metadata
cell_metadata = tu.tools.CELLxGENE_get_cell_metadata(
gene=gene_symbol
)
# Identify high-expression cell types
high_expression = [
ct for ct in expression
if ct.get('mean_expression', 0) > 1.0 # TPM > 1
]
return {
'expression_data': expression,
'high_expression_cells': high_expression,
'total_cell_types': len(cell_metadata)
}
Why it matters: Confirms candidate genes are expressed in disease-relevant tissues/cells.
3.5.2 Regulatory Context (ChIPAtlas)
def get_regulatory_context(tu, gene_symbol):
"""Get transcription factor binding for candidate genes."""
# Search for TF binding near gene
tf_binding = tu.tools.ChIPAtlas_enrichment_analysis(
gene=gene_symbol,
cell_type="all"
)
# Get specific binding peaks
peaks = tu.tools.ChIPAtlas_get_peak_data(
gene=gene_symbol,
experiment_type="TF"
)
return {
'transcription_factors': tf_binding,
'regulatory_peaks': peaks
}
Why it matters: Identifies regulatory mechanisms that may be disrupted in disease.
3.5.3 Output for Report
## 3.5 Expression & Regulatory Context
### Cell-Type Specific Expression (CELLxGENE)
| Gene | Top Expressing Cell Types | Expression Level | Tissue Relevance |
|------|---------------------------|------------------|------------------|
| FBN1 | Fibroblasts, Smooth muscle | High (TPM=45) | ✓ Connective tissue |
| TGFBR1 | Endothelial, Fibroblasts | Medium (TPM=12) | ✓ Vascular |
| COL3A1 | Fibroblasts, Myofibroblasts | Very High (TPM=120) | ✓ Connective tissue |
**Interpretation**: All top candidate genes show high expression in disease-relevant cell types (connective tissue, vascular cells), supporting their candidacy.
### Regulatory Context (ChIPAtlas)
| Gene | Key TF Regulators | Regulatory Significance |
|------|-------------------|------------------------|
| FBN1 | TGFβ pathway (SMAD2/3), AP-1 | TGFβ-responsive |
| TGFBR1 | STAT3, NF-κB | Inflammation-responsive |
*Source: CELLxGENE Census, ChIPAtlas*
Phase 3.6: Pathway Analysis (NEW)
3.6.1 KEGG Pathway Context
def get_pathway_context(tu, gene_symbols):
"""Get pathway context for candidate genes."""
pathways = {}
for gene in gene_symbols:
# Search KEGG for gene
kegg_genes = tu.tools.kegg_find_genes(query=f"hsa:{gene}")
if kegg_genes:
# Get pathway membership
gene_info = tu.tools.kegg_get_gene_info(gene_id=kegg_genes[0]['id'])
pathways[gene] = gene_info.get('pathways', [])
return pathways
3.6.2 Protein-Protein Interactions (IntAct)
def get_protein_interactions(tu, gene_symbol):
"""Get interaction partners for candidate genes."""
# Search IntAct for interactions
interactions = tu.tools.intact_search_interactions(
query=gene_symbol,
species="human"
)
# Get interaction network
network = tu.tools.intact_get_interaction_network(
gene=gene_symbol,
depth=1 # Direct interactors only
)
return {
'interactions': interactions,
'network': network,
'interactor_count': len(interactions)
}
3.6.3 Output for Report
## 3.6 Pathway & Network Context
### KEGG Pathways
| Gene | Key Pathways | Biological Process |
|------|--------------|-------------------|
| FBN1 | ECM-receptor interaction (hsa04512) | Extracellular matrix |
| TGFBR1/2 | TGF-beta signaling (hsa04350) | Cell signaling |
| COL3A1 | Focal adhesion (hsa04510) | Cell-matrix adhesion |
### Shared Pathway Analysis
**Convergent pathways** (≥2 candidate genes):
- TGF-beta signaling pathway: FBN1, TGFBR1, TGFBR2, SMAD3
- ECM organization: FBN1, COL3A1
**Interpretation**: Candidate genes converge on TGF-beta signaling and extracellular matrix pathways, consistent with connective tissue disorder etiology.
### Protein-Protein Interactions (IntAct)
| Gene | Direct Interactors | Notable Partners |
|------|-------------------|------------------|
| FBN1 | 42 | LTBP1, TGFB1, ADAMTS10 |
| TGFBR1 | 68 | TGFBR2, SMAD2, SMAD3 |
*Source: KEGG, IntAct, Reactome*
Phase 4: Variant Interpretation (If Provided)
4.1 ClinVar Lookup
def interpret_variant(tu, variant_hgvs):
"""Get ClinVar interpretation for variant."""
result = tu.tools.ClinVar_search_variants(query=variant_hgvs)
return {
'clinvar_id': result.get('id'),
'classification': result.get('clinical_significance'),
'review_status': result.get('review_status'),
'conditions': result.get('conditions'),
'last_evaluated': result.get('last_evaluated')
}
4.2 Population Frequency
def check_population_frequency(tu, variant_id):
"""Get gnomAD allele frequency."""
freq = tu.tools.gnomAD_get_variant_frequencies(variant_id=variant_id)
# Interpret rarity
if freq['allele_frequency'] < 0.00001:
rarity = "Ultra-rare"
elif freq['allele_frequency'] < 0.0001:
rarity = "Rare"
elif freq['allele_frequency'] < 0.01:
rarity = "Low frequency"
else:
rarity = "Common (likely benign)"
return freq, rarity
4.3 Computational Pathogenicity Prediction (ENHANCED)
Use state-of-the-art prediction tools for VUS interpretation:
def comprehensive_vus_prediction(tu, variant_info):
"""
Combine multiple prediction tools for VUS classification.
Critical for rare disease variants not in ClinVar.
"""
predictions = {}
# 1. CADD - Deleteriousness (NEW API)
cadd = tu.tools.CADD_get_variant_score(
chrom=variant_info['chrom'],
pos=variant_info['pos'],
ref=variant_info['ref'],
alt=variant_info['alt'],
version="GRCh38-v1.7"
)
if cadd.get('status') == 'success':
predictions['cadd'] = {
'score': cadd['data'].get('phred_score'),
'interpretation': cadd['data'].get('interpretation'),
'acmg': 'PP3' if cadd['data'].get('phred_score', 0) >= 20 else 'neutral'
}
# 2. AlphaMissense - DeepMind pathogenicity (NEW)
if variant_info.get('uniprot_id') and variant_info.get('aa_change'):
am = tu.tools.AlphaMissense_get_variant_score(
uniprot_id=variant_info['uniprot_id'],
variant=variant_info['aa_change'] # e.g., "E1541K"
)
if am.get('status') == 'success' and am.get('data'):
classification = am['data'].get('classification')
predictions['alphamissense'] = {
'score': am['data'].get('pathogenicity_score'),
'classification': classification,
'acmg': 'PP3 (strong)' if classification == 'pathogenic' else (
'BP4 (strong)' if classification == 'benign' else 'neutral'
)
}
# 3. EVE - Evolutionary prediction (NEW)
eve = tu.tools.EVE_get_variant_score(
chrom=variant_info['chrom'],
pos=variant_info['pos'],
ref=variant_info['ref'],
alt=variant_info['alt']
)
if eve.get('status') == 'success':
eve_scores = eve['data'].get('eve_scores', [])
if eve_scores:
predictions['eve'] = {
'score': eve_scores[0].get('eve_score'),
'classification': eve_scores[0].get('classification'),
'acmg': 'PP3' if eve_scores[0].get('eve_score', 0) > 0.5 else 'BP4'
}
# 4. SpliceAI - Splice variant prediction (NEW)
# Use for intronic, synonymous, or exonic variants near splice sites
variant_str = f"chr{variant_info['chrom']}-{variant_info['pos']}-{variant_info['ref']}-{variant_info['alt']}"
splice = tu.tools.SpliceAI_predict_splice(
variant=variant_str,
genome="38"
)
if splice.get('data'):
max_score = splice['data'].get('max_delta_score', 0)
interpretation = splice['data'].get('interpretation', '')
if max_score >= 0.8:
splice_acmg = 'PP3 (strong) - high splice impact'
elif max_score >= 0.5:
splice_acmg = 'PP3 (moderate) - splice impact'
elif max_score >= 0.2:
splice_acmg = 'PP3 (supporting) - possible splice effect'
else:
splice_acmg = 'BP7 (if synonymous) - no splice impact'
predictions['spliceai'] = {
'max_delta_score': max_score,
'interpretation': interpretation,
'scores': splice['data'].get('scores', []),
'acmg': splice_acmg
}
# Consensus for PP3/BP4
damaging = sum(1 for p in predictions.values() if 'PP3' in p.get('acmg', ''))
benign = sum(1 for p in predictions.values() if 'BP4' in p.get('acmg', ''))
return {
'predictions': predictions,
'consensus': {
'damaging_count': damaging,
'benign_count': benign,
'pp3_applicable': damaging >= 2 and benign == 0,
'bp4_applicable': benign >= 2 and damaging == 0
}
}
4.4 ACMG Classification Criteria
| Evidence Type | Criteria | Weight |
|---|---|---|
| PVS1 | Null variant in gene where LOF is mechanism | Very Strong |
| PS1 | Same amino acid change as established pathogenic | Strong |
| PM2 | Absent from population databases | Moderate |
| PP3 | Computational evidence supports deleterious (AlphaMissense, CADD, EVE, SpliceAI) | Supporting |
| BA1 | Allele frequency >5% | Benign standalone |
Enhanced PP3 Evidence (NEW):
- AlphaMissense pathogenic (>0.564) = Strong PP3 support (~90% accuracy)
- CADD ≥20 + EVE >0.5 = Multiple concordant predictions
- Agreement from 2+ predictors strengthens PP3 evidence
4.5 Output for Report
## 4. Variant Interpretation
### 4.1 Variant: FBN1 c.4621G>A (p.Glu1541Lys)
| Property | Value | Interpretation |
|----------|-------|----------------|
| Gene | FBN1 | Marfan syndrome gene |
| Consequence | Missense | Amino acid change |
| ClinVar | VUS | Uncertain significance |
| gnomAD AF | 0.000004 | Ultra-rare (PM2) |
### 4.2 Computational Predictions (NEW)
| Predictor | Score | Classification | ACMG Support |
|-----------|-------|----------------|--------------|
| **AlphaMissense** | 0.78 | Pathogenic | PP3 (strong) |
| **CADD PHRED** | 28.5 | Top 0.1% deleterious | PP3 |
| **EVE** | 0.72 | Likely pathogenic | PP3 |
**Consensus**: 3/3 predictors concordant damaging → **Strong PP3 support**
*Source: AlphaMissense, CADD API, EVE via Ensembl VEP*
### 4.3 ACMG Evidence Summary
| Criterion | Evidence | Strength |
|-----------|----------|----------|
| PM2 | Absent from gnomAD (AF < 0.00001) | Moderate |
| PP3 | AlphaMissense + CADD + EVE concordant | Supporting (strong) |
| PP4 | Phenotype highly specific for Marfan | Supporting |
| PS4 | Multiple affected family members | Strong |
**Preliminary Classification**: Likely Pathogenic (1 Strong + 1 Moderate + 2 Supporting)
*Source: ClinVar, gnomAD, AlphaMissense, CADD, EVE*
Phase 5: Structure Analysis for VUS
5.1 When to Perform Structure Analysis
Perform when:
- Variant is VUS or conflicting interpretations
- Missense variant in critical domain
- Novel variant not in databases
- Additional evidence needed for classification
5.2 Structure Prediction (NVIDIA NIM)
def analyze_variant_structure(tu, protein_sequence, variant_position):
"""Predict structure and analyze variant impact."""
# Predict structure with AlphaFold2
structure = tu.tools.NvidiaNIM_alphafold2(
sequence=protein_sequence,
algorithm="mmseqs2",
relax_prediction=False
)
# Extract pLDDT at variant position
variant_plddt = get_residue_plddt(structure, variant_position)
# Check if in structured region
confidence = "High" if variant_plddt > 70 else "Low"
return {
'structure': structure,
'variant_plddt': variant_plddt,
'confidence': confidence
}
5.3 Domain Impact Assessment
def assess_domain_impact(tu, uniprot_id, variant_position):
"""Check if variant affects functional domain."""
# Get domain annotations
domains = tu.tools.InterPro_get_protein_domains(accession=uniprot_id)
for domain in domains:
if domain['start'] <= variant_position <= domain['end']:
return {
'in_domain': True,
'domain_name': domain['name'],
'domain_function': domain['description']
}
return {'in_domain': False}
5.4 Output for Report
## 5. Structural Analysis
### 5.1 Structure Prediction
**Method**: AlphaFold2 via NVIDIA NIM
**Protein**: Fibrillin-1 (FBN1)
**Sequence Length**: 2,871 amino acids
| Metric | Value | Interpretation |
|--------|-------|----------------|
| Mean pLDDT | 85.3 | High confidence overall |
| Variant position pLDDT | 92.1 | Very high confidence |
| Nearby domain | cbEGF-like domain 23 | Calcium-binding |
### 5.2 Variant Location Analysis
**Variant**: p.Glu1541Lys
| Feature | Finding | Impact |
|---------|---------|--------|
| Domain | cbEGF-like domain 23 | Critical for calcium binding |
| Conservation | 100% conserved across vertebrates | High constraint |
| Structural role | Calcium coordination residue | Likely destabilizing |
| Nearby pathogenic | p.Glu1540Lys (Pathogenic) | Adjacent residue |
### 5.3 Structural Interpretation
The variant p.Glu1541Lys:
1. **Located in cbEGF domain** - These domains are critical for fibrillin-1 function
2. **Glutamate → Lysine** - Charge reversal (negative to positive)
3. **Calcium binding** - Glutamate at this position coordinates Ca2+
4. **Adjacent pathogenic variant** - p.Glu1540Lys is classified Pathogenic
**Structural Evidence**: Strong support for pathogenicity (PM1 - critical domain)
*Source: NVIDIA NIM via `NvidiaNIM_alphafold2`, InterPro*
Phase 6: Literature Evidence (NEW)
6.1 Published Literature (PubMed)
def search_disease_literature(tu, disease_name, genes):
"""Search for relevant published literature."""
# Disease-specific search
disease_papers = tu.tools.PubMed_search_articles(
query=f'"{disease_name}" AND (genetics OR mutation OR variant)',
limit=20
)
# Gene-specific searches
gene_papers = []
for gene in genes[:5]: # Top 5 genes
papers = tu.tools.PubMed_search_articles(
query=f'"{gene}" AND rare disease AND pathogenic',
limit=10
)
gene_papers.extend(papers)
return {
'disease_literature': disease_papers,
'gene_literature': gene_papers
}
6.2 Preprint Literature (BioRxiv/MedRxiv)
def search_preprints(tu, disease_name, genes):
"""Search preprints for cutting-edge findings."""
# BioRxiv search
biorxiv = tu.tools.BioRxiv_search_preprints(
query=f"{disease_name} genetics",
limit=10
)
# ArXiv for computational methods
arxiv = tu.tools.ArXiv_search_papers(
query=f"rare disease diagnosis {' OR '.join(genes[:3])}",
category="q-bio",
limit=5
)
return {
'biorxiv': biorxiv,
'arxiv': arxiv
}
6.3 Citation Analysis (OpenAlex)
def analyze_citations(tu, key_papers):
"""Analyze citation network for key papers."""
citation_analysis = []
for paper in key_papers[:5]:
# Get citation data
work = tu.tools.openalex_search_works(
query=paper['title'],
limit=1
)
if work:
citation_analysis.append({
'title': paper['title'],
'citations': work[0].get('cited_by_count', 0),
'year': work[0].get('publication_year')
})
return citation_analysis
6.4 Output for Report
## 6. Literature Evidence
### 6.1 Key Published Studies
| PMID | Title | Year | Citations | Relevance |
|------|-------|------|-----------|-----------|
| 32123456 | FBN1 variants in Marfan syndrome... | 2023 | 45 | Direct |
| 31987654 | TGF-beta signaling in connective... | 2022 | 89 | Pathway |
| 30876543 | Novel diagnostic criteria for... | 2021 | 156 | Diagnostic |
### 6.2 Recent Preprints (Not Yet Peer-Reviewed)
| Source | Title | Posted | Relevance |
|--------|-------|--------|-----------|
| BioRxiv | Novel FBN1 splice variant causes... | 2024-01 | Case report |
| MedRxiv | Machine learning for Marfan... | 2024-02 | Diagnostic |
**⚠️ Note**: Preprints have not undergone peer review. Use with caution.
### 6.3 Evidence Summary
| Evidence Type | Count | Strength |
|---------------|-------|----------|
| Case reports | 12 | Supporting |
| Functional studies | 5 | Strong |
| Clinical trials | 2 | Strong |
| Reviews | 8 | Context |
*Source: PubMed, BioRxiv, OpenAlex*
Report Template
File: [PATIENT_ID]_rare_disease_report.md
# Rare Disease Diagnostic Report
**Patient ID**: [ID] | **Date**: [Date] | **Status**: In Progress
---
## Executive Summary
[Researching...]
---
## 1. Phenotype Analysis
### 1.1 Standardized HPO Terms
[Researching...]
### 1.2 Key Clinical Features
[Researching...]
---
## 2. Differential Diagnosis
### 2.1 Ranked Candidate Diseases
[Researching...]
### 2.2 Disease Details
[Researching...]
---
## 3. Recommended Gene Panel
### 3.1 Prioritized Genes
[Researching...]
### 3.2 Testing Strategy
[Researching...]
---
## 4. Variant Interpretation (if applicable)
### 4.1 Variant Details
[Researching...]
### 4.2 ACMG Classification
[Researching...]
---
## 5. Structural Analysis (if applicable)
### 5.1 Structure Prediction
[Researching...]
### 5.2 Variant Impact
[Researching...]
---
## 6. Clinical Recommendations
### 6.1 Diagnostic Next Steps
[Researching...]
### 6.2 Specialist Referrals
[Researching...]
### 6.3 Family Screening
[Researching...]
---
## 7. Data Gaps & Limitations
[Researching...]
---
## 8. Data Sources
[Will be populated as research progresses...]
Evidence Grading
| Tier | Symbol | Criteria | Example |
|---|---|---|---|
| T1 | ★★★ | Phenotype match >80% + gene match | Marfan with FBN1 mutation |
| T2 | ★★☆ | Phenotype match 60-80% OR likely pathogenic variant | Good phenotype fit |
| T3 | ★☆☆ | Phenotype match 40-60% OR VUS in candidate gene | Possible diagnosis |
| T4 | ☆☆☆ | Phenotype <40% OR uncertain gene | Low probability |
Completeness Checklist
Phase 1: Phenotype
- All symptoms converted to HPO terms
- Core vs. variable features distinguished
- Age of onset documented
- Family history noted
Phase 2: Disease Matching
- ≥5 candidate diseases identified (or all matching)
- Phenotype overlap % calculated
- Inheritance patterns noted
- ORPHA and OMIM IDs provided
Phase 3: Gene Panel
- ≥5 genes prioritized (or all from top diseases)
- Evidence level for each gene (ClinGen)
- Expression validation performed
- Testing strategy recommended
Phase 4: Variant Interpretation (if applicable)
- ClinVar classification retrieved
- gnomAD frequency checked
- ACMG criteria applied
- Classification justified
Phase 5: Structure Analysis (if applicable)
- Structure predicted (if VUS)
- pLDDT confidence reported
- Domain impact assessed
- Structural evidence summarized
Phase 6: Recommendations
- ≥3 next steps listed
- Specialist referrals suggested
- Family screening addressed
Fallback Chains
| Primary Tool | Fallback 1 | Fallback 2 |
|---|---|---|
Orphanet_search_by_hpo |
OMIM_search |
PubMed phenotype search |
ClinVar_get_variant |
gnomAD_get_variant |
VEP annotation |
NvidiaNIM_alphafold2 |
alphafold_get_prediction |
UniProt features |
GTEx_expression |
HPA_expression |
Tissue-specific literature |
gnomAD_get_variant |
ExAC_frequencies |
1000 Genomes |
Tool Reference
See TOOLS_REFERENCE.md for complete tool documentation.