tooluniverse-regulatory-variant-analysis
COMPUTE, DON'T DESCRIBE
When analysis requires computation (statistics, data processing, scoring, enrichment), write and run Python code via Bash. Don't describe what you would do — execute it and report actual results. Use ToolUniverse tools to retrieve data, then Python (pandas, scipy, statsmodels, matplotlib) to analyze it.
Regulatory Variant Analysis Skill
Systematic regulatory variant interpretation: discover trait associations from GWAS, map eQTL effects, annotate chromatin context, assess regulatory element overlap, and produce evidence-graded functional impact predictions for non-coding variants.
When to Use
- "What GWAS associations exist for rs12913832?"
- "Find eQTLs for the APOE locus in brain tissue"
- "What regulatory elements overlap this variant region?"
- "Which SNPs are associated with type 2 diabetes from GWAS?"
- "Is this intronic variant in an active enhancer?"
- "What is the RegulomeDB score for rs429358?"
- "Find ENCODE histone marks at the BRCA1 promoter region"
- "Map trait ontology terms for 'blood pressure' to EFO IDs"
NOT for (use other skills instead):
- Coding variant pathogenicity -> Use
tooluniverse-variant-interpretation - Full clinical variant classification (ACMG) -> Use
tooluniverse-variant-interpretation - Gene-disease associations (not variant-specific) -> Use
tooluniverse-gene-disease-association - Pharmacogenomic variant annotation -> Use
tooluniverse-pharmacogenomics - Epigenomics data processing (BED/narrowPeak files) -> Use
tooluniverse-epigenomics
Non-Coding Variant Impact Reasoning
When evaluating a non-coding variant, build evidence across four questions:
1. Is the variant in a regulatory element? Use RegulomeDB to assess whether the variant overlaps TF binding sites, chromatin accessibility peaks, or known regulatory annotations. A low RegulomeDB score (categories 1a-2a) indicates strong evidence that the position is functionally active. Confirm with ENCODE histone marks: H3K27ac signals active enhancers and active promoters; H3K4me1 alone marks poised enhancers; H3K4me3 marks active promoters; H3K27me3 marks silenced regions.
2. Does it alter a transcription factor binding site? Check RegulomeDB's TF binding evidence and ENCODE TF ChIP-seq experiments. A variant that falls within a TF footprint and disrupts the consensus motif is mechanistically actionable, especially if the TF is known to be relevant in the disease tissue.
3. Is there eQTL evidence linking it to a gene? Query GTEx to determine whether the variant (or variants in tight LD) modulates expression of a nearby gene in a tissue-specific or ubiquitous manner. A tissue-specific eQTL suggests cell-type-specific regulation; a ubiquitous eQTL suggests a core regulatory element. The direction of the NES (positive = alternative allele increases expression, negative = decreases) and effect size matter for interpretation.
4. Is there GWAS evidence for trait association? Search the GWAS Catalog for the rsID or the surrounding locus. Genome-wide significant associations (p < 5×10⁻⁸) in relevant traits anchor the variant's biological importance. Cross-reference with OpenTargets for locus-to-gene mapping from multiple GWAS studies.
Synthesizing the evidence: Build a multi-layer case. A variant with GWAS significance + eQTL evidence + RegulomeDB score 1a-2a + active chromatin (H3K27ac) in the relevant tissue represents high-confidence regulatory impact. Two or three converging lines of evidence (e.g., eQTL plus active enhancer) constitute moderate confidence. A single line, or a variant only in a poised but not active regulatory context, represents lower confidence.
Workflow Overview
Input (rsID, genomic coordinates, trait/disease, gene)
|
v
Phase 0: Variant/Trait Resolution
Resolve rsIDs, map trait names to EFO/MONDO IDs via OLS
|
v
Phase 1: GWAS Association Lookup
GWAS Catalog associations, p-values, effect sizes, study metadata
|
v
Phase 2: eQTL Analysis
GTEx tissue-specific eQTLs, target gene identification
|
v
Phase 3: Regulatory Element Annotation
ENCODE histone marks, RegulomeDB scores, chromatin state
|
v
Phase 4: OpenTargets GWAS Integration
OpenTargets GWAS study aggregation, locus-to-gene mapping
|
v
Phase 5: Functional Impact Synthesis
Integrate all evidence, assign regulatory impact level
|
v
Phase 6: Report
Evidence-graded regulatory variant report
Phase 0: Variant/Trait Resolution
Use ols_search_terms to resolve trait names to ontology IDs before GWAS queries. Restrict to ontology="efo" for GWAS traits; OpenTargets prefers MONDO IDs (e.g., MONDO_0005148 for type 2 diabetes rather than EFO_0001360). Use EnsemblVEP_annotate_rsid (param is variant_id, not rsid) for initial consequence annotation and nearest gene identification.
Phase 1: GWAS Association Lookup
gwas_search_associations is the primary tool: accepts disease_trait (free text), efo_id (preferred for precision), rs_id, and p_value threshold. Use p_value=5e-8 for genome-wide significance. For locus-level discovery, gwas_get_variants_for_trait retrieves all SNPs for a trait. gwas_get_snps_for_gene finds GWAS-cataloged SNPs mapped to a specific gene.
Reasoning tip: When GWAS Catalog returns empty for a free-text trait, switch to the efo_id parameter — the catalog uses controlled vocabulary and free-text matching is imprecise.
Phase 2: eQTL Analysis
GTEx_query_eqtl accepts a gene symbol (auto-resolved to GENCODE ID) or Ensembl gene ID. It returns tissue-specific SNP-gene associations with NES (normalized effect size) and p-value per tissue.
When interpreting results, ask: does the eQTL effect occur in the tissue most relevant to the disease? A brain-specific eQTL for a neurodegenerative disease variant is more compelling than a ubiquitous one. Use GTEx_get_median_gene_expression to confirm that the target gene is actually expressed in the relevant tissue before placing weight on eQTL evidence.
Note: GTEx API uses v8 data; gtex_v10 endpoints may return empty for some queries.
Phase 3: Regulatory Element Annotation
RegulomeDB_query_variant (param: rsid) returns a regulatory score and feature annotations. Scores in categories 1a–2a indicate strong regulatory evidence (eQTL overlap + TF binding + chromatin accessibility). Scores 3a–6 represent progressively weaker evidence.
ENCODE_search_histone_experiments accepts histone_mark (e.g., "H3K27ac") and biosample_term_name (tissue or cell line name — NOT a disease name; ENCODE uses biological sample names like "liver" or "breast epithelium"). Use assay_title="TF ChIP-seq" (not just "ChIP-seq") when querying TF binding data.
Reasoning tip: RegulomeDB aggregates ENCODE, Roadmap, and other data. If ENCODE doesn't have the specific biosample, RegulomeDB may still have aggregate evidence from related cell types.
Phase 4: OpenTargets GWAS Integration
OpenTargets_search_gwas_studies_by_disease takes diseaseIds as an array of MONDO IDs. It provides locus-to-gene (L2G) scores from multiple GWAS studies, which go beyond simple proximity to incorporate colocalisation, eQTL, and chromatin data. Use OpenTargets_multi_entity_search or OpenTargets_get_disease_id_description_by_name to resolve disease names to MONDO/EFO IDs first.
Phase 5: Functional Impact Synthesis
After collecting evidence, reason through the layers:
- High impact: GWAS genome-wide significant + eQTL with meaningful NES + RegulomeDB score ≤ 2 + active chromatin (H3K27ac) in relevant tissue. Multiple independent lines converge on the same locus and gene.
- Moderate impact: Two to three lines of evidence (e.g., eQTL + active enhancer overlap, or GWAS significant + RegulomeDB ≤ 3) without full convergence.
- Low impact: Single line of evidence, or only computational annotation (VEP consequence category) without functional data.
- No evidence: No regulatory annotations in any source; the variant may be in a non-functional region or the relevant cell type is not represented in available datasets.
Fallback Strategies
- GWAS Catalog returns empty: Switch from free-text
disease_traittoefo_id; broaden the trait term. - GTEx eQTL empty for gene: Verify gene symbol spelling; try Ensembl ID; increase
sizeparameter. - RegulomeDB returns no data: Query ENCODE directly; the variant may lack regulatory annotations in available data.
- OpenTargets GWAS returns None: Verify MONDO/EFO ID format; try
OpenTargets_multi_entity_searchfirst to confirm the correct ID. - ENCODE tissue not found: ENCODE uses specific biosample names; RegulomeDB aggregates data from many cell types and may cover the gap.
Example Workflows
GWAS Variant Functional Annotation (rs429358 / APOE)
Step 1: gwas_search_associations(rs_id="rs429358")
-> All trait associations (Alzheimer's disease, LDL cholesterol, etc.)
Step 2: GTEx_query_eqtl(gene_symbol="APOE")
-> Tissue-specific eQTL evidence; note effect in brain vs liver
Step 3: RegulomeDB_query_variant(rsid="rs429358")
-> Regulatory score and TF binding annotations
Step 4: ENCODE_search_histone_experiments(histone_mark="H3K27ac", biosample_term_name="brain")
-> Active enhancer context near the variant
Step 5: Synthesize: does GWAS significance + eQTL + active chromatin converge on one gene?
Non-Coding Variant Assessment (Intronic/UTR Variant)
Step 1: EnsemblVEP_annotate_rsid(variant_id="rs12345678")
-> Confirm non-coding consequence, identify nearest gene
Step 2: RegulomeDB_query_variant(rsid="rs12345678")
-> Is this position in a regulatory context?
Step 3: gwas_search_associations(rs_id="rs12345678")
-> Any GWAS associations in relevant traits?
Step 4: GTEx_query_eqtl(gene_symbol=nearest_gene)
-> Does this variant or nearby variants modulate expression?
Step 5: ENCODE_search_histone_experiments(histone_mark="H3K27ac", biosample_term_name=relevant_tissue)
-> Active chromatin confirmation
Step 6: Classify impact based on convergence of evidence lines
Limitations
- GWAS Catalog covers published GWAS only; unpublished studies are not included.
- GTEx eQTL data is from v8; v10 endpoints may return empty.
- RegulomeDB annotations depend on available ENCODE/Roadmap data for the specific cell type.
- eQTL analysis identifies correlation, not causation; fine-mapping is needed to identify causal variants.
- RegulomeDB scores are heuristic; a score of 1a does not guarantee functional impact.
- GWAS associations are population-level; individual variant effects depend on genetic background.