tooluniverse-gwas-finemapping
GWAS Fine-Mapping & Causal Variant Prioritization
Identify and prioritize causal variants at GWAS loci using statistical fine-mapping and locus-to-gene predictions.
Overview
Genome-wide association studies (GWAS) identify genomic regions associated with traits, but linkage disequilibrium (LD) makes it difficult to pinpoint the causal variant. Fine-mapping uses Bayesian statistical methods to compute the posterior probability that each variant is causal, given the GWAS summary statistics.
This skill provides tools to:
- Prioritize causal variants using fine-mapping posterior probabilities
- Link variants to genes using locus-to-gene (L2G) predictions
- Annotate variants with functional consequences
- Suggest validation strategies based on fine-mapping results
Key Concepts
Credible Sets
A credible set is a minimal set of variants that contains the causal variant with high confidence (typically 95% or 99%). Each variant in the set has a posterior probability of being causal, computed using methods like:
- SuSiE (Sum of Single Effects)
- FINEMAP (Bayesian fine-mapping)
- PAINTOR (Probabilistic Annotation INtegraTOR)
Posterior Probability
The probability that a specific variant is causal, given the GWAS data and LD structure. Higher posterior probability = more likely to be causal.
Locus-to-Gene (L2G) Predictions
L2G scores integrate multiple data types to predict which gene is affected by a variant:
- Distance to gene (closer = higher score)
- eQTL evidence (expression changes)
- Chromatin interactions (Hi-C, promoter capture)
- Functional annotations (coding variants, regulatory regions)
L2G scores range from 0 to 1, with higher scores indicating stronger gene-variant links.
Use Cases
1. Prioritize Variants at a Known Locus
Question: "Which variant at the TCF7L2 locus is likely causal for type 2 diabetes?"
from python_implementation import prioritize_causal_variants
# Prioritize variants in TCF7L2 for diabetes
result = prioritize_causal_variants("TCF7L2", "type 2 diabetes")
print(result.get_summary())
# Output shows:
# - Credible sets containing TCF7L2 variants
# - Posterior probabilities (via fine-mapping methods)
# - Top L2G genes (which genes are likely affected)
# - Associated traits
2. Fine-Map a Specific Variant
Question: "What do we know about rs429358 (APOE4) from fine-mapping?"
# Fine-map a specific variant
result = prioritize_causal_variants("rs429358")
# Check which credible sets contain this variant
for cs in result.credible_sets:
print(f"Trait: {cs.trait}")
print(f"Fine-mapping method: {cs.finemapping_method}")
print(f"Top gene: {cs.l2g_genes[0] if cs.l2g_genes else 'N/A'}")
print(f"Confidence: {cs.confidence}")
3. Explore All Loci from a GWAS Study
Question: "What are all the causal loci from the recent T2D meta-analysis?"
from python_implementation import get_credible_sets_for_study
# Get all fine-mapped loci from a study
credible_sets = get_credible_sets_for_study("GCST90029024") # T2D GWAS
print(f"Found {len(credible_sets)} independent loci")
# Examine each locus
for cs in credible_sets:
print(f"\nRegion: {cs.region}")
print(f"Lead variant: {cs.lead_variant.rs_ids[0] if cs.lead_variant else 'N/A'}")
if cs.l2g_genes:
top_gene = cs.l2g_genes[0]
print(f"Most likely causal gene: {top_gene.gene_symbol} (L2G: {top_gene.l2g_score:.3f})")
4. Find GWAS Studies for a Disease
Question: "What GWAS studies exist for Alzheimer's disease?"
from python_implementation import search_gwas_studies_for_disease
# Search by disease name
studies = search_gwas_studies_for_disease("Alzheimer's disease")
for study in studies[:5]:
print(f"{study['id']}: {study.get('nSamples', 'N/A')} samples")
print(f" Author: {study.get('publicationFirstAuthor', 'N/A')}")
print(f" Has summary stats: {study.get('hasSumstats', False)}")
# Or use precise disease ontology IDs
studies = search_gwas_studies_for_disease(
"Alzheimer's disease",
disease_id="EFO_0000249" # EFO ID for Alzheimer's
)
5. Get Validation Suggestions
Question: "How should we validate the top causal variant?"
result = prioritize_causal_variants("APOE", "alzheimer")
# Get experimental validation suggestions
suggestions = result.get_validation_suggestions()
for suggestion in suggestions:
print(suggestion)
# Output includes:
# - CRISPR knock-in experiments
# - Reporter assays
# - eQTL analysis
# - Colocalization studies
Workflow Example: Complete Fine-Mapping Analysis
from python_implementation import (
prioritize_causal_variants,
search_gwas_studies_for_disease,
get_credible_sets_for_study
)
# Step 1: Find relevant GWAS studies
print("Step 1: Finding T2D GWAS studies...")
studies = search_gwas_studies_for_disease("type 2 diabetes", "MONDO_0005148")
largest_study = max(studies, key=lambda s: s.get('nSamples', 0) or 0)
print(f"Largest study: {largest_study['id']} ({largest_study.get('nSamples', 'N/A')} samples)")
# Step 2: Get all fine-mapped loci from the study
print("\nStep 2: Getting fine-mapped loci...")
credible_sets = get_credible_sets_for_study(largest_study['id'], max_sets=100)
print(f"Found {len(credible_sets)} credible sets")
# Step 3: Find loci near genes of interest
print("\nStep 3: Finding TCF7L2 loci...")
tcf7l2_loci = [
cs for cs in credible_sets
if any(gene.gene_symbol == "TCF7L2" for gene in cs.l2g_genes)
]
print(f"TCF7L2 appears in {len(tcf7l2_loci)} loci")
# Step 4: Prioritize variants at TCF7L2
print("\nStep 4: Prioritizing TCF7L2 variants...")
result = prioritize_causal_variants("TCF7L2", "type 2 diabetes")
# Step 5: Print summary and validation plan
print("\n" + "="*60)
print("FINE-MAPPING SUMMARY")
print("="*60)
print(result.get_summary())
print("\n" + "="*60)
print("VALIDATION STRATEGY")
print("="*60)
suggestions = result.get_validation_suggestions()
for suggestion in suggestions:
print(suggestion)
Data Classes
FineMappingResult
Main result object containing:
query_variant: Variant annotationquery_gene: Gene symbol (if queried by gene)credible_sets: List of fine-mapped lociassociated_traits: All associated traitstop_causal_genes: L2G genes ranked by score
Methods:
get_summary(): Human-readable summaryget_validation_suggestions(): Experimental validation strategies
CredibleSet
Represents a fine-mapped locus:
study_locus_id: Unique identifierregion: Genomic region (e.g., "10:112861809-113404438")lead_variant: Top variant by posterior probabilityfinemapping_method: Statistical method used (SuSiE, FINEMAP, etc.)l2g_genes: Locus-to-gene predictionsconfidence: Credible set confidence (95%, 99%)
L2GGene
Locus-to-gene prediction:
gene_symbol: Gene name (e.g., "TCF7L2")gene_id: Ensembl gene IDl2g_score: Probability score (0-1)
VariantAnnotation
Functional annotation for a variant:
variant_id: Open Targets format (chr_pos_ref_alt)rs_ids: dbSNP identifierschromosome,position: Genomic coordinatesmost_severe_consequence: Functional impactallele_frequencies: Population-specific MAFs
Tools Used
Open Targets Genetics (GraphQL)
OpenTargets_get_variant_info: Variant details and allele frequenciesOpenTargets_get_variant_credible_sets: Credible sets containing a variantOpenTargets_get_credible_set_detail: Detailed credible set informationOpenTargets_get_study_credible_sets: All loci from a GWAS studyOpenTargets_search_gwas_studies_by_disease: Find studies by disease
GWAS Catalog (REST API)
gwas_search_snps: Find SNPs by gene or rsIDgwas_get_snp_by_id: Detailed SNP informationgwas_get_associations_for_snp: All trait associations for a variantgwas_search_studies: Find studies by disease/trait
Understanding Fine-Mapping Output
Interpreting Posterior Probabilities
- > 0.5: Very likely causal (strong candidate)
- 0.1 - 0.5: Plausible causal variant
- 0.01 - 0.1: Possible but uncertain
- < 0.01: Unlikely to be causal
Interpreting L2G Scores
- > 0.7: High confidence gene-variant link
- 0.5 - 0.7: Moderate confidence
- 0.3 - 0.5: Weak but possible link
- < 0.3: Low confidence
Fine-Mapping Methods Compared
| Method | Approach | Strengths | Use Case |
|---|---|---|---|
| SuSiE | Sum of Single Effects | Handles multiple causal variants | Multi-signal loci |
| FINEMAP | Bayesian shotgun stochastic search | Fast, scalable | Large studies |
| PAINTOR | Functional annotations | Integrates epigenomics | Regulatory variants |
| CAVIAR | Colocalization | Finds shared causal variants | eQTL overlap |
Common Questions
Q: Why don't all variants have credible sets? A: Fine-mapping requires:
- GWAS summary statistics (not just top hits)
- LD reference panel
- Sufficient signal strength (p < 5e-8)
- Computational resources
Q: Can a variant be in multiple credible sets? A: Yes! A variant can be causal for multiple traits (pleiotropy) or appear in different studies for the same trait.
Q: What if the top L2G gene is far from the variant? A: This suggests regulatory effects (enhancers, promoters). Check:
- eQTL evidence in relevant tissues
- Chromatin interaction data (Hi-C)
- Regulatory element annotations (Roadmap, ENCODE)
Q: How do I choose between variants in a credible set? A: Prioritize by:
- Posterior probability (higher = better)
- Functional consequence (coding > regulatory > intergenic)
- eQTL evidence
- Evolutionary conservation
- Experimental feasibility
Limitations
- LD-dependent: Fine-mapping accuracy depends on LD structure matching the study population
- Requires summary stats: Not all studies provide full summary statistics
- Computational intensive: Fine-mapping large studies takes significant resources
- Prior assumptions: Bayesian methods depend on priors (number of causal variants, effect sizes)
- Missing data: Not all GWAS loci have been fine-mapped in Open Targets
Best Practices
- Start with study-level queries when exploring a new disease
- Check multiple studies for replication of signals
- Combine with functional data (eQTLs, chromatin, CRISPR screens)
- Consider ancestry - LD differs across populations
- Validate experimentally - fine-mapping provides candidates, not proof
References
- Wang et al. (2020) "A simple new approach to variable selection in regression, with application to genetic fine mapping." JRSS-B (SuSiE)
- Benner et al. (2016) "FINEMAP: efficient variable selection using summary data from genome-wide association studies." Bioinformatics
- Ghoussaini et al. (2021) "Open Targets Genetics: systematic identification of trait-associated genes using large-scale genetics and functional genomics." NAR
- Mountjoy et al. (2021) "An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci." Nat Genet
Related Skills
- tooluniverse-gwas-explorer: Broader GWAS analysis
- tooluniverse-eqtl-colocalization: Link variants to gene expression
- tooluniverse-gene-prioritization: Systematic gene ranking