skills/mims-harvard/tooluniverse/tooluniverse-gwas-finemapping

tooluniverse-gwas-finemapping

SKILL.md

GWAS Fine-Mapping & Causal Variant Prioritization

Identify and prioritize causal variants at GWAS loci using statistical fine-mapping and locus-to-gene predictions.

Overview

Genome-wide association studies (GWAS) identify genomic regions associated with traits, but linkage disequilibrium (LD) makes it difficult to pinpoint the causal variant. Fine-mapping uses Bayesian statistical methods to compute the posterior probability that each variant is causal, given the GWAS summary statistics.

This skill provides tools to:

  • Prioritize causal variants using fine-mapping posterior probabilities
  • Link variants to genes using locus-to-gene (L2G) predictions
  • Annotate variants with functional consequences
  • Suggest validation strategies based on fine-mapping results

Key Concepts

Credible Sets

A credible set is a minimal set of variants that contains the causal variant with high confidence (typically 95% or 99%). Each variant in the set has a posterior probability of being causal, computed using methods like:

  • SuSiE (Sum of Single Effects)
  • FINEMAP (Bayesian fine-mapping)
  • PAINTOR (Probabilistic Annotation INtegraTOR)

Posterior Probability

The probability that a specific variant is causal, given the GWAS data and LD structure. Higher posterior probability = more likely to be causal.

Locus-to-Gene (L2G) Predictions

L2G scores integrate multiple data types to predict which gene is affected by a variant:

  • Distance to gene (closer = higher score)
  • eQTL evidence (expression changes)
  • Chromatin interactions (Hi-C, promoter capture)
  • Functional annotations (coding variants, regulatory regions)

L2G scores range from 0 to 1, with higher scores indicating stronger gene-variant links.

Use Cases

1. Prioritize Variants at a Known Locus

Question: "Which variant at the TCF7L2 locus is likely causal for type 2 diabetes?"

from python_implementation import prioritize_causal_variants

# Prioritize variants in TCF7L2 for diabetes
result = prioritize_causal_variants("TCF7L2", "type 2 diabetes")
print(result.get_summary())

# Output shows:
# - Credible sets containing TCF7L2 variants
# - Posterior probabilities (via fine-mapping methods)
# - Top L2G genes (which genes are likely affected)
# - Associated traits

2. Fine-Map a Specific Variant

Question: "What do we know about rs429358 (APOE4) from fine-mapping?"

# Fine-map a specific variant
result = prioritize_causal_variants("rs429358")

# Check which credible sets contain this variant
for cs in result.credible_sets:
    print(f"Trait: {cs.trait}")
    print(f"Fine-mapping method: {cs.finemapping_method}")
    print(f"Top gene: {cs.l2g_genes[0] if cs.l2g_genes else 'N/A'}")
    print(f"Confidence: {cs.confidence}")

3. Explore All Loci from a GWAS Study

Question: "What are all the causal loci from the recent T2D meta-analysis?"

from python_implementation import get_credible_sets_for_study

# Get all fine-mapped loci from a study
credible_sets = get_credible_sets_for_study("GCST90029024")  # T2D GWAS

print(f"Found {len(credible_sets)} independent loci")

# Examine each locus
for cs in credible_sets:
    print(f"\nRegion: {cs.region}")
    print(f"Lead variant: {cs.lead_variant.rs_ids[0] if cs.lead_variant else 'N/A'}")

    if cs.l2g_genes:
        top_gene = cs.l2g_genes[0]
        print(f"Most likely causal gene: {top_gene.gene_symbol} (L2G: {top_gene.l2g_score:.3f})")

4. Find GWAS Studies for a Disease

Question: "What GWAS studies exist for Alzheimer's disease?"

from python_implementation import search_gwas_studies_for_disease

# Search by disease name
studies = search_gwas_studies_for_disease("Alzheimer's disease")

for study in studies[:5]:
    print(f"{study['id']}: {study.get('nSamples', 'N/A')} samples")
    print(f"   Author: {study.get('publicationFirstAuthor', 'N/A')}")
    print(f"   Has summary stats: {study.get('hasSumstats', False)}")

# Or use precise disease ontology IDs
studies = search_gwas_studies_for_disease(
    "Alzheimer's disease",
    disease_id="EFO_0000249"  # EFO ID for Alzheimer's
)

5. Get Validation Suggestions

Question: "How should we validate the top causal variant?"

result = prioritize_causal_variants("APOE", "alzheimer")

# Get experimental validation suggestions
suggestions = result.get_validation_suggestions()
for suggestion in suggestions:
    print(suggestion)

# Output includes:
# - CRISPR knock-in experiments
# - Reporter assays
# - eQTL analysis
# - Colocalization studies

Workflow Example: Complete Fine-Mapping Analysis

from python_implementation import (
    prioritize_causal_variants,
    search_gwas_studies_for_disease,
    get_credible_sets_for_study
)

# Step 1: Find relevant GWAS studies
print("Step 1: Finding T2D GWAS studies...")
studies = search_gwas_studies_for_disease("type 2 diabetes", "MONDO_0005148")
largest_study = max(studies, key=lambda s: s.get('nSamples', 0) or 0)
print(f"Largest study: {largest_study['id']} ({largest_study.get('nSamples', 'N/A')} samples)")

# Step 2: Get all fine-mapped loci from the study
print("\nStep 2: Getting fine-mapped loci...")
credible_sets = get_credible_sets_for_study(largest_study['id'], max_sets=100)
print(f"Found {len(credible_sets)} credible sets")

# Step 3: Find loci near genes of interest
print("\nStep 3: Finding TCF7L2 loci...")
tcf7l2_loci = [
    cs for cs in credible_sets
    if any(gene.gene_symbol == "TCF7L2" for gene in cs.l2g_genes)
]

print(f"TCF7L2 appears in {len(tcf7l2_loci)} loci")

# Step 4: Prioritize variants at TCF7L2
print("\nStep 4: Prioritizing TCF7L2 variants...")
result = prioritize_causal_variants("TCF7L2", "type 2 diabetes")

# Step 5: Print summary and validation plan
print("\n" + "="*60)
print("FINE-MAPPING SUMMARY")
print("="*60)
print(result.get_summary())

print("\n" + "="*60)
print("VALIDATION STRATEGY")
print("="*60)
suggestions = result.get_validation_suggestions()
for suggestion in suggestions:
    print(suggestion)

Data Classes

FineMappingResult

Main result object containing:

  • query_variant: Variant annotation
  • query_gene: Gene symbol (if queried by gene)
  • credible_sets: List of fine-mapped loci
  • associated_traits: All associated traits
  • top_causal_genes: L2G genes ranked by score

Methods:

  • get_summary(): Human-readable summary
  • get_validation_suggestions(): Experimental validation strategies

CredibleSet

Represents a fine-mapped locus:

  • study_locus_id: Unique identifier
  • region: Genomic region (e.g., "10:112861809-113404438")
  • lead_variant: Top variant by posterior probability
  • finemapping_method: Statistical method used (SuSiE, FINEMAP, etc.)
  • l2g_genes: Locus-to-gene predictions
  • confidence: Credible set confidence (95%, 99%)

L2GGene

Locus-to-gene prediction:

  • gene_symbol: Gene name (e.g., "TCF7L2")
  • gene_id: Ensembl gene ID
  • l2g_score: Probability score (0-1)

VariantAnnotation

Functional annotation for a variant:

  • variant_id: Open Targets format (chr_pos_ref_alt)
  • rs_ids: dbSNP identifiers
  • chromosome, position: Genomic coordinates
  • most_severe_consequence: Functional impact
  • allele_frequencies: Population-specific MAFs

Tools Used

Open Targets Genetics (GraphQL)

  • OpenTargets_get_variant_info: Variant details and allele frequencies
  • OpenTargets_get_variant_credible_sets: Credible sets containing a variant
  • OpenTargets_get_credible_set_detail: Detailed credible set information
  • OpenTargets_get_study_credible_sets: All loci from a GWAS study
  • OpenTargets_search_gwas_studies_by_disease: Find studies by disease

GWAS Catalog (REST API)

  • gwas_search_snps: Find SNPs by gene or rsID
  • gwas_get_snp_by_id: Detailed SNP information
  • gwas_get_associations_for_snp: All trait associations for a variant
  • gwas_search_studies: Find studies by disease/trait

Understanding Fine-Mapping Output

Interpreting Posterior Probabilities

  • > 0.5: Very likely causal (strong candidate)
  • 0.1 - 0.5: Plausible causal variant
  • 0.01 - 0.1: Possible but uncertain
  • < 0.01: Unlikely to be causal

Interpreting L2G Scores

  • > 0.7: High confidence gene-variant link
  • 0.5 - 0.7: Moderate confidence
  • 0.3 - 0.5: Weak but possible link
  • < 0.3: Low confidence

Fine-Mapping Methods Compared

Method Approach Strengths Use Case
SuSiE Sum of Single Effects Handles multiple causal variants Multi-signal loci
FINEMAP Bayesian shotgun stochastic search Fast, scalable Large studies
PAINTOR Functional annotations Integrates epigenomics Regulatory variants
CAVIAR Colocalization Finds shared causal variants eQTL overlap

Common Questions

Q: Why don't all variants have credible sets? A: Fine-mapping requires:

  1. GWAS summary statistics (not just top hits)
  2. LD reference panel
  3. Sufficient signal strength (p < 5e-8)
  4. Computational resources

Q: Can a variant be in multiple credible sets? A: Yes! A variant can be causal for multiple traits (pleiotropy) or appear in different studies for the same trait.

Q: What if the top L2G gene is far from the variant? A: This suggests regulatory effects (enhancers, promoters). Check:

  • eQTL evidence in relevant tissues
  • Chromatin interaction data (Hi-C)
  • Regulatory element annotations (Roadmap, ENCODE)

Q: How do I choose between variants in a credible set? A: Prioritize by:

  1. Posterior probability (higher = better)
  2. Functional consequence (coding > regulatory > intergenic)
  3. eQTL evidence
  4. Evolutionary conservation
  5. Experimental feasibility

Limitations

  1. LD-dependent: Fine-mapping accuracy depends on LD structure matching the study population
  2. Requires summary stats: Not all studies provide full summary statistics
  3. Computational intensive: Fine-mapping large studies takes significant resources
  4. Prior assumptions: Bayesian methods depend on priors (number of causal variants, effect sizes)
  5. Missing data: Not all GWAS loci have been fine-mapped in Open Targets

Best Practices

  1. Start with study-level queries when exploring a new disease
  2. Check multiple studies for replication of signals
  3. Combine with functional data (eQTLs, chromatin, CRISPR screens)
  4. Consider ancestry - LD differs across populations
  5. Validate experimentally - fine-mapping provides candidates, not proof

References

  1. Wang et al. (2020) "A simple new approach to variable selection in regression, with application to genetic fine mapping." JRSS-B (SuSiE)
  2. Benner et al. (2016) "FINEMAP: efficient variable selection using summary data from genome-wide association studies." Bioinformatics
  3. Ghoussaini et al. (2021) "Open Targets Genetics: systematic identification of trait-associated genes using large-scale genetics and functional genomics." NAR
  4. Mountjoy et al. (2021) "An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci." Nat Genet

Related Skills

  • tooluniverse-gwas-explorer: Broader GWAS analysis
  • tooluniverse-eqtl-colocalization: Link variants to gene expression
  • tooluniverse-gene-prioritization: Systematic gene ranking
Weekly Installs
105
GitHub Stars
1.1K
First Seen
Feb 20, 2026
Installed on
codex103
gemini-cli102
github-copilot102
opencode101
amp100
kimi-cli100