skills/mims-harvard/tooluniverse/tooluniverse-epigenomics-chromatin

tooluniverse-epigenomics-chromatin

Installation
SKILL.md

Epigenomics and Chromatin Accessibility Research

NOT for (use other skills instead)

  • Methylation array data processing (CpG beta values, differential methylation) -> Use tooluniverse-epigenomics
  • RNA-seq differential expression -> Use tooluniverse-rnaseq-deseq2
  • GWAS variant interpretation -> Use tooluniverse-gwas-snp-interpretation
  • Variant functional annotation from VCF -> Use tooluniverse-variant-analysis

Reasoning: Classify the Question First

Before calling any tool, identify which question type you're answering. Each maps to a different tool set.

(a) Which regulatory elements exist at a locus? Use UCSC_get_encode_cCREs (region-based) or SCREEN_get_regulatory_elements (gene-based). Then check ENCODE_get_chromatin_state for ChromHMM annotation and ENCODE_search_chromatin_accessibility for ATAC-seq evidence.

(b) Which TFs bind there? Use ReMap_get_transcription_factor_binding for ChIP-seq experiments. Use jaspar_search_matrices to retrieve binding motifs and check whether the sequence disrupts a known motif.

(c) How does a variant affect regulation? Use RegulomeDB_query_variant for a scored summary. Then build multi-layer evidence: UCSC_get_encode_cCREs (is the variant in a cCRE?), GTEx_get_single_tissue_eqtls (is it an eQTL?), jaspar_search_matrices (does it disrupt a TF motif?). No single layer is sufficient — see the variant reasoning section below.

(d) What genes are regulated by an element? Use GTEx_get_single_tissue_eqtls or GTEx_query_eqtl to find genes whose expression is associated with variants in the element. Use SCREEN_get_regulatory_elements with element_type="PLS"/"pELS"/"dELS" to classify element-to-promoter relationships.


Reasoning: Histone Marks

Use histone mark identity to guide tool queries and interpret results before fetching data.

  • H3K4me3 = active promoter. If present without H3K27ac, promoter may be active but not hyperacetylated.
  • H3K27ac = active enhancer or promoter. Strong signal = regulatory element is on.
  • H3K4me1 = poised or active enhancer. Needs H3K27ac to confirm activity; H3K4me1 alone = poised.
  • H3K27me3 = Polycomb repression. Gene is silenced by PRC2.
  • H3K9me3 = constitutive heterochromatin. Region is structurally silenced.
  • H3K36me3 = transcribed gene body. Confirms active elongation.

Bivalent promoter logic: If you observe H3K4me3 + H3K27me3 together at the same locus, the promoter is bivalent — poised but not active. This is common in stem cells and developmentally regulated genes. Do not report such genes as "actively transcribed." Use GTEx_get_expression_summary to check if the gene is actually expressed in the tissue of interest.

Inference rule: If a user asks about a mark you haven't queried yet, ask: does the mark you have found already answer the question? H3K4me3 in a region predicts active transcription; you may not need to also query H3K36me3 unless confirming elongation specifically.


Reasoning: eQTL Interpretation

An eQTL means variant X is statistically associated with expression of gene Y in tissue T. Before reporting eQTL results, apply this chain of reasoning:

  1. Association ≠ causation. The variant may be in LD with the causal variant. Report effect size (NES) and p-value, not causality.
  2. Check tissue specificity. Use GTEx_get_multi_tissue_eqtls to see whether the effect is shared across tissues (m-value near 1.0 in many tissues) or tissue-specific (m-value near 1.0 in only one tissue). Tissue-specific eQTLs are stronger candidates for cell-type-specific regulation.
  3. Cross-reference with chromatin. Is the eQTL variant inside a cCRE? Use UCSC_get_encode_cCREs on the variant's coordinates. If yes, the variant likely acts through a regulatory element.
  4. Check TF motif disruption. Use jaspar_search_matrices to find motifs overlapping the eQTL locus. If the variant alleles differ in motif score, it is a candidate causal variant.
  5. Effect direction matters. Positive NES = reference allele increases expression. Negative NES = alternative allele decreases expression.

Reasoning: Variant Regulatory Impact

To assess a non-coding variant's regulatory impact, build evidence from multiple independent layers. No single layer is sufficient.

Layer 1 — RegulomeDB score: High probability (score 1a–2b) means convergent evidence from eQTL + TF binding + DNase. Score 4–7 means weak support. Use as a triage filter.

Layer 2 — Regulatory element overlap: Query UCSC_get_encode_cCREs at the variant's coordinates. If the variant falls in a cCRE (especially PLS or pELS), it is in a functional context.

Layer 3 — eQTL evidence: Query GTEx_get_single_tissue_eqtls for nearby genes. If the variant is a significant eQTL, the association supports regulatory function.

Layer 4 — TFBS disruption: Query jaspar_search_matrices for TFs with motifs at the locus. If the variant changes a high-information-content position in a motif, it is a strong functional candidate.

Synthesis rule: Report each layer separately. Convergence across 3+ layers = high-confidence regulatory variant. A single layer (e.g., eQTL alone) warrants caution.


Phase 0: Disambiguation

MyGene_query_genes: query (string). Converts gene symbols to Ensembl IDs and coordinates. Filter results by symbol == '<GENE>' — first hit may not match.

ensembl_lookup_gene: gene_id (Ensembl ID), species (REQUIRED, "homo_sapiens"). Returns chr/start/end.

Key format notes:

  • GTEx requires versioned GENCODE IDs: ENSG00000012048.20
  • RegulomeDB takes rsIDs: rs4994
  • GTEx variant IDs: chr17_43705621_T_C_b38
  • UCSC cCRE regions: chrom="chr17", start=7668421, end=7687490

Phase 1: Histone Modification & ChIP-seq

ENCODE_search_histone_experiments: target (histone mark), cell_type (or tissue alias), biosample_term_name (most explicit ENCODE ontology name), limit.

ENCODE anatomy term notes: "breast" → try "breast epithelium" or "mammary epithelial cell"; "brain" → "brain" works; if 0 results, append "tissue", "epithelium", or "cell".

result = tu.tools.ENCODE_search_histone_experiments(target="H3K27ac", cell_type="GM12878", limit=5)
# result["data"]["experiments"][0]["accession"] -> "ENCSR000AKC"

GEO_search_chipseq_datasets: Fallback for older or non-ENCODE ChIP-seq datasets.


Phase 2: Chromatin Accessibility & Architecture

ENCODE_search_chromatin_accessibility: cell_type, limit. Returns ATAC-seq experiments.

ENCODE_get_chromatin_state: cell_type, limit. Returns ChromHMM 15-state annotations (TssA, Enh, TssBiv, ReprPC, etc.). Use to confirm bivalent promoter state or enhancer classification.

ENCODE_search_rnaseq_experiments: assay_type (default "total RNA-seq"), biosample, limit. If 0 results, retry with assay_type="polyA plus RNA-seq".

GEO_search_rnaseq_datasets / GEO_search_atacseq_datasets: query, organism, limit (also max_results). GEO adds "ATAC-seq" automatically for the ATAC tool.

ReMap_get_transcription_factor_binding (CTCF): gene_name="CTCF", cell_type, limit. Returns ENCODE TF ChIP-seq experiments.


Phase 3: Regulatory Element Identification

SCREEN_get_regulatory_elements: gene_name, element_type (PLS/pELS/dELS/CTCF-only/DNase-H3K4me3), limit.

UCSC_get_encode_cCREs: chrom (REQUIRED), start (REQUIRED), end (REQUIRED), genome (default "hg38"). Returns cCREs with Z-scores for DNase, H3K4me3, H3K27ac, CTCF signals.

# cCREs near TP53
result = tu.tools.UCSC_get_encode_cCREs(chrom="chr17", start=7668421, end=7687490, genome="hg38")

ENCODE_search_annotations: annotation_type ("candidate Cis-Regulatory Elements" or "chromatin state"), biosample_term_name, organism, assembly, limit.


Phase 4: eQTL Analysis

GTEx_get_single_tissue_eqtls: gene_symbol. Returns all significant eQTLs across tissues with snpId, pValue, tissueSiteDetailId, nes (normalized effect size).

result = tu.tools.GTEx_get_single_tissue_eqtls(gene_symbol="BRCA1")
from collections import Counter
tissue_counts = Counter(e["tissueSiteDetailId"] for e in result["data"])

GTEx_query_eqtl: gene_symbol, tissue (tissueSiteDetailId), page (1-indexed), size. Use for a specific tissue.

GTEx_get_multi_tissue_eqtls: operation="get_multi_tissue_eqtls", gencode_id (versioned, REQUIRED). Returns per-variant m-values showing tissue-sharing. m-value near 1.0 = effect present; near 0.0 = absent.

result = tu.tools.GTEx_get_multi_tissue_eqtls(
    operation="get_multi_tissue_eqtls",
    gencode_id="ENSG00000012048.20"
)

GTEx_calculate_eqtl: operation="calculate_eqtl", gencode_id, variant_id (chr_pos_ref_alt_b38), tissue_site_detail_id. Works for non-significant pairs.

eQTL_list_datasets / eQTL_get_associations: EBI eQTL Catalogue. Use dataset_id (from list call), gene_id (Ensembl), variant. Complementary to GTEx.


Phase 5: Gene Expression Context

GTEx_get_expression_summary: gene_symbol. Recommended — auto-resolves GENCODE versions. Returns median TPM per tissue.

result = tu.tools.GTEx_get_expression_summary(gene_symbol="BRCA1")
top_tissues = sorted(result["data"], key=lambda x: x["median"], reverse=True)[:5]

GTEx_get_median_gene_expression: Requires operation="get_median_gene_expression" + exact versioned gencode_id. Use only when version precision is needed.

GTEx_get_tissue_sites: No params. Returns all tissueSiteDetailId values.


Phase 6: Transcription Factor Binding

jaspar_search_matrices: name (TF name), collection ("CORE"), tax_group ("vertebrates"), species ("9606"), page_size.

result = tu.tools.jaspar_search_matrices(name="CTCF", collection="CORE", page_size=5)

jaspar_get_matrix: Returns position frequency matrix for a JASPAR matrix ID. Use to check if a variant allele disrupts a high-information-content position.

ReMap_get_transcription_factor_binding: gene_name (TF), cell_type, limit. Same tool used for CTCF in Phase 2 — applies to any TF.

STRING_get_functional_annotations: identifiers (gene name), species (9606), category ("Process"/"Function"/"KEGG"). Returns GO/KEGG/Reactome annotations for regulatory context.


Phase 7: Variant Regulatory Scoring

RegulomeDB_query_variant: rsid (e.g., "rs4994"). Returns probability, ranking (1a = strongest, 7 = weakest), and tissue-specific scores.

result = tu.tools.RegulomeDB_query_variant(rsid="rs4994")
score = result["data"]["regulome_score"]
# score["ranking"]: "1a" (eQTL + TF + motif + DNase) ... "7" (no evidence)
# score["probability"]: 0.0–1.0
top_tissues = sorted(score["tissue_specific_scores"].items(), key=lambda x: float(x[1]), reverse=True)[:5]

Rankings 1a–1f all have eQTL evidence. Rankings 2a–3b have TF binding without eQTL. Rankings 4–7 have decreasing evidence. Use ranking <= 2b as a threshold for "strong regulatory support."


Phase 8: Integration

Combine evidence tiers before reporting:

  • T1 (Direct experimental): ENCODE ChIP-seq experiments, GTEx eQTL p < 5e-8
  • T2 (Strong computational): RegulomeDB score <= 2, SCREEN cCRE classification, ChromHMM state
  • T3 (Moderate): eQTL p < 0.05, JASPAR motif match, multi-tissue m-value > 0.5
  • T4 (Annotation-based): STRING GO terms, literature references

Convergence of T1+T2 evidence from independent sources (e.g., ENCODE ChIP-seq overlapping a RegulomeDB 1a variant with GTEx eQTL) constitutes strong evidence for regulatory function. Contradictions between layers (e.g., high RegulomeDB score but no eQTL) should be explicitly noted.


Fallback Strategies

Phase Primary Tool Fallback
Histone ChIP-seq ENCODE_search_histone_experiments GEO_search_chipseq_datasets
RNA-seq ENCODE_search_rnaseq_experiments (total RNA-seq) retry with polyA plus RNA-seq
ATAC-seq ENCODE_search_chromatin_accessibility GEO_search_atacseq_datasets
cCREs UCSC_get_encode_cCREs SCREEN_get_regulatory_elements
eQTLs GTEx_get_single_tissue_eqtls eQTL_get_associations (EBI)
Expression GTEx_get_expression_summary GTEx_get_median_gene_expression
TF motifs jaspar_search_matrices ReMap_get_transcription_factor_binding
Variant scoring RegulomeDB_query_variant combine eQTL + TF binding manually
Weekly Installs
50
GitHub Stars
1.3K
First Seen
Today