tooluniverse-epigenomics
Genomics and Epigenomics Data Processing
Production-ready computational skill for processing and analyzing epigenomics data. Combines local Python computation (pandas, scipy, numpy, pysam, statsmodels) with ToolUniverse annotation tools for regulatory context. Designed to solve BixBench-style questions about methylation, ChIP-seq, ATAC-seq, and multi-omics integration.
When to Use This Skill
Triggers:
- User provides methylation data (beta-value matrices, Illumina arrays) and asks about CpG sites
- Questions about differential methylation analysis
- Age-related CpG detection or epigenetic clock questions
- Chromosome-level methylation density or statistics
- ChIP-seq peak files (BED format) with analysis questions
- ATAC-seq chromatin accessibility questions
- Multi-omics integration (expression + methylation, expression + ChIP-seq)
- Genome-wide epigenomic statistics
- Questions mentioning "methylation", "CpG", "ChIP-seq", "ATAC-seq", "histone", "chromatin", "epigenetic"
- Questions about missing data across clinical/genomic/epigenomic modalities
- Regulatory element annotation for processed epigenomic data
Example Questions This Skill Solves:
- "How many patients have no missing data for vital status, gene expression, and methylation data?"
- "What is the ratio of filtered age-related CpG density between chromosomes?"
- "What is the genome-wide average chromosomal density of unique age-related CpGs per base pair?"
- "How many CpG sites show significant differential methylation (padj < 0.05)?"
- "What is the Pearson correlation between methylation and expression for gene X?"
- "How many ChIP-seq peaks overlap with promoter regions?"
- "What fraction of ATAC-seq peaks are in enhancer regions?"
- "Which chromosome has the highest density of hypermethylated CpGs?"
- "Filter CpG sites by variance > threshold and map to nearest genes"
- "What is the average beta value difference between tumor and normal for chromosome 17?"
NOT for (use other skills instead):
- Gene regulation lookup without data files -> Use existing epigenomics annotation pattern
- RNA-seq differential expression -> Use
tooluniverse-rnaseq-deseq2 - Variant calling/annotation from VCF -> Use
tooluniverse-variant-analysis - Gene enrichment analysis -> Use
tooluniverse-gene-enrichment - Protein structure analysis -> Use
tooluniverse-protein-structure-retrieval
Required Python Packages
# Core (MUST be available)
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.stats.multitest as mt
# Optional but useful
import pysam # BAM/CRAM file access
import gseapy # Enrichment of genes from methylation analysis
# ToolUniverse (for annotation)
from tooluniverse import ToolUniverse
KEY PRINCIPLES
- Data-first approach - Load and inspect data files BEFORE any analysis
- Question-driven - Parse what the user is actually asking and extract the specific numeric answer
- File format detection - Automatically detect methylation arrays, BED files, BigWig, clinical data
- Coordinate system awareness - Track genome build (hg19, hg38, mm10), handle chr prefix differences
- Statistical rigor - Proper multiple testing correction, effect size filtering, sample size awareness
- Missing data handling - Explicitly report and handle NaN/missing values
- Chromosome normalization - Always normalize chromosome names (chr1 vs 1, chrX vs X)
- CpG site identification - Parse Illumina probe IDs (cg/ch probes), genomic coordinates
- Report-first - Create output file first, populate progressively
- English-first queries - Use English in all tool calls
Complete Workflow
Phase 0: Question Parsing and Data Discovery
CRITICAL FIRST STEP: Before writing ANY code, parse the question to identify what is being asked and what data files are available.
0.1 Discover Available Data Files
import os
import glob
data_dir = "." # or specified path
all_files = glob.glob(os.path.join(data_dir, "**/*"), recursive=True)
# Categorize files
methylation_files = [f for f in all_files if any(x in f.lower() for x in
['methyl', 'beta', 'cpg', 'illumina', '450k', '850k', 'epic', 'mval'])]
chipseq_files = [f for f in all_files if any(x in f.lower() for x in
['chip', 'peak', 'narrowpeak', 'broadpeak', 'histone'])]
atacseq_files = [f for f in all_files if any(x in f.lower() for x in
['atac', 'accessibility', 'openChromatin', 'dnase'])]
bed_files = [f for f in all_files if f.endswith(('.bed', '.bed.gz', '.narrowPeak', '.broadPeak'))]
bigwig_files = [f for f in all_files if f.endswith(('.bw', '.bigwig', '.bigWig'))]
clinical_files = [f for f in all_files if any(x in f.lower() for x in
['clinical', 'patient', 'sample', 'metadata', 'phenotype', 'survival'])]
expression_files = [f for f in all_files if any(x in f.lower() for x in
['express', 'rnaseq', 'fpkm', 'tpm', 'counts', 'transcriptom'])]
manifest_files = [f for f in all_files if any(x in f.lower() for x in
['manifest', 'annotation', 'probe', 'platform'])]
# Print summary
for category, files in [
('Methylation', methylation_files),
('ChIP-seq', chipseq_files),
('ATAC-seq', atacseq_files),
('BED', bed_files),
('BigWig', bigwig_files),
('Clinical', clinical_files),
('Expression', expression_files),
('Manifest', manifest_files),
]:
if files:
print(f"{category}: {files}")
0.2 Parse Question Parameters
Extract these from the question:
| Parameter | Default | Example Question Text |
|---|---|---|
| Significance threshold | 0.05 | "padj < 0.05", "FDR < 0.01" |
| Beta difference threshold | 0 | " |
| Variance filter | None | "variance > 0.01", "top 5000 most variable" |
| Chromosome filter | All | "chromosome 17", "autosomes only" |
| Genome build | hg38 | "hg19", "GRCh37", "mm10" |
| CpG type filter | All | "cg probes only", "exclude ch probes" |
| Region filter | None | "promoter", "gene body", "intergenic" |
| Missing data handling | Report | "complete cases", "no missing data" |
| Specific comparison | Infer | "tumor vs normal", "old vs young" |
| Specific statistic | Infer | "density", "ratio", "count", "average" |
0.3 Decision Tree
Q: What type of epigenomics data?
METHYLATION -> Phase 1 (Methylation Processing)
CHIP-SEQ -> Phase 2 (ChIP-seq Processing)
ATAC-SEQ -> Phase 3 (ATAC-seq Processing)
MULTI-OMICS -> Phase 4 (Integration)
CLINICAL -> Phase 5 (Clinical Integration)
ANNOTATION -> Phase 6 (ToolUniverse Annotation)
Q: Is this a genome-wide statistics question?
YES -> Focus on chromosome-level aggregation (Phase 7)
NO -> Focus on site/region-level analysis
Phase 1: Methylation Data Processing
1.1 Load Methylation Data
import pandas as pd
import numpy as np
def load_methylation_data(file_path, **kwargs):
"""Load methylation beta-value or M-value matrix.
Expected format:
- Rows: CpG probes (cg00000029, cg00000108, ...)
- Columns: Samples (TCGA-XX-XXXX, ...)
- Values: Beta values (0-1) or M-values (log2 ratio)
"""
ext = os.path.splitext(file_path)[1].lower()
if ext in ['.csv']:
df = pd.read_csv(file_path, index_col=0, **kwargs)
elif ext in ['.tsv', '.txt']:
df = pd.read_csv(file_path, sep='\t', index_col=0, **kwargs)
elif ext in ['.parquet']:
df = pd.read_parquet(file_path, **kwargs)
elif ext in ['.h5', '.hdf5']:
df = pd.read_hdf(file_path, **kwargs)
else:
# Try tab first, then comma
try:
df = pd.read_csv(file_path, sep='\t', index_col=0, **kwargs)
except Exception:
df = pd.read_csv(file_path, index_col=0, **kwargs)
return df
def detect_methylation_type(df):
"""Detect if data is beta values (0-1) or M-values (unbounded)."""
sample_vals = df.iloc[:1000, :5].values.flatten()
sample_vals = sample_vals[~np.isnan(sample_vals)]
if sample_vals.min() >= 0 and sample_vals.max() <= 1:
return 'beta'
else:
return 'mvalue'
def beta_to_mvalue(beta):
"""Convert beta values to M-values: M = log2(beta / (1 - beta))."""
beta = np.clip(beta, 1e-6, 1 - 1e-6)
return np.log2(beta / (1 - beta))
def mvalue_to_beta(mvalue):
"""Convert M-values to beta values: beta = 2^M / (2^M + 1)."""
return 2**mvalue / (2**mvalue + 1)
1.2 Load Methylation Manifest / Probe Annotation
def load_probe_annotation(manifest_path):
"""Load Illumina methylation array manifest.
Common columns: IlmnID, Name, CHR, MAPINFO (position), Strand,
UCSC_RefGene_Name, UCSC_RefGene_Group, Relation_to_UCSC_CpG_Island
"""
# Try reading manifest - may have header rows to skip
for skiprows in [0, 7, 8]:
try:
manifest = pd.read_csv(manifest_path, skiprows=skiprows,
low_memory=False)
if 'CHR' in manifest.columns or 'chr' in manifest.columns:
break
if 'Name' in manifest.columns or 'IlmnID' in manifest.columns:
break
except Exception:
continue
# Normalize column names
col_map = {}
for col in manifest.columns:
lower = col.lower()
if lower in ['chr', 'chromosome']:
col_map[col] = 'chr'
elif lower in ['mapinfo', 'position', 'pos', 'start']:
col_map[col] = 'position'
elif lower in ['name', 'ilmnid', 'probe_id', 'cpg_id']:
col_map[col] = 'probe_id'
elif 'refgene_name' in lower or 'gene' in lower:
col_map[col] = 'gene_name'
elif 'refgene_group' in lower:
col_map[col] = 'gene_group'
elif 'cpg_island' in lower or 'relation' in lower:
col_map[col] = 'cpg_island_relation'
manifest = manifest.rename(columns=col_map)
return manifest
def normalize_chromosome(chrom):
"""Normalize chromosome name: '1' -> 'chr1', 'chrX' -> 'chrX', etc."""
if chrom is None or pd.isna(chrom):
return None
chrom = str(chrom).strip()
if not chrom.startswith('chr'):
chrom = 'chr' + chrom
return chrom
def get_chromosome_lengths(genome='hg38'):
"""Return chromosome lengths for common genome builds."""
# hg38 chromosome sizes (main chromosomes)
hg38 = {
'chr1': 248956422, 'chr2': 242193529, 'chr3': 198295559,
'chr4': 190214555, 'chr5': 181538259, 'chr6': 170805979,
'chr7': 159345973, 'chr8': 145138636, 'chr9': 138394717,
'chr10': 133797422, 'chr11': 135086622, 'chr12': 133275309,
'chr13': 114364328, 'chr14': 107043718, 'chr15': 101991189,
'chr16': 90338345, 'chr17': 83257441, 'chr18': 80373285,
'chr19': 58617616, 'chr20': 64444167, 'chr21': 46709983,
'chr22': 50818468, 'chrX': 156040895, 'chrY': 57227415,
}
hg19 = {
'chr1': 249250621, 'chr2': 243199373, 'chr3': 198022430,
'chr4': 191154276, 'chr5': 180915260, 'chr6': 171115067,
'chr7': 159138663, 'chr8': 146364022, 'chr9': 141213431,
'chr10': 135534747, 'chr11': 135006516, 'chr12': 133851895,
'chr13': 115169878, 'chr14': 107349540, 'chr15': 102531392,
'chr16': 90354753, 'chr17': 81195210, 'chr18': 78077248,
'chr19': 59128983, 'chr20': 63025520, 'chr21': 48129895,
'chr22': 51304566, 'chrX': 155270560, 'chrY': 59373566,
}
mm10 = {
'chr1': 195471971, 'chr2': 182113224, 'chr3': 160039680,
'chr4': 156508116, 'chr5': 151834684, 'chr6': 149736546,
'chr7': 145441459, 'chr8': 129401213, 'chr9': 124595110,
'chr10': 130694993, 'chr11': 122082543, 'chr12': 120129022,
'chr13': 120421639, 'chr14': 124902244, 'chr15': 104043685,
'chr16': 98207768, 'chr17': 94987271, 'chr18': 90702639,
'chr19': 61431566, 'chrX': 171031299, 'chrY': 91744698,
}
genomes = {'hg38': hg38, 'hg19': hg19, 'mm10': mm10}
return genomes.get(genome, hg38)
1.3 CpG Site Filtering
def filter_cpg_probes(df, manifest=None, filters=None):
"""Filter CpG probes based on various criteria.
Args:
df: Methylation matrix (probes x samples)
manifest: Probe annotation DataFrame
filters: dict with keys:
- 'variance_threshold': float, minimum variance across samples
- 'mean_beta_range': tuple (min, max), filter probes with extreme mean beta
- 'missing_threshold': float (0-1), max fraction of NaN allowed per probe
- 'chromosomes': list, keep only these chromosomes
- 'exclude_sex_chr': bool, remove chrX and chrY
- 'probe_type': 'cg' or 'ch', keep only one type
- 'cpg_island': str ('Island', 'Shore', 'Shelf', 'OpenSea')
- 'gene_group': str ('TSS200', 'TSS1500', 'Body', '1stExon', etc.)
- 'top_n_variable': int, keep top N most variable probes
"""
if filters is None:
filters = {}
probe_mask = pd.Series(True, index=df.index)
# Probe type filter (cg vs ch)
if 'probe_type' in filters:
ptype = filters['probe_type']
probe_mask &= df.index.str.startswith(ptype)
# Missing data filter
if 'missing_threshold' in filters:
threshold = filters['missing_threshold']
missing_frac = df.isna().mean(axis=1)
probe_mask &= missing_frac <= threshold
# Variance filter
if 'variance_threshold' in filters:
var_threshold = filters['variance_threshold']
probe_var = df.var(axis=1, skipna=True)
probe_mask &= probe_var >= var_threshold
# Mean beta range filter
if 'mean_beta_range' in filters:
min_beta, max_beta = filters['mean_beta_range']
probe_mean = df.mean(axis=1, skipna=True)
probe_mask &= (probe_mean >= min_beta) & (probe_mean <= max_beta)
# Top N most variable
if 'top_n_variable' in filters:
n = filters['top_n_variable']
probe_var = df.var(axis=1, skipna=True)
top_probes = probe_var.nlargest(n).index
probe_mask &= df.index.isin(top_probes)
# Manifest-based filters
if manifest is not None and len(manifest) > 0:
probe_id_col = 'probe_id' if 'probe_id' in manifest.columns else manifest.columns[0]
manifest_indexed = manifest.set_index(probe_id_col) if probe_id_col in manifest.columns else manifest
# Chromosome filter
if 'chromosomes' in filters and 'chr' in manifest_indexed.columns:
valid_chr = [normalize_chromosome(c) for c in filters['chromosomes']]
chr_probes = manifest_indexed[
manifest_indexed['chr'].apply(normalize_chromosome).isin(valid_chr)
].index
probe_mask &= df.index.isin(chr_probes)
# Exclude sex chromosomes
if filters.get('exclude_sex_chr', False) and 'chr' in manifest_indexed.columns:
sex_chr = ['chrX', 'chrY', 'X', 'Y']
nonsex_probes = manifest_indexed[
~manifest_indexed['chr'].apply(normalize_chromosome).isin(['chrX', 'chrY'])
].index
probe_mask &= df.index.isin(nonsex_probes)
# CpG island relation filter
if 'cpg_island' in filters and 'cpg_island_relation' in manifest_indexed.columns:
relation = filters['cpg_island']
island_probes = manifest_indexed[
manifest_indexed['cpg_island_relation'].str.contains(relation, na=False, case=False)
].index
probe_mask &= df.index.isin(island_probes)
# Gene group filter (TSS200, Body, etc.)
if 'gene_group' in filters and 'gene_group' in manifest_indexed.columns:
group = filters['gene_group']
group_probes = manifest_indexed[
manifest_indexed['gene_group'].str.contains(group, na=False, case=False)
].index
probe_mask &= df.index.isin(group_probes)
filtered_df = df[probe_mask]
return filtered_df
1.4 Differential Methylation Analysis
from scipy import stats
import statsmodels.stats.multitest as mt
def differential_methylation(beta_df, group1_samples, group2_samples,
test='ttest', correction='fdr_bh', alpha=0.05):
"""Perform differential methylation analysis between two groups.
Args:
beta_df: Beta-value matrix (probes x samples)
group1_samples: list of sample IDs for group 1
group2_samples: list of sample IDs for group 2
test: 'ttest', 'wilcoxon', or 'ks' (Kolmogorov-Smirnov)
correction: multiple testing correction method
alpha: significance threshold
Returns:
DataFrame with columns: mean_g1, mean_g2, delta_beta, pvalue, padj
"""
g1 = beta_df[group1_samples]
g2 = beta_df[group2_samples]
results = []
for probe in beta_df.index:
vals1 = g1.loc[probe].dropna().values
vals2 = g2.loc[probe].dropna().values
if len(vals1) < 2 or len(vals2) < 2:
results.append({
'probe': probe, 'mean_g1': np.nan, 'mean_g2': np.nan,
'delta_beta': np.nan, 'pvalue': np.nan
})
continue
mean1 = np.nanmean(vals1)
mean2 = np.nanmean(vals2)
delta = mean2 - mean1
if test == 'ttest':
stat, pval = stats.ttest_ind(vals1, vals2, equal_var=False)
elif test == 'wilcoxon':
stat, pval = stats.mannwhitneyu(vals1, vals2, alternative='two-sided')
elif test == 'ks':
stat, pval = stats.ks_2samp(vals1, vals2)
else:
stat, pval = stats.ttest_ind(vals1, vals2, equal_var=False)
results.append({
'probe': probe, 'mean_g1': mean1, 'mean_g2': mean2,
'delta_beta': delta, 'pvalue': pval
})
result_df = pd.DataFrame(results).set_index('probe')
# Multiple testing correction
valid_pvals = result_df['pvalue'].dropna()
if len(valid_pvals) > 0:
reject, padj, _, _ = mt.multipletests(valid_pvals.values, alpha=alpha, method=correction)
result_df.loc[valid_pvals.index, 'padj'] = padj
else:
result_df['padj'] = np.nan
return result_df
def identify_dmps(dm_results, alpha=0.05, delta_beta_threshold=0.0):
"""Identify differentially methylated positions (DMPs).
Args:
dm_results: Output from differential_methylation()
alpha: adjusted p-value threshold
delta_beta_threshold: minimum absolute beta-value difference
Returns:
DataFrame of significant DMPs
"""
dmps = dm_results[
(dm_results['padj'] < alpha) &
(dm_results['delta_beta'].abs() >= delta_beta_threshold)
].copy()
dmps['direction'] = np.where(dmps['delta_beta'] > 0, 'hyper', 'hypo')
return dmps.sort_values('padj')
1.5 Age-Related CpG Analysis
def identify_age_related_cpgs(beta_df, ages, method='correlation',
correction='fdr_bh', alpha=0.05):
"""Identify CpG sites associated with age.
Args:
beta_df: Beta-value matrix (probes x samples)
ages: Series or array of ages corresponding to samples
method: 'correlation' (Pearson/Spearman) or 'regression'
correction: multiple testing method
alpha: significance threshold
Returns:
DataFrame with correlation, p-value, adjusted p-value
"""
results = []
for probe in beta_df.index:
vals = beta_df.loc[probe].values
mask = ~np.isnan(vals) & ~np.isnan(ages.values if hasattr(ages, 'values') else ages)
if sum(mask) < 5:
results.append({'probe': probe, 'correlation': np.nan,
'pvalue': np.nan})
continue
if method == 'correlation':
corr, pval = stats.pearsonr(ages[mask] if hasattr(ages, '__getitem__') else
np.array(ages)[mask], vals[mask])
elif method == 'spearman':
corr, pval = stats.spearmanr(ages[mask] if hasattr(ages, '__getitem__') else
np.array(ages)[mask], vals[mask])
else:
corr, pval = stats.pearsonr(ages[mask] if hasattr(ages, '__getitem__') else
np.array(ages)[mask], vals[mask])
results.append({'probe': probe, 'correlation': corr, 'pvalue': pval})
result_df = pd.DataFrame(results).set_index('probe')
# Multiple testing correction
valid_pvals = result_df['pvalue'].dropna()
if len(valid_pvals) > 0:
reject, padj, _, _ = mt.multipletests(valid_pvals.values, alpha=alpha, method=correction)
result_df.loc[valid_pvals.index, 'padj'] = padj
else:
result_df['padj'] = np.nan
return result_df
1.6 Chromosome-Level Methylation Statistics
def chromosome_cpg_density(cpg_probes, manifest, genome='hg38'):
"""Calculate CpG density per chromosome.
Args:
cpg_probes: list/Index of CpG probe IDs
manifest: probe annotation with chr and position columns
genome: genome build for chromosome lengths
Returns:
DataFrame with chr, n_cpgs, chr_length, density (CpGs per bp)
"""
chr_lengths = get_chromosome_lengths(genome)
# Map probes to chromosomes
probe_id_col = 'probe_id' if 'probe_id' in manifest.columns else manifest.columns[0]
if probe_id_col in manifest.columns:
probe_chr = manifest.set_index(probe_id_col)
else:
probe_chr = manifest
# Get chromosome for each probe
if 'chr' in probe_chr.columns:
chr_col = 'chr'
elif 'CHR' in probe_chr.columns:
chr_col = 'CHR'
else:
raise ValueError("No chromosome column found in manifest")
# Count probes per chromosome
probe_chrs = probe_chr.loc[probe_chr.index.isin(cpg_probes), chr_col]
probe_chrs = probe_chrs.apply(normalize_chromosome)
chr_counts = probe_chrs.value_counts()
results = []
for chrom, count in chr_counts.items():
if chrom in chr_lengths:
length = chr_lengths[chrom]
density = count / length
results.append({
'chr': chrom,
'n_cpgs': count,
'chr_length': length,
'density_per_bp': density,
'density_per_mb': density * 1e6,
})
return pd.DataFrame(results).sort_values('chr',
key=lambda x: x.str.replace('chr', '').replace({'X': '23', 'Y': '24'}).astype(int))
def genome_wide_average_density(density_df):
"""Calculate genome-wide average CpG density across all chromosomes.
Args:
density_df: Output from chromosome_cpg_density()
Returns:
float: genome-wide average density (CpGs per bp)
"""
total_cpgs = density_df['n_cpgs'].sum()
total_length = density_df['chr_length'].sum()
return total_cpgs / total_length
def chromosome_density_ratio(density_df, chr1, chr2):
"""Calculate density ratio between two chromosomes.
Args:
density_df: Output from chromosome_cpg_density()
chr1, chr2: chromosome names (e.g., 'chr1', 'chr2')
Returns:
float: density ratio (chr1 / chr2)
"""
chr1 = normalize_chromosome(chr1)
chr2 = normalize_chromosome(chr2)
d1 = density_df[density_df['chr'] == chr1]['density_per_bp'].values[0]
d2 = density_df[density_df['chr'] == chr2]['density_per_bp'].values[0]
return d1 / d2
Phase 2: ChIP-seq Peak Analysis
2.1 Load BED/Peak Files
def load_bed_file(file_path, format='bed'):
"""Load BED format file (standard BED, narrowPeak, broadPeak).
Standard BED: chrom, start, end, name, score, strand
narrowPeak: + signalValue, pValue, qValue, peak
broadPeak: + signalValue, pValue, qValue
"""
if format == 'narrowPeak' or file_path.endswith('.narrowPeak'):
names = ['chrom', 'start', 'end', 'name', 'score', 'strand',
'signalValue', 'pValue', 'qValue', 'peak']
elif format == 'broadPeak' or file_path.endswith('.broadPeak'):
names = ['chrom', 'start', 'end', 'name', 'score', 'strand',
'signalValue', 'pValue', 'qValue']
else:
# Standard BED - detect number of columns
with open(file_path, 'r') as f:
first_line = f.readline().strip()
while first_line.startswith('#') or first_line.startswith('track') or first_line.startswith('browser'):
first_line = f.readline().strip()
n_cols = len(first_line.split('\t'))
bed_col_names = ['chrom', 'start', 'end', 'name', 'score', 'strand',
'thickStart', 'thickEnd', 'itemRgb', 'blockCount',
'blockSizes', 'blockStarts']
names = bed_col_names[:n_cols]
df = pd.read_csv(file_path, sep='\t', header=None, names=names,
comment='#', low_memory=False)
# Skip track/browser lines
df = df[~df['chrom'].astype(str).str.startswith(('track', 'browser'))]
# Normalize chromosomes
df['chrom'] = df['chrom'].apply(normalize_chromosome)
# Ensure numeric types
df['start'] = pd.to_numeric(df['start'], errors='coerce')
df['end'] = pd.to_numeric(df['end'], errors='coerce')
return df
def peak_statistics(peaks_df):
"""Calculate basic peak statistics.
Args:
peaks_df: BED DataFrame from load_bed_file()
Returns:
dict with peak statistics
"""
peaks_df = peaks_df.copy()
peaks_df['length'] = peaks_df['end'] - peaks_df['start']
stats_dict = {
'total_peaks': len(peaks_df),
'mean_peak_length': peaks_df['length'].mean(),
'median_peak_length': peaks_df['length'].median(),
'total_coverage_bp': peaks_df['length'].sum(),
'peaks_per_chromosome': peaks_df['chrom'].value_counts().to_dict(),
}
if 'signalValue' in peaks_df.columns:
stats_dict['mean_signal'] = peaks_df['signalValue'].mean()
stats_dict['median_signal'] = peaks_df['signalValue'].median()
if 'qValue' in peaks_df.columns:
stats_dict['mean_qvalue'] = peaks_df['qValue'].mean()
return stats_dict
2.2 Peak Annotation
def annotate_peaks_to_genes(peaks_df, gene_annotation=None,
tss_upstream=2000, tss_downstream=500):
"""Annotate peaks to nearest gene / genomic feature.
Args:
peaks_df: BED DataFrame
gene_annotation: DataFrame with gene coordinates (chr, start, end, gene_name, strand)
tss_upstream: bp upstream of TSS to define promoter
tss_downstream: bp downstream of TSS to define promoter
Returns:
DataFrame with peak annotations
"""
if gene_annotation is None:
return peaks_df # No annotation available
annotated = peaks_df.copy()
annotations = []
for _, peak in peaks_df.iterrows():
peak_chr = peak['chrom']
peak_mid = (peak['start'] + peak['end']) // 2
# Filter genes on same chromosome
chr_genes = gene_annotation[gene_annotation['chr'] == peak_chr]
if len(chr_genes) == 0:
annotations.append({
'nearest_gene': 'intergenic',
'distance_to_tss': np.nan,
'feature': 'intergenic'
})
continue
# Calculate distance to TSS
tss_positions = chr_genes.apply(
lambda g: g['start'] if g.get('strand', '+') == '+' else g['end'],
axis=1
)
distances = (peak_mid - tss_positions).abs()
nearest_idx = distances.idxmin()
nearest_gene = chr_genes.loc[nearest_idx]
distance = distances.loc[nearest_idx]
tss = tss_positions.loc[nearest_idx]
# Classify feature type
if abs(peak_mid - tss) <= tss_upstream:
feature = 'promoter'
elif peak['start'] >= nearest_gene['start'] and peak['end'] <= nearest_gene['end']:
feature = 'gene_body'
elif abs(peak_mid - tss) <= 10000:
feature = 'proximal'
else:
feature = 'distal'
annotations.append({
'nearest_gene': nearest_gene.get('gene_name', nearest_gene.name),
'distance_to_tss': int(distance),
'feature': feature
})
ann_df = pd.DataFrame(annotations, index=peaks_df.index)
return pd.concat([peaks_df, ann_df], axis=1)
def classify_peak_regions(annotated_peaks):
"""Classify peaks into genomic regions.
Returns:
dict with counts per region type
"""
if 'feature' not in annotated_peaks.columns:
return {'unknown': len(annotated_peaks)}
return annotated_peaks['feature'].value_counts().to_dict()
2.3 Peak Overlap Analysis
def find_overlaps(peaks_a, peaks_b, min_overlap=1):
"""Find overlapping peaks between two BED DataFrames.
Uses a simple interval overlap approach (pure Python, no pybedtools).
Args:
peaks_a: BED DataFrame (query)
peaks_b: BED DataFrame (subject)
min_overlap: minimum overlap in bp
Returns:
DataFrame of overlapping pairs
"""
overlaps = []
# Group by chromosome for efficiency
for chrom in peaks_a['chrom'].unique():
a_chr = peaks_a[peaks_a['chrom'] == chrom].sort_values('start')
b_chr = peaks_b[peaks_b['chrom'] == chrom].sort_values('start')
if len(b_chr) == 0:
continue
for _, a_peak in a_chr.iterrows():
# Binary search for potential overlaps
for _, b_peak in b_chr.iterrows():
if b_peak['start'] >= a_peak['end']:
break
if b_peak['end'] <= a_peak['start']:
continue
# Calculate overlap
overlap_start = max(a_peak['start'], b_peak['start'])
overlap_end = min(a_peak['end'], b_peak['end'])
overlap_bp = overlap_end - overlap_start
if overlap_bp >= min_overlap:
overlaps.append({
'a_chrom': chrom,
'a_start': a_peak['start'],
'a_end': a_peak['end'],
'b_start': b_peak['start'],
'b_end': b_peak['end'],
'overlap_bp': overlap_bp,
})
return pd.DataFrame(overlaps) if overlaps else pd.DataFrame()
def jaccard_similarity(peaks_a, peaks_b, genome='hg38'):
"""Calculate Jaccard similarity between two peak sets.
Jaccard = intersection / union of genomic coverage.
"""
chr_lengths = get_chromosome_lengths(genome)
total_genome = sum(chr_lengths.values())
# Simple approximation: total bp covered
coverage_a = (peaks_a['end'] - peaks_a['start']).sum()
coverage_b = (peaks_b['end'] - peaks_b['start']).sum()
overlaps = find_overlaps(peaks_a, peaks_b)
if len(overlaps) == 0:
return 0.0
intersection = overlaps['overlap_bp'].sum()
union = coverage_a + coverage_b - intersection
return intersection / union if union > 0 else 0.0
Phase 3: ATAC-seq Analysis
3.1 ATAC-seq Peak Processing
def load_atac_peaks(file_path):
"""Load ATAC-seq peak file (typically narrowPeak format)."""
return load_bed_file(file_path, format='narrowPeak')
def atac_peak_statistics(peaks_df):
"""ATAC-seq specific statistics.
ATAC-seq peaks represent open chromatin regions.
"""
basic_stats = peak_statistics(peaks_df)
# ATAC-specific: nucleosome-free region (NFR) analysis
# NFR peaks typically < 150bp
peaks_df = peaks_df.copy()
peaks_df['length'] = peaks_df['end'] - peaks_df['start']
nfr_peaks = peaks_df[peaks_df['length'] < 150]
nucleosome_peaks = peaks_df[peaks_df['length'] >= 150]
basic_stats['nfr_peaks'] = len(nfr_peaks)
basic_stats['nucleosome_peaks'] = len(nucleosome_peaks)
basic_stats['nfr_fraction'] = len(nfr_peaks) / len(peaks_df) if len(peaks_df) > 0 else 0
return basic_stats
def chromatin_accessibility_by_region(peaks_df, gene_annotation=None):
"""Calculate chromatin accessibility distribution across genomic regions."""
annotated = annotate_peaks_to_genes(peaks_df, gene_annotation)
regions = classify_peak_regions(annotated)
total = sum(regions.values())
region_fractions = {k: v / total for k, v in regions.items()}
return {
'counts': regions,
'fractions': region_fractions,
'total_peaks': total,
}
Phase 4: Multi-Omics Integration
4.1 Expression-Methylation Correlation
def correlate_methylation_expression(beta_df, expression_df, probe_gene_map,
method='pearson', correction='fdr_bh'):
"""Correlate methylation levels with gene expression.
Args:
beta_df: Methylation matrix (probes x samples)
expression_df: Expression matrix (genes x samples)
probe_gene_map: dict or Series mapping probe IDs to gene symbols
method: 'pearson' or 'spearman'
correction: multiple testing correction
Returns:
DataFrame with correlation, p-value per probe-gene pair
"""
# Align samples
common_samples = list(set(beta_df.columns) & set(expression_df.columns))
if len(common_samples) < 5:
raise ValueError(f"Not enough common samples: {len(common_samples)}")
beta_aligned = beta_df[common_samples]
expr_aligned = expression_df[common_samples]
results = []
for probe, gene in probe_gene_map.items():
if probe not in beta_aligned.index or gene not in expr_aligned.index:
continue
meth_vals = beta_aligned.loc[probe].values
expr_vals = expr_aligned.loc[gene].values
mask = ~np.isnan(meth_vals) & ~np.isnan(expr_vals)
if sum(mask) < 5:
continue
if method == 'pearson':
corr, pval = stats.pearsonr(meth_vals[mask], expr_vals[mask])
else:
corr, pval = stats.spearmanr(meth_vals[mask], expr_vals[mask])
results.append({
'probe': probe,
'gene': gene,
'correlation': corr,
'pvalue': pval,
'n_samples': sum(mask),
})
result_df = pd.DataFrame(results)
if len(result_df) > 0:
valid_pvals = result_df['pvalue'].dropna()
if len(valid_pvals) > 0:
reject, padj, _, _ = mt.multipletests(valid_pvals.values, method=correction)
result_df.loc[valid_pvals.index, 'padj'] = padj
return result_df
4.2 ChIP-seq + Expression Integration
def integrate_chipseq_expression(peaks_df, expression_df, gene_annotation,
tss_window=5000):
"""Integrate ChIP-seq peaks with gene expression.
Args:
peaks_df: ChIP-seq peaks (BED)
expression_df: Gene expression (genes x samples)
gene_annotation: Gene coordinates
tss_window: window around TSS for promoter peaks
Returns:
DataFrame with genes having promoter peaks and their expression
"""
annotated = annotate_peaks_to_genes(peaks_df, gene_annotation,
tss_upstream=tss_window)
promoter_peaks = annotated[annotated['feature'] == 'promoter']
# Get genes with promoter peaks
peak_genes = promoter_peaks['nearest_gene'].unique()
# Get expression for these genes
common_genes = [g for g in peak_genes if g in expression_df.index]
result = pd.DataFrame({
'gene': common_genes,
'has_promoter_peak': True,
'mean_expression': [expression_df.loc[g].mean() for g in common_genes],
})
return result
Phase 5: Clinical Data Integration
5.1 Missing Data Analysis
def missing_data_analysis(clinical_df=None, expression_df=None,
methylation_df=None, sample_id_col=None):
"""Analyze missing data across multiple omics modalities.
For BixBench questions like:
"How many patients have no missing data for vital status, gene expression, and methylation?"
Args:
clinical_df: Clinical data (patients x variables)
expression_df: Expression matrix (genes x samples)
methylation_df: Methylation matrix (probes x samples)
sample_id_col: Column name for sample/patient IDs in clinical data
Returns:
dict with completeness statistics
"""
results = {}
# Get sample sets from each modality
clinical_samples = set()
if clinical_df is not None:
if sample_id_col and sample_id_col in clinical_df.columns:
clinical_samples = set(clinical_df[sample_id_col].dropna())
else:
clinical_samples = set(clinical_df.index)
results['clinical_samples'] = len(clinical_samples)
expression_samples = set()
if expression_df is not None:
expression_samples = set(expression_df.columns)
results['expression_samples'] = len(expression_samples)
methylation_samples = set()
if methylation_df is not None:
methylation_samples = set(methylation_df.columns)
results['methylation_samples'] = len(methylation_samples)
# Intersection: samples with ALL data types
all_sets = []
if clinical_samples:
all_sets.append(clinical_samples)
if expression_samples:
all_sets.append(expression_samples)
if methylation_samples:
all_sets.append(methylation_samples)
if len(all_sets) > 0:
complete_samples = set.intersection(*all_sets)
results['complete_samples'] = len(complete_samples)
results['complete_sample_ids'] = sorted(complete_samples)
else:
results['complete_samples'] = 0
# Additional: check for missing values within clinical data
if clinical_df is not None:
for col in clinical_df.columns:
n_missing = clinical_df[col].isna().sum()
n_total = len(clinical_df)
results[f'clinical_{col}_missing'] = n_missing
results[f'clinical_{col}_complete'] = n_total - n_missing
return results
def find_complete_cases(data_frames, variables=None):
"""Find samples that are complete across specified data frames and variables.
Args:
data_frames: dict of {name: DataFrame} where columns are samples
variables: dict of {df_name: [variable_names]} for clinical variables to check
Returns:
set of sample IDs with complete data
"""
sample_sets = []
for name, df in data_frames.items():
if df is not None:
if variables and name in variables:
# Check specific variables for completeness
for var in variables[name]:
if var in df.columns:
complete = set(df[df[var].notna()].index)
sample_sets.append(complete)
elif var in df.index:
complete = set(df.columns[df.loc[var].notna()])
sample_sets.append(complete)
else:
# Just check sample presence
sample_sets.append(set(df.columns))
if not sample_sets:
return set()
return set.intersection(*sample_sets)
Phase 6: ToolUniverse Annotation Integration
Use ToolUniverse tools for biological annotation of epigenomic findings.
6.1 Gene-Level Annotation
from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()
# Annotate differentially methylated genes
def annotate_genes_with_tooluniverse(gene_list, tu):
"""Annotate a list of genes using ToolUniverse tools.
Uses:
- Ensembl for gene coordinates and cross-references
- SCREEN for regulatory elements near gene
- ChIPAtlas for ChIP-seq experiments
"""
annotations = {}
for gene in gene_list[:20]: # Limit for API rate
annotation = {'gene': gene}
# Ensembl lookup
try:
ens = tu.tools.ensembl_lookup_gene(id=gene, species='homo_sapiens')
if isinstance(ens, dict):
data = ens.get('data', ens)
annotation['ensembl_id'] = data.get('id', 'N/A')
annotation['chr'] = data.get('seq_region_name', 'N/A')
annotation['start'] = data.get('start', 'N/A')
annotation['end'] = data.get('end', 'N/A')
annotation['biotype'] = data.get('biotype', 'N/A')
except Exception:
pass
# SCREEN regulatory elements
try:
screen = tu.tools.SCREEN_get_regulatory_elements(
gene_name=gene, element_type="enhancer", limit=5
)
if screen is not None:
annotation['screen_enhancers'] = 'available'
except Exception:
pass
annotations[gene] = annotation
return pd.DataFrame.from_dict(annotations, orient='index')
6.2 ChIPAtlas Integration
def query_chipatlas_experiments(antigen, genome='hg38', cell_type=None, tu=None):
"""Query ChIPAtlas for available ChIP-seq experiments.
Args:
antigen: TF or histone mark name (e.g., 'H3K27ac', 'CTCF')
genome: genome build
cell_type: optional cell type filter
Returns:
ChIPAtlas experiment metadata
"""
if tu is None:
from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()
params = {
'operation': 'get_experiment_list',
'genome': genome,
'antigen': antigen,
'limit': 50,
}
if cell_type:
params['cell_type'] = cell_type
return tu.tools.ChIPAtlas_get_experiments(**params)
6.3 Ensembl Regulatory Feature Annotation
def annotate_regions_with_ensembl(regions, species='human', tu=None):
"""Annotate genomic regions with Ensembl regulatory features.
Args:
regions: list of (chr, start, end) tuples
species: Ensembl species name
Returns:
dict of region -> regulatory features
"""
if tu is None:
from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()
annotations = {}
for chrom, start, end in regions[:10]: # Limit for API rate
# Ensembl uses chromosome without 'chr' prefix
ens_chrom = chrom.replace('chr', '') if chrom.startswith('chr') else chrom
region_str = f"{ens_chrom}:{start}-{end}"
try:
result = tu.tools.ensembl_get_regulatory_features(
region=region_str, feature="regulatory", species=species
)
annotations[(chrom, start, end)] = result
except Exception as e:
annotations[(chrom, start, end)] = {'error': str(e)}
return annotations
Phase 7: Genome-Wide Statistics
7.1 Comprehensive Genome Statistics
def genome_wide_methylation_stats(beta_df, manifest=None, genome='hg38'):
"""Calculate comprehensive genome-wide methylation statistics.
Args:
beta_df: Methylation matrix (probes x samples)
manifest: Probe annotation
genome: genome build
Returns:
dict with genome-wide statistics
"""
stats_result = {
'total_probes': len(beta_df),
'total_samples': beta_df.shape[1],
'global_mean_beta': float(beta_df.mean().mean()),
'global_median_beta': float(beta_df.median().median()),
'global_std_beta': float(beta_df.values[~np.isnan(beta_df.values)].std()),
'missing_fraction': float(beta_df.isna().mean().mean()),
}
# Per-sample statistics
stats_result['sample_means'] = beta_df.mean(axis=0).describe().to_dict()
# Per-probe statistics
probe_var = beta_df.var(axis=1, skipna=True)
stats_result['probe_variance'] = {
'mean': float(probe_var.mean()),
'median': float(probe_var.median()),
'max': float(probe_var.max()),
}
stats_result['high_variance_probes'] = int((probe_var > 0.01).sum())
# Chromosome-level stats (if manifest available)
if manifest is not None:
density_df = chromosome_cpg_density(beta_df.index.tolist(), manifest, genome)
stats_result['chromosome_density'] = density_df.to_dict('records')
stats_result['genome_wide_density'] = genome_wide_average_density(density_df)
return stats_result
def summarize_differential_methylation(dm_results, alpha=0.05):
"""Summarize differential methylation results.
Args:
dm_results: Output from differential_methylation()
Returns:
dict with summary statistics
"""
sig = dm_results[dm_results['padj'] < alpha]
hyper = sig[sig['delta_beta'] > 0]
hypo = sig[sig['delta_beta'] < 0]
return {
'total_tested': len(dm_results),
'total_significant': len(sig),
'hypermethylated': len(hyper),
'hypomethylated': len(hypo),
'fraction_significant': len(sig) / len(dm_results) if len(dm_results) > 0 else 0,
'mean_delta_beta_sig': float(sig['delta_beta'].mean()) if len(sig) > 0 else 0,
'max_abs_delta_beta': float(sig['delta_beta'].abs().max()) if len(sig) > 0 else 0,
}
ToolUniverse Tool Parameter Reference
Regulatory Annotation Tools
| Tool | Parameters | Returns |
|---|---|---|
ensembl_lookup_gene |
id: str, species: str (REQUIRED) |
{status, data: {id, display_name, seq_region_name, start, end, strand, biotype}, url} |
ensembl_get_regulatory_features |
region: str (no "chr"), feature: str, species: str |
{status, data: [...features...]} |
ensembl_get_overlap_features |
region: str, feature: str, species: str |
Gene/transcript overlap data |
SCREEN_get_regulatory_elements |
gene_name: str, element_type: str, limit: int |
cCREs (enhancers, promoters, insulators) |
ReMap_get_transcription_factor_binding |
gene_name: str, cell_type: str, limit: int |
TF binding sites |
RegulomeDB_query_variant |
rsid: str |
{status, data, url} regulatory score |
jaspar_search_matrices |
search: str, collection: str, species: str |
{count, results: [...matrices...]} |
ENCODE_search_experiments |
assay_title: str, target: str, organism: str, limit: int |
Experiment metadata |
ChIPAtlas_get_experiments |
operation: str (REQUIRED: "get_experiment_list"), genome: str, antigen: str, cell_type: str, limit: int |
Experiment list |
ChIPAtlas_search_datasets |
operation REQUIRED, antigenList/celltypeList |
Dataset search results |
ChIPAtlas_enrichment_analysis |
Various input types (BED regions, motifs, genes) | Enrichment results |
ChIPAtlas_get_peak_data |
operation REQUIRED |
Peak data download URLs |
FourDN_search_data |
operation: str (REQUIRED: "search_data"), assay_title: str, limit: int |
Chromatin conformation data |
Gene Annotation Tools
| Tool | Parameters | Returns |
|---|---|---|
MyGene_query_genes |
query: str |
{hits: [{_id, symbol, ensembl, ...}]} |
MyGene_batch_query |
gene_ids: list[str], fields: str |
{results: [{query, symbol, ...}]} |
HGNC_get_gene_info |
symbol: str |
Gene symbol, aliases, IDs |
GO_get_annotations_for_gene |
gene_id: str |
GO annotations |
CRITICAL Tool Notes
- ensembl_lookup_gene: REQUIRES
species='homo_sapiens'parameter - ensembl_get_regulatory_features: Region format is "17:start-end" (NO "chr" prefix)
- ChIPAtlas tools: ALL require
operationparameter (SOAP-style) - FourDN tools: ALL require
operationparameter (SOAP-style) - SCREEN: Returns JSON-LD format with
@context, @graphkeys
Response Format Notes
- Methylation data: Typically stored as probes (rows) x samples (columns), beta values 0-1
- BED files: Tab-separated, 0-based half-open coordinates
- narrowPeak: 10-column BED extension with signalValue, pValue, qValue, peak
- Illumina manifests: Contains probe ID, chromosome, position, gene annotation
- Clinical data: Patient/sample-centric with clinical variables as columns
Fallback Strategies
| Scenario | Primary | Fallback |
|---|---|---|
| No manifest file | Load from data dir | Build minimal from Ensembl lookup |
| No pybedtools | Pure Python overlap | pandas-based interval intersection |
| No pyBigWig | Skip BigWig analysis | Use pre-computed summary tables |
| Missing clinical data | Report missing | Use available samples only |
| Low sample count | Parametric test | Use non-parametric (Wilcoxon) |
| Large dataset (>500K probes) | Full analysis | Sample or chunk-based processing |
Common Use Patterns
Pattern 1: Methylation Array Analysis
Input: Beta-value matrix + manifest + clinical data
Question: "How many CpGs are differentially methylated?"
Flow:
1. Load beta matrix, manifest, clinical data
2. Filter CpG probes (cg only, remove sex chr, variance filter)
3. Define groups from clinical data
4. Run differential_methylation()
5. Apply thresholds (padj < 0.05, |delta_beta| > 0.2)
6. Report count and direction (hyper/hypo)
Pattern 2: Age-Related CpG Density
Input: Beta-value matrix + manifest + ages
Question: "What is the density ratio of age-related CpGs between chr1 and chr2?"
Flow:
1. Load beta matrix and ages from clinical data
2. Run identify_age_related_cpgs()
3. Filter significant age-related CpGs
4. Map to chromosomes using manifest
5. Calculate chromosome_cpg_density()
6. Compute ratio between specified chromosomes
Pattern 3: Multi-Omics Missing Data
Input: Clinical + expression + methylation data files
Question: "How many patients have complete data for all modalities?"
Flow:
1. Load all data files
2. Extract sample IDs from each
3. Find intersection (common samples)
4. Check for NaN/missing within clinical variables
5. Report complete cases count
Pattern 4: ChIP-seq Peak Annotation
Input: BED/narrowPeak file
Question: "What fraction of peaks are in promoter regions?"
Flow:
1. Load BED file with load_bed_file()
2. Load or fetch gene annotation (Ensembl)
3. Run annotate_peaks_to_genes()
4. Classify regions with classify_peak_regions()
5. Calculate fraction in promoters
Pattern 5: Methylation-Expression Integration
Input: Beta matrix + expression matrix + probe-gene mapping
Question: "What is the correlation between methylation and expression?"
Flow:
1. Load both matrices
2. Build probe-gene map from manifest
3. Align samples across datasets
4. Run correlate_methylation_expression()
5. Report significant anti-correlations
Edge Cases
Missing Probe Annotation
When no manifest/annotation file is available:
- Extract chromosome from probe ID naming patterns if possible
- Use ToolUniverse Ensembl tools to build minimal annotation
- Report limitation: "chromosome mapping unavailable for X probes"
Mixed Genome Builds
When data uses different builds:
- Detect build from context (data README, file names, known coordinates)
- Use appropriate chromosome lengths for density calculations
- Do NOT mix hg19 and hg38 coordinates
Very Large Datasets
For datasets with >500K CpG sites:
- Use chunked processing for differential methylation
- Pre-filter by variance before statistical testing
- Use vectorized operations (avoid row-by-row loops where possible)
Sample ID Mismatches
Clinical and molecular data may use different ID formats:
- TCGA: barcode (TCGA-XX-XXXX-01A) vs patient ID (TCGA-XX-XXXX)
- Try truncating or matching partial IDs
- Report number of matched/unmatched samples
Limitations
- No native pybedtools: Uses pure Python interval operations (slower for very large BED files)
- No native pyBigWig: Cannot read BigWig files directly without package
- No R bridge: Does not use methylKit, ChIPseeker, or DiffBind
- Illumina-centric: Methylation functions designed for 450K/EPIC arrays
- Statistical simplicity: Uses t-test/Wilcoxon for differential methylation (not limma/bumphunter)
- No peak calling: Assumes peaks are pre-called; does not run MACS2 or similar
- API rate limits: ToolUniverse annotation limited to ~20 genes per batch
Summary
Genomics & Epigenomics Data Processing Skill provides:
- Methylation analysis - Beta-value processing, CpG filtering, differential methylation, age-related CpGs, chromosome density
- ChIP-seq analysis - Peak loading (BED/narrowPeak), peak annotation, overlap analysis, statistics
- ATAC-seq analysis - Chromatin accessibility peaks, NFR detection, region classification
- Multi-omics integration - Methylation-expression correlation, ChIP-seq + expression
- Clinical integration - Missing data analysis across modalities, complete case identification
- Genome-wide statistics - Chromosome-level density, genome-wide averages, density ratios
- ToolUniverse annotation - Ensembl, SCREEN, ChIPAtlas, JASPAR, ENCODE for biological context
Core packages: pandas, numpy, scipy, statsmodels, pysam, gseapy ToolUniverse tools: 25+ tools across Ensembl, SCREEN, ENCODE, ChIPAtlas, JASPAR, ReMap, RegulomeDB, 4DN Best for: BixBench-style quantitative questions about methylation data, ChIP-seq peaks, chromatin accessibility, and multi-omics integration