tooluniverse-multi-omics-integration
Multi-Omics Integration
Coordinate and integrate multiple omics datasets for comprehensive systems biology analysis. This skill orchestrates specialized ToolUniverse skills to perform cross-omics correlation, multi-omics clustering, pathway-level integration, and unified interpretation across molecular layers.
When to Use This Skill
Triggers:
- User has multiple omics datasets (RNA-seq + proteomics, methylation + expression, etc.)
- Requests for integrative multi-omics analysis
- Cross-omics correlation queries (e.g., "How does methylation affect expression?")
- Multi-omics biomarker discovery
- Systems biology questions requiring multiple molecular layers
- Precision medicine applications with multi-omics patient data
- Questions about molecular mechanisms across omics types
Example Questions This Skill Solves:
- "Integrate RNA-seq and proteomics data to find genes with concordant changes"
- "How does promoter methylation correlate with gene expression?"
- "Perform multi-omics clustering to identify patient subtypes"
- "Which pathways are dysregulated across transcriptome, proteome, and metabolome?"
- "Find multi-omics biomarkers for disease classification"
- "Correlate CNV with gene expression to identify dosage effects"
- "Integrate GWAS variants, eQTLs, and expression data"
- "Perform MOFA+ analysis on multi-omics cancer data"
Core Capabilities
| Capability | Description |
|---|---|
| Data Integration | Match samples across omics, handle missing data, normalize scales |
| Cross-Omics Correlation | Correlate features across molecular layers (gene expression vs protein, methylation vs expression) |
| Multi-Omics Clustering | MOFA+, NMF, joint clustering to identify omics-driven subtypes |
| Pathway Integration | Combine omics evidence at pathway level for unified biological interpretation |
| Biomarker Discovery | Identify multi-omics signatures with improved predictive power |
| Skill Coordination | Orchestrate RNA-seq, epigenomics, variant-analysis, protein-interactions, gene-enrichment skills |
| Visualization | Circos plots, integrated heatmaps, network visualizations |
| Reporting | Unified multi-omics reports with cross-layer insights |
Workflow Overview
Input: Multiple Omics Datasets
|
v
Phase 1: Data Loading & QC
|-- Load RNA-seq (expression matrix)
|-- Load proteomics (protein abundance)
|-- Load methylation (beta values or M-values)
|-- Load variants (CNV, SNV from VCF)
|-- Load metabolomics (metabolite abundance)
|-- Quality control per omics type
|
v
Phase 2: Sample Matching
|-- Match samples across omics by ID
|-- Identify common samples
|-- Handle batch effects
|-- Normalize sample identifiers
|
v
Phase 3: Feature Mapping
|-- Map features to common identifier space (genes, proteins, metabolites)
|-- Link CpG sites to genes (promoter, gene body)
|-- Map variants to genes
|-- Create unified feature matrix
|
v
Phase 4: Cross-Omics Correlation
|-- Gene expression vs protein abundance (translation efficiency)
|-- Promoter methylation vs expression (epigenetic regulation)
|-- CNV vs expression (dosage effect)
|-- eQTL variants vs expression (genetic regulation)
|-- Metabolite vs enzyme expression (metabolic flux)
|
v
Phase 5: Multi-Omics Clustering
|-- MOFA+ (Multi-Omics Factor Analysis) for latent factors
|-- NMF (Non-negative Matrix Factorization) for patient subtypes
|-- Joint clustering across omics
|-- Identify omics-specific vs shared variation
|
v
Phase 6: Pathway-Level Integration
|-- Aggregate omics to pathway level
|-- Score pathway dysregulation (combined evidence)
|-- Use ToolUniverse enrichment tools (Reactome, KEGG, GO)
|-- Identify driver pathways across omics
|
v
Phase 7: Biomarker Discovery
|-- Feature selection across omics
|-- Multi-omics signatures for classification
|-- Cross-validation and performance
|-- Interpretation and biological validation
|
v
Phase 8: Generate Integrated Report
|-- Summary statistics per omics
|-- Cross-omics correlation results
|-- Multi-omics clusters and subtypes
|-- Top dysregulated pathways
|-- Multi-omics biomarkers
|-- Biological interpretation
Phase Details
Phase 1: Data Loading & Quality Control
Objective: Load multiple omics datasets and perform quality control.
Supported omics types:
- Transcriptomics: RNA-seq count matrices, microarray
- Proteomics: Protein abundance (MS-based)
- Epigenomics: Methylation (450K, EPIC arrays, WGBS), ChIP-seq peaks
- Genomics: CNV, SNV, structural variants
- Metabolomics: Metabolite abundance (targeted, untargeted)
Data formats:
- Expression: CSV/TSV matrices, HDF5, AnnData (.h5ad)
- Proteomics: MaxQuant output, Spectronaut, DIA-NN
- Methylation: IDAT files, beta value matrices
- Variants: VCF, SEG files (CNV)
- Metabolomics: Peak tables, identified metabolites
Quality control per omics:
# RNA-seq QC
- Filter low-count genes (mean counts < threshold)
- Normalize (TPM, FPKM, or DESeq2)
- Log-transform for correlation
# Proteomics QC
- Filter proteins with high missing values
- Impute missing values (minimum, KNN)
- Normalize (median, quantile)
# Methylation QC
- Remove failed probes
- Correct for batch effects (ComBat)
- Filter cross-reactive probes
# Variants QC
- Use variant-analysis skill for VCF QC
- CNV segmentation validation
Phase 2: Sample Matching
Objective: Identify common samples across omics datasets.
Sample ID harmonization:
def match_samples_across_omics(omics_data_dict):
"""
Match samples across multiple omics datasets.
Parameters:
omics_data_dict: {
'rnaseq': DataFrame (genes x samples),
'proteomics': DataFrame (proteins x samples),
'methylation': DataFrame (CpGs x samples),
'cnv': DataFrame (genes x samples)
}
Returns:
- common_samples: List of sample IDs present in all omics
- matched_data: Dict of DataFrames with common samples only
"""
# Extract sample IDs from each omics
sample_ids = {
omics_type: set(df.columns)
for omics_type, df in omics_data_dict.items()
}
# Find common samples (intersection)
common_samples = set.intersection(*sample_ids.values())
# Subset each omics to common samples
matched_data = {
omics_type: df[sorted(common_samples)]
for omics_type, df in omics_data_dict.items()
}
return sorted(common_samples), matched_data
Handling missing omics:
- Pairwise integration if not all samples have all omics
- Document sample availability matrix
Phase 3: Feature Mapping
Objective: Map features from different omics to common gene-level identifiers.
Gene-centric integration:
# Map all features to genes
feature_mapping = {
'rnaseq': 'gene_symbol', # Already gene-level
'proteomics': 'gene_symbol', # Map protein to gene
'methylation': 'gene_symbol', # Map CpG to gene (promoter)
'cnv': 'gene_symbol', # CNV regions to overlapping genes
'metabolomics': 'enzyme_gene' # Metabolite to enzyme gene
}
CpG to gene mapping:
- Promoter methylation: CpGs within TSS ± 2kb
- Gene body methylation: CpGs within gene boundaries
- Average methylation per gene (weighted by probe coverage)
CNV to gene mapping:
- Use variant-analysis skill to identify genes in CNV regions
- Calculate copy number per gene (log2 ratio)
Phase 4: Cross-Omics Correlation
Objective: Correlate features across molecular layers to understand regulation.
Example analyses:
4.1: Expression vs Protein (Translation Efficiency)
def correlate_rna_protein(rnaseq_data, proteomics_data):
"""
Correlate mRNA and protein levels for each gene.
Expected: Positive correlation (r ~ 0.4-0.6 typical)
Discordance indicates post-transcriptional regulation
"""
# Find common genes
common_genes = set(rnaseq_data.index) & set(proteomics_data.index)
correlations = {}
for gene in common_genes:
rna = rnaseq_data.loc[gene]
protein = proteomics_data.loc[gene]
# Spearman correlation (robust to outliers)
r, p = spearmanr(rna, protein)
correlations[gene] = {'r': r, 'p': p}
# Identify discordant genes (low RNA-protein correlation)
discordant = {g: v for g, v in correlations.items() if abs(v['r']) < 0.2}
return correlations, discordant
4.2: Methylation vs Expression (Epigenetic Regulation)
def correlate_methylation_expression(methylation_data, rnaseq_data):
"""
Correlate promoter methylation with gene expression.
Expected: Negative correlation (increased methylation → decreased expression)
"""
# For each gene with promoter methylation
results = {}
for gene in methylation_data.index:
if gene in rnaseq_data.index:
meth = methylation_data.loc[gene] # Average promoter beta
expr = rnaseq_data.loc[gene]
r, p = spearmanr(meth, expr)
results[gene] = {'r': r, 'p': p, 'direction': 'repressive' if r < 0 else 'activating'}
# Identify genes with strong methylation-expression anticorrelation
regulated = {g: v for g, v in results.items() if v['r'] < -0.5 and v['p'] < 0.01}
return results, regulated
4.3: CNV vs Expression (Dosage Effect)
def correlate_cnv_expression(cnv_data, rnaseq_data):
"""
Correlate copy number with gene expression.
Expected: Positive correlation (gene dosage effect)
"""
results = {}
for gene in cnv_data.index:
if gene in rnaseq_data.index:
cnv = cnv_data.loc[gene] # log2 ratio
expr = rnaseq_data.loc[gene]
r, p = pearsonr(cnv, expr)
results[gene] = {'r': r, 'p': p}
# Genes with dosage effect (CNV drives expression)
dosage_genes = {g: v for g, v in results.items() if v['r'] > 0.5 and v['p'] < 0.01}
return results, dosage_genes
Phase 5: Multi-Omics Clustering
Objective: Identify patient subtypes using integrated omics data.
Method 1: MOFA+ (Multi-Omics Factor Analysis)
MOFA+ identifies latent factors that explain variation across omics.
# Conceptual workflow (uses R's MOFA2 package or Python implementation)
# 1. Prepare multi-omics data as list of matrices
# 2. Run MOFA+ to identify factors
# 3. Inspect factor variance explained per omics
# 4. Cluster samples based on factor scores
# Example interpretation:
# Factor 1: Explains 40% variance in RNA-seq, 30% in proteomics → Cell proliferation
# Factor 2: Explains 50% variance in methylation → Epigenetic subtype
# Factor 3: Explains 20% variance in CNV → Genomic instability
Method 2: Joint NMF (Non-negative Matrix Factorization)
Decompose multi-omics matrices into shared latent components.
def joint_nmf_clustering(omics_data_dict, n_clusters=3):
"""
Perform joint NMF across omics for clustering.
Returns patient cluster assignments based on shared factors.
"""
# Concatenate omics matrices (after normalization)
combined_matrix = np.vstack([
omics_data_dict['rnaseq'].values,
omics_data_dict['proteomics'].values,
omics_data_dict['methylation'].values
])
# Run NMF
from sklearn.decomposition import NMF
model = NMF(n_components=n_clusters, init='nndsvd', random_state=42)
W = model.fit_transform(combined_matrix) # Feature loadings
H = model.components_ # Sample coefficients
# Cluster samples based on H (components)
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters=n_clusters).fit_predict(H.T)
return clusters, W, H
Method 3: Similarity Network Fusion (SNF)
Integrate omics through patient similarity networks.
Phase 6: Pathway-Level Integration
Objective: Aggregate multi-omics evidence at the pathway level.
Approach: Score pathway dysregulation using combined evidence from multiple omics.
def integrate_pathway_evidence(omics_results, pathway_genes):
"""
Score pathway dysregulation across omics.
omics_results: {
'rnaseq': {'gene': fold_change},
'proteomics': {'gene': fold_change},
'methylation': {'gene': methylation_diff},
'cnv': {'gene': copy_number}
}
pathway_genes: List of genes in pathway
"""
# For each gene in pathway
pathway_scores = []
for gene in pathway_genes:
gene_score = 0
evidence_count = 0
# RNA-seq evidence
if gene in omics_results['rnaseq']:
gene_score += abs(omics_results['rnaseq'][gene])
evidence_count += 1
# Proteomics evidence
if gene in omics_results['proteomics']:
gene_score += abs(omics_results['proteomics'][gene])
evidence_count += 1
# Methylation evidence (negative correlation)
if gene in omics_results['methylation']:
gene_score += abs(omics_results['methylation'][gene])
evidence_count += 1
# CNV evidence
if gene in omics_results['cnv']:
gene_score += abs(omics_results['cnv'][gene])
evidence_count += 1
if evidence_count > 0:
pathway_scores.append(gene_score / evidence_count)
# Aggregate pathway score (mean of gene scores)
pathway_score = np.mean(pathway_scores) if pathway_scores else 0
return {
'pathway_score': pathway_score,
'n_genes_with_evidence': len(pathway_scores),
'n_omics_types': evidence_count
}
Use ToolUniverse enrichment tools:
# Get pathways for gene set
from tooluniverse import ToolUniverse
tu = ToolUniverse()
# Enrichment for genes dysregulated in ANY omics
all_dysregulated_genes = set()
all_dysregulated_genes.update(rnaseq_degs)
all_dysregulated_genes.update(diff_proteins)
all_dysregulated_genes.update(methylation_dmgs)
# Run enrichment
enrichment = tu.run_one_function({
"name": "enrichr_enrich",
"arguments": {
"gene_list": ",".join(all_dysregulated_genes),
"library": "KEGG_2021_Human"
}
})
# Score each pathway with multi-omics evidence
for pathway in enrichment['data']['results']:
pathway_genes = pathway['genes']
pathway['multi_omics_score'] = integrate_pathway_evidence(
omics_results, pathway_genes
)
Phase 7: Biomarker Discovery
Objective: Identify multi-omics signatures for disease classification.
Feature selection across omics:
def select_multiomics_features(X_dict, y, n_features=50):
"""
Select top features across omics for classification.
X_dict: {
'rnaseq': DataFrame (samples x genes),
'proteomics': DataFrame (samples x proteins),
'methylation': DataFrame (samples x CpGs)
}
y: Target labels (disease vs control)
Returns: Selected features per omics
"""
from sklearn.feature_selection import SelectKBest, f_classif
selected_features = {}
for omics_type, X in X_dict.items():
selector = SelectKBest(f_classif, k=min(n_features, X.shape[1]))
selector.fit(X, y)
# Get selected feature names
selected_idx = selector.get_support()
selected_features[omics_type] = X.columns[selected_idx].tolist()
return selected_features
Multi-omics classification:
def multiomics_classification(X_dict, y, selected_features):
"""
Train classifier using multi-omics features.
"""
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
# Concatenate selected features from each omics
X_combined = []
for omics_type, features in selected_features.items():
X_combined.append(X_dict[omics_type][features])
X_combined = pd.concat(X_combined, axis=1)
# Train classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(clf, X_combined, y, cv=5, scoring='roc_auc')
return {
'mean_auc': scores.mean(),
'std_auc': scores.std(),
'n_features': X_combined.shape[1],
'features_per_omics': {k: len(v) for k, v in selected_features.items()}
}
Phase 8: Integrated Reporting
Generate comprehensive multi-omics report:
# Multi-Omics Integration Report
## Dataset Summary
- **Omics Types**: RNA-seq, Proteomics, Methylation, CNV
- **Common Samples**: 45 patients (30 disease, 15 control)
- **Features**: 15,000 genes, 5,000 proteins, 450K CpGs, 20K CNV regions
## Cross-Omics Correlation
### RNA-Protein Correlation
- **Overall correlation**: r = 0.52 (expected: 0.4-0.6)
- **Highly correlated**: 3,245 genes (45%)
- **Discordant genes**: 890 genes (post-transcriptional regulation)
### Methylation-Expression
- **Promoter methylation**: Anticorrelation r = -0.41
- **Epigenetically regulated genes**: 1,256 genes (p < 0.01)
- **Example**: BRCA1 promoter hypermethylation → 3-fold reduced expression
### CNV-Expression Dosage Effect
- **Genes with dosage effect**: 445 genes (r > 0.5, p < 0.01)
- **Example**: MYC amplification (3 copies) → 2.8-fold increased expression
## Multi-Omics Clustering
### MOFA+ Analysis
- **Factor 1** (25% variance): Cell cycle genes (RNA + protein)
- **Factor 2** (18% variance): Immune signature (RNA + methylation)
- **Factor 3** (15% variance): Metabolic reprogramming (RNA + metabolites)
### Patient Subtypes
- **Subtype 1** (n=18): High proliferation, MYC amplification
- **Subtype 2** (n=15): Immune-enriched, hypomethylation
- **Subtype 3** (n=12): Metabolic dysregulation, mitochondrial dysfunction
## Pathway Integration
### Top Dysregulated Pathways (Multi-Omics Score)
1. **Cell Cycle** (score: 8.5) - RNA (↑), Protein (↑), CNV (amplification)
2. **Immune Response** (score: 7.2) - RNA (↑), Methylation (hypo)
3. **Glycolysis** (score: 6.8) - RNA (↑), Metabolites (↑)
## Multi-Omics Biomarkers
### Classification Performance
- **AUC**: 0.92 ± 0.04 (5-fold CV)
- **Features**: 50 total (20 RNA, 15 protein, 10 methylation, 5 CNV)
- **Top biomarkers**:
- MYC expression (RNA)
- CDK1 protein abundance
- BRCA1 promoter methylation
- TP53 CNV status
## Biological Interpretation
The multi-omics analysis reveals three distinct disease subtypes driven by different molecular mechanisms:
1. **Proliferative subtype**: Characterized by MYC amplification driving coordinated upregulation of cell cycle genes at both RNA and protein levels.
2. **Immune subtype**: Hypomethylation of immune genes leading to increased expression and T-cell infiltration.
3. **Metabolic subtype**: Shift from oxidative phosphorylation to glycolysis, with concordant changes in enzyme expression and metabolite levels.
These subtypes may respond differently to targeted therapies.
ToolUniverse Skills Coordination
This skill orchestrates multiple specialized skills:
| Skill | Used For | Phase |
|---|---|---|
tooluniverse-rnaseq-deseq2 |
Load and analyze RNA-seq data | Phase 1, 4 |
tooluniverse-epigenomics |
Methylation analysis, ChIP-seq peaks | Phase 1, 4 |
tooluniverse-variant-analysis |
CNV and SNV processing | Phase 1, 3, 4 |
tooluniverse-protein-interactions |
Protein network context | Phase 6 |
tooluniverse-gene-enrichment |
Pathway enrichment | Phase 6 |
tooluniverse-expression-data-retrieval |
Public omics data retrieval | Phase 1 |
tooluniverse-target-research |
Gene/protein annotation | Phase 3, 8 |
Example Use Cases
Use Case 1: Cancer Multi-Omics
Question: "Integrate TCGA breast cancer RNA-seq, proteomics, methylation, and CNV data"
Workflow:
- Load 4 omics types for 500 patients
- Match samples (450 common across all omics)
- Correlate RNA-protein (identify translation-regulated genes)
- Correlate methylation-expression (find epigenetically silenced genes)
- Correlate CNV-expression (identify dosage-sensitive genes)
- Run MOFA+ to find latent factors
- Identify 4 subtypes with distinct multi-omics profiles
- Perform pathway enrichment per subtype
- Select multi-omics biomarkers (AUC=0.94)
Use Case 2: eQTL + Expression
Question: "How do GWAS variants affect gene expression through methylation?"
Workflow:
- Load genotype data (SNPs from GWAS)
- Load expression data (RNA-seq)
- Load methylation data (450K array)
- For each GWAS SNP:
- Test association with nearby gene expression (eQTL)
- Test association with nearby CpG methylation (meQTL)
- Test CpG-gene correlation
- Identify SNP → methylation → expression regulatory chains
- Annotate with ToolUniverse (GWAS traits, gene function)
Use Case 3: Drug Response Multi-Omics
Question: "Predict drug response using multi-omics profiles"
Workflow:
- Load baseline multi-omics (pre-treatment)
- Load drug response data (IC50 or clinical response)
- Correlate each omics with response
- Select multi-omics features predictive of response
- Train multi-omics classifier
- Identify pathways associated with resistance/sensitivity
- Use ToolUniverse drug-repurposing skill for alternative options
Advanced Analysis Patterns
Pattern 1: Omics-Driven Patient Stratification
For precision medicine applications where patient stratification is goal.
Pattern 2: Multi-Omics Network Analysis
Build integrated networks combining PPI, co-expression, regulatory interactions.
Pattern 3: Temporal Multi-Omics
Longitudinal multi-omics data (time-series or treatment response).
Pattern 4: Spatial Multi-Omics
Spatial transcriptomics + proteomics for tissue architecture.
Quantified Minimums
| Component | Requirement |
|---|---|
| Omics types | At least 2 omics datasets |
| Common samples | At least 10 samples across omics |
| Cross-correlation | Pearson/Spearman correlation computed |
| Clustering | At least one method (MOFA+, NMF, or SNF) |
| Pathway integration | Enrichment with multi-omics evidence scores |
| Report | Summary, correlations, clusters, pathways, biomarkers |
Limitations
- Sample size: Multi-omics integration requires sufficient samples (n≥20 recommended)
- Missing data: Some patients may not have all omics types
- Batch effects: Different omics platforms/batches require careful normalization
- Computational: Large multi-omics datasets may require significant memory/compute
- Interpretation: Multi-omics results require domain expertise for biological validation
References
Methods:
- MOFA+: https://doi.org/10.1186/s13059-020-02015-1
- Similarity Network Fusion: https://doi.org/10.1038/nmeth.2810
- Multi-omics review: https://doi.org/10.1038/s41576-019-0093-7
ToolUniverse Skills:
- See individual skill documentation for omics-specific methods