tooluniverse-proteomics-analysis
Proteomics Analysis
Comprehensive analysis of mass spectrometry-based proteomics data from protein identification through quantification, differential expression, post-translational modifications, and systems-level interpretation.
When to Use This Skill
Triggers:
- User has proteomics data (MS output files)
- Questions about protein abundance or expression
- Differential protein expression analysis requests
- PTM analysis (phosphorylation, acetylation, ubiquitination)
- Protein-RNA correlation analysis
- Multi-omics integration involving proteomics
- Protein complex or interaction analysis
- Proteomics biomarker discovery
Example Questions This Skill Solves:
- "Analyze this MaxQuant output for differential protein expression"
- "Which proteins are significantly upregulated in disease vs control?"
- "Correlate protein abundance with mRNA expression"
- "What post-translational modifications change between conditions?"
- "Identify protein complexes in my co-IP MS data"
- "Which pathways are enriched in differentially expressed proteins?"
- "Find protein biomarkers for disease classification"
- "Compare protein and RNA levels to identify translation-regulated genes"
Core Capabilities
| Capability | Description |
|---|---|
| Data Import | MaxQuant, Spectronaut, DIA-NN, Proteome Discoverer, FragPipe outputs |
| Quality Control | Missing value analysis, intensity distributions, sample clustering |
| Normalization | Median, quantile, TMM, VSN normalization methods |
| Imputation | MinProb, KNN, QRILC for missing values |
| Differential Expression | Limma, DEP, MSstats for statistical testing |
| PTM Analysis | Phospho-site localization, PTM enrichment, kinase prediction |
| Protein-RNA Integration | Correlation analysis, translation efficiency |
| Pathway Enrichment | Over-representation and GSEA for protein sets |
| PPI Analysis | Protein complex detection, interaction networks via STRING/IntAct |
| Reporting | Comprehensive reports with volcano plots, heatmaps, pathway diagrams |
Workflow Overview
Input: MS Proteomics Data
|
v
Phase 1: Data Import & QC
|-- Load MaxQuant/Spectronaut/DIA-NN output
|-- Parse protein groups, intensities, modifications
|-- Quality control plots (missing values, intensity distributions)
|-- Sample correlation and PCA
|
v
Phase 2: Preprocessing
|-- Filter low-confidence proteins
|-- Handle missing values (imputation or filtering)
|-- Log-transform intensities
|-- Normalize across samples
|
v
Phase 3: Differential Expression Analysis
|-- Statistical testing (limma, t-test, ANOVA)
|-- Multiple testing correction (BH, Bonferroni)
|-- Fold change calculation
|-- Significance thresholds (p < 0.05, |log2FC| > 1)
|
v
Phase 4: PTM Analysis (if applicable)
|-- Identify modified peptides
|-- Localization probability filtering
|-- PTM site quantification
|-- Kinase-substrate prediction
|-- PTM enrichment analysis
|
v
Phase 5: Functional Enrichment
|-- Gene Ontology enrichment
|-- KEGG/Reactome pathway enrichment
|-- Protein complex enrichment (CORUM)
|-- Tissue-specific enrichment
|
v
Phase 6: Protein-Protein Interactions
|-- Query STRING for interaction networks
|-- Identify protein complexes
|-- Network clustering and modules
|-- Hub protein identification
|
v
Phase 7: Multi-Omics Integration (optional)
|-- Correlate with RNA-seq data
|-- Identify translation-regulated proteins
|-- Compare with variant/CNV data
|-- Integrate with metabolomics
|
v
Phase 8: Generate Report
|-- Summary statistics
|-- Volcano plots and heatmaps
|-- Pathway diagrams
|-- Protein network visualizations
|-- Multi-omics integration plots
Phase Details
Phase 1: Data Import & Quality Control
Objective: Load proteomics data and assess data quality.
Supported input formats:
MaxQuant (most common):
proteinGroups.txt- Protein-level quantificationevidence.txt- Peptide-level dataPhospho (STY)Sites.txt- Phosphorylation sitesmodificationSpecificPeptides.txt- Other PTMs
Spectronaut:
*_Report.tsv- Protein/peptide quantification- DIA-based quantification
DIA-NN:
report.tsv- Protein groupsreport.pr_matrix.tsv- Protein matrix
Proteome Discoverer:
*_Proteins.txt*_PSMs.txt
Data loading:
def load_maxquant_proteins(protein_groups_file):
"""
Load MaxQuant proteinGroups.txt file.
Returns:
- DataFrame with proteins as rows, samples as columns
- Metadata (protein names, gene names, sequence coverage)
"""
import pandas as pd
# Read file
df = pd.read_csv(protein_groups_file, sep='\t')
# Extract intensity columns (LFQ or raw)
intensity_cols = [col for col in df.columns if 'LFQ intensity' in col or 'Intensity ' in col]
# Create intensity matrix
intensity_matrix = df[intensity_cols].copy()
intensity_matrix.columns = [col.replace('LFQ intensity ', '').replace('Intensity ', '')
for col in intensity_cols]
# Add protein metadata
metadata = df[['Protein IDs', 'Gene names', 'Fasta headers',
'Peptides', 'Sequence coverage [%]']].copy()
return intensity_matrix, metadata
Quality Control:
- Missing value assessment:
def assess_missing_values(intensity_matrix):
"""
Calculate percentage of missing values per protein and sample.
"""
# Per protein
missing_per_protein = (intensity_matrix == 0).sum(axis=1) / intensity_matrix.shape[1]
# Per sample
missing_per_sample = (intensity_matrix == 0).sum(axis=0) / intensity_matrix.shape[0]
# Visualize
plot_missing_value_heatmap(intensity_matrix)
return missing_per_protein, missing_per_sample
- Intensity distribution:
def plot_intensity_distributions(intensity_matrix):
"""
Plot log10 intensity distributions per sample.
Check for consistent distributions across samples.
"""
import matplotlib.pyplot as plt
import numpy as np
log_intensities = np.log10(intensity_matrix.replace(0, np.nan))
# Boxplot per sample
log_intensities.plot(kind='box')
plt.ylabel('log10 Intensity')
plt.title('Intensity Distribution per Sample')
# Should see similar median and spread across samples
- Sample correlation:
def plot_sample_correlation(intensity_matrix):
"""
Calculate and visualize sample-sample correlation.
Expect: High correlation within replicates, lower between conditions.
"""
# Log-transform and remove zeros
log_data = np.log2(intensity_matrix.replace(0, np.nan))
# Correlation matrix
corr_matrix = log_data.corr(method='pearson')
# Heatmap
import seaborn as sns
sns.heatmap(corr_matrix, annot=True, cmap='RdYlBu_r', vmin=0.8, vmax=1.0)
- PCA:
def perform_pca(intensity_matrix, sample_groups):
"""
Principal component analysis for sample clustering.
"""
from sklearn.decomposition import PCA
# Prepare data (log, impute, scale)
log_data = np.log2(intensity_matrix.replace(0, np.nan))
# Simple imputation with minimum value
imputed = log_data.fillna(log_data.min().min())
# PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(imputed.T)
# Plot with group colors
plt.scatter(pca_result[:, 0], pca_result[:, 1], c=sample_groups)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
Phase 2: Preprocessing & Normalization
Objective: Clean data and normalize across samples for fair comparison.
Filtering:
def filter_proteins(intensity_matrix, metadata, min_valid=3):
"""
Filter out low-confidence proteins.
Criteria:
- At least 2 unique peptides (from metadata)
- At least min_valid samples with detected intensity
- Remove contaminants and reverse sequences
"""
# Filter by peptide count
valid_proteins = metadata['Peptides'] >= 2
# Filter by detection in samples
n_detected = (intensity_matrix > 0).sum(axis=1)
valid_detection = n_detected >= min_valid
# Remove contaminants (from MaxQuant)
is_contaminant = metadata['Protein IDs'].str.contains('CON__', na=False)
is_reverse = metadata['Protein IDs'].str.contains('REV__', na=False)
# Combined filter
keep = valid_proteins & valid_detection & ~is_contaminant & ~is_reverse
return intensity_matrix[keep], metadata[keep]
Missing value imputation:
def impute_missing_values(intensity_matrix, method='MinProb'):
"""
Impute missing protein intensities.
Methods:
- MinProb: Random from minimum observed + normal noise (for MNAR)
- KNN: K-nearest neighbors imputation
- QRILC: Quantile regression-based imputation
"""
if method == 'MinProb':
# Assume missing = low abundance (MNAR assumption)
min_val = intensity_matrix[intensity_matrix > 0].min().min()
width = 0.3 # Standard deviation of noise
shift = 1.8 # Downshift from minimum
# Replace zeros with random low values
imputed = intensity_matrix.copy()
missing_mask = imputed == 0
n_missing = missing_mask.sum().sum()
random_vals = np.random.normal(
loc=min_val - shift,
scale=width,
size=n_missing
)
imputed.values[missing_mask.values] = random_vals
return imputed
elif method == 'KNN':
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
imputed = pd.DataFrame(
imputer.fit_transform(intensity_matrix.replace(0, np.nan)),
index=intensity_matrix.index,
columns=intensity_matrix.columns
)
return imputed
Normalization:
def normalize_intensities(intensity_matrix, method='median'):
"""
Normalize protein intensities across samples.
Methods:
- median: Divide by median intensity per sample
- quantile: Quantile normalization (same distribution)
- TMM: Trimmed mean of M-values (from edgeR)
- VSN: Variance-stabilizing normalization
"""
if method == 'median':
# Median normalization
medians = intensity_matrix.median(axis=0)
global_median = medians.median()
norm_factors = global_median / medians
normalized = intensity_matrix * norm_factors
return normalized
elif method == 'quantile':
# Quantile normalization
from sklearn.preprocessing import quantile_transform
normalized = pd.DataFrame(
quantile_transform(intensity_matrix, axis=1),
index=intensity_matrix.index,
columns=intensity_matrix.columns
)
return normalized
Phase 3: Differential Expression Analysis
Objective: Identify proteins with significant abundance changes between conditions.
Statistical testing with limma:
def differential_expression_limma(log2_intensities, group1_samples, group2_samples):
"""
Perform differential expression using limma-like approach.
Returns:
- log2 fold changes
- p-values
- adjusted p-values (BH)
"""
from scipy import stats
results = []
for protein in log2_intensities.index:
# Extract intensities for each group
group1 = log2_intensities.loc[protein, group1_samples]
group2 = log2_intensities.loc[protein, group2_samples]
# Calculate statistics
mean1 = group1.mean()
mean2 = group2.mean()
log2fc = mean2 - mean1
# t-test
t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)
results.append({
'protein': protein,
'log2FC': log2fc,
'mean_group1': mean1,
'mean_group2': mean2,
'p_value': p_value,
't_statistic': t_stat
})
results_df = pd.DataFrame(results)
# Multiple testing correction (Benjamini-Hochberg)
from statsmodels.stats.multitest import multipletests
results_df['adj_p_value'] = multipletests(results_df['p_value'], method='fdr_bh')[1]
# Classify significance
results_df['significant'] = (
(results_df['adj_p_value'] < 0.05) &
(np.abs(results_df['log2FC']) > 1.0)
)
return results_df
Volcano plot:
def plot_volcano(de_results, title='Volcano Plot'):
"""
Visualize differential expression results.
"""
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
# Non-significant
non_sig = de_results[~de_results['significant']]
plt.scatter(non_sig['log2FC'], -np.log10(non_sig['p_value']),
c='gray', alpha=0.5, s=10)
# Significant
sig = de_results[de_results['significant']]
plt.scatter(sig['log2FC'], -np.log10(sig['p_value']),
c='red', alpha=0.7, s=20)
# Thresholds
plt.axhline(-np.log10(0.05), color='blue', linestyle='--', label='p=0.05')
plt.axvline(-1, color='blue', linestyle='--')
plt.axvline(1, color='blue', linestyle='--', label='|log2FC|=1')
plt.xlabel('log2 Fold Change')
plt.ylabel('-log10(p-value)')
plt.title(title)
plt.legend()
Phase 4: PTM Analysis
Objective: Analyze post-translational modifications (phosphorylation, acetylation, etc.)
Phosphoproteomics workflow:
def analyze_phosphosites(phospho_sites_file, intensity_matrix):
"""
Analyze phosphorylation site changes.
Input: MaxQuant Phospho (STY)Sites.txt
Output: Differential phosphorylation per site
"""
# Load phospho data
phospho = pd.read_csv(phospho_sites_file, sep='\t')
# Filter by localization probability
phospho_confident = phospho[phospho['Localization prob'] > 0.75]
# Extract site information
phospho_confident['site'] = (
phospho_confident['Gene names'] + '_' +
phospho_confident['Amino acid'] +
phospho_confident['Position'].astype(str)
)
# Quantification (similar to protein-level analysis)
# ... perform differential analysis ...
return phospho_results
Kinase-substrate prediction:
def predict_kinases(phospho_sites):
"""
Predict upstream kinases for phosphorylation sites.
Uses ToolUniverse PhosphoSitePlus or KEA3 tools.
"""
from tooluniverse import ToolUniverse
tu = ToolUniverse()
# For each significant phosphosite
kinase_predictions = []
for site in phospho_sites:
# Query kinase-substrate databases
# (would use actual ToolUniverse tool here)
result = tu.run_one_function({
"name": "phosphosite_plus_query", # hypothetical
"arguments": {"site": site}
})
kinase_predictions.append(result)
return kinase_predictions
Phase 5: Functional Enrichment
Objective: Interpret biological meaning of protein changes via pathway analysis.
Gene Ontology enrichment:
def pathway_enrichment_proteins(de_proteins, organism='human'):
"""
Perform pathway enrichment for differentially expressed proteins.
Uses ToolUniverse gene-enrichment skill.
"""
from tooluniverse import ToolUniverse
tu = ToolUniverse()
# Extract gene names for significant proteins
sig_proteins = de_proteins[de_proteins['significant']]
gene_list = sig_proteins['gene_name'].tolist()
# Run enrichment via ToolUniverse
enrichment = tu.run_one_function({
"name": "enrichr_enrich",
"arguments": {
"gene_list": ",".join(gene_list),
"library": "KEGG_2021_Human"
}
})
return enrichment
Protein complex enrichment:
def protein_complex_enrichment(protein_list):
"""
Test for enrichment of known protein complexes (CORUM database).
"""
# Query CORUM or use ToolUniverse
# Identify if proteins are part of known complexes
pass
Phase 6: Protein-Protein Interactions
Objective: Identify interaction networks and protein complexes.
STRING network analysis:
def build_protein_network(protein_list, confidence=0.7):
"""
Build PPI network using STRING database.
Uses ToolUniverse STRING tools.
"""
from tooluniverse import ToolUniverse
tu = ToolUniverse()
# Get interactions
interactions = tu.run_one_function({
"name": "string_get_interactions",
"arguments": {
"proteins": ",".join(protein_list),
"species": 9606, # human
"score_threshold": int(confidence * 1000)
}
})
# Build network graph
import networkx as nx
G = nx.Graph()
for interaction in interactions['data']:
G.add_edge(
interaction['protein1'],
interaction['protein2'],
score=interaction['score']
)
return G
Module detection:
def detect_protein_modules(network_graph):
"""
Identify tightly connected protein modules (complexes).
"""
from networkx.algorithms import community
# Detect communities
communities = community.greedy_modularity_communities(network_graph)
# Annotate modules with enriched functions
modules = []
for i, comm in enumerate(communities):
module_proteins = list(comm)
# Run enrichment for this module
enrichment = pathway_enrichment_proteins(module_proteins)
modules.append({
'module_id': i,
'proteins': module_proteins,
'size': len(module_proteins),
'top_function': enrichment['top_terms'][0]
})
return modules
Phase 7: Multi-Omics Integration
Objective: Integrate proteomics with transcriptomics and other omics.
Protein-RNA correlation:
def correlate_protein_rna(protein_data, rna_data, common_samples):
"""
Correlate protein and mRNA levels for each gene.
Expected: r ~ 0.4-0.6 (moderate correlation)
Discordance indicates post-transcriptional regulation
"""
from scipy.stats import spearmanr
# Find common genes
common_genes = set(protein_data.index) & set(rna_data.index)
correlations = {}
for gene in common_genes:
protein = protein_data.loc[gene, common_samples]
rna = rna_data.loc[gene, common_samples]
r, p = spearmanr(protein, rna)
correlations[gene] = {
'r': r,
'p': p,
'regulation': classify_regulation(r, protein.mean(), rna.mean())
}
return correlations
def classify_regulation(r, protein_level, rna_level):
"""
Classify regulatory mechanism based on correlation and levels.
"""
if r > 0.6 and protein_level > 0 and rna_level > 0:
return 'transcriptional_upregulation'
elif r > 0.6 and protein_level < 0 and rna_level < 0:
return 'transcriptional_downregulation'
elif r < 0.2 and protein_level > 0 and rna_level < 0:
return 'translational_upregulation'
elif r < 0.2 and protein_level < 0 and rna_level > 0:
return 'protein_degradation'
else:
return 'mixed_regulation'
Integration with multi-omics skill:
def integrate_with_multiomics(protein_data, rna_data, methylation_data):
"""
Pass proteomics data to multi-omics integration skill.
Enables comprehensive analysis across all molecular layers.
"""
# Prepare for multi-omics skill
omics_data = {
'proteomics': protein_data,
'rnaseq': rna_data,
'methylation': methylation_data
}
# Invoke multi-omics integration skill
from tooluniverse import ToolUniverse
# (Would use Skill tool to invoke tooluniverse-multi-omics-integration)
return integrated_analysis
Phase 8: Report Generation
Generate comprehensive proteomics report:
# Proteomics Analysis Report
## Dataset Summary
- **Samples**: 20 (10 disease, 10 control)
- **Proteins Identified**: 5,432
- **Proteins Quantified**: 4,987 (at least 3 samples)
- **Platform**: Orbitrap Fusion Lumos, MaxQuant 2.0
## Quality Control
- **Missing Values**: 15% average per protein
- **Sample Correlation**: 0.92-0.98 within groups
- **PCA**: Clear separation between disease and control (PC1: 35% variance)
## Differential Expression
- **Significant Proteins**: 432 (adj. p < 0.05, |log2FC| > 1)
- Upregulated: 245 proteins
- Downregulated: 187 proteins
- **Top upregulated**: MYC (log2FC=3.2), EGFR (log2FC=2.8)
- **Top downregulated**: TP53 (log2FC=-2.5), BRCA1 (log2FC=-2.1)
## Phosphoproteomics
- **Phosphosites Quantified**: 8,543
- **Differentially Phosphorylated**: 234 sites (p < 0.05)
- **Top Predicted Kinases**: CDK1, MAPK1, AKT1
## Pathway Enrichment
### Top Pathways (Upregulated)
1. **Cell Cycle** (p=1e-15) - 45 proteins, including cyclins, CDKs
2. **DNA Replication** (p=1e-12) - 23 proteins
3. **Glycolysis** (p=1e-10) - 18 proteins
### Top Pathways (Downregulated)
1. **Apoptosis** (p=1e-14) - 32 proteins, including caspases
2. **DNA Repair** (p=1e-11) - 28 proteins
3. **Oxidative Phosphorylation** (p=1e-9) - 25 proteins
## Protein Network Analysis
- **Network**: 432 nodes, 1,245 edges (STRING confidence > 0.7)
- **Modules Detected**: 8 functional modules
- Module 1: Cell cycle (85 proteins)
- Module 2: Metabolism (62 proteins)
- Module 3: Translation (48 proteins)
## Protein-RNA Correlation
- **Overall Correlation**: r = 0.54 (moderate, expected)
- **High Correlation**: 2,134 genes (r > 0.6) - transcriptional regulation
- **Low Correlation**: 456 genes (r < 0.2) - post-transcriptional regulation
- **Translation-Regulated**: 89 proteins (high protein, low RNA)
## Biological Interpretation
Disease state shows increased proliferation (MYC, cyclins) with concurrent
suppression of apoptosis and DNA repair (TP53, BRCA1). Metabolic shift toward
glycolysis evident at protein level. Post-transcriptional upregulation of
translation machinery suggests adaptation to proliferative demands.
## Potential Biomarkers
Top 10 proteins for disease classification (Random Forest AUC=0.95):
1. MYC (protein)
2. EGFR (protein)
3. CDK1 (phospho-T161)
4. TP53 (protein)
5. BRCA1 (protein)
Integration with ToolUniverse
Skills Coordinated:
| Skill | Used For | Phase |
|---|---|---|
tooluniverse-gene-enrichment |
Pathway enrichment | Phase 5 |
tooluniverse-protein-interactions |
PPI networks | Phase 6 |
tooluniverse-rnaseq-deseq2 |
RNA-seq for integration | Phase 7 |
tooluniverse-multi-omics-integration |
Cross-omics analysis | Phase 7 |
tooluniverse-target-research |
Protein annotation | Phase 8 |
Example Use Cases
Use Case 1: Cancer Proteomics
Question: "Analyze proteomics data from breast cancer vs normal tissue"
Workflow:
- Load MaxQuant proteinGroups.txt
- QC and filter (keep proteins with 2+ peptides, detected in 3+ samples)
- Impute missing, normalize by median
- Differential expression (limma): 432 significant proteins
- Pathway enrichment: Cell cycle, metabolism upregulated
- STRING network: Identify hub proteins (MYC, EGFR)
- Integrate with TCGA RNA-seq: Find translation-regulated genes
- Report: Comprehensive analysis with biomarkers
Use Case 2: Phosphoproteomics Signaling
Question: "What kinase signaling is activated in response to drug treatment?"
Workflow:
- Load Phospho (STY)Sites.txt from MaxQuant
- Filter by localization probability > 0.75
- Differential phosphorylation analysis
- Kinase prediction for significant sites
- Identify MAPK1, CDK1, AKT1 as top kinases
- Pathway enrichment: MAPK, PI3K/AKT pathways
- Report: Drug activates growth signaling
Use Case 3: Protein-RNA Integration
Question: "Which proteins are regulated post-transcriptionally?"
Workflow:
- Load proteomics (MaxQuant) and RNA-seq (DESeq2) data
- Match samples, extract common genes
- Correlate protein and RNA for each gene
- Identify low-correlation genes (r < 0.2)
- Classify: translation upregulation, protein degradation
- Enrichment: Find pathways enriched in post-transcriptional regulation
- Report: 89 translation-regulated proteins, RNA-binding proteins enriched
Quantified Minimums
| Component | Requirement |
|---|---|
| Proteins quantified | At least 500 proteins |
| Replicates | At least 3 per condition |
| Filtering | 2+ unique peptides per protein |
| Statistical test | limma or t-test with multiple testing correction |
| Pathway enrichment | At least one method (GO, KEGG, or Reactome) |
| Report | Summary, QC, DE results, pathways, visualizations |
Limitations
- Platform-specific: Optimized for MS-based proteomics (not Western blot quantification)
- Missing values: High missing rate (>50% per protein) limits statistical power
- PTM analysis: Requires enrichment protocols for comprehensive PTM profiling
- Absolute quantification: Relative abundance only (unless TMT/SILAC used)
- Protein isoforms: Typically collapsed to gene level
- Dynamic range: MS has limited dynamic range vs mRNA sequencing
References
Methods:
- MaxQuant: https://doi.org/10.1038/nbt.1511
- Limma for proteomics: https://doi.org/10.1093/nar/gkv007
- DEP workflow: https://doi.org/10.1038/nprot.2018.107
Databases:
- STRING: https://string-db.org
- PhosphoSitePlus: https://www.phosphosite.org
- CORUM: https://mips.helmholtz-muenchen.de/corum