skills/mims-harvard/tooluniverse/tooluniverse-crispr-screen-analysis

tooluniverse-crispr-screen-analysis

SKILL.md

ToolUniverse CRISPR Screen Analysis

Comprehensive skill for analyzing CRISPR-Cas9 genetic screens to identify essential genes, synthetic lethal interactions, and therapeutic targets through robust statistical analysis and pathway enrichment.

Overview

CRISPR screens enable genome-wide functional genomics by systematically perturbing genes and measuring fitness effects. This skill provides an 8-phase workflow for:

  • Processing sgRNA count matrices
  • Quality control and normalization
  • Gene-level essentiality scoring (MAGeCK-like and BAGEL-like approaches)
  • Synthetic lethality detection
  • Pathway enrichment analysis
  • Drug target prioritization with DepMap integration
  • Integration with expression and mutation data

Core Workflow

Phase 1: Data Import & sgRNA Count Processing

Load sgRNA count matrix (MAGeCK format or generic TSV). Expected columns: sgRNA, Gene, plus sample columns. Create experimental design table linking samples to conditions (baseline/treatment) with replicate assignments.

Phase 2: Quality Control & Filtering

Assess sgRNA distribution quality:

  • Library sizes per sample (total reads)
  • Zero-count sgRNAs: Count across samples
  • Low-count filtering: Remove sgRNAs below threshold (default: <30 reads in >N-2 samples)
  • Gini coefficient: Assess distribution skewness per sample
  • Report filtering recommendations

Phase 3: Normalization

Normalize sgRNA counts to account for library size differences:

  • Median ratio (DESeq2-like): Calculate geometric mean reference, compute size factors as median of ratios
  • Total count (CPM-like): Divide by library size in millions

Calculate log2 fold changes (LFC) between treatment and control conditions with pseudocount.

Phase 4: Gene-Level Scoring

Two scoring approaches:

  • MAGeCK-like (RRA): Rank all sgRNAs by LFC, compute mean rank per gene. Lower mean rank = more essential. Includes sgRNA count and mean LFC per gene.
  • BAGEL-like (Bayes Factor): Use reference essential/non-essential gene sets to estimate LFC distributions. Calculate likelihood ratio (Bayes Factor) for each gene. Higher BF = more likely essential.

Phase 5: Synthetic Lethality Detection

Compare essentiality scores between wildtype and mutant cell lines:

  • Merge gene scores, calculate delta LFC and delta rank
  • Filter for genes essential in mutant (LFC < threshold) but not wildtype (LFC > -0.5) with large rank change
  • Sort by differential essentiality

Query DepMap/literature for known dependencies using PubMed search.

Phase 6: Pathway Enrichment Analysis

Submit top essential genes to Enrichr for pathway enrichment:

  • KEGG pathways
  • GO Biological Process
  • Retrieve enriched terms with p-values and gene lists

Phase 7: Drug Target Prioritization

Composite scoring combining:

  • Essentiality (50% weight): Normalized mean LFC from CRISPR screen
  • Expression (30% weight): Log2 fold change from RNA-seq (if available)
  • Druggability (20% weight): Number of drug interactions from DGIdb

Query DGIdb for each candidate gene to find existing drugs, interaction types, and sources.

Phase 8: Report Generation

Generate markdown report with:

  • Summary statistics (total genes, essential genes, non-essential genes)
  • Top 20 essential genes table (rank, gene, mean LFC, sgRNAs, score)
  • Pathway enrichment results (top 10 terms per database)
  • Drug target candidates (rank, gene, essentiality, expression FC, druggability, priority score)
  • Methods section

ToolUniverse Tool Integration

Key Tools Used:

  • PubMed_search - Literature search for gene essentiality
  • Enrichr_submit_genelist - Pathway enrichment submission
  • Enrichr_get_results - Retrieve enrichment results
  • DGIdb_query_gene - Drug-gene interactions and druggability
  • STRING_get_network - Protein interaction networks
  • KEGG_get_pathway - Pathway visualization

Expression Integration:

  • GEO_get_dataset - Download expression data
  • ArrayExpress_get_experiment - Alternative expression source

Variant Integration:

  • ClinVar_query_gene - Known pathogenic variants
  • gnomAD_get_gene - Population allele frequencies

Quick Start

import pandas as pd
from tooluniverse import ToolUniverse

# 1. Load data
counts, meta = load_sgrna_counts("sgrna_counts.txt")
design = create_design_matrix(['T0_1', 'T0_2', 'T14_1', 'T14_2'],
                               ['baseline', 'baseline', 'treatment', 'treatment'])

# 2. Process
filtered_counts, filtered_mapping = filter_low_count_sgrnas(counts, meta['sgrna_to_gene'])
norm_counts, _ = normalize_counts(filtered_counts)
lfc, _, _ = calculate_lfc(norm_counts, design)

# 3. Score genes
gene_scores = mageck_gene_scoring(lfc, filtered_mapping)

# 4. Enrich pathways
enrichment = enrich_essential_genes(gene_scores, top_n=100)

# 5. Find drug targets
drug_targets = prioritize_drug_targets(gene_scores)

# 6. Generate report
report = generate_crispr_report(gene_scores, enrichment, drug_targets)

References

  • Li W, et al. (2014) MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens. Genome Biology
  • Hart T, et al. (2015) High-Resolution CRISPR Screens Reveal Fitness Genes and Genotype-Specific Cancer Liabilities. Cell
  • Meyers RM, et al. (2017) Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens. Nature Genetics
  • Tsherniak A, et al. (2017) Defining a Cancer Dependency Map. Cell (DepMap)

See Also

  • ANALYSIS_DETAILS.md - Detailed code snippets for all 8 phases
  • USE_CASES.md - Complete use cases (essentiality screen, synthetic lethality, drug target discovery, expression integration) and best practices
  • EXAMPLES.md - Example usage and quick reference
  • QUICK_START.md - Quick start guide
  • FALLBACK_PATCH.md - Fallback patterns for API issues
Weekly Installs
125
GitHub Stars
1.1K
First Seen
Feb 12, 2026
Installed on
gemini-cli120
codex119
opencode118
github-copilot117
kimi-cli112
amp112