skills/mims-harvard/tooluniverse/tooluniverse-cell-line-profiling

tooluniverse-cell-line-profiling

Installation
SKILL.md

Cancer Cell Line Profiling and Selection

Comprehensive profiling of cancer cell lines for experimental model selection. Transforms a query (cancer type, gene, or cell line name) into an actionable report covering identity verification, molecular features, gene dependencies, drug sensitivities, and druggable targets.

KEY PRINCIPLES:

  1. Decision-first - Answer "which cell line should I use?" not "here is all the data"
  2. Multi-source validation - Cross-reference DepMap, Cellosaurus, COSMIC, PharmacoDB
  3. Actionable output - Ranked cell line recommendations with rationale
  4. Practical focus - Include availability, growth characteristics, common pitfalls
  5. Gene-aware - When a gene of interest is given, prioritize lines with relevant mutations/dependencies
  6. Source-referenced - Cite database sources for every claim
  7. English-first queries - Always use English terms in tool calls, even if the user writes in another language

LOOK UP, DON'T GUESS

When uncertain about any scientific fact, SEARCH databases first rather than reasoning from memory. A database-verified answer is always more reliable than a guess.


COMPUTE, DON'T DESCRIBE

When analysis requires computation (statistics, data processing, scoring, enrichment), write and run Python code via Bash. Don't describe what you would do — execute it and report actual results. Use ToolUniverse tools to retrieve data, then Python (pandas, scipy, statsmodels, matplotlib) to analyze it.

When to Use

Apply for: cell line selection by cancer type/gene, cell line profiling, gene dependencies, drug sensitivity queries, cell line comparisons, mutation checks.


Phase 0: Tool Parameter Reference (CRITICAL)

BEFORE calling ANY tool, verify parameters against this table.

Tool Key Parameters Notes
DepMap_search_cell_lines query (required) Search by name, e.g., "A549", "MCF"
DepMap_get_cell_line model_name OR model_id Name: "A549"; ID: "SIDM00001"
DepMap_get_cell_lines tissue, cancer_type, page_size Filter by tissue (e.g., "Lung")
DepMap_get_gene_dependencies gene_symbol (required), model_id Gene effect scores; negative = essential
DepMap_search_genes query (required) Validate gene symbol in DepMap first
cellosaurus_search_cell_lines q (required), size Solr syntax: id:HeLa, ox:9606 AND char:cancer
cellosaurus_get_cell_line_info accession (required, CVCL_ format) Full cell line record
cellosaurus_query_converter query (required) Natural language to Solr syntax
COSMIC_search_mutations terms OR query, max_results Search "BRAF V600E" or gene name
COSMIC_get_mutations_by_gene gene OR gene_name, max_results All mutations for a gene
PharmacoDB_get_cell_line operation="get_cell_line", cell_name Cell line metadata + datasets
PharmacoDB_get_experiments operation="get_experiments", compound_name, cell_line_name, dataset_name, per_page Drug response data (IC50, AAC, EC50)
PharmacoDB_get_biomarker_assoc operation="get_biomarker_associations", compound_name, tissue_name, mdata_type, per_page Gene-drug sensitivity correlations
PharmacoDB_search operation="search", query Find PharmacoDB IDs
CellMarker_search_cancer_markers operation="search_cancer_markers", cancer_type, gene_symbol, cell_type Cancer cell markers
CellMarker_search_by_gene operation="search_by_gene", gene_symbol (required), species Cell types expressing a gene
HPA_get_comparative_expression_by_gene_and_cellline gene_name (required), cell_line (required) Supported lines: ishikawa, hela, mcf7, a549, hepg2, jurkat, pc3, rh30, siha, u251
CLUE_get_cell_lines operation="get_cell_lines", cell_id L1000 CMap cell line info (requires CLUE_API_KEY)
SYNERGxDB_search_combos drug_name_1, drug_name_2, sample (tissue or cell ID) Drug combination synergy (ZIP, Bliss, Loewe)
SYNERGxDB_list_cell_lines - All cell lines in SYNERGxDB
DGIdb_get_drug_gene_interactions genes: list[str] Druggable gene interactions
OpenTargets_get_associated_drugs_by_target_ensemblID ensemblId, size Drugs targeting a gene
STRING_get_network protein_ids: list[str], species: int (9606) PPI network for gene context
MyGene_query_genes query (NOT q) Resolve gene symbol to Ensembl ID
cBioPortal_get_mutations study_id, gene_list (STRING, not array) Cell line mutations from CCLE

Workflow Overview

Input: Cancer type AND/OR Gene of interest AND/OR Cell line name(s)

Phase 1: Cell Line Identification
  - Search and verify cell line identity (Cellosaurus)
  - Get metadata: species, disease, STR profile, cross-references
  - If cancer type given without cell line: find candidate lines (DepMap)

Phase 2: Molecular Profiling
  - Mutation landscape (COSMIC, cBioPortal CCLE)
  - Gene expression (HPA, DepMap)
  - Cancer markers (CellMarker)

Phase 3: Gene Dependencies (CRISPR Screens)
  - Gene essentiality scores from DepMap
  - Identify selectively essential genes
  - Compare across cell lines if multiple candidates

Phase 4: Drug Sensitivity
  - IC50/AAC from PharmacoDB (GDSC, CCLE, CTRPv2, PRISM)
  - Biomarker associations for drug response
  - Drug combination synergy (SYNERGxDB)

Phase 5: Target Druggability & Recommendations
  - Druggable targets (DGIdb, OpenTargets)
  - Final ranked recommendation with rationale

Phase 1: Cell Line Identification

Goal: Verify cell line identity and find candidates.

If specific cell line given: (1) cellosaurus_search_cell_lines(q="id:<NAME>") → get CVCL accession, species, disease, contamination flags. (2) cellosaurus_get_cell_line_info(accession="CVCL_XXXX") for STR profile. (3) DepMap_get_cell_line(model_name="...") for tissue, cancer_type, MSI, ploidy. (4) PharmacoDB_get_cell_line(operation="get_cell_line", cell_name="...") for datasets.

If cancer type only: (1) DepMap_get_cell_lines(tissue="Lung", page_size=20). (2) Narrow by gene mutations/dependencies in Phases 2-3. (3) CellMarker_search_cancer_markers(operation="search_cancer_markers", cancer_type="Lung").

OUTPUT: Table of candidate cell lines with: name, tissue, cancer type, key identifiers.


Phase 2: Molecular Profiling

Goal: Characterize mutational and expression landscape.

2A Mutations: COSMIC_get_mutations_by_gene(gene="EGFR") + cBioPortal_get_mutations(study_id="ccle_broad_2019", gene_list="EGFR,KRAS,TP53"). Note: gene_list is a comma-separated STRING. CCLE study ID: ccle_broad_2019.

2B Expression: HPA_get_comparative_expression_by_gene_and_cellline(gene_name="EGFR", cell_line="a549"). Only 10 lines supported: hela, mcf7, a549, hepg2, jurkat, pc3, rh30, siha, u251, ishikawa.

2C Cancer markers: CellMarker_search_by_gene(operation="search_by_gene", gene_symbol="EGFR", species="Human")

OUTPUT: Mutation table (gene, AA change, type) + expression summary per cell line.


Phase 3: Gene Dependencies (CRISPR Screens)

Goal: Determine which genes are essential in candidate cell lines.

LIMITATION: DepMap_get_gene_dependencies returns gene metadata (HGNC ID, Ensembl ID) but NOT per-cell-line CRISPR scores. Full Chronos scores require depmap.org download.

Available tools: (1) DepMap_search_genes(query="EGFR") — validate gene exists. (2) DepMap_get_gene_dependencies(gene_symbol="EGFR") — metadata only. (3) Alternatives: cBioPortal CCLE for mutation data, PubMed for published screens, or direct user to depmap.org/portal.

Interpreting Chronos scores (from DepMap portal): <-0.5 = essential; ~0 = not essential; ~-1.0 = strongly essential. Selective dependency (essential in some lineages only) indicates therapeutic window.

OUTPUT: Gene validation + mutation status per cell line.

Offline DepMap analysis (when API lacks CRISPR scores): Download CRISPRGeneEffect.csv + Model.csv from https://depmap.org/portal/download/all/. Load with pandas, find gene column (format: "KRAS (3845)"), merge with metadata, filter by lineage, sort by score. Most negative Chronos score = most dependent.

If DepMap data is unavailable: Use cBioPortal_get_mutations(study_id="ccle_broad_2019", gene_list="KRAS") for mutation data, and the Quick Reference table below for common recommendations.


Phase 4: Drug Sensitivity

Goal: Profile drug response data.

4A PharmacoDB: PharmacoDB_get_experiments(operation="get_experiments", compound_name="Erlotinib", cell_line_name="A549", per_page=20) for dose-response (IC50, AAC, EC50). Omit compound_name to get all drugs for a cell line. Use PharmacoDB_get_biomarker_assoc(compound_name="...", tissue_name="...", mdata_type="mutation") for sensitivity biomarkers.

4B SYNERGxDB: SYNERGxDB_search_combos(drug_name_1="gemcitabine", drug_name_2="erlotinib", sample="lung"). Positive ZIP = synergy. Covers cytotoxic agents only (not targeted therapies/biologics).

4C CLUE: CLUE_get_cell_lines(operation="get_cell_lines", cell_id="MCF7") — requires CLUE_API_KEY.

OUTPUT: Drug sensitivity table (drug, IC50, AAC, dataset) + synergy data if available.


Phase 5: Target Druggability and Recommendations

5A Druggability: DGIdb_get_drug_gene_interactions(genes=["EGFR", "KRAS"]) + MyGene_query_genes(query="EGFR")OpenTargets_get_associated_drugs_by_target_ensemblID(ensemblId="...", size=10) + STRING_get_network(protein_ids=["EGFR"], species=9606).

5B Final Recommendation: Synthesize all phases. Explain WHY one line is better for this specific use case.

Decision Criteria with Concrete Thresholds

Criterion Weight Score 3 (Best) Score 2 (Acceptable) Score 1 (Poor)
Mutation match x3 Exact mutation (e.g., KRAS G12D) Same gene, different mutation No mutation in gene of interest
Co-mutation simplicity x2 Few co-mutations (cleaner background) Moderate co-mutations Complex background (3+ driver mutations)
Gene dependency x2 DepMap score < -0.5 (essential) Score -0.5 to -0.2 (moderately essential) Score > -0.2 (not essential)
Drug sensitivity data x1 In GDSC + CCLE + PRISM (3+ datasets) In 1-2 datasets No drug response data
Practical factors x1 Adherent, well-characterized, widely used Suspension or less common Hard to culture, contamination-prone

Total score = sum of (criterion score × weight). Max = 27. Rank cell lines by total score.

Use-Case-Specific Guidance

The best cell line depends on what you're doing with it:

Use Case Key Requirements Extra Considerations
CRISPR knockout screen Adherent growth, good lentiviral transduction, pre-existing Cas9 clones (check Cellosaurus for "-Cas9" derivatives) Doubling time matters for library coverage; <72h ideal
Drug sensitivity testing In PharmacoDB/GDSC, known IC50 for reference compounds Check SYNERGxDB for combo data
Xenograft model Known tumorigenicity in mice, available PDX data Check if line forms tumors in nude/NSG mice (Cellosaurus often notes this)
Mechanism of action Clean genetic background, gene dependency confirmed Fewer co-mutations = easier to attribute phenotypes
Biomarker discovery Isogenic pairs available, well-characterized omics Check if isogenic knockouts exist (Cellosaurus)
Drug combination In SYNERGxDB with combo data, known single-agent responses ZIP score available for synergy assessment

Cellosaurus Derivative Lines

Check for pre-made derivatives — this can save months of lab work:

  • cellosaurus_search_cell_lines(q="ca:<PARENT_LINE>", size=20) — finds all derivatives
  • Look for: Cas9-expressing clones, drug-resistant derivatives, knockout lines, fluorescent reporter lines
  • Example: PANC-1-Cas9-554 through PANC-1-Cas9-559 (CVCL_WL48-WL53) are pre-validated Cas9 clones

DepMap API Fallbacks

If DepMap_get_gene_dependencies fails (common for some genes):

  • The Sanger Cell Model Passports API may not index all genes. Note this limitation.
  • Recommend the user check DepMap portal (depmap.org) directly for CRISPR dependency data.
  • Use cBioPortal_get_mutations(study_id="ccle_broad_2019", gene_list="<GENE>") as an alternative source for cell line mutation data.

OUTPUT: Ranked cell line table with total scores, per-criterion breakdown, and a text recommendation explaining the top pick and runner-up with biological reasoning.


Common Use Patterns

Pattern Question Type Key Tools (in order)
1 "Which cell line for [cancer] + [gene]?" DepMap_get_cell_lines → DepMap_get_gene_dependencies → COSMIC_get_mutations_by_gene → cBioPortal_get_mutations (ccle_broad_2019) → PharmacoDB_get_experiments → rank by mutation + dependency + drug sensitivity
2 "Profile cell line X" cellosaurus_search → DepMap_get_cell_line → PharmacoDB_get_cell_line → cBioPortal_get_mutations → HPA expression (if supported) → PharmacoDB_get_experiments
3 "Which lines are sensitive to [drug]?" DepMap_get_cell_lines (tissue filter) → PharmacoDB_get_experiments (compound) → PharmacoDB_get_biomarker_assoc → rank by AAC (higher=sensitive) or IC50 (lower=sensitive)
4 "Compare A vs B" Run Pattern 2 for both in parallel → side-by-side comparison table
5 "Drug combos for [cell line]?" SYNERGxDB_search_combos → PharmacoDB_get_experiments (single-agent baseline) → report synergistic pairs with ZIP scores

Quick Reference: Common Cancer Cell Lines by Type

Cancer Type Key Cell Lines Common Mutations
NSCLC A549 (KRAS G12S), H1975 (EGFR L858R/T790M), PC-9 (EGFR del19), HCC827 (EGFR del19/amp), H460 (KRAS Q61H), H1299 (NRAS Q61K, TP53-null) KRAS, EGFR, TP53, STK11
Breast MCF7 (ER+/PR+), MDA-MB-231 (TNBC, KRAS G13D), T-47D (ER+), BT-474 (HER2+), SK-BR-3 (HER2+) PIK3CA, TP53, BRCA1/2
Colorectal HCT116 (KRAS G13D, MSI-H), SW480 (KRAS G12V), HT-29 (BRAF V600E), Caco-2 (APC), LoVo (KRAS G13D, MSI-H) APC, KRAS, TP53, BRAF
Melanoma A375 (BRAF V600E), SK-MEL-28 (BRAF V600E), WM266-4 (BRAF V600D), MeWo (WT BRAF) BRAF, NRAS, TP53
Pancreatic PANC-1 (KRAS G12D), MIA PaCa-2 (KRAS G12C), AsPC-1 (KRAS G12D), Capan-1 (BRCA2 mut) KRAS, TP53, CDKN2A, SMAD4
Prostate PC-3 (AR-negative), LNCaP (AR+, PTEN-null), DU145 (AR-negative), VCaP (AR amp, TMPRSS2-ERG) AR, PTEN, TP53, RB1
Ovarian SKOV3 (HER2+, TP53 mut), OVCAR3 (TP53 mut), A2780 (sensitive), A2780cis (cisplatin-resistant) TP53, BRCA1/2
Leukemia K562 (CML, BCR-ABL), Jurkat (T-ALL), HL-60 (AML), THP-1 (AML, monocytic) BCR-ABL, FLT3, NPM1
Glioblastoma U251 (TP53 mut), U87MG (PTEN-null), T98G (TP53/PTEN mut), LN229 (TP53 mut, PTEN WT) TP53, PTEN, EGFR, IDH1
Liver HepG2 (hepatoblastoma, WT TP53), Hep3B (HBV+, TP53-null), Huh7 (HCC, TP53 Y220C) TP53, CTNNB1, AXIN1

Cross-Referencing Cell Line IDs

Use cell line NAME as common key across databases. IDs: DepMap=SIDM, Cellosaurus=CVCL, cBioPortal=sample (e.g. A549_LUNG), PharmacoDB/SYNERGxDB=name string. When names differ ("HCT 116" vs "HCT116"), check Cellosaurus synonyms first.

Mutation-based filtering: cBioPortal_get_mutations(study_id="ccle_broad_2019", gene_list="KRAS") → filter by amino_acid_change → extract cell line names → query other databases.


Error Handling

Issue Resolution
DepMap returns no results for cell line name Try alternative names: check Cellosaurus synonyms first
cBioPortal CCLE study ID unknown Use ccle_broad_2019 as default CCLE study
PharmacoDB cell line name mismatch Use PharmacoDB_search(operation="search", query="<name>") to find the canonical name
HPA cell line not supported Only 10 lines supported (hela, mcf7, a549, hepg2, jurkat, pc3, rh30, siha, u251, ishikawa). Skip HPA for other lines
CLUE requires API key Skip CLUE tools if CLUE_API_KEY not set; note in report
Gene symbol not found in DepMap Use DepMap_search_genes(query="<symbol>") to check aliases
Cellosaurus accession pattern Must be CVCL_XXXX format; search first if you only have a name
SYNERGxDB no results for drug combo Drug may not be in database; SYNERGxDB covers cytotoxic agents, not most targeted therapies

Completeness Checklist

Before finalizing the report, verify:

  • Cell line identity verified (Cellosaurus or DepMap)
  • Species confirmed as human (unless otherwise specified)
  • Key mutations documented (COSMIC or cBioPortal)
  • Gene dependency assessed (DepMap CRISPR, if gene of interest provided)
  • Drug sensitivity data included (PharmacoDB, at least one dataset)
  • Druggability of key targets checked (DGIdb or OpenTargets)
  • Practical recommendation provided (not just raw data)
  • All claims cite their source database
  • Known limitations noted (missing data, unsupported lines)
Weekly Installs
46
GitHub Stars
1.3K
First Seen
Today