Geneformer

Use This Skill When

Use this skill when the task involves official Geneformer workflows such as:

converting raw scRNA-seq data into Geneformer tokenized datasets
fine-tuning Geneformer for cell or gene classification
extracting cell or gene embeddings
generating state embeddings for downstream perturbation analysis
running in silico perturbation or in silico treatment style analyses
distinguishing pretrained zero-shot usage from fine-tuned classifier usage

This skill is for Geneformer-specific workflows, not generic single-cell model use.

Start Here

Confirm the input is raw-count scRNA-seq data and still suitable for tokenization.
Check that ensembl_id and n_counts are available.
Tokenize first unless the user already has a Geneformer .dataset.
Decide whether the task is classification, embedding extraction, or in silico perturbation.

Choose A Path

Tokenization

Use TranscriptomeTokenizer first for almost every Geneformer workflow. This step converts raw-count .loom or .h5ad data into tokenized datasets used by the downstream APIs.

Geneformer expects:

row attribute ensembl_id
cell attribute n_counts

Optional metadata can be passed through during tokenization.

Classification

Use Classifier for:

cell state classification
cell type annotation
gene classification tasks

The input is a tokenized Geneformer .dataset object, not raw AnnData.

Embedding extraction

Use EmbExtractor when the task is to:

extract CLS, cell, or gene embeddings
plot or inspect cell embeddings
generate state embeddings for later perturbation analysis

In silico perturbation

Use InSilicoPerturber for zero-shot or model-based perturbation analyses such as:

deleting or shifting genes
modeling start and goal cell states
ranking perturbations by movement toward a desired cell state

This is one of Geneformer's defining workflows and should be treated as more than ordinary classifier inference.

Guardrails

Do not pass feature-selected matrices into the tokenizer; the docs expect raw counts without feature selection.
Do not use gene symbols where the tokenizer expects ensembl_id.
Do not confuse tokenized .dataset files with AnnData objects.
Do not skip tokenization and jump directly to classifiers or perturbation APIs.
For perturbation tasks, be explicit about model type, embedding mode, and target cell states.

Official Workflow Surface

Component	Use
`TranscriptomeTokenizer`	create tokenized datasets
`Classifier`	fine-tune cell or gene classifiers
`MTLClassifier`	multitask cell classification
`EmbExtractor`	extract and summarize embeddings
`InSilicoPerturber`	simulate perturbations / treatment directions

single-cell-foundation-model-geneformer