single-cell-foundation-model-stofm
SKILL.md
SToFM
Use This Skill When
Use this skill for the local SToFM repository at /DATA/disk0/zhaosy/home/SToFM.
It is the right choice when the task involves:
- preprocessing spatial transcriptomics data into the format expected by SToFM
- converting mouse genes to the human Geneformer vocabulary when needed
- generating cell embeddings with the official
get_embeddings.pypipeline - working with spatial coordinates, sub-slice splitting, or hypernode construction
- using SToFM embeddings for downstream region segmentation or cell type annotation
- understanding the repo's two-stage architecture: cell encoder plus SE(2) Transformer
Do not use this skill for ordinary scRNA-seq analysis without spatial coordinates.
Start Here
- Confirm the data has usable spatial coordinates.
- Check whether the input has already been preprocessed into both
data.h5adandhf.dataset. - Check that the required checkpoints exist for both the cell encoder and the SE(2) Transformer.
- Prefer the official embedding pipeline before building downstream heads.
Choose A Path
Preprocessing
Use preprocessing/preprocess.py first unless the dataset is already in the
expected SToFM format.
The repo's preprocessing flow:
- starts from
AnnData - expects Geneformer-style transcriptome tokenization
- adds
obs["n_counts"] - uses
var["ensembl_id"] - maps mouse gene ids to human ids when needed
- saves both:
hf.datasetfor the cell encoderdata.h5adfor later spatial loading
Embedding generation
The main official workflow is get_embeddings.py.
This path:
- loads the pretrained cell encoder
- loads the SToFM SE(2) Transformer
- encodes cells from
hf.datasetifce_emb.npyis missing - loads spatial coordinates from
data.h5ad - splits large slices into sub-slices
- builds hypernodes and attention biases
- runs the SE(2) Transformer
- saves final embeddings such as
stofm_emb.npy
Downstream tasks
The repo's recommended downstream pattern is simple:
- generate SToFM embeddings first
- train a task-specific head on top of those embeddings
The README specifically highlights:
- tissue region semantic segmentation
- cell type annotation
Guardrails
- Do not skip spatial information; SToFM is not just a transcriptome encoder.
- Do not treat
hf.datasetalone as sufficient input for the full model; the SE(2) stage also needs spatial structure. - Do not assume mouse genes can be used directly; the repo explicitly maps them into the human vocabulary.
- Do not run downstream heads on raw expression if the intended workflow is SToFM; generate official embeddings first.
- Do not assume the repo ships checkpoints; they are external downloads.
Read More Only If Needed
- Read
references/local-usage.mdfor paths, required files, and practical checks. - Read
references/model-notes.mdfor the paper-level positioning and the multi-scale workflow.
Weekly Installs
2
Repository
pharmolix/openbiomedGitHub Stars
1.0K
First Seen
11 days ago
Security Audits
Installed on
trae-cn2
iflow-cli2
deepagents2
antigravity2
claude-code2
github-copilot2