datasets-loading
OmicVerse Built-in Datasets
ov.datasets provides 30+ ready-to-use datasets with automatic download, caching, and fallback to mock data. Use these instead of manually downloading files or relying on scanpy.datasets.
When to Use This Module
- Tutorials/demos: Load standard benchmarks (PBMC3k, Paul15, dentate gyrus) with one function call
- Testing pipelines: Use
create_mock_dataset()to generate synthetic data without downloads - Gene set analysis: Use
predefined_signaturesfor curated GMT gene sets (cell cycle, gender, mitochondrial, tissue-specific) - Velocity workflows: Load pre-formatted datasets with spliced/unspliced layers
Dataset Catalog
Single-Cell
| Function | Cells | Genes | Description |
|---|---|---|---|
ov.datasets.pbmc3k() |
2,700 | 32,738 | 10x PBMC3k (raw or processed) |
ov.datasets.pbmc8k() |
~8,000 | — | 10x PBMC 8k |
ov.datasets.paul15() |
2,730 | 3,451 | Myeloid progenitors |
ov.datasets.krumsiek11() |
640 | 11 | Myeloid differentiation simulation |
ov.datasets.bone_marrow() |
5,780 | 27,876 | Bone marrow hematopoietic |
ov.datasets.hematopoiesis() |
— | — | Processed hematopoiesis |
ov.datasets.hematopoiesis_raw() |
— | — | Raw hematopoiesis |
ov.datasets.sc_ref_Lymph_Node() |
~10,000 | ~15,000 | Lymph node reference |
ov.datasets.bhattacherjee() |
~5,000 | ~2,000 | Mouse PFC cocaine study |
ov.datasets.human_tfs() |
— | — | Human TF list (DataFrame) |
RNA Velocity & Trajectories
| Function | Cells | Genes | Description |
|---|---|---|---|
ov.datasets.dentate_gyrus() |
18,213 | 27,998 | Dentate gyrus (loom) |
ov.datasets.dentate_gyrus_scvelo() |
2,930 | 13,913 | DG subset from scVelo |
ov.datasets.zebrafish() |
4,181 | 16,940 | Zebrafish developmental |
ov.datasets.pancreatic_endocrinogenesis() |
— | — | Pancreatic epithelial |
ov.datasets.pancreas_cellrank() |
2,930 | 13,913 | Pancreas cellrank benchmark |
ov.datasets.scnt_seq_neuron_splicing() |
13,476 | 44,021 | scNT-seq neuron splicing |
ov.datasets.scnt_seq_neuron_labeling() |
3,060 | 24,078 | scNT-seq neuron labeling |
ov.datasets.sceu_seq_rpe1() |
~2,930 | ~13,913 | scEU-seq RPE1 |
ov.datasets.sceu_seq_organoid() |
3,831 | 9,157 | scEU-seq organoid |
ov.datasets.haber() |
7,216 | 27,998 | Intestinal epithelium |
ov.datasets.chromaffin() |
— | — | Chromaffin cell lineage |
ov.datasets.hg_forebrain_glutamatergic() |
1,720 | 32,738 | Human forebrain |
ov.datasets.toggleswitch() |
200 | 2 | Two-gene simulation |
Spatial & Multiome
| Function | Description |
|---|---|
ov.datasets.seqfish() |
SeqFISH spatial transcriptomics |
ov.datasets.multi_brain_5k() |
10x E18 mouse brain multiome (MuData) |
Bulk RNA-seq & Deconvolution
| Function | Description |
|---|---|
ov.datasets.burczynski06() |
UC/CD PBMC bulk (127 samples) |
ov.datasets.moignard15() |
Embryo hematopoiesis qRT-PCR |
ov.datasets.decov_bulk_covid_bulk() |
COVID-19 PBMC bulk |
ov.datasets.decov_bulk_covid_single() |
COVID-19 PBMC single-cell ref |
Synthetic
| Function | Description |
|---|---|
ov.datasets.create_mock_dataset() |
Configurable synthetic scRNA-seq |
ov.datasets.blobs() |
Gaussian blob clusters |
Mock Data Generation
Use create_mock_dataset() when you need data without network access or for pipeline testing:
import omicverse as ov
# Basic mock dataset
adata = ov.datasets.create_mock_dataset(
n_cells=2000,
n_genes=1500,
n_cell_types=6,
with_clustering=False,
random_state=42,
)
# adata.obs: cell_type, sample_id, condition, tissue
# adata.var: gene_symbols, highly_variable
# With full preprocessing (normalized, PCA, UMAP, leiden)
adata = ov.datasets.create_mock_dataset(
n_cells=5000,
n_genes=3000,
n_cell_types=10,
with_clustering=True,
)
Features:
- Negative binomial expression distribution
- Cell-type-specific marker genes (2-5x expression multiplier)
- Gene names:
Gene_0001,Gene_0002, ... with_clustering=Trueadds: normalization, HVG, scaling, PCA, UMAP, leiden
Predefined Gene Set Signatures
Pre-loaded GMT files for common scoring tasks:
from omicverse.datasets import predefined_signatures, load_signatures_from_file
# Available signature keys
print(list(predefined_signatures.keys()))
# ['cell_cycle_human', 'cell_cycle_mouse', 'gender_human', 'gender_mouse',
# 'mitochondrial_genes_human', 'mitochondrial_genes_mouse',
# 'ribosomal_genes_human', 'ribosomal_genes_mouse',
# 'apoptosis_human', 'apoptosis_mouse',
# 'human_lung', 'mouse_lung', 'mouse_brain', 'mouse_liver', 'emt_human']
# Load a signature → dict[str, list[str]]
cell_cycle = load_signatures_from_file(predefined_signatures['cell_cycle_human'])
# {'S_genes': ['MCM5', 'PCNA', ...], 'G2M_genes': ['HMGB2', 'CDK1', ...]}
# Use with scoring
import scanpy as sc
sc.tl.score_genes_cell_cycle(adata, s_genes=cell_cycle['S_genes'],
g2m_genes=cell_cycle['G2M_genes'])
Critical API Reference
# CORRECT: use ov.datasets for standard benchmarks
adata = ov.datasets.pbmc3k()
# WRONG: manually downloading what's already built-in
# import urllib.request
# urllib.request.urlretrieve('https://...', 'pbmc3k.h5ad') # unnecessary!
# adata = ov.read('pbmc3k.h5ad')
# CORRECT: pbmc3k(processed=True) for pre-processed version
adata = ov.datasets.pbmc3k(processed=True)
# WRONG: loading raw then manually preprocessing for a demo
# adata = ov.datasets.pbmc3k()
# sc.pp.normalize_total(adata) # unnecessary if you just need a quick demo
# CORRECT: mock data for testing (no network needed)
adata = ov.datasets.create_mock_dataset(n_cells=500, n_genes=200)
# WRONG: creating synthetic data manually with numpy
# X = np.random.poisson(1, (500, 200)) # missing metadata, layers, etc.
Caching Behavior
- Default cache directory:
./data/(relative to working directory) - Skip if exists: All functions check for existing files before downloading
- Mirror fallback: Stanford and Figshare mirrors for reliability
- Mock fallback: Most functions generate mock data if download fails (network issues)
var_names_make_unique()called automatically after loading
Troubleshooting
- Download timeout / 403 error: Some datasets use
download_data_requests()with custom headers. If persistent, manually download the file to./data/with the expected filename and the function will find it. ModuleNotFoundError: No module named 'muon'when callingmulti_brain_5k(): Install muon:pip install muon. This function returns MuData, not AnnData.- Mock dataset has no
.raworlayers['counts']: Add manually after creation:ov.utils.store_layers(adata, layers='counts')andadata.raw = adata. load_signatures_from_filereturns empty dict: Verify the GMT file path. Usepredefined_signatures['key']which resolves to the bundled file viaimportlib.resources.- Dentate gyrus loom download is slow: The loom file is large (~200MB). Use
ov.datasets.dentate_gyrus_scvelo()for the smaller pre-processed subset (2,930 cells).
Dependencies
- Core:
omicverse,scanpy,anndata,numpy,pandas - Downloads:
tqdm,requests(for mirror fallback) - Multiome:
muon(only formulti_brain_5k()) - Signatures:
importlib.resources(stdlib)
Examples
- "Load the PBMC3k dataset and run the standard preprocessing pipeline."
- "Create a mock dataset with 5000 cells and 8 cell types for testing my clustering workflow."
- "Load cell cycle gene signatures and score my adata for S and G2M phase genes."
References
- Quick copy/paste commands:
reference.md
More from starlitnightly/omicverse
single-cell-downstream-analysis
AUCell pathway scoring, metacell DEG, scDrug response, SCENIC regulons, cNMF programs, and NOCD community detection in OmicVerse.
46single-cell-annotation-skills-with-omicverse
Cell type annotation: SCSA, MetaTiME, CellVote consensus, CellMatch, GPTAnno, weighted KNN label transfer in OmicVerse.
45bulk-rna-seq-deseq2-analysis-with-omicverse
PyDESeq2 differential expression: ID mapping, DE testing, fold-change thresholding, and GSEA enrichment visualization in OmicVerse.
44single-cell-preprocessing-with-omicverse
Single-cell QC, normalization, HVG detection, PCA, neighbor graph, UMAP/tSNE embedding pipelines in OmicVerse (CPU/GPU).
40data-export-pdf
Create professional PDF reports with text, tables, and embedded images using reportlab. Works with ANY LLM provider (GPT, Gemini, Claude, etc.).
38single-cell-multi-omics-integration
Multi-omics integration: MOFA factor analysis, GLUE unpaired alignment, SIMBA batch correction, TOSICA label transfer, StaVIA trajectory. Covers scRNA+scATAC paired/unpaired workflows.
38