Benchmark Datasets#

Levine_32dim#

The standard benchmark dataset for cytometry clustering methods.

Source	Levine et al. (2015) Cell 162(1):184-197
Total cells	265,627
Gated cells	104,184 (39.2%)
Markers	32 surface CyTOF markers
Populations	14 manually gated immune populations
Rare populations (<3%)	7
Technology	CyTOF (mass cytometry)
Tissue	Human bone marrow

Population distribution#

Population	Cells	Fraction
Large (>5%)	4 populations	~32% of gated cells each
Medium (1-5%)	3 populations	~2% each
Rare (<1%)	7 populations	0.1–0.5% each

The high number of rare populations (half of all populations) makes this dataset particularly challenging for methods that don’t account for density imbalance.

Download#

The dataset is automatically downloaded from lmweber/benchmark-data-Levine-32-dim when first used:

from benchmarks.download_data import load_dataset
X, labels, markers = load_dataset("Levine_32dim")

Preprocessing#

Ungated cells (60.8% of total) are removed for evaluation
Data is arcsinh-transformed with cofactor 5 (standard for CyTOF) before clustering
All 32 surface markers are used for clustering

Synthetic#

A challenging synthetic dataset designed to stress-test clustering methods with overlapping populations, hierarchical structure, and rare subsets.

Cells	50,000
Features	15
Populations	12
Rare populations (<3%)	3
Seed	42 (deterministic)

Design#

4 lineages of 3 sub-populations each — sub-populations within a lineage are close in feature space (harder to separate)
Per-feature variable spread — some markers are more discriminative than others
3 rare populations (~1% each, tighter clusters)
Global measurement noise added

from benchmarks.download_data import generate_synthetic_benchmark
X, labels, markers = generate_synthetic_benchmark()