Skip to content

Benchmark Datasets

Levine_32dim

The standard benchmark dataset for cytometry clustering methods.

Property Value
Source Levine et al. (2015) Cell 162(1):184-197
Total cells 265,627
Gated cells 104,184 (39.2%)
Markers 32 surface CyTOF markers
Populations 14 manually gated immune populations
Rare populations (<3%) 7
Technology CyTOF (mass cytometry)
Tissue Human bone marrow

Population distribution

Population Cells Fraction
Large (>5%) 4 populations ~32% of gated cells each
Medium (1-5%) 3 populations ~2% each
Rare (<1%) 7 populations 0.1--0.5% each

The high number of rare populations (half of all populations) makes this dataset particularly challenging for methods that don't account for density imbalance.

Download

The dataset is automatically downloaded from lmweber/benchmark-data-Levine-32-dim when first used:

from benchmarks.download_data import load_dataset
X, labels, markers = load_dataset("Levine_32dim")

Preprocessing

  • Ungated cells (60.8% of total) are removed for evaluation
  • Data is arcsinh-transformed with cofactor 5 (standard for CyTOF) before clustering
  • All 32 surface markers are used for clustering

Synthetic

A challenging synthetic dataset designed to stress-test clustering methods with overlapping populations, hierarchical structure, and rare subsets.

Property Value
Cells 50,000
Features 15
Populations 12
Rare populations (<3%) 3
Seed 42 (deterministic)

Design

  • 4 lineages of 3 sub-populations each -- sub-populations within a lineage are close in feature space (harder to separate)
  • Per-feature variable spread -- some markers are more discriminative than others
  • 3 rare populations (~1% each, tighter clusters)
  • Global measurement noise added
from benchmarks.download_data import generate_synthetic_benchmark
X, labels, markers = generate_synthetic_benchmark()