Benchmark Datasets ================== Levine_32dim ------------ The standard benchmark dataset for cytometry clustering methods. .. list-table:: :header-rows: 0 :widths: auto * - **Source** - `Levine et al. (2015) Cell 162(1):184-197 `_ * - **Total cells** - 265,627 * - **Gated cells** - 104,184 (39.2%) * - **Markers** - 32 surface CyTOF markers * - **Populations** - 14 manually gated immune populations * - **Rare populations (<3%)** - 7 * - **Technology** - CyTOF (mass cytometry) * - **Tissue** - Human bone marrow Population distribution ~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :header-rows: 1 :widths: auto * - Population - Cells - Fraction * - Large (>5%) - 4 populations - ~32% of gated cells each * - Medium (1-5%) - 3 populations - ~2% each * - Rare (<1%) - 7 populations - 0.1--0.5% each The high number of rare populations (half of all populations) makes this dataset particularly challenging for methods that don't account for density imbalance. Download ~~~~~~~~ The dataset is automatically downloaded from `lmweber/benchmark-data-Levine-32-dim `_ when first used: .. code-block:: python from benchmarks.download_data import load_dataset X, labels, markers = load_dataset("Levine_32dim") Preprocessing ~~~~~~~~~~~~~ - Ungated cells (60.8% of total) are removed for evaluation - Data is arcsinh-transformed with cofactor 5 (standard for CyTOF) before clustering - All 32 surface markers are used for clustering ---- Synthetic --------- A challenging synthetic dataset designed to stress-test clustering methods with overlapping populations, hierarchical structure, and rare subsets. .. list-table:: :header-rows: 0 :widths: auto * - **Cells** - 50,000 * - **Features** - 15 * - **Populations** - 12 * - **Rare populations (<3%)** - 3 * - **Seed** - 42 (deterministic) Design ~~~~~~ - **4 lineages** of 3 sub-populations each — sub-populations within a lineage are close in feature space (harder to separate) - **Per-feature variable spread** — some markers are more discriminative than others - **3 rare populations** (~1% each, tighter clusters) - **Global measurement noise** added .. code-block:: python from benchmarks.download_data import generate_synthetic_benchmark X, labels, markers = generate_synthetic_benchmark()