densitree User Guide¶
A Python implementation of the SPADE (Spanning-tree Progression Analysis of Density-normalized Events) algorithm for high-dimensional cytometry and single-cell data analysis.
Table of Contents¶
- Installation
- Quick Start
- Core Concepts
- API Reference
- Examples
- Method Background & Literature
- Comparison with Other Tools
Installation¶
Dependencies: numpy, scipy, scikit-learn, networkx, matplotlib, plotly, pandas.
Quick Start¶
import numpy as np
from densitree import SPADE
# Generate synthetic data: 1000 cells, 5 markers
rng = np.random.default_rng(42)
X = np.vstack([
rng.normal(loc=[0, 0, 5, 3, 1], scale=0.5, size=(300, 5)),
rng.normal(loc=[4, 4, 1, 1, 6], scale=0.5, size=(400, 5)),
rng.normal(loc=[8, 2, 3, 7, 2], scale=0.5, size=(300, 5)),
])
spade = SPADE(n_clusters=10, downsample_target=0.2, random_state=0)
spade.fit(X)
print(spade.labels_[:20]) # cluster labels for every cell
print(spade.result_.cluster_stats_) # per-cluster summary table
spade.result_.plot_tree(color_by=0, backend="matplotlib")
Core Concepts¶
SPADE works in five sequential steps:
-
Density estimation -- For each cell, compute local density via k-nearest-neighbor distances. Cells in crowded regions get high density scores.
-
Density-dependent downsampling -- Cells in dense regions are sampled with lower probability, while rare populations are preserved. This prevents abundant cell types from dominating the clustering.
-
Agglomerative clustering -- The downsampled cells are clustered into
n_clustersgroups using Ward's linkage hierarchical clustering. -
Upsampling -- Every original cell (not just the downsampled ones) is assigned to the cluster whose centroid is nearest.
-
Minimum spanning tree (MST) -- Cluster centroids are connected into a tree where edge weights are Euclidean distances. This tree captures the progression/hierarchy structure of the data.
Data transforms¶
Before any step runs, an optional transform is applied to all features:
transform |
Formula | Typical use |
|---|---|---|
"arcsinh" (default) |
arcsinh(x / cofactor) |
CyTOF data (cofactor=5) or flow cytometry (cofactor=150) |
"log" |
log(1 + x) |
Count data, scRNA-seq |
None |
Identity | Already-transformed data |
API Reference¶
SPADE¶
SPADE(
n_clusters=50, # number of clusters in the tree
downsample_target=0.05, # fraction of cells to retain
knn=5, # k for density estimation
transform="arcsinh", # "arcsinh", "log", or None
cofactor=150.0, # arcsinh cofactor
random_state=None, # seed for reproducibility
)
Methods:
| Method | Returns | Description |
|---|---|---|
fit(X) |
self |
Run the full SPADE pipeline |
fit_predict(X) |
ndarray |
Run pipeline and return cluster labels |
Attributes (after fit):
| Attribute | Type | Description |
|---|---|---|
labels_ |
ndarray[int] |
Cluster assignment for every cell |
result_ |
SPADEResult |
Rich output object (see below) |
Input X can be a numpy array or a pandas DataFrame. If a DataFrame is passed, column names are preserved in the result.
SPADEResult¶
| Attribute | Type | Description |
|---|---|---|
labels_ |
ndarray[int] |
Cluster labels (same as SPADE.labels_) |
tree_ |
networkx.Graph |
MST with node attributes size and median |
X_down |
ndarray |
Downsampled cell matrix |
down_idx |
ndarray[int] |
Indices of downsampled cells in the original array |
cluster_stats_ |
pd.DataFrame |
One row per cluster: size, median_<feature> columns |
Methods:
result.plot_tree(
color_by=None, # feature index (int) or name (str) to color nodes
size_by="count", # "count" scales nodes by cell count
backend="matplotlib", # "matplotlib" or "plotly"
)
BaseStep¶
Abstract base class for pipeline steps. Subclass it to create custom steps:
from densitree import BaseStep
import numpy as np
class MyDensityEstimator(BaseStep):
def run(self, data: np.ndarray, **ctx) -> dict:
# your custom density logic
return {"density": my_density_values}
spade = SPADE(density_estimator=MyDensityEstimator())
Examples¶
Example 1: Basic flow cytometry analysis¶
import pandas as pd
from densitree import SPADE
# Load a CSV exported from FlowJo or similar
df = pd.read_csv("flow_data.csv")
markers = ["CD3", "CD4", "CD8", "CD19", "CD56"]
X = df[markers]
spade = SPADE(
n_clusters=30,
downsample_target=0.1,
transform="arcsinh",
cofactor=150.0, # standard for fluorescence flow cytometry
random_state=42,
)
spade.fit(X)
# Explore clusters
print(spade.result_.cluster_stats_)
# Color tree by CD4 expression
fig = spade.result_.plot_tree(color_by="CD4", backend="matplotlib")
fig.savefig("spade_tree_cd4.png", dpi=150, bbox_inches="tight")
Example 2: CyTOF / mass cytometry data¶
from densitree import SPADE
# CyTOF uses cofactor=5 for arcsinh transform
spade = SPADE(
n_clusters=50,
downsample_target=0.05,
transform="arcsinh",
cofactor=5.0,
random_state=0,
)
spade.fit(X_cytof) # numpy array, shape (n_cells, n_markers)
# Interactive plotly visualization
fig = spade.result_.plot_tree(color_by=0, backend="plotly")
fig.show()
Example 3: Comparing conditions¶
import numpy as np
from densitree import SPADE
# Fit on combined data from two conditions
X_combined = np.vstack([X_healthy, X_disease])
condition = np.array(
["healthy"] * len(X_healthy) + ["disease"] * len(X_disease)
)
spade = SPADE(n_clusters=20, downsample_target=0.1, random_state=0)
spade.fit(X_combined)
# Count cells per cluster per condition
labels = spade.labels_
for cluster_id in range(20):
mask = labels == cluster_id
n_healthy = (condition[mask] == "healthy").sum()
n_disease = (condition[mask] == "disease").sum()
print(f"Cluster {cluster_id}: healthy={n_healthy}, disease={n_disease}")
Example 4: Working with the tree directly (networkx)¶
import networkx as nx
from densitree import SPADE
spade = SPADE(n_clusters=15, downsample_target=0.1, random_state=0)
spade.fit(X)
tree = spade.result_.tree_
# Find the two most connected clusters (highest degree)
degrees = sorted(tree.degree, key=lambda x: x[1], reverse=True)
print(f"Hub clusters: {degrees[:3]}")
# Shortest path between two clusters
path = nx.shortest_path(tree, source=0, target=10, weight="weight")
print(f"Path from cluster 0 to 10: {path}")
# Export tree for use in other tools
nx.write_graphml(tree, "spade_tree.graphml")
Example 5: Custom density estimator¶
import numpy as np
from densitree import SPADE, BaseStep
class KDEDensityEstimator(BaseStep):
"""Use scipy KDE instead of k-NN for density estimation."""
def __init__(self, bandwidth: float = 1.0):
self.bandwidth = bandwidth
def run(self, data: np.ndarray, **ctx) -> dict:
from scipy.stats import gaussian_kde
kde = gaussian_kde(data.T, bw_method=self.bandwidth)
density = kde(data.T)
return {"density": density}
spade = SPADE(
n_clusters=10,
downsample_target=0.1,
density_estimator=KDEDensityEstimator(bandwidth=0.5),
random_state=0,
)
spade.fit(X)
Example 6: Exporting results to a DataFrame¶
import pandas as pd
from densitree import SPADE
spade = SPADE(n_clusters=20, downsample_target=0.1, random_state=0)
spade.fit(X)
# Add cluster labels back to original data
df_annotated = pd.DataFrame(X, columns=marker_names)
df_annotated["spade_cluster"] = spade.labels_
df_annotated.to_csv("annotated_cells.csv", index=False)
# Export cluster statistics
spade.result_.cluster_stats_.to_csv("cluster_stats.csv")
Method Background & Literature¶
The SPADE algorithm¶
SPADE was introduced to address a fundamental challenge in cytometry: how to organize high-dimensional single-cell data into an interpretable hierarchy that reflects biological structure (e.g., hematopoietic differentiation).
The key insight is density-dependent downsampling. Traditional clustering on cytometry data is dominated by abundant cell types (e.g., mature lymphocytes), causing rare populations (e.g., progenitors, transitional cells) to be absorbed into larger clusters. By downsampling proportionally to the inverse of local density, SPADE ensures rare populations are represented in the clustering step.
After clustering the downsampled cells, a minimum spanning tree is constructed over the cluster centroids. This tree naturally captures gradual transitions between cell phenotypes -- branches in the tree correspond to differentiation trajectories.
Key references¶
- Qiu, P., Simonds, E.F., Bendall, S.C., et al. (2011). "Extracting a cellular hierarchy from high-dimensional cytometry data with SPADE." Nature Biotechnology, 29(10), 886-891. doi:10.1038/nbt.1991
-
The original SPADE paper. Introduces the algorithm and demonstrates it on mouse and human bone marrow CyTOF data.
-
Bendall, S.C., Simonds, E.F., Qiu, P., et al. (2011). "Single-cell mass cytometry of differential immune and drug responses across a human hematopoietic continuum." Science, 332(6030), 687-696. doi:10.1126/science.1198704
-
High-profile application of CyTOF + SPADE for mapping the human hematopoietic system.
-
Qiu, P. (2017). "Toward deterministic and semiautomated SPADE analysis." Cytometry Part A, 91(7), 714-727. doi:10.1002/cyto.a.23068
-
Addresses the stochastic nature of the original algorithm and proposes improvements for reproducibility.
-
Samusik, N., Good, Z., Spitzer, M.H., Davis, K.L., & Nolan, G.P. (2016). "Automated mapping of phenotype space with single-cell data." Nature Methods, 13(6), 493-496. doi:10.1038/nmeth.3863
-
Systematic comparison of SPADE, FlowSOM, PhenoGraph, and other clustering methods on benchmark datasets.
-
Levine, J.H., Simonds, E.F., Bendall, S.C., et al. (2015). "Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis." Cell, 162(1), 184-197. doi:10.1016/j.cell.2015.05.047
- Introduces PhenoGraph; includes SPADE comparisons.
Comparison with Other Tools¶
SPADE implementations¶
| Feature | densitree | R spade (Bioconductor) | Cytobank | FlowJo plugin |
|---|---|---|---|---|
| Language | Python | R / C++ | Cloud (web UI) | Java (desktop) |
| Input formats | numpy, pandas | FCS files | FCS upload | FCS via FlowJo |
| scikit-learn compatible | Yes | No | No | No |
| Custom steps / extensible | Yes (BaseStep ABC) | Limited | No | No |
| Interactive visualization | plotly | No (static only) | Yes (built-in) | Yes (built-in) |
| FCS file parsing | No (bring your own) | Yes (built-in) | Yes | Yes |
| Maintained | Active | Archived | Active | Active |
| Cost | Free / open source | Free / open source | Commercial subscription | Commercial license |
| Reproducibility (seed) | Yes | Partial (improved in spade2) | Limited | Limited |
densitree vs. R spade (Bioconductor)¶
The original R implementation (spade, now archived) was the reference. It includes FCS parsing, C++-accelerated density estimation, and direct integration with the Bioconductor/flowCore ecosystem. densitree trades FCS parsing for Python ecosystem integration: it works natively with numpy arrays, pandas DataFrames, and scikit-learn conventions (fit/fit_predict). If you're already working in Python (e.g., with scanpy, anndata, or general ML pipelines), densitree avoids the R interop overhead.
densitree vs. Cytobank / FlowJo¶
Cytobank and FlowJo offer SPADE as part of larger GUI-driven analysis platforms. They are ideal for bench scientists who want point-and-click workflows with built-in gating, compensation, and visualization. densitree is for programmatic use: batch processing, reproducible pipelines, parameter sweeps, and integration with custom downstream analysis. You trade the GUI for full scripting control.
SPADE vs. other single-cell clustering methods¶
| Method | Key idea | Output | Best for |
|---|---|---|---|
| SPADE | Density-normalized downsampling + MST | Tree of clusters | Hierarchy/trajectory visualization, rare population preservation |
| FlowSOM | Self-organizing maps + consensus clustering | Grid/tree of clusters | Fast, scalable, good default for most panels |
| PhenoGraph | k-NN graph + Louvain community detection | Flat clusters | Discovering phenotypically distinct populations |
| UMAP + Leiden | Dimension reduction + graph clustering | Flat clusters + 2D embedding | Exploratory visualization, scRNA-seq |
When to use SPADE over alternatives:
- You care about tree structure -- SPADE's MST explicitly models gradual transitions between cell types, which is valuable for differentiation studies.
- You need to preserve rare populations -- the density-dependent downsampling is SPADE's signature feature.
- You want an interpretable number of clusters -- SPADE's
n_clustersparameter gives direct control, unlike graph-based methods where cluster count emerges from resolution parameters.
When to consider alternatives:
- For very large datasets (>1M cells), FlowSOM is typically faster.
- If you don't need tree structure and just want the best cluster assignments, PhenoGraph or Leiden clustering may give better separation.
- For scRNA-seq data, the scanpy/Leiden ecosystem is more mature and better integrated with downstream tools (DEG analysis, trajectory inference with PAGA, etc.).