Optimal Rates For Density-Based ClusteringUsing DBSCAN

Alessandro RinaldoDepartment of Statistics and Data Science

Carnegie Mellon University

joint work with Daren Wang and Xinyang Lu

September 8, 2018WHOA-PSI 2018

Inference for Clustering

Clustering is one of the oldest and most important problems in dataanalysis. There is a vast literature in Statistics, Machine Learning, CS,Probability, etc., and countless algorithms...

Abstract formulation: organize a set of objects into groups, so thatobjects in the same group are maximally similar and objects in differentgroups are maximally dissimilar.

From a statistical standpoint, the goals, scope and performance of theclustering task are often times poorly defined. In general, statisticalinference for clustering is relatively underdeveloped.

Today I would like to (1) make a case in support of density-basedclustering as a statistically principled paradigm for clustering and (2)presents some consistency results.

Inference for Clustering

We are interested in clustering an i.i.d. sample Xn = (X1, . . . ,Xn) from anunknown probability distribution P in Rd , with d fixed(!), having a Lebesguedensity p. We wish to be as agnostic about P as possible.

-4 -2 0 2 4




normal4[1:1000, ][,1]




00, ]


−4 −2 0 2 4 6 8−3














EngyTime, n = 4096, dimension = 2, classes = 2, main problem: gaussian mixture

−3 −2 −1 0 1 2 3−3









Target, n = 770, dimension = 2, classes = 6, main problem: outlier













Chainlink, n = 1000, dimension = 3, classes = 2, main problem: linear not separable



The Density-Based Clustering Approach– Hartigan (1975, 1981)

For a fixed threshold λ > 0, the λ-upper level set (high-density region) ofp is

L(λ) = x ∈ Rd : p(x) ≥ λ.

A λ-cluster of P is a connected component of L(λ).

Alternatively, for α ∈ [0, 1], set λα = supλ : P (L(λ)) ≥ α andL(α) = L(λα). A α-cluster of P is a connected component of L(α):minimal volume set of prescribed probability content (see Rinaldo et al.,2012, and Chen, 2018+).

Cluster Tree

The family T of all λ-clusters is the cluster tree of P.T satisfies the treeproperty: A,B ∈ T implies that

A ⊂ B or B ⊂ A or A ∩ B = ∅.

This hierarchy of inclusions form a dendrogram, with height parameter λ or α.

For topological and measure-theoretical details, see Steinwart (2015).

The Cluster Tree

0 5 10

0 5 10

0 5 10

0 5 10

0 5 10

0 5 10














The Cluster Tree

The cluster tree of P is a data structure encoding all the clusteringproperties of P: high-density + connectivity.

It represents the 0-th order persistent homology of P.

The “the number of clusters" depends on the height of the tree.

Figure 1:


In Kim et al. (2016) we look at the topology and a partial ordering overcluster trees.

Vast theoretical literature on level set estimation and some on clustering:

Support estimation: Korostelev and Tsybakov (1993), Mammen andTsybakov (1995), Cuevas and Fraiman (1997), Biau, Cadre andPellettier (2008), Patschkowski and Rohde (2015).

Level set estimation for fixed λ: Polonik (1995), Tsybakov (1997),Walther (1997), Scott and Nowak (2006), Cuevas, González-Menteigaand Rodríguez-Casal (2006), Singh, Scott and Nowak (2009), Rigolletand Vert (2011), Rinaldo and Wasserman (2010).

Cluster tree estimation: Koltchinskii (2000), Stuetzle and Nugent (2010).More recently: Chaudhuri, Dasgupta, Kpotufe and von Luxburg (2013),Balakrishnan et al. (2013), Steinwart (2011, 2012, 2015), Sriperumbudurand Steinwart (2012, 2017), Kim et al. (2016).

Some algorithms: DBSCAN, HDBSCAN, OPTICS, denpro, pdfcluster,DeBaCl, etc.

Back to Clustering the Sample Points...

Cluster Tree Estimator

A cluster tree estimator is collection of subsets of Xn with the tree property.

The best (oracle) estimator is the one whose clusters are the setsA ∩ Xn,A ∈ T (when non-empty). Of course, it is unreasonable to askthat any tree estimator will do as well as the oracle estimator.

The Chaudhuri-Dasgupta Approach

Define a separation criterion that quantifies how "far apart" any twohigh-density regions (clusters) are.

For each n, construct a collection An of high-density regions of P(possibly clusters) that fulfill said criterion in such a way that, as n→∞,the degree of separation is vanishing.

A cluster tree estimator Tn is consistent with respect to An when thefollowing holds, simultaneously over all A 6= A′ in An: with probabilitytending to 1 as n→∞, the smallest clusters in Tn containing A ∩ Xn andA′ ∩ Xn (if non-empty) are disjoint.

A Blueprint for Constructing Cluster Tree Estimators

Typically, building a cluster tree estimator entails two steps:

Density estimation step (statistically hard): construct a density estimatorto determine the high-density points. In this talk, we will use KDEs: forh > 0 and a kernel K , set

x ∈ Rd 7→ ph(x) =1

nhd Cd



x − Xi



where Cd is a normalizing constant. Nearest neighborhood densityestimators could also be used.

Connectivity step (computationally hard): cluster the high-densitysample points. Using the topology of Rd is computationally unfeasible(even if p were known exactly). It is necessary to use someapproximation to speed things up.

The DBSCAN-CT Estimator

For h > 0, let ph be the spherical kernel density estimator and let

X(1) ≤ . . . ≤ X(n)

be the sample points ordered based on their ph values.

Algorithm 1 The DBSCAN Cluster Tree Estimator1: Input i.i.d sample Xn and h > 0.2: for k ∈ 0, . . . , n − 1 do3: Construct a graph Gh,k with node set X(n−i) : i = 0, . . . , k and edge

set (X(i),X(j)) : ‖X(i) − X(j)‖ < 2h4: Compute C(h, k), the set of connected components of Gh,k .5: end for6: Output: Tn = C(h, k), k ∈ 0, . . . , n − 1

In the connectivity step 4: the topology of the support is approximatedwith that of the 2h-neighborhood graph over Xn, single-linkage style.Can be efficiently computed with a union-find algorithm.The original DBSCAN algorithm was designed for “flat clustering” at afixed k and is slightly different.

Linking DBSCAN-CT to ph (also, why 2h?)

For any λ ≥ 0, setD(λ) = x : ph(x) ≥ λ ∩ Xn

andL(λ) =


B(Xj , h) (1)

Let k and h be the input to DBSCAN. Then the nodes of Gh,k is the set D(λk )

where λk = knhd vd

. Furthermore, two points Xi and Xj in D(λk ) are the in the

same connected component of L(λk ) if and only if they are in the samegraphical connected component of Gh,k . Consequently, for any pair A and A′

of subsets of Rd with A ∩ Xn 6= ∅ and A′ ∩ Xn 6= ∅,if A ⊂ L(λk ) is connected, all the sample points in A belong to the sameconnected component of Gh,k .

if A and A′ belongs to distinct connected components of L(λk ) , then thesample points in A and the sample points in A′ belong to distinctconnected components of Gh,k .

Consistency of DBSCAN-CT for Arbitrary Densities

(ε, σ)-separation criterion

Let ε, σ > 0. Two connected subsets A and A′ of the support of P are said tobe (ε, σ)-separated when

they belong to different connected components of L(λ∗ − ε), whereλ∗ = infx∈A∪A p(x) > ε, and

mink 6=l dist(Ck , Cl ) > σ, where C1, . . . , Cm are the connected componentsof L(λ∗ − ε).

The parameter σ quantifies the geometric (horizontal) distance while εmeasures the probabilistic (vertical) distance.

In addition, we avoid sets that are too thin.

Thick sets

A subset A is h-thick if A−h := x ∈ A : B(x , h) ⊂ A 6= ∅.

Consistency of DBSCAN-CT for Arbitrary Densities

(ε, σ)-consistency of DBSCAN-CT

For a constant C = C(d), when the DBSCAN-CT estimator is computed withparameter

σ/4 ≥ h ≥ C(

log nnε2



the following holds with probability at least 1− 1n , uniformly over all 2h-thick

and (ε, σ) separated clusters A′ and A: A−2h ∩ Xn and A′−2h ∩ Xn (ifnon-empty) each belongs one separate cluster of Tn.

Paraphrasing, and ignoring log terms, the sample complexity for theDBSCAN-CT for (ε, σ)-consistency of 2h-thick clusters is

n ≥ C(d)1

ε2σd .

Note: we may let ε, σ, h→ 0 as n→∞


The above sample complexity is minimax rate optimal, up to a log n factor.

The previous result is nearly identical to (and follows easily from theproof of) the seminal consistency result for cluster tree estimation ofChaudhuri and Dasgupta (2010), who analyzed a different,computationally more expensive algorithm.

The (ε, σ)-separation criterion applies to arbitrary densities. Because ofthat, the vertical and horizontal separation parameters are essentiallydecoupled and, as a result, the sample complexity depends jointly onthem, i.e. on 1

ε2σd .

Furthermore, the resulting rate is agnostic to the degree of smoothnessof the underlying density...

...which begs the question:

Can faster rates be obtained with smoother densities?

Brief Recap of Nonparametric Density Estimation

A Lebesgue density p : Rd → R is said to belong the Hölder classΣ(L, α) with parameters α, L > 0 when, ∀x , y ∈ Rd and ∀s ∈ Nd with|s| :=

∑di=1 si = bαc,

|Dsp(x)− Dsp(y)| ≤ L‖x − y‖α−s.

When 0 < α ≤ 1, the above reduces to a Lipschitz condition on p.

Variance: If K is “well behaved”, then


(‖ph − ph‖∞ ≤ C

√log(1/γ) + log(1/h)


)≥ 1− γ, γ ∈ (0, 1)

where C = C(‖p‖∞, d ,K ) > 0 and ph(x) = E[ph(x)], x ∈ Rd .

Bias: If p ∈ Σ(L, α), then, for some C′′ = C”(L, α) > 0,

‖ph − p‖∞ ≤ C′hα.

If α > 1, the kernel K has to be α-valid (see Rigollet and Vert, 2009).

δ-Separation Criterion

For smooth densities, one can formulate a one-parameter criterion forcluster separation. The smoothness of the density ensures that thevertical and horizontal distances cannot be decoupled.

δ-separation criterion

Two connected subsets A and A′ of the support of p are δ-separated whenthey belong to distinct connected components of the level setx ∈ Rd : p(x) > λ∗ − δ, where λ∗ := infx∈A∪A′ p(x).

For continuous densities, this corresponds to the notion of mergedistance by Eldridge et al. (2015).

Let δnn ⊂ R+ and γnn ⊂ (0, 1) be vanishing.

A cluster tree estimator Tn is δn-consistent if, with probability no smallerthan 1− γn, for any pair of connected subsets A and A′ of the support ofp that are δn-separated, the two smallest clusters in Tn containing A ∩ Xn

and A′ ∩ Xn (if non-empty) are distinct.

The sequence δn then defines a consistency rate.

Interestingly, we need to distinguish two cases: α ≤ 1 and α > 1.

δ-Consistency: α ≤ 1

The DBSCAN-CT estimator with parameter h of order(log n



δn-consistent with rate

δn ≥ C(

log nn


and γn =1n,

for a constant C = C(‖p‖∞, L, d).

The above rate is minimax optimal, up to a log n factor.

This is not surprising: Sriperumbudur and Steinwart (2012) have ananalogous result about DBSCAN in different settings.

δ-Consistency: α > 1

If p ∈ Σ(L, α) with α > 1, the DBSCAN-CT estimator is no longer rate optimalfor 2 types of reasons:

Statistical reasons: by standard non-parametric theory, in order tominimize the bias of ph we can no longer use a spherical kernel butinstead need to deploy a smoother kernel (e.g., an α-valid kernel). Easyto fix.

Computational reasons: when using a single-linkage type of method forthe connectivity step, the error we incur in approximating the level sets ofp is of order h, larger than the order of the bias, hα.So, when α > 1, DBSCAN-CT would not be optimal even if we could usethe true density to get the high-density points! Hard to fix.

Theoretically trivial but computationally unfeasible consistent estimator: use aα-valid kernel to get a rate-optimal KDE ph, and compute the connectedcomponents of the upper level set of ph to cluster the data.

Instead we will derive conditions on p ensuring that single-linkage still works!

Assumptions for α > 1

There exists a δ0 > 0 such that, for any split level λ∗ of p and any 0 < δ ≤ δ0,the set Ωδ = x : p(x) ≥ λ∗ + δ satisfies the

Standard Assumption: for any r ≥ 0 and x ∈ Ωδ,

Vol(B(x , r) ∩ Ωδ) vd r d .

The Covering Condition: for any 0 < r , there exists a collection of pointsNr ⊂ Ωδ such that card(Nr ) r−d and⋃


B(y , r) ⊃ Ωδ.

Low Noise Condition: Letting Ckmk=1 be the connected components of

x : p(x) > λ∗, ,

mink 6=k′

dist(Ck ∩ p ≥ λ∗ + δ, Ck′ ∩ p ≥ λ∗ + δ) δ1/α.

There exist densities satisfying the above conditions, for any α > 1. Forexample, natural splines (α = 3) and Morse functions (α = 2).

δ-Consistency: α > 1

Consider the DBSCAN-CT(α) estimator: same as the DBSCAN-CTestimator, except that a KDE with an α-valid kernel (α > 1) is used torank the data. It can similarly be computed with a union-find algorithm.

Let p ∈ Σ(L, α > 1) be any density function with compact connectedsupport and finitely many split levels bounded from below by λ0 > 0 andsatisfying the above conditions. Then, the DBSCAN-CT(α) withparameter h of order

(log n


)1/(2α+d)is δn -consistent with rate

δn (

log nn



and γ 1n + h−d exp−λ0nhd. The constants depend on d , L, ‖p‖∞

and α.

The above rate is minimax optimal, up to log n factors.

We have shown that DBSCAN-based procedures deliver, under someconditions, optimal consistency rates for cluster tree estimation withfaster rates arising from smoother densities, while remainingcomputationally feasible.

Interestingly, the rates we obtain match the rates for minimax estimation ofdensities in Σ(L, α) in the ‖ · ‖∞ norm. Though in hindsight this may not bethat surprising, this result further confirm that that the ‖ · ‖∞ norm is a good(possibly the right) metric for cluster tree consistency. See also Kim et al.(2016).

It is surprising (to us) however that one can in principle perform optimalclustering using computationally feasible methods for the connectivitystep.

Future work

For instance...

Choice of the tuning parameter h.

Adaptivity to α.

Extension to algorithms based on knn-graphs and knn-densityestimators.

Extension to non-Euclidean data.

Inference beyond consistency: confidence sets for the cluster tree.

Study single-linkage type procedures to estimate higher ordertopological properties of the densities. In an upcoming work we will showhow to draw statistical inference for the persistent homology ofhigh-density sets using the Rips-complex, further advancing a line ofwork initiated by Bobrowski, Mukherjee and Taylor, 2014.

Investigate if and how density-based clustering can work in highdimensions.

Inference for Cluster Trees: Beyond Consistency

How do we carry out statistical inference for a discrete object such as acluster tree?

In Kim et al. (2016) we illustrate the difficulties in building confidencesets for density trees. We also propose bootstrap-based pruningmethods to “simplify" the tree and eliminate spurious clusters.

−1.0 −0.5 0.0 0.5 1.0







Ring data, n = 1200



−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5







Mickey mouse data, n = 1200



−3 −2 −1 0 1 2






Yingyang data, n = 3200



Ring data, alpha = 0.05



0.0 0.2 0.4 0.6 0.8 1.0






Mickey mouse data, alpha = 0.05



0.0 0.2 0.4 0.6 0.8 1.0





Yingyang data, alpha = 0.05



0.0 0.2 0.4 0.6 0.8 1.0







It is worth noting that this is a strongly non-parametric problem for whichonly one-side inference (Donoho, 1988) is feasible: it is not possible tolearn the entire tree, but we may be able to discern parts of it.

Density Clustering in High-Dimensions

Our results are established within the classic nonparametric frameworkleading to dimension-dependent rates! This would suggest that densityclustering cannot/should not be done in high-dimensions.

In fact, empirically that is not the case (see next slide).

Dimension dependent rates is a bias issue, which calls for a vanishing h.But if we keep h fixed and we target ph instead of p, then the dimensionno longer affects the rate. (It is still in the constants!)

Intuition: in high-dimensions, density clustering may still work providedthat good clustering solutions come from heavily biased densityestimators. This paradigm requires a different analysis.

-5 0 5 -5 0 5

High-Dimensional Density-Based Clustering: Population Admixture

Data from The Human Genome Diversity Project (HGDP) dataset,available at


Cleaned-up dataset comprised of 11,775 SNPs from 931 subjects from53 populations from Crosset et al. (2010).

The goal of the analysis is to identify the hierarchy of high-densityclusters of individuals in the sample, ideally capturing the correctmembership in populations.

We use DeBaCl, a density cluster algorithm that uses knn-graphs.Below, In the first level set tree k = 40, in the second k = 6.

