data analysis a statistical approach to topological...a statistical approach to topological data...
TRANSCRIPT
![Page 1: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/1.jpg)
Bertrand MICHEL
Laboratoire de Statistique Theorique et AppliqueeUniversite Pierre et Marie Curie
A Statistical Approach to TopologicalData Analysis
Stochastic Geometry ConferenceNantes, April 6 2016
![Page 2: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/2.jpg)
I - Introduction : Statistics andTopological Data Analysis
![Page 3: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/3.jpg)
Topological data analysis and topological inference
• Geometric inference, algebraic topology tools and computationaltopology have recently witnessed important developments with regardsto data analysis, giving birth to the field of topological data analysis(TDA).
• The aim of TDA is to infer relevant, qualitative and quantitative topolog-ical structures (clusters, holes ...) directly from the data.
• The two popular methods in TDA : Mapper algorithm [Singh et al., 2007]and persistent homology [Edelsbrunner et al., 2002].
• Topological inference methods aim toinfer topological properties of an unknowntopological space X, typically from a pointcloud Xn “close” to X.
![Page 4: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/4.jpg)
[distribution of galaxies]
[3D shape database]
[Sensor Data]
Application fields of TDA methods
![Page 5: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/5.jpg)
Topological data analysis methods can be used:
• For exploratory analysis, visualization:
• For feature extraction in supervised settings (prediction) :
[Chazal et al., 2014b]
[Chazal et al., 2015a]
![Page 6: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/6.jpg)
Non exhaustive list of questions for a statistical approach to TDA :
• proving consistency of TDA methods.
• providing confidence regions for topological features and discussing thesignificance of the estimated topological quantities.
• selecting relevant scales at which the topological phenomenon should beconsidered.
• dealing with outliers and providing robust methods for TDA.
• ...
Statistics and TDAUntil very recently, TDA and topological inference mostly relied on deterministicapproaches. Alternatively, a statistical approach to TDA means that :
• we consider data as generated from an unknown distribution
• the inferred topological features by TDA methods are seen as estimatorsof topological quantities describing an underlying object.
![Page 7: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/7.jpg)
II- Homologyand
Persistent homology
![Page 8: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/8.jpg)
Basic tools for TDA : Offsets and Simplicial Complexes
Point clouds in themselves do not carry any non trivial topological or geometricstructure.
For a point cloud Xn in Rd (or in a metric space),the α-offset of Xn is defined by
Xαn =⋃x∈Xn
B(x, α).
More generally, for any compact set X,
Xα :=⋃x∈X
B(x, α) = d−1X ([0, α])
where the distance function dX to X is
dX(y) = infx∈X‖x− y‖ (in Rd)
General idea: deduce from (Xαn)r>0 some topological and geometric informationof an underlying object.
![Page 9: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/9.jpg)
Basic tools for TDA : Offsets and Simplicial Complexes
Non-discrete sets such as offsets, and also continuous mathematical shapes likecurves, surfaces cannot easily be encoded as finite discrete structures.
A geometric simplicial complex C is a set of simplices such that:
• Any face of a simplex from C is also in C.
• The intersection of any two simplices s1, s2 ∈ C is either a face of boths1 and s2, or empty.
![Page 10: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/10.jpg)
Basic tools for TDA : Offsets and Simplicial ComplexesExamples:
• A simplex [x0, x1, · · · , xk] is in the Cech complex Cechα(Xn) if and only
if⋂kj=0B(xj , α) 6= ∅.
• A simplex [x0, x1, · · · , xk] is in the Rips complex Ripsα(Xn) if and only if‖xj − xj′‖ ≤ α for all j, j′ ∈ {1, . . . , k}.
Can be also defined for a set of points in any metric space or for any compactmetric space.
Rips Cech
Nerve Theorem : the offsets Xαn of a point cloud Xn in Rd are homotopy equiv-alent to the Cech complex Cechα(Xn)
![Page 11: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/11.jpg)
Filtrations of simplicial complexes
Given a point cloud Xn in Rd, we generally define a filtration of (nested simpli-cial) complexes by considering all the possibles scale parameters α : (Cα)α∈A
α
Cα1 Cα2
![Page 12: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/12.jpg)
Homology inference
• Singular homology provides a algebraic description of “holes” in a geo-metric shape (connected components, loops, etc ...)
• Betti number βk is the rank of the k-th homology group.
• Computational Topology : Betti numbers can be computed on simplicialcomplexes.
Homology inference [Niyogi et al., 2008 and 2011] [Balakrishnan et al., 2012] :The Betti number (actually the homotopy type) of Riemannian manifolds withpositive reach can be recovered with high probability from offsets of a sample on(or close to) the manifold.
![Page 13: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/13.jpg)
Persistent homology
Starting from a point cloud Xn, let Filt = (Cα)α∈A be a fitration of nestedsimplicial complexes.
• multiscale information ;
• more stable and more robust ;
• (but does not answer the scale selection problem...)
α
Persistent homology: identification of “persistent” topological features along thefiltration.
![Page 14: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/14.jpg)
Barecodes and Persistence Diagrams
Xn
Barecode
Filtration of simplicialcomplexes Filt(Xn)
Offsets
![Page 15: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/15.jpg)
Barecodes and Persistence Diagrams
Dgm (Filt(Xn))Persistence diagram of the
filtration Filt(Xn) built on Xn.
Xn
Barecode
Filtration of simplicialcomplexes Filt(Xn)
Offsets birth
death
![Page 16: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/16.jpg)
Distance between persistence diagrams and stability
birth
death
∞
0
Multiplicity: 2
Add the diagonal
Dgm1
Dgm2
The bottleneck distance between two diagrams Dgm1 and Dgm2 is
db(Dgm1,Dgm2) = infγ∈Γ
supp∈Dgm1
‖p− γ(p)‖∞
where Γ is the set of all the bijections between Dgm1 and Dgm2 and
‖p− q‖∞ = max(|xp − xq|, |yp − yq|).
![Page 17: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/17.jpg)
Distance between persistence diagrams and stability
birth
death
∞
0
Multiplicity: 2
Add the diagonal
Theorem [Chazal et al., 2012]: For any compact metric spaces (X, ρ) and (Y, ρ′),
db (Dgm(Filt(X)),Dgm(Filt(Y))) ≤ 2 dGH (X,Y) .
Consequently, if X and Y are embedded in the same metric space (M, ρ) then
db (Dgm(Filt(X)),Dgm(Filt(Y))) ≤ 2 dH (X,Y) .
Dgm(Filt(Y))
Dgm(Filt(X))
![Page 18: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/18.jpg)
III - Statisticsand
Persistent homology
![Page 19: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/19.jpg)
Persistence diagram inference [Chazal et al., 2014b]
∞
00
Xn Filt(Xn)
Dgm(Filt(Xn))n points sampled in Xaccording to µ
Filt(X)
∞
00
Dgm(Filt(X))
X
Convergence???
Estimator of Dgm(Filt(K))
(M, ρ) metric spaceX compact set in M. well defined for any
compact metric space[Chazal et al., 2012]
Joint work with F. Chazal, M. Glisse and C. Labruere.
![Page 20: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/20.jpg)
Theorem: For a, b > 0 :
supµ∈P(a,b,M)
E[db(Dgm(Filt(Xµ)),Dgm(Filt(Xn)))
]≤ C
(lnn
n
)1/b
where C only depends on a and b.
Under additional technical hypotheses, for any estimator Dgmn of Dgm(Filt(Xµ)):
lim infn→∞
supµ∈P(a,b,M)
E[db(Dgm(Filt(Xµ)), Dgmn)
]≥ C′n−1/b
where C′ is an absolute constant.
For a, b > 0, µ satisfies the (a, b)-standard assumption on its support Xµ if for anyx ∈ Xµ and any r > 0 :
µ(B(x, r)) ≥ min(arb, 1).
P(a, b,M) : set of all the probability measures satisfying the (a, b)-standard as-sumption on the metric space (M, ρ).
Persistence diagram inference [Chazal et al., 2014a]
![Page 21: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/21.jpg)
Confidence sets for persistence diagrams [Fasy et al., 2014]
P(
Dgm(Filt(K)) ∈ R)≥ 1− α ??
![Page 22: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/22.jpg)
Confidence sets for persistence diagrams [Fasy et al., 2014]
Using the Hausdorff stability, we can define confidence sets for persistence dia-grams:
db (Dgm (Filt(K)) ,Dgm (Filt(Xn))) ≤ dH(K,Xn).
It is sufficient to find cn such that
lim supn→∞
(dH(K,Xn) > cn
)≤ α.
P(
Dgm(Filt(K)) ∈ R)≥ 1− α ??
![Page 23: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/23.jpg)
IV - Robust distance functions forTDA and geometric inference
![Page 24: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/24.jpg)
Standard TDA methods are not robust to outliers
Xr :=⋃x∈X
B(x, r)
= d−1X ([0, r])
where the distancefunction dX to X is
dX(y) = infx∈X‖x− y‖
![Page 25: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/25.jpg)
Standard TDA methods are not robust to outliers
Xr :=⋃x∈X
B(x, r)
= d−1X ([0, r])
where the distancefunction dX to X is
dX(y) = infx∈X‖x− y‖
![Page 26: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/26.jpg)
We would like to consider the sub levels of an alternative distance functionrelated to the sampling measure, which support is X, or close to X.
Robust TDA with an alternative distance function ?
![Page 27: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/27.jpg)
Preliminary distance function to a measure P :Let u ∈]0, 1[ be a positive mass, and P a probability measure on Rd:
δP,u(x) = inf {r > 0 : P (B(x, r)) ≥ u}
supp(P )
x
δP,u(x)
u
δP,u is the smallest distance needed tocapture a mass of at least u.
δP,u is the quantile function at u of the r.v.
‖x−X‖
where X ∼ P .
Distance To Measure [Chazal et al., 2011]
![Page 28: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/28.jpg)
Preliminary distance function to a measure P :Let u ∈]0, 1[ be a positive mass, and P a probability measure on Rd:
δP,u(x) = inf {r > 0 : P (B(x, r)) ≥ u}
supp(P )
x
δP,u(x)
u
Definition: Given a probability measure Pon Rd and m > 0, the distance function tothe measureP (DTM) is defined by
dP,m : x ∈ Rd 7→(
1
m
∫ m
0
δ2P,u(x)du
)1/2
Distance To Measure [Chazal et al., 2011]
![Page 29: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/29.jpg)
Distance To Measure [Chazal et al., 2011]
• Stability under Wassertein perturbations:
‖dP,m − dQ,m‖∞ ≤1√mW2(P,Q)
• The function x 7→ d2P,m(x) is semiconcave, this is ensuring strong reg-ularity properties on the geometry of its sublevel sets.
• Consequently, if P is a probability distribution close to P for Wassersteindistance W2, then the sublevel sets of dP ,m provide a topologicallycorrect approximation of the support of P .
Properties of the DTM :
![Page 30: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/30.jpg)
Let X1, . . . , Xn sample according to P and let Pn be the empirical measure.Then
d2Pn,
kn
(x) =n
k
k∑i=1
||x−X(i)||2
where ||X(1) − x|| ≥ ||X(2) − x|| ≥ · · · ≥ ||X(k) − x| · · · ≥ ||X(n) − x||
Distance to The Empirical Measure (DTEM)
x
k = 8
X(i)
![Page 31: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/31.jpg)
Estimation of the DTM via the empirical DTM
Quantity of interest:d2Pn,
kn
(x)− d2P, kn
(x)
• Observe that
d2P,m(x) =1
m
∫ m
0
F−1x (u)du
where Fx is the cdf of ‖x−X‖2 with X ∼ P .
• The distance to the empirical measure is the empirical counter part ofthe distance to P :
dPn,m(x)2 =1
m
∫ m
0
F−1x,n(u)du
where Fx,n is the cdf of ‖x−X‖2 with X ∼ Pn.
• Finally we get that
d2Pn,
kn
(x)− d2P, kn
(x) =1
m
∫ m
0
{F−1x,n(u)− F−1x (u)
}du
[Chazal et al., 2014b] and [Chazal et al., 2015b]
![Page 32: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/32.jpg)
Estimation of the DTM via the empirical DTM
Quantity of interest:d2Pn,
kn
(x)− d2P, kn
(x)
Two complementary approaches of the problem:
• Asymptotic approach : knn = m is fixed and n tends to infinity.
• Non asymptotic approach : n is fixed, and we want a tight controlover the fluctuations of the empirical DTM, in function of k, whichcan be taken very small.
We do not use Wasserstein stability for either of the two approaches.Wasserstein rates of convergence [Fournier and Guillin, 2013 ;Dereich et al.,2013] do not provide tight rates for the DTM in this context.
[Chazal et al., 2014b] and [Chazal et al., 2015b]
![Page 33: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/33.jpg)
Theorem: Let P be a measure on Rd with compact support. Let D be acompact domain on Rd and m ∈ (0, 1). Assume that there exists an uni-form upper bound ωD on the modulus of continuity for the family (F−1x )x∈Dsatisfying
limu→0
ωD(u) = ωD(0) = 0.
Then√n(d2Pn,m − d
2P,m)converges in distribution to B on D, where B is a
centered Gaussian process with covariance kernel
κ(x, y) =1
m2
∫ F−1x (m)
0
∫ F−1y (m)
0
(P[B(x,
√t) ∩B(y,
√s)]−Fx(t)Fy(s)
)ds dt.
Functional convergence [Chazal et al., 2014b]
Modulus of continuity ωx of F−1x : for any v ∈ (0, 1]
ωx(v) := sup(u,u′)∈[0,1]2, u 6=u′, ‖u−u′‖≤v
|F−1x (u)− F−1x (u′)|.
joint work with F. Chazal, B. Fasy, F. Lecci, A. Rinaldo and L. Wasserman
![Page 34: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/34.jpg)
Theorem: Let x be a fixed observation point in Rd. Assume that ωx :(0, 1] → R+ is an upper bound on the modulus of continuity of F−1x . Letk < n
2 . For any λ > 0:
P(∣∣∣d2Pn, kn (x)− d2
P, kn(x)∣∣∣ ≥ λ) ≤ 2 exp
(−n
8
kn[
ωx(kn
)]2λ2)
+ ...
Assume moreover that ωx(u)/u is a non increasing function, then
E(∣∣∣d2Pn, kn (x)− d2
P, kn(x)∣∣∣) ≤ C√
n
√n
kωx
(k
n
).
Fluctuations of the DTEM [Chazal et al., 2015b]
E(∣∣∣d2Pn, kn (x)− d2
P, kn(x)∣∣∣) ≤ Cn
k
1√n
√k
nωx
(k
n
)renormalization by the mass proportion
parametric rate of convergence
localization at the origin
statistical complexityof the problem
joint work with F. Chazal and P. Massart
![Page 35: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/35.jpg)
Fluctuations of the DTEM [Chazal et al., 2015b]
The quantile function F−1x carries some geometric information.For instance ω(0+) = 0 means that the support of dFx is a closed interval.
![Page 36: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/36.jpg)
Bootstrap and significance of topological features[Chazal et al., 2014b]
Aim : studying the persistent homology of the sub-levels of the DTM andproviding confidence regions.
Two alternative boostrap methods :
• by bootstrapping the DTM
• Bottleneck Bootstrap
![Page 37: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/37.jpg)
Bootstrap and significance of topological features[Chazal et al., 2014b]
Bootstrapping the DTM
For m ∈ (0, 1), define cα by
P(√n||d2P,m − d2Pn,m||∞ > cα
)= α.
Let X∗1 , . . . , X∗n be a sample from Pn, and let P ∗n be the corresponding (boot-
strap) empirical measure.
We consider the bootstrap quantity dP∗n ,m(x) of dPn,m.
The bootstrap estimate cα is defined by
P(√
n||d2Pn,m − d2P∗n ,m
||∞ > cα |X1, . . . , Xn
)= α
where cα can be approximated by Monte Carlo.
Theorem: If F−1x is regular enough, the distance to measure function isHadamard differentiable at P . Consequently, the bootstrap method for theDTM is asymptotically valid.
![Page 38: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/38.jpg)
Bootstrap and significance of topological features[Chazal et al., 2014b]
Dgm : persistence diagram of the sub-levels of dP,m
Dgm : persistence diagram of the sub-levels of dPn,m.Let
Cn =
{E ∈ Diag : db(Dgm, E) ≤ cα√
n
},
where Diag is the set of all the persistence diagrams.
Then,
P(Dgm ∈ Cn) = P(
db(Dgm, Dgm) ≤ cα√n
)≥ P
(‖d2P,m − d2Pn,m‖∞ ≤
cα√n
)
Bootstrapping the DTM
Bootstrap estimate
![Page 39: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/39.jpg)
Bootstrap and significance of topological features[Chazal et al., 2014b]
The Bottleneck Bootstrap
Dgm : persistence diagram of the sub-levels of dP,m
Dgm : persistence diagram of the sub-levels of dPn,m.
Dgm∗
: persistence diagram of the sub-levels of dP∗n ,m.
We directly bootstrap in the set of the persistence diagram by considering the
random quantity db(Dgm∗, Dgm). We define tα by
P(√
ndb(Dgm∗, Dgm) > tα |X1, . . . , Xn
)= α.
The quantile tα can be estimated by Monte Carlo.
![Page 40: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/40.jpg)
Bootstrap and significance of topological features[Chazal et al., 2014b]
In practice, the bottleneck bootstrap can lead to more precise inferences becausein many cases the following stability result is not sharp
db(Dgm,Dgm) ≤ ‖d2P,m − d2Pn,m‖∞.
For both methods we can identify significant features by putting a band of size2cα or 2tα around the diagonal:
![Page 41: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/41.jpg)
Concluding remarks
• TDA methods focus on the topological properties (homology / persistenthomology) of a shape.
• TDA methods can be used
– as an “exploratory method”, in particuar when the point cloud issampled on (close to) a real geometric object
– as a “feature extraction” procedure, next these extracted features canbe used for learning purposes.
• TDA is an emerging field, at the interface maths, computer sciences, statis-tics.
• Many topics about the statistical analysis of TDA
• Applications in many fields of sciences ( medecine, biology, dynamic sys-tems, astronomy, dynamical systems, physics ...)
• TDA methods need to bring together Geometric Inference, ComputationalTopology and Geometry, Statistics and Learning methods.
![Page 42: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/42.jpg)
Thank you !
![Page 43: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/43.jpg)
References[Balakrishnan et al., 2012] Balakrishnan, S., Rinaldo, A., Sheehy, D., Singh, A., and Wasser-
man, L. A.. Minimax rates for homology inference. Journal of Machine Learning Research- Proceedings Track, 22:64–72.
[Bubenik, 2015] Bubenik, P. Statistical topological data analysis using persistence landscapes.Journal of Machine Learning Research, 16:77–102.
[Caillerie and Michel, 2011] Caillerie, C. and Michel, B. (2011). Model selection for simplicialapproximation. Foundations of Computational Mathematics, 11(6):707–731.
[Carlsson, 2009] Carlsson, G. Topology and data. AMS Bulletin, 46(2):255–308.
[Chazal and Lieutier, 2007] Chazal, F. and Lieutier, A. Stability and computation of topolog-ical invariants of solids in {\ Bbb R}ˆ n. Discrete & Computational Geometry, 37(4):601–617.
[Chazal et al., 2012] Chazal, F., de Silva, V., Glisse, M., and Oudot, S. The structure andstability of persistence modules. arXiv preprint arXiv:1207.3674.
[Chazal et al., 2014a] Chazal, F., Glisse, M., Labruere, C., and Michel, B. Convergence ratesfor persistence diagram estimation in topological data analysis. To appear in Journal ofMachine Learning Research.
[Chazal et al., 2014b] Chazal, F., Fasy, B. T., Lecci, F., Michel, B., Rinaldo, A., and Wasser-man, L. (2014a). Robust topological inference: Distance to a measure and kernel distance.ArXiv preprint 1412.7197.
References
![Page 44: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/44.jpg)
References[Chazal et al., 2015a] Chazal, F., Fasy, B. T., Lecci, F., Michel, B., Rinaldo, A., and Wasser-
man, L. Subsampling methods for persistent homology. To appear in Proceedings of the 32st International Conference on Machine Learning (ICML-15).
[Chazal et al., 2015b] Chazal, F., Massart, P., and Michel, B. (2015b). Rates of convergencefor robust geometric inference. ArXiv preprint 1505.07602.
[Dereich et al., 2013] Dereich, S., Scheutzow, M., and Schottstedt, R. (2013). Constructivequantization: Approximation by empirical measures. Ann. Inst. H. Poincare Probab. Statist.,49:1183–1203.
[Edelsbrunner et al., 2002] Edelsbrunner, H., Letscher, D., and Zomorodian, A. (2002). Topo-logical persistence and simplification. Discrete Comput. Geom., 28:511–533.
[Federer, 1959] Federer, H. (1959). Curvature measures. Transactions of the American Math-ematical Society, pages 418–491.
[Fasy et al., 2014] Fasy, B. T., Lecci, F., Rinaldo, A., Wasserman, L., Balakrishnan, S., andSingh, A. Confidence sets for persistence diagrams. The Annals of Statistics, 42(6):2301–2339.
[Fournier and Guillin, 2013] Fournier, N. and Guillin, A. (2013). On the rate of convergencein wasserstein distance of the empirical measure. Probability Theory and Related Fields,pages 1–32.
[Genovese et al., 2012] Genovese, C. R., Perone-Pacifico, M., Verdinelli, I., and Wasserman,L. (2012). Manifold estimation and singular deconvolution under hausdorff loss. Ann.Statist., 40:941–963.
References
![Page 45: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/45.jpg)
References[Massart, 2007] Massart, P. (2007). Concentration Inequalities and Model Selection, volume
Lecture Notes in Mathematics 1896. Springer-Verlag.
[Niyogi et al., 2008] Niyogi, P., Smale, S., and Weinberger, S. (2008). Finding the homologyof submanifolds with high confidence from random samples. Discrete & ComputationalGeometry, 39(1-3):419–441.
[Singh et al., 2007] Singh, G., Memoli, F., and Carlsson, G. E. (2007). Topological methodsfor the analysis of high dimensional data sets and 3d object recognition. In SPBG, pages91–100. Citeseer.
[Singh et al., 2009] Singh, A., Scott, C., and Nowak, R. (2009). Adaptive Hausdorff estimationof density level sets. Ann. Statist., 37(5B):2760–2782.
[Turner et al., 2014] Turner, K., Mileyko, Y., Mukherjee, S. and Harer, J. (2014) Frechetmeans for distributions of persistence diagrams. Discrete & Computational Geometry,52(1):44–70 .
References
![Page 46: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/46.jpg)
Topological invariants
For comparing topological spaces, we consider topological invariants (preservedby homeomorphism) : numbers, groups, polynomials.
How topological spaces can be compared from a topological point of view ?
![Page 47: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/47.jpg)
Topological invariants
Homotopy is weaker than homeomorphism but is preserves many topologicalinvariants.
For comparing topological spaces, we consider topological invariants (preservedby homeomorphism) : numbers, groups, polynomials.
How topological spaces can be compared from a topological point of view ?
• Two continous functions f : X → Y and g : X → Y are homotopicif there exists a continous application H : X × [0, 1] → Y such thatH(·, 0) = f and H(·, 1) = g.
• Two topological spaces X and Y are homotopic if there exists two con-tinous applications f : X → Y and g : Y → X such that
– g ◦ f is homotopic to idX ;
– f ◦ g is homotopic to idY ;
![Page 48: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/48.jpg)
Topological Stability and Regularity
Topological inference : under “regularity assumptions”, topological properties ofX can be recovered from (the off-sets) of a close enough object Y.
![Page 49: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/49.jpg)
Topological Stability and Regularity
Topological inference : under “regularity assumptions”, topological properties ofX can be recovered from (the off-sets) of a close enough object Y.
• The local feature size is a local notion of regularity :For x ∈ X, lfsX(x) := d (x,M(Xc)) .
• Weak feature size and its extensions [Chazal and Lieutier, 2007] (by con-sidering the critical values of dX).
• The global version of the local featuresize is the reach [Federer, 1959] :
κ(X) = infx∈Xc
lfsX(x).
The reach is small if either X is notsmooth or if X is close to being self-intersecting.
![Page 50: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/50.jpg)
Topological Stability and Regularity
Topological inference : under “regularity assumptions”, topological properties ofX can be recovered from (the off-sets) of a close enough object Y.
Theorem [Chazal and Lieutier, 2007]: Let X and Y be two compact sets inRd and let ε > 0 be such that dH(X,Y) < ε, wfs(X) > 2ε and wfs(Y) > 2ε.Then for any 0 < α < 2ε, Xα and Yβ are homotopy equivalent.
Example :
dH(X,Y) = inf {α ≥ 0 | X ⊂ Yα and Y ⊂ Xα}
![Page 51: Data Analysis A Statistical Approach to Topological...A Statistical Approach to Topological Data Analysis Stochastic Geometry Conference Nantes, April 6 2016 I - Introduction : Statistics](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed79ef09661ae43ff66a284/html5/thumbnails/51.jpg)
Topological Stability and Regularity
Topological inference : under “regularity assumptions”, topological properties ofX can be recovered from (the off-sets) of a close enough object Y.
Theorem [Chazal and Lieutier, 2007]: Let X and Y be two compact sets inRd and let ε > 0 be such that dH(X,Y) < ε, wfs(X) > 2ε and wfs(Y) > 2ε.Then for any 0 < α < 2ε, Xα and Yβ are homotopy equivalent.
Example :
dH(X,Y) = inf {α ≥ 0 | X ⊂ Yα and Y ⊂ Xα}
Sampling conditions in Hausdorff metric.
Statistical analysis of homotopy inference can be deduced from support estima-tion of a distribution under the Hausdorff metric.