topology and data - workshop on algorithms for modern...

Topology and Data

Gunnar Carlsson 1

Department of MathematicsStanford University

http://comptop.stanford.edu/

June 27, 2008

1Research supported in part by DARPA and NSF

Introduction

I General area of geometric data analysis attempts to giveinsight into data by imposing a geometry on it

I Sometimes very natural (physics), sometimes less so(genomics)

I Value of geometry is that it allows us to organize and viewdata more effectively, for better understanding

I Can obtain an idea of a reasonable layout or overview of thedata

I Sometimes all that is required is a qualitative overview

Introduction

Methods for Imposing a Geometry

Define a metric

Define a graph or network structure

Cluster the data

Methods for Summarizing or Visualizing a Geometry

Linear projections

Multidimensional scaling, ISOMAP, LLE

Project to a tree

Properties of Data Geometries

We Don’t Trust Large Distances

I In physics, distances have strong theoretical backing, andshould be viewed as reliable

I In biology or social sciences, distances are constructed using anotion of similarity, but have no theoretical backing (e.g.Jukes-Cantor distance between sequences)

I Means that small distances still represent similarity, butcomparison of long distances makes little sense

We Only Trust Small Distances a Bit

1.5 1.8

I Both pairs are regarded as similar, but the strength of thesimilarity as encoded by the distance may not be so significant

I Similarity more like a 0/1-valued quantity than R-valued

1.5 1.8

Connections are Noisy

I Distance measurements are noisy, as are the connections inmany graph models

I Requires stochastic geometric methods for study

I Methods of Coifman et al and others relevant here

Topology

Homeomorphic

Topology

Homeomorphic

Topology

I To see that these pairs are “same” requires distortion ofdistances, i.e. stretching and shrinking

I We do not permit “tearing”, i.e. distorting distances in adiscontinuous way

I How to make this precise?

Topology

I One would like to say that all non-zero distances in a metricspace are the same

I But, d(x , y) = 0 means x = y

I Idea: consider instead distances from points to subsets. Canbe zero.

This accomplishes the intuitive idea of permitting arbitraryrescalings of distances while leaving “infinite nearness”intact.

Topology

I Topology is the idealized form of what we want in dealingwith data, namely permitting arbitrary rescalings which varyover the space

I Now must make versions of topological methods which are“less idealized”

I Means in particular finding ways of tracking or summarizingbehavior as metrics are deformed or other parameters arechanged

I Ultimately means building in noise and uncertainty. This is inthe future - “statistical topology”.

Topology

Outline

1. Homology as signature for shape identification

2. Image processing example

3. Topological “imaging” of data

4. Signatures for significance of structural invariants

Outline

Persistent Homology

I Homology: crudest measure of topological properties

I For every space X and dimension k, constructs a vector spaceHk(X ) whose dimension (the k-th Betti number βk) is amathematically precise version of the intuitive notion ofcounting “k-dimensional holes”

I Computed using linear algebraic methods, basically Smithnormal form

I β0 is a count of the number of connected components

I βi ’s form a signature which encodes topological informationabout the shape

Persistent Homology

β0 = 1, β1 = 1, and βi = 0 for i ≥ 2

Persistent Homology

β0 = 1, β1 = 0, β2 = 0, and βk = 0 for k ≥ 3

Persistent Homology

β0 = 1, β1 = 2, β2 = 1, and βk = 0 for k ≥ 3

Persistent Homology

Question: For a point cloud X , can one infer the Bettinumbers of the space X from which it is sampled?

Persistent Homology - Cech Complex

C(X , ε) - involves a choice of a parameter ε (radius of the balls)

Points are connected if balls of radius ε around them overlap

Complex grows with ε

Persistent Homology

I Obtain a diagram of vector spaces

· · · → Hi (C(X , ε1))→ Hi (C(X , ε2))→ Hi (C(X , ε3))→ · · ·

when ε1 ≤ ε2 ≤ ε3 etc.

I Called persistence vector spaces

I Such diagrams can be classified by bar codes

I Analogue of dimension for ordinary vector spaces

Persistent Homology

Persistent Homology - Bar Codes

A segment indicates a basis element “born” at the left handendpoint and which dies at the right hand endpoint

Geometrically, means a loop which begins to exist (i.e. becomesclosed) at the left hand point and is filled in at the right handendpoint.

A segment indicates a basis element “born” at the left handendpoint and which dies at the right hand endpoint

Geometrically, means a loop which begins to exist (i.e. becomesclosed) at the left hand point and is filled in at the right handendpoint.

Interpretation:

Long segments correspond to “honest” geometric features in thepoint cloud

Short segments correspond to “noise”

Look at an example.

Interpretation:

Look at an example.

Interpretation:

Look at an example.

Interpretation:

Look at an example.

Example: Natural Image Statistics

I Joint with V. de Silva, T. Ishkanov, A. Zomorodian

I An image taken by black and white digital camera can beviewed as a vector, with one coordinate for each pixel

I Each pixel has a “gray scale” value, can be thought of as areal number (in reality, takes one of 255 values)

I Typical camera uses tens of thousands of pixels, so images liein a very high dimensional space, call it pixel space, P

D. Mumford: What can be said about the set of images I ⊆ Pone obtains when one takes many images with a digital camera?

(Lee, Mumford, Pedersen): Useful to study local structure ofimages statistically

3× 3 patches in images

Observations:

1. Each patch gives a vector in R9

2. Most patches will be nearly constant, or low contrast, becauseof the presence of regions of solid shading in most images

3. Low contrast will dominate statistics, not interesting

Observations:

I Lee-Mumford-Pedersen [LMP] study only high contrastpatches

I Collect c:a 4.5× 106 high contrast patches from a collectionof images obtained by van Hateren and van der Schaaf

I Normalize mean intensity by subtracting mean from each pixelvalue to obtain patches with mean intensity = 0

I Puts data on an 8-dimensional hyperplane, ∼= R8

I Normalize contrast by dividing by the norm, so obtain patcheswith norm = 1

I Means that data now lies on a 7-D ellipsoid, ∼= S7

I Normalize contrast by dividing by the norm, so obtain patcheswith norm = 1

I Means that data now lies on a 7-D ellipsoid, ∼= S7

Result: Point cloud data M lying on a sphere in R8

We wish to analyze it with persistent homology to understand itqualitatively

Result: Point cloud data M lying on a sphere in R8

We wish to analyze it with persistent homology to understand itqualitatively

First Observation: The points fill out S7 in the sense that everypoint in S7 is “close” to a point in M

However, density of points varies a great deal from region to region

How to analyze?

Threshholding M

Define M[T ] ⊆M by

M[T ] = {x |x is in T -th percentile of densest points}

What is the persistent homology of these M[T ]’s?

Threshholding M

5× 104 points, T = 25

One-dimensional barcode, suggests β1 = 5

THREE CIRCLE MODEL

Three Circle Model

Red and green circles do not touch, each touches black circle

Does the data fit with this model?

PRIMARY

SECONDARY SECONDARY

IS THERE A TWO DIMENSIONAL SURFACE IN WHICHTHIS PICTURE FITS?

4.5× 106 points, T = 10

Betti 0 = 1

Betti 1 = 2

Betti 2 = 1

Betti 0 = 1

Betti 1 = 2

Betti 2 = 1

K - KLEIN BOTTLE

Identification Space Model

Three circles fit naturally inside K?

PRIMARY CIRCLE

SECONDARY CIRCLES

Natural Image Statistics

Klein bottle makes sense in quadratic polynomials in two variables,as polynomials which can be written as

f = q(λ(x))

1. q is single variable quadratic

2. λ is a linear functional

3.∫D f = 0

4.∫D f 2 = 1

Mapper

Algebraic topology can produce signatures which can help inmapping out a data set.

Can one obtain flexible topological mapping methods, withcombinatorial simplicial complex images?

Yes, joint work with G. Singh and F. Memoli.

Mapper

Mapper - Mayer-Vietoris Blowup

X a space, U = {Uα}α∈A a covering of X .

∆ is the simplex with vertex set A

∅ 6= S ⊆ A,X (S) =⋂

s∈S Us and ∆[S ] = face spanned by S

Let XU ⊆ X ×∆, XU =⋃

S X (S)×∆[S ]

∅ 6= S ⊆ A,X (S) =⋂

S X (S)×∆[S ]

∅ 6= S ⊆ A,X (S) =⋂

S X (S)×∆[S ]

∅ 6= S ⊆ A,X (S) =⋂

S X (S)×∆[S ]

Exists a map πX : XU → X , which is a homotopy equivalence withmild hypotheses

N(U) =⋃{S|X (S)6=∅}∆[S ]

Exists a second map π∆ : XU → N(U)

π∆ is equivalence if all X (S)’s are empty or contractible

N(U) =⋃{S|X (S)6=∅}∆[S ]

Intermediate construction M(X ,U)

M(X ,U) =∐S

π0(X (S))×∆[S ]/ '

π0(X (S))×∆[S ]φ←− π0(X (T ))×∆[S ]

ψ−→ π0(X (T ))×∆[T ]

φ(x , ζ) ' ψ(x , ζ)

M(X ,U) =∐S

π0(X (S))×∆[S ]/ '

π0(X (S))×∆[S ]φ←− π0(X (T ))×∆[S ]

ψ−→ π0(X (T ))×∆[T ]

φ(x , ζ) ' ψ(x , ζ)

M(X ,U) =∐S

π0(X (S))×∆[S ]/ '

π0(X (S))×∆[S ]φ←− π0(X (T ))×∆[S ]

ψ−→ π0(X (T ))×∆[T ]

φ(x , ζ) ' ψ(x , ζ)

M(X ,U) =∐S

π0(X (S))×∆[S ]/ '

π0(X (S))×∆[S ]φ←− π0(X (T ))×∆[S ]

ψ−→ π0(X (T ))×∆[T ]

φ(x , ζ) ' ψ(x , ζ)

Mapper - Statistical Version

Now given point cloud data set X, and a covering U .

Build simplicial complex same way, but π0 operation replaced bysingle linkage clustering with fixed error parameter ε.

Critical that clustering operation be functorial.

Partition of unity subordinate to U gives map from X to M(X,U).

How to choose coverings?

Given a reference map (or filter) f : X→ Z , where Z is a metricspace, and a covering U of Z , can consider the covering{f −1Uα}α∈A of X. Typical choices of Z - R, R2, S1.

Construction gives an image complex of the data set which canreflect interesting properties of X.

Typical one dimensional filters:

I Density estimators

I “Eccentricity” :∑

x ′∈X d(x , x ′)2

I Eigenfunctions of graph Laplacian for Vietoris-Rips graph

I User defined, data dependent filter functions

x ′∈X d(x , x ′)2

Miller-Reaven Diabetes Study, 1976

Cell Cycle Microarray Data

Joint with M. Nicolau, Nagarajan, G. Singh

Mapper - Scale Space

How to choose the parameter ε in the single linkage clustering?

Can one allow ε to vary with α?

Important question: too many parameter choices makes toolunusable, and choosing one ε for the entire space is too restrictive.

Construct a new space with reference map to Z .

For each α, we construct the zero dimensional persistence diagramfor f −1Uα.

Consider the set of all endpoints of intervals in the persistencediagram. Provides a decomposition of the real line in which ε isvarying into intervals. Call these intervals S-intervals.

I Vertex set of SS(X ,U) consists of a pair (α, I ), where α ∈ Aand I is an S-interval for the zero dimensional persistencediagram for f −1(Uα).

I We connect (α, I ) and (β, J) with an edge if (a) Uα ∩ Uβ 6= ∅and (b) I ∩ J 6= ∅.

I SS(X ) is equipped with a reference map π : SS(X ,U)→ NUgiven on vertices by (α, I )→ α

A varying choice of scale is now determined by a section of π, i.e amap

σ : NU −→ SS(X ,U)

so that πσ = idNU .

Sections can be given an weighting depending on the length of Ifor the vertices and depending on the length of I ∩ J for the edges.

Finding the high weight sections in the case of 1-D filters iscomputationally tractable.

σ : NU −→ SS(X ,U)

Variants on Persistence: Zig-Zags

Bootstrap - B. Efron

I Studies statistics of measures of central tendency acrossdifferent samples within a data set

I Can give assessment of reliability of conclusions to be drawnfrom the statistics of the data set

I How can one adapt the technique to apply to qualitativeinformation, such as presence of loops or decompositions intoclusters?

How to distinguish?

I Family of samples S1,S2, . . . ,Sk from point cloud data X

I Construct new samples Si ∪ Si+1

I Fit together into a diagram

· · · Si−1 ∪ Si Si ∪ Si+1 · · ·

Si−1 Si Si+1

��3

��3 Q

QQk��3

I Apply Hk to VR-complexes on each of these, get a diagram ofvector spaces of same shape

I If a family of homology classes “matches up” under inducedmaps, then they are stable across samples

I Family of samples S1,S2, . . . ,Sk from point cloud data XI Construct new samples Si ∪ Si+1

· · · Si−1 ∪ Si Si ∪ Si+1 · · ·

Si−1 Si Si+1

��3

��3 Q

QQk��3

· · · Si−1 ∪ Si Si ∪ Si+1 · · ·

Si−1 Si Si+1

��3

��3 Q

QQk��3

· · · Si−1 ∪ Si Si ∪ Si+1 · · ·

Si−1 Si Si+1

��3

��3 Q

QQk��3

· · · Si−1 ∪ Si Si ∪ Si+1 · · ·

Si−1 Si Si+1

��3

��3 Q

QQk��3

To carry out analysis, one needs a classification of diagrams ofvector spaces of shape of upper row. Second row is shape forordinary persistence.

Classification exists, due to P. Gabriel

Every diagram of this shape has a decomposition into a direct sumof cyclic diagrams, i.e. diagrams which consist of either aone-dimensional or a zero dimensional vector space.

Can therefore parametrize isomorphism classes by barcodes, just asin the case of ordinary persistence.

Long intervals correspond to elements stable across samples, othersare artifacts.

Results have value in other situations:

I Analysis of time varying data

I Analysis of behavior of data under varying choice of densityestimators

I Analysis of behavior of witness complexes under varyingchoices of landmarks

This analysis is relevant and interesting even in zero dimensionalcase, i.e. clustering.

topology and data - workshop on algorithms for modern...

Documents