quality of similarity rankings in time series - sstd …sstd2011.cs.umn.edu/files/slides/berc.pdfsnn...

18
Quality of Similarity Rankings in Time Series T. Bernecker, M. E. Houle, H.-P. Kriegel, P. Kröger, M. Renz, E. Schubert, A. Zimek Motivation Interpreting Distance Fct. Distance Functions Curse of Dimens. SNN Distance Experiments SNN performance Histograms Effects of noise Conclusions Quality of Similarity Rankings in Time Series 12th International Symposium on Spatial and Temporal Databases (SSTD 2011) Thomas Bernecker 1 , Michael E. Houle 2 , Hans-Peter Kriegel 1 , Peer Kröger 1 , Matthias Renz 1 , Erich Schubert 1 , Arthur Zimek 1 1 Ludwig-Maximilians-Universität München, Munich, Germany 2 National Institute of Informatics, Tokyo, Japan 2011-08-26 — Minneapolis, MN 1/18

Upload: buihuong

Post on 15-May-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Quality ofSimilarityRankings

in Time Series

T. Bernecker,M. E. Houle,H.-P. Kriegel,

P. Kröger,M. Renz,

E. Schubert,A. Zimek

Motivation

InterpretingDistance Fct.Distance Functions

Curse of Dimens.

SNN Distance

ExperimentsSNN performance

Histograms

Effects of noise

Conclusions

Quality of Similarity Rankingsin Time Series

12th International Symposium on Spatial and TemporalDatabases (SSTD 2011)

Thomas Bernecker1, Michael E. Houle2,Hans-Peter Kriegel1, Peer Kröger1, Matthias Renz1,

Erich Schubert1, Arthur Zimek1

1 Ludwig-Maximilians-Universität München, Munich, Germany2 National Institute of Informatics, Tokyo, Japan

2011-08-26 — Minneapolis, MN1/18

Quality ofSimilarityRankings

in Time Series

T. Bernecker,M. E. Houle,H.-P. Kriegel,

P. Kröger,M. Renz,

E. Schubert,A. Zimek

Motivation

InterpretingDistance Fct.Distance Functions

Curse of Dimens.

SNN Distance

ExperimentsSNN performance

Histograms

Effects of noise

Conclusions

Time Series Distances

Time series research

. . . has plenty of:I New distance functionsI Dimensionality reductionI Approximations

. . . but:I How big is a distance of 0.432?I How big is a difference of 0.123?

What is the meaning of these values?

2/18

Quality ofSimilarityRankings

in Time Series

T. Bernecker,M. E. Houle,H.-P. Kriegel,

P. Kröger,M. Renz,

E. Schubert,A. Zimek

Motivation

InterpretingDistance Fct.Distance Functions

Curse of Dimens.

SNN Distance

ExperimentsSNN performance

Histograms

Effects of noise

Conclusions

Interpreting distance functions

Distance functions used to have a physical meaning:I “As the crow flies”I “Taxicab metric”

This worked well for the three-dimensional world.

But this is not so in time series:I “Curse of dimensionality”

loss of contrast in high-dimensional dataI Dimension-alignment as done by time warpingI Edit distances treat big and small edits the same

But: the distance functions work!

3/18

Quality ofSimilarityRankings

in Time Series

T. Bernecker,M. E. Houle,H.-P. Kriegel,

P. Kröger,M. Renz,

E. Schubert,A. Zimek

Motivation

InterpretingDistance Fct.Distance Functions

Curse of Dimens.

SNN Distance

ExperimentsSNN performance

Histograms

Effects of noise

Conclusions

The “Curse of Dimensionality”

Commonly described asI Distances become “indiscernible”I Distances “lose their usefulness”I Hypercube becomes “vastly” bigger than hypersphereI Nearest and farthest neighbor become similarI Mathematical:

limdim→∞

distmax − distmin

distmin→ 0

So they should not work.But: they do!

4/18

Quality ofSimilarityRankings

in Time Series

T. Bernecker,M. E. Houle,H.-P. Kriegel,

P. Kröger,M. Renz,

E. Schubert,A. Zimek

Motivation

InterpretingDistance Fct.Distance Functions

Curse of Dimens.

SNN Distance

ExperimentsSNN performance

Histograms

Effects of noise

Conclusions

How bad is the “Curse of Dimensionality”?

Some facts on the “Curse of Dimensionality”(from Houle et al. 2010):

I Mathematics proven for i.i.d. data onlyI Relevant dimensions make the problem easierI Irrelevant dimensions make the problem harderI ⇒ mostly a matter of “signal to noise ratio”I Numerical contrast goes away,

but ranking still remains meaningful

Goal: Restore contrast and intuitionusing the ranking information

of the existing distance functions!

5/18

Quality ofSimilarityRankings

in Time Series

T. Bernecker,M. E. Houle,H.-P. Kriegel,

P. Kröger,M. Renz,

E. Schubert,A. Zimek

Motivation

InterpretingDistance Fct.Distance Functions

Curse of Dimens.

SNN Distance

ExperimentsSNN performance

Histograms

Effects of noise

Conclusions

Shared Nearest Neighbor Similarity

Idea: Similar objects have similar neighbors.

SNNs(x, y) = |NNs(x) ∩ NNs(y)|

simcoss(x, y) =SNNs(x, y)

s

Properties:I Intuitive value range from “None” to “All”I Intuitive interpretation (“social”)I Good contrast, good performanceI Needs an “okay” existing rankingI Extra parameter s to chooseI More expensive to use (second order distance)

6/18

Quality ofSimilarityRankings

in Time Series

T. Bernecker,M. E. Houle,H.-P. Kriegel,

P. Kröger,M. Renz,

E. Schubert,A. Zimek

Motivation

InterpretingDistance Fct.Distance Functions

Curse of Dimens.

SNN Distance

ExperimentsSNN performance

Histograms

Effects of noise

Conclusions

Shared Nearest Neighbor Distance

The similarity function needs to be transformed to a(non-metrical) distance function:

dinvs(x, y) = 1− simcoss(x, y)

dacoss(x, y) = arccos (simcoss(x, y))

dlns(x, y) = − ln simcoss(x, y)

Just like cosine distance.Interpretable as “cosine distance” in “neighbor space”.Similar: Jaccard distance (metrical)

J(x, y) := 1− |NNs(x) ∩ NNs(y)||NNs(x) ∪ NNs(y)|

7/18

Quality ofSimilarityRankings

in Time Series

T. Bernecker,M. E. Houle,H.-P. Kriegel,

P. Kröger,M. Renz,

E. Schubert,A. Zimek

Motivation

InterpretingDistance Fct.Distance Functions

Curse of Dimens.

SNN Distance

ExperimentsSNN performance

Histograms

Effects of noise

Conclusions

Experiments

Experimental results

8/18

Quality ofSimilarityRankings

in Time Series

T. Bernecker,M. E. Houle,H.-P. Kriegel,

P. Kröger,M. Renz,

E. Schubert,A. Zimek

Motivation

InterpretingDistance Fct.Distance Functions

Curse of Dimens.

SNN Distance

ExperimentsSNN performance

Histograms

Effects of noise

Conclusions

Data sets used

Four very different data sets:I Cylinder-Bell-Funnel (CBF): artificialI Synthetic control: artificialI Leaf dataset: outlines of tree leafsI Lightning-7: lightning strike emissions

Each modified in different ways:I Original data setI Extended with noise (irrelevant attributes)I Extended with “signal” (relevant attributes)

9/18

Quality ofSimilarityRankings

in Time Series

T. Bernecker,M. E. Houle,H.-P. Kriegel,

P. Kröger,M. Renz,

E. Schubert,A. Zimek

Motivation

InterpretingDistance Fct.Distance Functions

Curse of Dimens.

SNN Distance

ExperimentsSNN performance

Histograms

Effects of noise

Conclusions

Unmodified data sets

Results on unmodified data sets

Benefits of using SNNExemplary on the Cylinder-Bell-Funnel

(artificial) data set

10/18

Quality ofSimilarityRankings

in Time Series

T. Bernecker,M. E. Houle,H.-P. Kriegel,

P. Kröger,M. Renz,

E. Schubert,A. Zimek

Motivation

InterpretingDistance Fct.Distance Functions

Curse of Dimens.

SNN Distance

ExperimentsSNN performance

Histograms

Effects of noise

Conclusions

Contrast gain using SNN

Visual improvement (unmodified CBF data set):

Euclidean DTW 20% LCSS 20%

DTW s = 70 DTW s = 100 LCSS s = 100

11/18

Quality ofSimilarityRankings

in Time Series

T. Bernecker,M. E. Houle,H.-P. Kriegel,

P. Kröger,M. Renz,

E. Schubert,A. Zimek

Motivation

InterpretingDistance Fct.Distance Functions

Curse of Dimens.

SNN Distance

ExperimentsSNN performance

Histograms

Effects of noise

Conclusions

Distance Histograms

Numerical contrast improved (unmodified CBF data set):

10 20 30 40 50 60 70

PD

F

Primary distance

0 0.2 0.4 0.6 0.8 1

PD

F

SNN 100 distance

Euclidean

100 200 300 400 500 600 700 800

PD

F

Primary distance

0 0.2 0.4 0.6 0.8 1

PD

F

SNN 100 distance

Manhattan

0 10 20 30 40 50 60 70

PD

F

Primary distance

0 0.2 0.4 0.6 0.8 1

PD

F

SNN 100 distance

DTW 20%

5 10 15 20 25 30 35 40 45 50 55 60

PD

F

Primary distance

0 0.2 0.4 0.6 0.8 1

PD

F

SNN 100 distance

ERP 20%

30 40 50 60 70 80 90 100 110 120

PD

F

Primary distance

0 0.2 0.4 0.6 0.8 1

PD

F

SNN 100 distance

EDR 20%

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

PD

F

Primary distance

0 0.2 0.4 0.6 0.8 1

PD

F

SNN 100 distance

LCSS 20%12/18

Quality ofSimilarityRankings

in Time Series

T. Bernecker,M. E. Houle,H.-P. Kriegel,

P. Kröger,M. Renz,

E. Schubert,A. Zimek

Motivation

InterpretingDistance Fct.Distance Functions

Curse of Dimens.

SNN Distance

ExperimentsSNN performance

Histograms

Effects of noise

Conclusions

Effect of neighborhood size s:

Effect of variation of SNN size parameter s (CBF):

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0 50 100 150 200

Mean

RO

C A

UC

SNN size

Euclidean

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0 50 100 150 200

Mean

RO

C A

UC

SNN size

Manhattan

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0 50 100 150 200

Mean

RO

C A

UC

SNN size

DTW 20%

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0 50 100 150 200

Mean R

OC

AU

C

SNN size

ERP 20%

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0 50 100 150 200

Mean R

OC

AU

C

SNN size

EDR 20%

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0 50 100 150 200

Mean R

OC

AU

C

SNN size

LCSS 20%

13/18

Quality ofSimilarityRankings

in Time Series

T. Bernecker,M. E. Houle,H.-P. Kriegel,

P. Kröger,M. Renz,

E. Schubert,A. Zimek

Motivation

InterpretingDistance Fct.Distance Functions

Curse of Dimens.

SNN Distance

ExperimentsSNN performance

Histograms

Effects of noise

Conclusions

Modified data sets

Results on modified data sets

Adding noise to the data set,Changing the signal to noise ratio

14/18

Quality ofSimilarityRankings

in Time Series

T. Bernecker,M. E. Houle,H.-P. Kriegel,

P. Kröger,M. Renz,

E. Schubert,A. Zimek

Motivation

InterpretingDistance Fct.Distance Functions

Curse of Dimens.

SNN Distance

ExperimentsSNN performance

Histograms

Effects of noise

Conclusions

Adding noise

Adding noise to the data (Leaf data set)

0.5

0.6

0.7

0.8

0.9

1

1 2 4 8 16

Mea

n R

OC

AU

C

Data set size multiplicator

EuclideanEuclidean SNN 60

Euclidean

0.5

0.6

0.7

0.8

0.9

1

1 2 4 8 16

Mea

n R

OC

AU

C

Data set size multiplicator

ManhattanManhattan SNN 60

Manhattan

0.5

0.6

0.7

0.8

0.9

1

1 2 4 8 16

Mea

n R

OC

AU

C

Data set size multiplicator

DTW 20%DTW 20% SNN 60

DTW 20%

0.5

0.6

0.7

0.8

0.9

1

1 2 4 8 16

Mea

n R

OC

AU

C

Data set size multiplicator

ERP 20%ERP 20% SNN 60

ERP 20%

0.5

0.6

0.7

0.8

0.9

1

1 2 4 8 16

Mea

n R

OC

AU

C

Data set size multiplicator

EDR 20%EDR 20% SNN 60

EDR 20%

0.5

0.6

0.7

0.8

0.9

1

1 2 4 8 16

Mea

n R

OC

AU

C

Data set size multiplicator

LCSS 20%LCSS 20% SNN 60

LCSS 20%

15/18

Quality ofSimilarityRankings

in Time Series

T. Bernecker,M. E. Houle,H.-P. Kriegel,

P. Kröger,M. Renz,

E. Schubert,A. Zimek

Motivation

InterpretingDistance Fct.Distance Functions

Curse of Dimens.

SNN Distance

ExperimentsSNN performance

Histograms

Effects of noise

Conclusions

Chaning signal to noise ratio

Changing the signal-to-noise ratio (Leaf data set)

0.5

0.6

0.7

0.8

0.9

1

0% 25% 50% 75% 100%

Mean

RO

C A

UC

Relative amount of noise

4-fold4-fold, SNN 60

16-fold16-fold, SNN 60

Euclidean

0.5

0.6

0.7

0.8

0.9

1

0% 25% 50% 75% 100%

Mean

RO

C A

UC

Relative amount of noise

4-fold4-fold, SNN 60

16-fold16-fold, SNN 60

Manhattan

0.5

0.6

0.7

0.8

0.9

1

0% 25% 50% 75% 100%

Mean

RO

C A

UC

Relative amount of noise

4-fold4-fold, SNN 60

16-fold16-fold, SNN 60

DTW 20%

0.5

0.6

0.7

0.8

0.9

1

0% 25% 50% 75% 100%

Mean

RO

C A

UC

Relative amount of noise

4-fold4-fold, SNN 60

16-fold16-fold, SNN 60

ERP 20%

0.5

0.6

0.7

0.8

0.9

1

0% 25% 50% 75% 100%

Mean

RO

C A

UC

Relative amount of noise

4-fold4-fold, SNN 60

16-fold16-fold, SNN 60

EDR 20%

0.5

0.6

0.7

0.8

0.9

1

0% 25% 50% 75% 100%

Mean

RO

C A

UC

Relative amount of noise

4-fold4-fold, SNN 60

16-fold16-fold, SNN 60

LCSS 20%

16/18

Quality ofSimilarityRankings

in Time Series

T. Bernecker,M. E. Houle,H.-P. Kriegel,

P. Kröger,M. Renz,

E. Schubert,A. Zimek

Motivation

InterpretingDistance Fct.Distance Functions

Curse of Dimens.

SNN Distance

ExperimentsSNN performance

Histograms

Effects of noise

Conclusions

Conclusions

Conclusions

Second order “shared nearest neighbor” distances offer:I Improved performanceI Better numerical contrastI Parameter s is not difficult to chooseI Less sensitive to noiseI . . . but computationally more expensive

17/18

Quality ofSimilarityRankings

in Time Series

T. Bernecker,M. E. Houle,H.-P. Kriegel,

P. Kröger,M. Renz,

E. Schubert,A. Zimek

Motivation

InterpretingDistance Fct.Distance Functions

Curse of Dimens.

SNN Distance

ExperimentsSNN performance

Histograms

Effects of noise

Conclusions

Thank youfor your attention!

18/18