latent variable modelling - github pagesadamian.github.io/talks/damianou_dimredkenya15.pdf ·...

42
Dimensionality Reduction & Latent Variable Modelling Andreas Damianou University of Sheffield, UK Workshop on Data Science in Africa, 17 th June, 2015 Dedan Kimathi University of Technology, Kenya

Upload: others

Post on 24-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Dimensionality Reduction&

Latent Variable Modelling

Andreas Damianou

University of Sheffield, UK

Workshop on Data Science in Africa,17th June, 2015

Dedan Kimathi University of Technology, Kenya

Page 2: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Working with data

● Data-science: everything revolves around a dataset

● Dataset: the set of data (to be) collected for our algorithms to learn from

● Example: animals dataset

height weight

3340287015

202812367

animal 1animal 2animal 3animal 4animal 5

Page 3: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Working with data

● Data-science: everything revolves around a dataset

● Dataset: the set of data (to be) collected for our algorithms to learn from

● Example: animals dataset

height weight

3340287015

202812367

animal 1animal 2animal 3animal 4animal 5

Y =

Page 4: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Working with data

● Data-science: everything revolves around a dataset

● Dataset: the set of data (to be) collected for our algorithms to learn from

● Example: animals dataset

hhhhh

wwwww

yyyyy

Y =

5 rows (instances)2 columns (features)

height weight

12345

12345

12345

12345

Page 5: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

What is “dimensionality”?

● Simply, the number of features used for describing each instance.

● In the previous example: height and width.

● The number of features depends on our selection or limitations during data collection.

Page 6: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Notation

It's convenient to use notation from linear algebra.

Y =

1,2,

n,

yy

y

1,2,

n,

yy

y

1,2,

n,

yy

y

11

1

22

2

dd

d

n rows → n instances

d columns → d features (dimensions)

So, the matrix Y containsd-dimensional data

Page 7: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

High-dimensional data

● Data having large number of features, d● Examples

– Micro-array data

– Images

Y =

1,2,

n,

yy

y

1,2,

n,

yy

y

1,2,

n,

yy

y

11

1

22

2

dd

d

...

Page 8: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Properties of high-dimensional data

Page 9: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Properties of high-dimensional data

British peopleNon-British people

Page 10: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Name Height

Neil: 1.97John: 1.94Mike: 1.89Ciira: 1.76Andreas: 1.74

1.70 1.80 1.90 2.00

A C M J N

Page 11: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Name Height

Neil: 1.97John: 1.94Mike: 1.89Ciira: 1.76Andreas: 1.74

1.70 1.80 1.90 2.00

A C M J Ny*

Page 12: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Name Height

Neil: 1.97John: 1.94Mike: 1.89Ciira: 1.76Andreas: 1.74

1.70 1.80 1.90 2.00

A C M J Ny*Nearest neighbour classification for a new instance y*

Distance from C to M: sqrt((C_h – M_h)^2) = 0.0009

Page 13: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Adding another feature: weight

J

A

C

MN

Distance from C to M: sqrt((C_h – M_h)^2 + (C_w – M_w)^2) = 16

Page 14: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Adding another feature: “beardness”

Distance from C to M: sqrt((C_h – M_h)^2 + (C_w – M_w)^2 + (C_b – M_b)^2 ) = 65

A

C

M

J

N

Page 15: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Adding another feature: “beardness”

Distance from C to M: sqrt((C_h – M_h)^2 + (C_w – M_w)^2 + (C_b – M_b)^2 ) = 65

A

C

M

J

N

Show demo!! (humanClassif)

Page 16: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

The (potential) problem

● What happens as the dimensionality grows?–

● Why is that a problem?–

● Is that always a problem?–

Page 17: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

The (potential) problem

● What happens as the dimensionality grows?– Distances grow bigger

– Everything seems to be spread apart

● Why is that a problem?–

● Is that always a problem?–

Page 18: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

The (potential) problem

● What happens as the dimensionality grows?– Distances grow bigger

– Everything seems to be spread apart

● Why is that a problem?– Think about doing nearest neighbour classification. For a test

instance, everything would just be super far...

● Is that always a problem?–

Page 19: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

The (potential) problem

● What happens as the dimensionality grows?– Distances grow bigger

– Everything seems to be spread apart

● Why is that a problem?– Think about doing nearest neighbour classification. For a test

instance, everything would just be super far...

● Is that always a problem?– No, but with real, “dirty” data, it can often be

Page 20: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

The (potential) problem

● What happens as the dimensionality grows?– Distances grow bigger

– Everything seems to be spread apart

● Why is that a problem?– Think about doing nearest neighbour classification. For a test

instance, everything would just be super far...

● Is that always a problem?– No, but with real, “dirty” data, it can often be

● Another problem we'll see later: noise in the data

Page 21: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Why do dimensionality reduction?

Page 22: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Why do dimensionality reduction?

● Pre-process data for another task (e.g. classification)

● Compression (lossy)

● Visualisation

● Data understanding / clearing

Page 23: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Why do dimensionality reduction?

● Pre-process data for another task (e.g. classification)

● Compression (lossy)

● Visualisation

● Data understanding / clearing

Page 24: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Compression

hhhhh

wwwww

Y =

12345

12345

12345

hhhhh

+ w+ w+ w+ w+ w

12345

12345

12345

redundant!!

Page 25: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Why do dimensionality reduction?

● Pre-process data for another task (e.g. classification)

● Compression (lossy)

● Visualisation

● Data understanding / clearing

Page 26: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Spring-ball example

On the board!

Page 27: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Clustering vs Dimensionality Reduction

Dimensionality Reduction

Clustering

n x d matrixm x d matrix

m < n

n x k matrix

k < d

Page 28: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Dimensionality reduction in action

1st way: One solution is to just drop one of the features

hhhhh

wwwww

Y =

height weight

12345

12345

12345

hhhhh

X' =

height

12345

12345

h

w

h

A DC EB

A

D

C

E

B

Page 29: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Dimensionality reduction in action

2nd way: Another solution is to transform the two features into one

hhhhh

wwwww

Y =

height weight

12345

12345

12345

sssss

X'' =

size

12345

12345

h

w

s

A

A

D

C

E

B

BC DEs = f(h,w)

Page 30: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Dimensionality reduction in action

Another solution is to transform the two features to one

hhhhh

wwwww

Y =

height weight

12345

12345

12345

sssss

X'' =

size

12345

12345

h

w

s

A

A

D

C

E

B

BC DEs = f(h,w)

But how do we find s?

Page 31: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Principal Component Analysis

Run demo!! (pcaPlot)

Page 32: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Eigendecomposition

● Gives a feeling of the properties of the matrix

++++

+ +++

+

+ +

+

+

+

++++

++++ ++++

+

Page 33: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Eigendecomposition

● Gives a feeling of the properties of the matrix

++++

+ +++

+

+ +

+

+

u u1 2

● u1 and u2 define the axes with maximum variances, where the data is most spread

● To reduce the dimensionality I project the data on the axis where data is the most spread

● There is no class information given

+

++++

++++ ++++

+

Page 34: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

● We need to optimise in order to minimise the distance between the true point and its reconstrcution:

x* = argmin || y – rec(x) ||x

2

y is 1 times d x is 1 times q

● rec(x) = x W'

• What is the dimensionality of W'?• Answer: d times q

• Then: x = y W. Now W is q times d.

• We can determine both x and W by solving an optimisation problem (called eigenvalue problem)

Page 35: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

● When class labels are given, PCA cannot take them into account... but we hope that the natural separation of the data (see clustering) will encode this information

x

xx

xxx

x

xx

x

x x x

++

++ +

+++

+ +

+

+

xx

++ +

+

+

Page 36: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Principal Component Analysis

• Remember: data are noisy!• Trade-off: reduce size / noise without losing too much information

2D → 1D 3D → 2D

Page 37: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

But what about this case...

xx

x xx x

xx

xx

xxx x

xxx

ooooo

oo o

oo

ooo

oo

Page 38: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

But what about this case...

xx

x xx x

xx

xx

xxx x

xxx

ooooo

oo o

oo

ooo

oo

xxx

xxx

xxxx

xxxxxx

x

ooo

oooooo

oooo

oo

PCA

Page 39: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

But what about this case...

xx

x xx x

xx

xx

xxx x

xxx

ooooo

oo o

oo

ooo

oo

xxx

xxx

xxxx

xxxxxx

x

ooo

oooooo

oooo

oo

xxxxxxxx

xxoooooooooo

PCA

Page 40: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

But what about this case...

xx

x xx x

xx

xx

xxx x

xxx

ooooo

oo o

oo

ooo

oo

xxx

xxx

xxxx

xxxxxx

x

ooo

oooooo

oooo

oo

xxxxxxxx

xxoooooooooo

LDA

PCA

In discriminant analysis, we want to maximise the spread between classes.

Page 41: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Latent variable models

● Non-probabilistic approach:

● Probabilistic approach:

Learn “backwards”: Model the relationship between Y and X:

p(Y|X). This comes from: y = wx + ε

Then, we can get the posterior distribution:

p(X|Y)

Y X

Page 42: Latent Variable Modelling - GitHub Pagesadamian.github.io/talks/Damianou_dimRedKenya15.pdf · Working with data Data-science: everything revolves around a dataset Dataset: the set

Linear vs Non-linear

(Slide from Neil Lawrence.)