visualization using tsne

Visualization using tSNE

Yan Xu

Jun 7, 2013

Linear(PCA)

Nonlinear

Non-

parametric

Parametric(LDA)

Dimension

reduction

Global(ISOMAP,MDS)

Local(LLE, SNE)

Dimension Reduction Overview

MDS SNE sym SNE UNI-SNE tSNE Barnes-Hut-SNE

Local+probability crowding problem

more stable

and faster

solution

tSNE (t-distributed Stochastic Neighbor Embedding)

easier

implementation

2002 2008 2013

O(N2)->O(NlogN)

2007

MDS: Multi-Dimensional Scaling

• Multi-Dimensional Scaling arranges the low-dimensional points so as to minimize the discrepancy between the pairwise distances in the original space and the pairwise distances in the low-D space.

2

2

2

||||ˆ

||||

)ˆ(

jiij

jiij

ij

ji

ij

yyd

xxd

ddCost

2

||||

||||||||

ji

jiji

ji

Costxx

yyxx

high-D

distancelow-D

distance

Sammon mapping from MDS

It puts too much emphasis on getting very small distances exactly

right. It’s slow to optimize and also gets stuck in different local

optima each time

Global to Local?

The idea is to make the local configurations of points in the low-dimensional

space resemble the local configurations in the high-dimensional space.

Maps that preserve local geometry

2

)(

||||iNj

jij

i

i wCost yy

fixed weights

1,||||)(

2

)( iNj

ij

iNj

jij

i

i wwCost xx

Find the y that minimize the cost subject to the constraint that the y have

unit variance on each dimension.

LLE (Locally Linear Embedding)

A probabilistic version of local MDS:

Stochastic Neighbor Embedding (SNE)

• It is more important to get local distances right than non-local ones.

• Stochastic neighbor embedding has a probabilistic way of deciding if

a pairwise distance is “local”.

• Convert each high-dimensional similarity into the probability that one

data point will pick the other data point as its neighbor.

2

2

2|| || 2

| 2|| || 2

i j

i k

x x i

j ix x i

k

ep

e

probability of

picking j given i in

high D

2

2

|| ||

|| |||

i j

i k

y y

y yj i

k

eq

e

probability of

picking j given

i in low D

Picking the radius of the Gaussian that is

used to compute the p’s

• We need to use different radii in different parts of the space so that

we keep the effective number of neighbors about constant.

• A big radius leads to a high entropy for the distribution over

neighbors of i. A small radius leads to a low entropy.

• So decide what entropy you want and then find the radius that

produces that entropy.

• Its easier to specify perplexity:2

2

2|| || 2

| 2|| || 2

i j

i k

x x i

j ix x i

k

ep

e

The cost function for a low-dimensional

representation

ijq

ijp

i jijpQ

iiPKLCost i

|

|log|)||(

)()(2 |||| jijiijiji

j

j

i

qpqpC

yyy

Gradient descent:

Gradient update with a momentum term:

Learning

rate

Momentum

Simpler version SNE: Turning conditional

probabilities into pairwise probabilities

1

2ij

j

pn

n

ppp

jiijij

2

||

2||

2||

2|| 2

2|| 2ij

xi j

xk l

x

x

k l

ep

e

4 ( )( )ij ij i j

ji

Cp q y y

y

( || ) logij

ijij

pCost KL P Q p

q

MNIST

Database

of handwritten

digits

28×28 images

Problem?

Why SNE does not have gaps between

classes

A uniform background model (UNI-SNE) eliminates this effect and allows gaps between classes to appear.

qij can never fall below

Crowding problem: the area accommodating moderately distant

datapoints is not large enough compared with the area

accommodating nearby datapoints.

2

( 1)n n

From UNI-SNE to t-SNE

2 1

2 1

(1 || || )

(1 || || )

i j

k l

k l

ij

y yq

y y

High dimension: Convert distances into probabilities using a

Gaussian distribution

Low dimension: Convert distances into probabilities using a

probability distribution that has much heavier tails than a Gaussian.

Student’s t-distribution

V : the number of degrees of freedom

Standard

Normal Dis.

T-Dis. With

V = 1

Compare tSNE with SNE and UNI-SNE

10

12

14

16

18

10

12

14

-2

-4

Optimization method for tSNE2 1

2 1

(1 || || )

(1 || || )

i j

k l

k l

ij

y yq

y y

2

2

2|| || 2

| 2|| || 2

i j

i k

x x i

j ix x i

k

ep

e

Optimization method for tSNE

Tricks:

1. Keep momentum term small until the map points have become

moderately well organized.

2. Use adaptive learning rate described by Jacobs (1988), which

gradually increases the learning rate in directions where the

gradient is stable.

3. Early compression: force map points to stay close together at the

start of the optimization.

4. Early exaggeration: multiply all the pij’s by 4, in the initial stages

of the optimization.

6000

MNIST

digits

Isomap

Locally Linear Embedding

t-SNE

Sammon mapping

tSNE vs Diffusion maps

Diffusion distance:

Diffusion maps:

2|| ||

(1)i jx x

ijp e

( ) ( 1) ( 1)

1

nt t t

ij ik kj

k

p p p

Weakness

1. It’s unclear how t-SNE performs on general dimensionality

reduction task;

2. The relative local nature of t-SNE makes it sensitive to the curse

of the intrinsic dimensionality of the data;

3. It’s not guaranteed to converge to a global optimum of its cost

function.

References:

t-SNE homepage:

http://homepage.tudelft.nl/19j49/t-SNE.html

Advanced Machine Learning: Lecture11: Non-linear Dimensionality Reduction

http://www.cs.toronto.edu/~hinton/csc2535/lectures.html

Plugin Ad: tSNE in Farsightsplot = new SNEPlotWindow(this);

splot->setPerplexity(perplexity);

splot->setModels(table, selection))

splot->show();








visualization using tsne

Technology