visualization using tsne
DESCRIPTION
An introduction to tSNE in the background of dimension reductionTRANSCRIPT
Visualization using tSNE
Yan Xu
Jun 7, 2013
Linear(PCA)
Nonlinear
Non-
parametric
Parametric(LDA)
Dimension
reduction
Global(ISOMAP,MDS)
Local(LLE, SNE)
Dimension Reduction Overview
MDS SNE sym SNE UNI-SNE tSNE Barnes-Hut-SNE
Local+probability crowding problem
more stable
and faster
solution
tSNE (t-distributed Stochastic Neighbor Embedding)
easier
implementation
2002 2008 2013
O(N2)->O(NlogN)
2007
MDS: Multi-Dimensional Scaling
• Multi-Dimensional Scaling arranges the low-dimensional points so as to minimize the discrepancy between the pairwise distances in the original space and the pairwise distances in the low-D space.
2
2
2
||||ˆ
||||
)ˆ(
jiij
jiij
ij
ji
ij
yyd
xxd
ddCost
2
||||
||||||||
ji
jiji
ji
Costxx
yyxx
high-D
distancelow-D
distance
Sammon mapping from MDS
It puts too much emphasis on getting very small distances exactly
right. It’s slow to optimize and also gets stuck in different local
optima each time
Global to Local?
The idea is to make the local configurations of points in the low-dimensional
space resemble the local configurations in the high-dimensional space.
Maps that preserve local geometry
2
)(
||||iNj
jij
i
i wCost yy
fixed weights
1,||||)(
2
)( iNj
ij
iNj
jij
i
i wwCost xx
Find the y that minimize the cost subject to the constraint that the y have
unit variance on each dimension.
LLE (Locally Linear Embedding)
A probabilistic version of local MDS:
Stochastic Neighbor Embedding (SNE)
• It is more important to get local distances right than non-local ones.
• Stochastic neighbor embedding has a probabilistic way of deciding if
a pairwise distance is “local”.
• Convert each high-dimensional similarity into the probability that one
data point will pick the other data point as its neighbor.
2
2
2|| || 2
| 2|| || 2
i j
i k
x x i
j ix x i
k
ep
e
probability of
picking j given i in
high D
2
2
|| ||
|| |||
i j
i k
y y
y yj i
k
eq
e
probability of
picking j given
i in low D
Picking the radius of the Gaussian that is
used to compute the p’s
• We need to use different radii in different parts of the space so that
we keep the effective number of neighbors about constant.
• A big radius leads to a high entropy for the distribution over
neighbors of i. A small radius leads to a low entropy.
• So decide what entropy you want and then find the radius that
produces that entropy.
• Its easier to specify perplexity:2
2
2|| || 2
| 2|| || 2
i j
i k
x x i
j ix x i
k
ep
e
The cost function for a low-dimensional
representation
ijq
ijp
i jijpQ
iiPKLCost i
|
|log|)||(
)()(2 |||| jijiijiji
j
j
i
qpqpC
yyy
Gradient descent:
Gradient update with a momentum term:
Learning
rate
Momentum
Simpler version SNE: Turning conditional
probabilities into pairwise probabilities
1
2ij
j
pn
n
ppp
jiijij
2
||
2||
2||
2|| 2
2|| 2ij
xi j
xk l
x
x
k l
ep
e
4 ( )( )ij ij i j
ji
Cp q y y
y
( || ) logij
ijij
pCost KL P Q p
q
MNIST
Database
of handwritten
digits
28×28 images
Problem?
Why SNE does not have gaps between
classes
A uniform background model (UNI-SNE) eliminates this effect and allows gaps between classes to appear.
qij can never fall below
Crowding problem: the area accommodating moderately distant
datapoints is not large enough compared with the area
accommodating nearby datapoints.
2
( 1)n n
From UNI-SNE to t-SNE
2 1
2 1
(1 || || )
(1 || || )
i j
k l
k l
ij
y yq
y y
High dimension: Convert distances into probabilities using a
Gaussian distribution
Low dimension: Convert distances into probabilities using a
probability distribution that has much heavier tails than a Gaussian.
Student’s t-distribution
V : the number of degrees of freedom
Standard
Normal Dis.
T-Dis. With
V = 1
Compare tSNE with SNE and UNI-SNE
10
12
14
16
18
10
12
14
-2
-4
Optimization method for tSNE2 1
2 1
(1 || || )
(1 || || )
i j
k l
k l
ij
y yq
y y
2
2
2|| || 2
| 2|| || 2
i j
i k
x x i
j ix x i
k
ep
e
Optimization method for tSNE
Tricks:
1. Keep momentum term small until the map points have become
moderately well organized.
2. Use adaptive learning rate described by Jacobs (1988), which
gradually increases the learning rate in directions where the
gradient is stable.
3. Early compression: force map points to stay close together at the
start of the optimization.
4. Early exaggeration: multiply all the pij’s by 4, in the initial stages
of the optimization.
6000
MNIST
digits
Isomap
Locally Linear Embedding
t-SNE
Sammon mapping
tSNE vs Diffusion maps
Diffusion distance:
Diffusion maps:
2|| ||
(1)i jx x
ijp e
( ) ( 1) ( 1)
1
nt t t
ij ik kj
k
p p p
Weakness
1. It’s unclear how t-SNE performs on general dimensionality
reduction task;
2. The relative local nature of t-SNE makes it sensitive to the curse
of the intrinsic dimensionality of the data;
3. It’s not guaranteed to converge to a global optimum of its cost
function.
References:
t-SNE homepage:
http://homepage.tudelft.nl/19j49/t-SNE.html
Advanced Machine Learning: Lecture11: Non-linear Dimensionality Reduction
http://www.cs.toronto.edu/~hinton/csc2535/lectures.html
Plugin Ad: tSNE in Farsightsplot = new SNEPlotWindow(this);
splot->setPerplexity(perplexity);
splot->setModels(table, selection))
splot->show();