multivariate statistics { multi-dimensional scaling (mds ...nicolai/mv/notes5.pdf · word2vec, and...

11
Multivariate Statistics – multi-dimensional scaling (MDS) and visualization Suppose we have n sample points x 1 ,...,x n R p with distances D ij for all pairs (i, j ) ∈{1,...,n} given for example by a metric D ij = kx i - x j k. The data-point x i R p can be thought of as the ith row X i· of an n × p-dimensional data-matrix X . The distance matrix D R n×n can also be derived in some other way (see example on Isomap, word2vec, and visualization of Random Forests below). In fact, we do not need to have the under- lying data points x 1 ,...,x n . A distance matrix D is sufficient for MDS. The goal of visualization with multi-dimensional scaling is to find an arrangement of z 1 ,...,z n R q with q<p (typically q = 2) that preserves distances as well as possible. Several variations of MDS exist. They are in general characterized by three choices. We assume the underling data-points x 1 ,...,x n are available but if they are not and just a distance matrix D is given, we can skip the first item below. (i) First choice: a function f X that measures distance/similarity/affinity D R n×n between the original data-points: D i,j = f X (x i ,x j ), where the subscript in f X indicates that the function f might depend on the whole dataset. (ii) Second choice: a function f Z that measures distance/similarity/affinity ˜ D in the “embedded” space R q : ˜ D i,j = f Z (z i ,z j ). (iii) Third choice: the so-called stress function S that measures how far apart the distances/affinities D and ˜ D lie, which is typically separable over all pairs such that S (D, ˜ D)= n X i,j =1 s(D i,j , ˜ D i,j ), where typically s(·, ·) 0 and s(d, ˜ d) = 0 if and only if d = ˜ d. The goal is now to find a configuration Z =(z 1 ,...,z n ) R q that minimizes the stress: (z 1 ,...,z n ) = argmin Z S (D, ˜ D), where D is given and ˜ D is a function of the chosen configuration Z . The data-points z 1 ,...,z n can the be visualized easily if q = 2 or q = 3. Some special choices: 1

Upload: others

Post on 19-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multivariate Statistics { multi-dimensional scaling (MDS ...nicolai/mv/notes5.pdf · word2vec, and visualization of Random Forests below). In fact, we do not need to have the under-

Multivariate Statistics – multi-dimensional scaling (MDS) and visualization

Suppose we have n sample points x1, . . . , xn ∈ Rp with distances Dij for all pairs (i, j) ∈ {1, . . . , n}given for example by a metric

Dij = ‖xi − xj‖.

The data-point xi ∈ Rp can be thought of as the ith row Xi· of an n× p-dimensional data-matrixX.

The distance matrix D ∈ Rn×n can also be derived in some other way (see example on Isomap,word2vec, and visualization of Random Forests below). In fact, we do not need to have the under-lying data points x1, . . . , xn. A distance matrix D is sufficient for MDS.

The goal of visualization with multi-dimensional scaling is to find an arrangement of z1, . . . , zn ∈ Rq

with q < p (typically q = 2) that preserves distances as well as possible.

Several variations of MDS exist. They are in general characterized by three choices. We assumethe underling data-points x1, . . . , xn are available but if they are not and just a distance matrix Dis given, we can skip the first item below.

(i) First choice: a function fX that measures distance/similarity/affinity D ∈ Rn×n between theoriginal data-points:

Di,j = fX(xi, xj),

where the subscript in fX indicates that the function f might depend on the whole dataset.

(ii) Second choice: a function fZ that measures distance/similarity/affinity D in the “embedded”space Rq:

Di,j = fZ(zi, zj).

(iii) Third choice: the so-called stress function S that measures how far apart the distances/affinitiesD and D lie, which is typically separable over all pairs such that

S(D, D) =

n∑i,j=1

s(Di,j , Di,j),

where typically s(·, ·) ≥ 0 and s(d, d) = 0 if and only if d = d.

The goal is now to find a configuration Z = (z1, . . . , zn) ∈ Rq that minimizes the stress:

(z1, . . . , zn) = argminZ S(D, D),

where D is given and D is a function of the chosen configuration Z. The data-points z1, . . . , zn canthe be visualized easily if q = 2 or q = 3.

Some special choices:

1

Page 2: Multivariate Statistics { multi-dimensional scaling (MDS ...nicolai/mv/notes5.pdf · word2vec, and visualization of Random Forests below). In fact, we do not need to have the under-

The direction of the PC1 axis and its relative strength may reflect aspecial role for this geographic axis in the demographic history ofEuropeans (as first suggested in ref. 10). PC1 aligns north-northwest/south-southeast (NNW/SSE, 216 degrees) and accounts forapproximately twice the amount of variation as PC2 (0.30% versus0.15%, first eigenvalue 5 4.09, second eigenvalue 5 2.04). However,caution is required because the direction and relative strength of thePC axes are affected by factors such as the spatial distribution ofsamples (results not shown, also see ref. 9). More robust evidencefor the importance of a roughly NNW/SSE axis in Europe is that, inthese same data, haplotype diversity decreases from south to north(A.A. et al., submitted). As the fine-scale spatial structure evident inFig. 1 suggests, European DNA samples can be very informativeabout the geographical origins of their donors. Using a multi-ple-regression-based assignment approach, one can place 50% of

individuals within 310 km of their reported origin and 90% within700 km of their origin (Fig. 2 and Supplementary Table 4, resultsbased on populations with n . 6). Across all populations, 50% ofindividuals are placed within 540 km of their reported origin, and90% of individuals within 840 km (Supplementary Fig. 3 andSupplementary Table 4). These numbers exclude individuals whoreported mixed grandparental ancestry, who are typically assignedto locations between those expected from their grandparental origins(results not shown). Note that distances of assignments fromreported origin may be reduced if finer-scale information on originwere available for each individual.

Population structure poses a well-recognized challenge for disease-association studies (for example, refs 11–13). The results obtainedhere reinforce that the geographic distribution of a sample is impor-tant to consider when evaluating genome-wide association studies

–0.03 –0.02 –0.01 0 0.01 0.02 0.03–0.03

–0.02

–0.01

0

0.01

0.02

0.03

Italy

Germany

France

UK

SpainPortugal

0 1,000 2,000 3,000

–0.010

0

0.010

0.020

Geographic distance betweenpopulations (km)

Med

ian

gene

tic c

orre

latio

n

PC

1a

b c

French-speaking SwissGerman-speaking SwissItalian-speaking Swiss

FrenchGermanItalian

Nor

th–s

outh

in P

C1–

PC

2 sp

ace

East–west in PC1–PC2 space

PC2

Figure 1 | Population structure within Europe. a, A statistical summary ofgenetic data from 1,387 Europeans based on principal component axis one(PC1) and axis two (PC2). Small coloured labels represent individuals andlarge coloured points represent median PC1 and PC2 values for eachcountry. The inset map provides a key to the labels. The PC axes are rotatedto emphasize the similarity to the geographic map of Europe. AL, Albania;AT, Austria; BA, Bosnia-Herzegovina; BE, Belgium; BG, Bulgaria; CH,Switzerland; CY, Cyprus; CZ, Czech Republic; DE, Germany; DK, Denmark;ES, Spain; FI, Finland; FR, France; GB, United Kingdom; GR, Greece; HR,

Croatia; HU, Hungary; IE, Ireland; IT, Italy; KS, Kosovo; LV, Latvia; MK,Macedonia; NO, Norway; NL, Netherlands; PL, Poland; PT, Portugal; RO,Romania; RS, Serbia and Montenegro; RU, Russia, Sct, Scotland; SE,Sweden; SI, Slovenia; SK, Slovakia; TR, Turkey; UA, Ukraine; YG,Yugoslavia. b, A magnification of the area around Switzerland froma showing differentiation within Switzerland by language. c, Geneticsimilarity versus geographic distance. Median genetic correlation betweenpairs of individuals as a function of geographic distance between theirrespective populations.

NATURE | Vol 456 | 6 November 2008 LETTERS

99 ©2008 Macmillan Publishers Limited. All rights reserved

Figure 1: Two-dimensional classical scaling embedding (PCA-based) of SNP data from Europeanpeople from Novembre et al. There are n = 1387 people in the study and the number of SNPsmeasures is approximately p ≈ 500000.

2

Page 3: Multivariate Statistics { multi-dimensional scaling (MDS ...nicolai/mv/notes5.pdf · word2vec, and visualization of Random Forests below). In fact, we do not need to have the under-

I. Classical scaling.

Here we choose as distances inner products:

(i) Di,j = (xi − x)t(xj − x).

(ii) Di,j = (zi − z)t(zj − z).

(iii) Stress function S(D, D) =∑n

i,j=1(Di,j − Di,j)2 = ‖D − D‖22.

Note that the inner products rely on the choice of an origin (x resp. z). Assume that the datamatrix X has mean-centered columns. From now on we neglect the mean. Then

(i) D = XXt ∈ Rn×n.

(ii) D = ZZt ∈ Rn×n (but rank at most q).

(iii) S(D, D) =∑n

i,j=1(Di,j− Di,j)2 =

∑ni,j=1

((XXt)i,j− (ZZt)i,j

)2= trace

[((XXt)− (ZZt)

)2].

If X = UΛ1/2V t is the SVD1 of X, then

XXt = UΛU t.

Let the decomposition of Z be given for U ∈ Rn×n and Λ1/2 a diagonal matrix with at most qnon-zero entries as

ZZt = U ΛU t.

The optimal solution, if defining G := U tU and using symmetry of XXt and ZZt, is

minZ∈Rn×q trace[(

(XXt)− (ZZt))2]

= minU ,Λ trace[(UΛU t − U ΛU t

)2]= minG,Λ trace

[(GtΛG− Λ

)2]where we have used GGt = GtG = 1n×n and the cyclic property of the trace in the last step. Usingthese constraints, the optimal solution for a fixed Λ is seen to be G = 1n×n, that is

U = U.

and the optimal solution for Λ is then given by

argminΛ trace[Λ2 + Λ2 − 2ΛΛ

]= argminΛ trace

[(Λ− Λ)2

].

Using the constraint that Λ can contain only up to q non-zero entries and has to be diagonal, wefind that the optimal solution is given by keeping the first q (largest) non-zero entries of Λ andsetting all others equal to 0:

Λkk =

{Λkk if k ≤ q0 otherwise

.

The optimal solution (up to translations and rotations) for classical scaling is hence to choose anembedding as

Z = U Λ1/2,

1choosing the square-root of the diagonal eigenvalue matrix of X, Λ1/2, instead of the commonly used D for thematrix with singular values to avoid confusion with the distance matrix D

3

Page 4: Multivariate Statistics { multi-dimensional scaling (MDS ...nicolai/mv/notes5.pdf · word2vec, and visualization of Random Forests below). In fact, we do not need to have the under-

where Λ is the truncated matrix Λ, keeping the first q diagonal entries and setting all others tozero. Using Z = U Λ1/2 as an embedding is the same as using the first q columns of the score matrixA of the PCA solution as an embedding. The classical scaling is thus equivalent to PCA. In otherwords: if we have a q-rank PCA approximation

X︸︷︷︸∈Rn×p

≈ A︸︷︷︸∈Rn×q

H︸︷︷︸∈Rq×p

,

where A = U Λ1/2 and H = V t in the SVD of X = UΛ1/2V t, then the scores A are the optimalembedding and we should choose Z ≡ A for a q-dimensional embedding with classical scaling. Anexample for gene data is shown in Figure 1.

Exact solution (stress 0) if original data are low-rank. If the rank of X is q or smaller,then the stress will be 0, as then Λ = Λ and

ZZt = UΛ2U t = XXt and hence D ≡ D

and we can embed in dimension q while preserving all inner products.

II. Least-squares scaling.

To get rid of the choice of an origin, we can change the classical scaling to a least-squares scalingwhere the distances are the Euclidean distances in the original and embedded space:

(i) Di,j = ‖xi − xj‖2.

(ii) Di,j = ‖zi − zj‖2.

(iii) Stress function S(D, D) =∑n

i,j=1(Di,j − Di,j)2 = ‖D − D‖22.

The optimal solution Z does not have an explicit solution any longer and is usually computediteratively. Note, however, that in case of centered X (mean 0 in all columns), the matrix D canbe expressed as

(Di,j)2 = xtixi + xtjxj − 2xtixj

Let d ∈ Rn be the vector with diagonal entries of XXt and let e be a vector in Rn with all entriesequal to 1. Let H be a n× n matrix with entries Hi,j = D2

i,j . Then

H = det + edt − 2XXt.

This implies that if we find an embedding Z ∈ Rn×q for which ZZt = XXt (assuming againmean-centered columns) then

Di,j = Di,j for all 1 ≤ i, j ≤ n.

and the embedding will preserve the distances exactly. This can be achieved (using the classicalscaling which tries to match XXt with ZZt directly) if the rank of X is q or smaller.

4

Page 5: Multivariate Statistics { multi-dimensional scaling (MDS ...nicolai/mv/notes5.pdf · word2vec, and visualization of Random Forests below). In fact, we do not need to have the under-

III. Sammon mapping.

We might care more about getting the distances between neighbouring points right and less aboutreplicating the distances between points that have a large distance already. Sammon mappingachieves this by replacing the least squares stress function by a rescaled version

(i) Di,j = ‖xi − xj‖2.

(ii) Di,j = ‖zi − zj‖2.

(iii) Stress function

S(D, D) =n∑

i,j=1

(Di,j − Di,j)2

Di,j,

where the new stress function is down-weighting the approximation quality for pairs of observationsthat have a large distance in the original space.

Some general remarks: Minimizing the stress function typically uses greedy optimization (classicalscaling is an exception as it can be solved exactly via a singular value decomposition) and can endin a local minimum, so several solutions are typically computed (using different starting points) andthe best one (with the minimal stress) chosen. Note also that MDS does not provide an explicitmapping into lower-dimensional space. If we want to add a new data-point, we would have tore-compute the solution with now n+ 1 points instead of the original n (although one can of coursekeep the first set of n points fixed and just minimize the strees between the original n points andthe newly added points).

IV. Shephard-Kruskal nonmetric scaling.

If we just want to preserve the orderings between the distances and not distances themselves, anattractive scaling allows a monotone transformation of the distances between the embedded pointsand optimize:

(i) Di,j = ‖xi − xj‖2.

(ii) Di,j = ‖zi − zj‖2.

(iii) Stress function

S(D, D) = ming monotone

n∑i,j=1

(Di,j − g(Di,j))2,

where the function g is an arbitrary monotonically increasing function with usually g(0) = 0. Theoptimization typically alternatives between keeping D constant and optimizing g and then keepingg constant and optimizing D. This non-metric scaling tries to keep the orderings between thedistances as well preserved as possible.

5

Page 6: Multivariate Statistics { multi-dimensional scaling (MDS ...nicolai/mv/notes5.pdf · word2vec, and visualization of Random Forests below). In fact, we do not need to have the under-

converts distances to inner products (17),which uniquely characterize the geometry ofthe data in a form that supports efficientoptimization. The global minimum of Eq. 1 isachieved by setting the coordinates yi to thetop d eigenvectors of the matrix !(DG) (13).

As with PCA or MDS, the true dimen-sionality of the data can be estimated fromthe decrease in error as the dimensionality ofY is increased. For the Swiss roll, whereclassical methods fail, the residual varianceof Isomap correctly bottoms out at d " 2(Fig. 2B).

Just as PCA and MDS are guaranteed,given sufficient data, to recover the truestructure of linear manifolds, Isomap is guar-anteed asymptotically to recover the true di-mensionality and geometric structure of astrictly larger class of nonlinear manifolds.Like the Swiss roll, these are manifolds

whose intrinsic geometry is that of a convexregion of Euclidean space, but whose ambi-ent geometry in the high-dimensional inputspace may be highly folded, twisted, orcurved. For non-Euclidean manifolds, such asa hemisphere or the surface of a doughnut,Isomap still produces a globally optimal low-dimensional Euclidean representation, asmeasured by Eq. 1.

These guarantees of asymptotic conver-gence rest on a proof that as the number ofdata points increases, the graph distancesdG(i, j) provide increasingly better approxi-mations to the intrinsic geodesic distancesdM(i, j), becoming arbitrarily accurate in thelimit of infinite data (18, 19). How quicklydG(i, j) converges to dM(i, j) depends on cer-tain parameters of the manifold as it lieswithin the high-dimensional space (radius ofcurvature and branch separation) and on the

density of points. To the extent that a data setpresents extreme values of these parametersor deviates from a uniform density, asymp-totic convergence still holds in general, butthe sample size required to estimate geodes-ic distance accurately may be impracticallylarge.

Isomap’s global coordinates provide asimple way to analyze and manipulate high-dimensional observations in terms of theirintrinsic nonlinear degrees of freedom. For aset of synthetic face images, known to havethree degrees of freedom, Isomap correctlydetects the dimensionality (Fig. 2A) and sep-arates out the true underlying factors (Fig.1A). The algorithm also recovers the knownlow-dimensional structure of a set of noisyreal images, generated by a human hand vary-ing in finger extension and wrist rotation(Fig. 2C) (20). Given a more complex dataset of handwritten digits, which does not havea clear manifold geometry, Isomap still findsglobally meaningful coordinates (Fig. 1B)and nonlinear structure that PCA or MDS donot detect (Fig. 2D). For all three data sets,the natural appearance of linear interpolationsbetween distant points in the low-dimension-al coordinate space confirms that Isomap hascaptured the data’s perceptually relevantstructure (Fig. 4).

Previous attempts to extend PCA andMDS to nonlinear data sets fall into twobroad classes, each of which suffers fromlimitations overcome by our approach. Locallinear techniques (21–23) are not designed torepresent the global structure of a data setwithin a single coordinate system, as we do inFig. 1. Nonlinear techniques based on greedyoptimization procedures (24–30) attempt todiscover global structure, but lack the crucialalgorithmic features that Isomap inheritsfrom PCA and MDS: a noniterative, polyno-mial time procedure with a guarantee of glob-al optimality; for intrinsically Euclidean man-

Fig. 2. The residualvariance of PCA (opentriangles), MDS [opentriangles in (A) through(C); open circles in (D)],and Isomap (filled cir-cles) on four data sets(42). (A) Face imagesvarying in pose and il-lumination (Fig. 1A).(B) Swiss roll data (Fig.3). (C) Hand imagesvarying in finger exten-sion and wrist rotation(20). (D) Handwritten“2”s (Fig. 1B). In all cas-es, residual variance de-creases as the dimen-sionality d is increased.The intrinsic dimen-sionality of the datacan be estimated bylooking for the “elbow”at which this curve ceases to decrease significantly with added dimensions. Arrows mark the true orapproximate dimensionality, when known. Note the tendency of PCA and MDS to overestimate thedimensionality, in contrast to Isomap.

Fig. 3. The “Swiss roll” data set, illustrating how Isomap exploits geodesicpaths for nonlinear dimensionality reduction. (A) For two arbitrary points(circled) on a nonlinear manifold, their Euclidean distance in the high-dimensional input space (length of dashed line) may not accuratelyreflect their intrinsic similarity, as measured by geodesic distance alongthe low-dimensional manifold (length of solid curve). (B) The neighbor-hood graph G constructed in step one of Isomap (with K " 7 and N "

1000 data points) allows an approximation (red segments) to the truegeodesic path to be computed efficiently in step two, as the shortestpath in G. (C) The two-dimensional embedding recovered by Isomap instep three, which best preserves the shortest path distances in theneighborhood graph (overlaid). Straight lines in the embedding (blue)now represent simpler and cleaner approximations to the true geodesicpaths than do the corresponding graph paths (red).

R E P O R T S

www.sciencemag.org SCIENCE VOL 290 22 DECEMBER 2000 2321

on

Augu

st 1

4, 2

013

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

Isomap

• Euclidean distances between outputs match the geodesic distances between inputs on the Manifold from which they are sampled.

Iyad Batal

Figure 2: Top: The isomap idea from Tenenbaum et al. (2000): use as distances the shortestpath length of two sample points in a k-NN graph and approximate it by MDS-type projection intwo-dimensional space. Bottom: the unfolding (right) of a “swiss roll” (left) using Isomap.

6

Page 7: Multivariate Statistics { multi-dimensional scaling (MDS ...nicolai/mv/notes5.pdf · word2vec, and visualization of Random Forests below). In fact, we do not need to have the under-

tion to geodesic distance. For faraway points,geodesic distance can be approximated byadding up a sequence of “short hops” be-tween neighboring points. These approxima-tions are computed efficiently by findingshortest paths in a graph with edges connect-ing neighboring data points.

The complete isometric feature mapping,or Isomap, algorithm has three steps, whichare detailed in Table 1. The first step deter-mines which points are neighbors on themanifold M, based on the distances dX (i, j)between pairs of points i, j in the input space

X. Two simple methods are to connect eachpoint to all points within some fixed radius !,or to all of its K nearest neighbors (15). Theseneighborhood relations are represented as aweighted graph G over the data points, withedges of weight dX(i, j) between neighboringpoints (Fig. 3B).

In its second step, Isomap estimates thegeodesic distances dM (i, j) between all pairsof points on the manifold M by computingtheir shortest path distances dG(i, j) in thegraph G. One simple algorithm (16) for find-ing shortest paths is given in Table 1.

The final step applies classical MDS tothe matrix of graph distances DG " {dG(i, j)},constructing an embedding of the data in ad-dimensional Euclidean space Y that bestpreserves the manifold’s estimated intrinsicgeometry (Fig. 3C). The coordinate vectors yi

for points in Y are chosen to minimize thecost function

E ! !#$DG% " #$DY%!L2 (1)

where DY denotes the matrix of Euclideandistances {dY(i, j) " !yi & yj!} and !A!L2

the L2 matrix norm '(i, j Ai j2 . The # operator

Fig. 1. (A) A canonical dimensionality reductionproblem from visual perception. The input consistsof a sequence of 4096-dimensional vectors, rep-resenting the brightness values of 64 pixel by 64pixel images of a face rendered with differentposes and lighting directions. Applied to N " 698raw images, Isomap (K" 6) learns a three-dimen-sional embedding of the data’s intrinsic geometricstructure. A two-dimensional projection is shown,with a sample of the original input images (redcircles) superimposed on all the data points (blue)and horizontal sliders (under the images) repre-senting the third dimension. Each coordinate axisof the embedding correlates highly with one de-gree of freedom underlying the original data: left-right pose (x axis, R " 0.99), up-down pose ( yaxis, R " 0.90), and lighting direction (slider posi-tion, R " 0.92). The input-space distances dX(i, j )given to Isomap were Euclidean distances be-tween the 4096-dimensional image vectors. (B)Isomap applied to N " 1000 handwritten “2”sfrom the MNIST database (40). The two mostsignificant dimensions in the Isomap embedding,shown here, articulate the major features of the“2”: bottom loop (x axis) and top arch ( y axis).Input-space distances dX(i, j ) were measured bytangent distance, a metric designed to capture theinvariances relevant in handwriting recognition(41). Here we used !-Isomap (with ! " 4.2) be-cause we did not expect a constant dimensionalityto hold over the whole data set; consistent withthis, Isomap finds several tendrils projecting fromthe higher dimensional mass of data and repre-senting successive exaggerations of an extrastroke or ornament in the digit.

R E P O R T S

22 DECEMBER 2000 VOL 290 SCIENCE www.sciencemag.org2320

on

Augu

st 1

4, 2

013

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

Figure 3: Two examples of two-dimensional isomap embeddings of hand gestures and characterwritings. Note that the axis labels/interpretations are just added post-hoc as interpretation of whatthe dimensions could correspond to.

V. Isomap.

If the data lie on a very non-linear manifold, the Euclidean distances are sometimes not verymeaningful, see Figure 2 for an example. Isomap is an example of an algorithm that replacesEuclidean distances by a graph-based distances, where the first step is the construction of a so-called k-NN graph, where each sample corresponds to a node in the graph. For each node/sample,an edge is drawn to the k nearest neighbours (where nearest neighbour is still evaluated in Euclideanmetric). The distances Di,j are then taken to be the shortest path length in the k-NN graph.

(i) Di,j = length of shortest path between i and j in k-NN graph

(ii) Di,j = ‖zi − zj‖2.

(iii) Stress function S(D, D) = ‖D − D‖22.

After replacing distances with shortest path length in a k-NN graph, standard least squares MDSis used to find a good embedding z1, . . . , zn in a lower-dimensional space. The idea is illustrated inFigure 2 and some examples from the original paper are shown in Figure 3.

The approach is appealing if we believe that the data live on a low-dimensional (but non-linear)manifold in a higher-dimensional space. The procedure might, however, be not very robust to noisein the observations (as this would create “shortcuts” in the nearest-neighbour graph). Many similarideas exist (diffusion maps etc.), all with advantages and disadvantages.

VI. t-SNE.

The recently proposed t-SNE (t-distributed stochastic neighbourhood embedding) takes the ideaof Sammon mapping a step further by just focussing on getting the nearest neighbours correctlyrepresented and not caring about the “large-scale” structure of the data. The basic underlyingidea of this (and other similar) approaches is to approximate well not distances themselves but

7

Page 8: Multivariate Statistics { multi-dimensional scaling (MDS ...nicolai/mv/notes5.pdf · word2vec, and visualization of Random Forests below). In fact, we do not need to have the under-

something like inverse distances, that are close to 0 for large distances and it does not matter hencehow far apart samples with a large distances Di,j are placed in the embedding as long as they arenot nearest neighbours.

For t-SNE the choices are specifically:

(i) A distance or rather an affinity function that is 0 for large Euclidean distances (and so is asimilarity rather than a distance since its large for near neighbours):

Di,j =exp(−‖xi − xj‖22/(2σ2))∑k 6=l exp(−‖xk − xl‖22/(2σ2))

The value of σ is a tuning parameter2. Note that Di,j can be interpreted probabilistically asa transition probability in a random walk in sample space (where a jump to near neighboursis more likely than to far-away points).

(ii) A distance/similarity function in the embedded space, that makes use of the Cauchy distri-bution (t-distribution with one degree of freedom) with rather than a normal distribution

Di,j =(1 + ‖zi − zj‖22)−1∑k 6=l(1 + ‖zk − zl‖22)−1

.

(iii) For the stress function a Kullback-Leibler divergence is used (if interpreting viewing D andD as probability distributions):

S(D, D) =

n∑i,j=1

Di,j logDi,j

Di,j

,

which can also be written as a cross-entropy-type expression (equivalent up to a constantvalue once we fix D)

S(D, D) = −n∑

i,j=1

Di,j log Di,j ,

The t-SNE choices focus mostly on keeping nearby data points xi and xj (in the original space –with a low Di,j) close to each other in the embedded space, whereas pairs with a large value ofDi,j can be put far away in embedded space or close to each other – the stress function is ratherinsensitive to the latter.

Comparison and extensions

Preserving short- versus long-range structure. The approach taken by t-SNE tries to pre-serve the short-range distances but it is inconsequential how far the long-distance pairs are as longas they are well separated in the new space. This emphasis is in a sense opposite to least-squaresscaling which tries mostly to preserve the long-distance relationships. Two examples are in Figure 4and show the relative advantage of one over the other, depending on the application.

2there are different versions of t-SNE and in some the parameter σi can be set individually for each sample andthe sum in the denominator reaches only over pairs that contain sample i; see the original paper if you are interested.

8

Page 9: Multivariate Statistics { multi-dimensional scaling (MDS ...nicolai/mv/notes5.pdf · word2vec, and visualization of Random Forests below). In fact, we do not need to have the under-

−2000 −1000 0 1000 2000

−10

00−

500

050

010

0015

00

Boston_MABrooklyn_NY

Miami.Beach_FL

Salt.Lake.City_UT

San.Francisco_CA

Santa.Barbara_CA

Seattle_WASpokane_WA

−40 −20 0 20 40−

40−

200

2040

Boston_MA

Brooklyn_NY

Miami.Beach_FL

Salt.Lake.City_UT

San.Francisco_CA

Santa.Barbara_CA

Seattle_WA

Spokane_WA

6

8

7

1

7

6

5

2

8

6

5

68

6

3

9

1

3

9

1

4

7

21

1

1

2

39

3

5

7

688

3

1

3

4

9

6

1

1

6

8

7

2

2

4

8

5

8

3 7

7

7

7

4

2

1

8

3

2

5

8

3

7

4

4

61

5

8

3

8

1

23

9

4

3

8

9

1

6

4

7

7

8

2

4

3

5

7

7

4

5

94

8

3

1

4

8

4

6

8

5

9

6

6

3

2

5

7

9

8

1

5

4

1

7

5

7

2

2

8

4

6

7 6

6

5 15

8

7

9

7

2

7

56

7

2

3

5

6

3

4

8

6

6

9 3

9

4

3

8

9

3

7

1

5

8

5

3

68

6

58 3

5

6

2

5

4

7

3

3

39

6

7

7

1

7

3

2

2

8

3

4

4

7

3

9

6

9

9

9

5

1

3 8

2

4

9

2

8

4

8

2

4

6

4

1

6

2

8 6

7

1 1

5

7

8

7

2

8

6

9

7

3

97

1

7

9

9

1

6

9

7

1

9

7

8

9

2

2

7

6

8

7

5

8

8

6

9

52

7

6

2

9 647

6

1

9

6

6

1

8

2

6

4

1

9

6

4

6

8

6

4

5

4

3

7

5

3

8

6

1

8

3

2

3

7

5

8

7

1

7

3

5

4

7

5

4

82

3

4

4

6

4

4

3

4

1

6

9

4

7

8

4

6

6

9

4

3

87

2

8

5

8

6

7

2

8

8

2

4

2

4

9

7

1

9

5

5

8

3

9

4

9

4

6

83

6 5

6

7

4

3

3

1

6

94

5

5

5

4

7

9

6

1

4

9

1

6

7

6

6

2

4

8

7

4

7

2

1

5

2

36

89

8

6

1

3

4

7

27

3

9

3

9

4

2

8

7

7

97

5

11

5

89

3

7

8

7

4

2

4

8

2

9

9

1

4

5

7

7

11

8

1

7

9

8

7

2

5

5

9

9

6

6

6

3

9

3

6

7

7

6

7

8

6

1

8 51

1

3

9

8

8 61

5

8

8

57

1

1

9

7

94

2

3

6

4

2

9

7

7

7

3

9

5

3

3

6

9

2

3 7

6

2

2

9

9

7

25

1

8

4

4

1

7

81

8

7

7

3

1

6

4

6

9

5

9

7

1

4

6

28

8

7

11

3

9

6

2

16

7

5

3

2

3

4

4

2

8

2

3

4

5

7

2

4

4

6

1

5

45

7

4

7

3

9

7

9

7

4

98

5

1

4

3

4

8

8

8

8

6

5

29

8

2

4

3

33

5

4

77 5

8

7

6

61

1

1

5

5

4

9

9

4

2

86

82

3

7

2

11

8

3

2

84

4

8

3

1

9

6

4

3

1

3

9

8

5

8

4

1 2

66

3

4

5

16

8

4

4

7

79

1

4

36

3

4

9

7

4

9

8

9

2

1

6

1

47

7

5

8

2

9

7

3

97

6

2

4

6

4

6

9

6

5

6

9

7

79

3

8

6

3

1

3

2

6

38

3

2

3

2

1

8

1

4

7

2

5

3

7

32

5

1

1

5

7

6

6

2

9

3

8

9

8

6

1

1

4

4

5

2

92

4

9

1

8

7

6

9

1

3

2

6

67

7

4

8

2

5

5

8

3

81 212 1

6

3

6

4

9

9

3

1

3

54

8

5

2

7

9

5

8 8

11

5

1

6

21

2

9

8

2

5

7

8

9

2

2

6

1

1

55

6

4

1

9

3

7

5

4

6

6

6

77

7

8

3

7

1

5

8

4

5

7

5

8

7

57

5

1

2

1

77

58

9

5

32

99

8

6

7

3

4

7

6

1 5

6

4

9 7

6

1

4

7

9

6

1

2

6

7

9

42

59

2

4

3

3

7

8

7

3

7

1

4

9

3

5

7

3

6

3

8

2

3

4

9

8

1

74

7

1

1

6

49

5

3

1

7

9

9

6

9

5

46

1

7

6

5

2

7

4

9

3

6

5

5

2

7

4

1

7

2

3 9

5

9

1

8

8

5

2

9

5

3

6

2

5

8

1

2

6

9

2

2

3

1

8

2

26

9

3

7

7

8

5

2

7

2

4

2

4

49

66

3

9

9

4

4

7

9

1

9

9

6

1

4

3

5

1

5

4

9

7

9

5

4

6

3

12

8

3

6

6

5

2

82 6

1

8

53

8

8

1

1

3

9

5

6

7

4

6

7

7

1

8

4 8

8

5

9

7

9

3

32

4

85

7

6

9

9

2

3

1

4

4

2

2

3

8

5

4

9

2

2

1

1

7

11

4

7

1

53

6

83

37

4

2

4

1

2

3

4

7

74

1

1

21

9

9

6

1

7

5

9

2

2

6

7

5

8

8

5

7

4

3

1

3

5

7

6

3

3

4

8

8

9

7

9

1

2

3

1

1

9

9

1

5

5

91

4

9

8

39

3

7

3

6

9

7

9

8

7

1

2

1

3 8

3

7

9

7

6

1

3

2

3

6

1

8

3

1

8

1

8

4

8

5

36

66

4

7

6

8

8

29

7

7

1

9

7

3

1

2

8

1

5

9

58

3

6

5

3

1

1

5

4

1

2

1

3

4

8

9

3

5

1

1

34

1

1

8

4

6

4

2

3

8

9

86

6

1

2

12

18

6

2

5

8

4

5

2

88

8

5

6

47

4

4

3

8

4

8

3

1 2

88

9

8

5

7

8

7

2

7

8

9

1

8

4

3

4

5

3

3

4

55

6

9

4

5

9

5

11

64

2

8

4

8

1

6

2

5

9

5

2

99

6

7

8

12

3

86

7

8

7

5

32

1

5

8

7

7

3

6

2

8

3

9

7

8

9 4

2

2

4

6

4

1

2

18

5

9

7

85

3

5

4

7

7

4

85

4

2

1

4

6

3

5

2

9

5

1

7

7

5

9

2

6

8

5

3

9

4

4

9

19

46

2

2

5

2

2

7

9

5

7

3

8

4

5

6

99

51

8

99

3

9

6

6

4

5

4

8

8

79

4

6

7

5

1

6

9

9

4

4

2

8

7

8

5

7

8

2

42

1

7

8

1

2

2

6

3

4

8

7

8

1

7

8

1

6

7

9

9

11

2

4

5

6

2

88

25

8

7

7

8

7

2

21

2

5

7

7

2

4 5

6

5

2

7

11

15 1

9

1

3

4

88

8

42

7

8

1

2

28

6

1

8

1

8

43

4

6

8

3

1

3

1

4

4

6

9

3

35

79

8

1

3

5

1

8

3

1

4

1

8

8

5

7

9

1

3

2

3

1

2

6

2

4

1

8

6

1

4

8

5

88

6

8

9

2

2

8

3

7

4

3

89

7

4

2

38

6

9

7

3

2

9

7

3

4

2

7

3

6

2

7

2

1

2

3

1

7

3

7

3

7

8

8

1

3

2

1

6

1

6

5

8

9

1

6

9

1

9

9

13

8 5

1

7

8

5

1

6

6

3

3

4

54

9

2

2

8

8 9

4

8

9

9

7

46

4

4

1

3

4

2

1

5

6

8

3

9

7

8

2

2

9

7

5

5

5

11

5

8

3

9

8

9

2

3

4

8

5

36

8

4

8

2

3

2

3

6

9

3

9

7

4

2

9

62

7

2

8

4

2

7

9

4

7

3

3

1

7

2

1

4

2

5

4

8

11

2

5

7

6

6

3

1

3

4

7

2

3

8

9

5

1

7

2

6

69

6

2

8

49

9

8

4

2

4

6

3

4

1

7

6

1

8

5

9

1

3

3

6

5

21

−4 −2 0 2 4 6 8

−6

−4

−2

02

46

Z[,1]

Z[,2

]

6

8

71

7

6

5

2

8

6

5

6

8

6

3

9

1

3

9

1

4

7

2

1

1

1

2

39

3

5

76

8

8

3

1

3

4

96

1

1

6

8

7

2

2

4

85

83

7

7

7

7

4

2

1

83

2

5

8

3

7

4

4

6

1

5

83

8

1

2

3

9

4

3

8

9

1

6

4

7

7

8

2

43

5

77

4

5

9

4

8

3

1

4

8

4

6

85

9

6

6

3

2

5

7

9

8

1

5

4

1

7

5

7

2 2

8

4

6

7

6

6

5

1

5

8

7

9

7

2

7

5

6

7

2

3

5

6

3

4

8

6

6

9

3

9

4

3

8

9

3

7

1

5

8

5

3

6

8

6

5

83

5

6

2

5

4

7

3

3

3

9

6

7

7

1

7

3

2

2

83

4

4

7

3

9

6

9

9

9

5

1

3

8

24

9

2

8

4

8

2

4

64

1

6

2

8

67

115

7

8

7

2

8

6

9

7

3

9

7

1

7

9

9

1

6

9

71

9

7

8

9

2

2

7

6

8

7

5

8

8

6

9

5

2

7

6

2

9

6

4

7

6

1

9

6

6

1

8

2

6 4

1

9

6

46

8

6

4

5

4

3

7

5

3

8

61

8

3

2

3

7

5

8

7

1

7

3

5

4

7

5

4

8

2

3

4

4

6 4

4

3

4

16

9

4

7

8

4

6

6

9

4

3

8

7

2

8

5

8

6

7

2

8

8

2

4

2

4

9

7

1

9

5

5

8

3

9

4

9

4

6

8

3

6

5

6

7

4

3

3

16

9

4

5

5

5

4

7

9

6

1

4

9

1

6

7

6

6

2

4

8

7

4 7

2

1

5

2

3

6

8

9

8

6

1

3

4

7

2

7

3

9

3

9

4

2

8

7

7

9

75 1

1

5

8

9

3

7

8

7

4

2

48

2

9

9

1

4

5

7

71

1

8

1 7

9

8

72

55

9

9

6

66

39

3

6

77

6

7

8

6

1

8

5

1

1

3

9

8

8

6

1

5

8 8

5

7

1

1

9

7

9

4

2

3

6

4

2

9

7

7

7

3

9

5

3

3

6

9

2

3

7

6

22

9

9

7

2

5

1

8

4

4

1

7

8

1

8

77

3

1

6

46

9

5

9

7

1

4

6

2

88

7

1

1

3

96

2

1

6

7

5

3

2

3

4

4

2

8

2

3

4

5 7

2

4

4

6 1

5

4

5

7

4

7

3

9

7

9

7

4

9

8

5

1

4

34

8

8

88

6

5

2

9

8

2

4

3

33

5

4

77

5

8

7

6

6

1

1

1

5

54

9

9

42

8

6

8

2

3

7

2

11

8

3

2

8

4

4

8

3

1

9

6

4

3

1

3

9

8

5

8

4

1

2

6

6

3

4

5

1

6

8

4

4

7

7

9

1

4

3

6

3

4

9

7

4

9

8

9

2

1

6

1

4

7

7

5

8

2

9

7

3

9

76

2

4

6

4

6

9

6

5

6

9

7

7

9

3

8

6

3

1

3

26

3

8

3

2

3

2

1

8

1

4

7

2

5

3

7

3

2

5 1

1

5

7

6

6

29

3

8

9

8

6

1

1

4

45

2

9

2

4

9

1

8

7

6

9

1

3

266

7

7

4

8

2

5

5

8

3

8

1

2

1

2

1

6

3

6

4

9

9

3

1

3

5

48

5

27

9

5

8

8

1

1

5

1

6

2

1

2

9

8

2

57

8

9

2

2

6

1

1

5

5

6

4

1

9

3

7

5

4

6 6

67

7

7

8

3

7

1

5

8

4

5

7

5

8

7

5

7

5

1

2

17

7

5

8

9

5

3

2

9

98

6 7

3

4

7

6

15

6

4

9

7

6

1

4

7

96

1

2

6

7

9

4

2

5

9

2

4

3

3

7

8

7

3

7

1

4

9

3

5

7

3

6

3

8

2

3

4 9

8

1

7

4

7

1 1

6

4

9

5

3

1

7

9

9

6

9

5

4

6

1

76

52

7

49

3

6

5

52

7

4

1

72

3

9

5

9

18

8

5

2

9

5

3

6

2

5

8

1

2

6

9

2

2

3

1

8

2

2

6

9

3

7

7

8

5

2

7

2

42

4

4

9

6

6

3

9

9

4

4

7

9

1

9

9

6

1

4

3

51

5

49

7

9

5

4

6

3

1

2

8

3

6

6

5

2

8

2

6

1

8

53

8

8

11

3

9

5

6

7

4

6

7

7

1

8

4

8

85

9

7

9

3

3

2

4

8

5

7

6

9

9

2

3

1

4

42

2

3

8

5

4

9

2

2

11

7

1

1

4

7

15

3

6

8

3

3

7

4

2

4

1

2

3

47

7

4

1

1

2

1

9

9

6

1

7

5

9

2

2

6 7

5

88

5

7

4

3

1

3

5

7

6

3

3

4

8

8

9

7

9

1

2

3

1

1

9

9

1

5

5

9

1

4

9

8

3

9

3

7

3

6

9

7

9

8

7

1

2

13

8

3

7

9

7

6

1

3

2

3

6

1

8

3

1

8

1

8

4

8

5

3

6 6

6

4

7

6

88

2

9

7

7

1

97

3

1

2

8

1

5

9

5

8

3

6

5 3

1

1

5

4

1

2

1

3

4

8

9

3

5

11

3

4

1

1

8

4

6

4

2

3

8

9

8

6

6

1

2

1

2

1

8

6

2

5

8

4

5

2

8

8

8

5

6

4

7

4

4

3 8

4

8

3

1

2

88

9

85

7

8

7

2

7

8

9

1

8

4

3

4

5

33

4

5

56

9

4

5

9

5

1

1

6

4

2

8

4

8

1

6

2

5

9

5

2

99

6

7

8

1

2

3

8

6

7

8

7

5 3

2

1

5

8

7

7

3

6

2

8

3

9

7

8

9

4

2

24

64

1

2

1

8

5

9

7

8

5

3

5

4

7

7

4

8

54

2

1

4

6

3

5

2

95

1

7

7

5

9

2

6

8

5

3

94

49

1

9

4

6

2

2

5

2

2

7

9

5

7

3

8

45

6

99

5

1

8

9 93

9

6

6

4

5

4

8

8

7

9

4

6

7

5

1

6

9

9

4

4

2

8

7

8

5

7

8

2

4

2

1

7

8

1

2

2

6

3

4

8

7

8

1

7

8

1

6

7

9

9

11

24

5

6

2

8 8

2

5

8

7

7

8

7

2

2

1

2

5

7

7

2

4

56

5

2

7

1

11

51

9

1

3

4

88 8

4

2

7

8

1

2

2

8

61

8

1

8

4

34

6

8

3

1

3

1

4

4

6

9

33

5

7

9

8

1

3

51

83

1

4

1

8

8

5

7

9

1

3

2

3

1

2

6

2

4

1

8

6

1

4

8

5

8

8

6

8

9

2

2

8 3

7

4

3

8

9

7

4

2

3

8

6

9

7

3

2

9

7

3

4

2

7

3

6

2

7

2

1

2

3

1

7

3

7

3

7

8

8

1

3

2

1

61

6

5

8

9

1

6

9

1

9

9

1

3

8

5

1

7

8

51

66

3

3

4

5

4

9

2

2

8

8

94

8

9

9

7

46

4

4

13

4

2

1

5

6

8

3

9

7

8

2

2

9

7

5

5

5

11

5

8

3

9

8

9

2

3

4

8

5

3

6

8

4

8

2

3

2

3

69

3

9

7

4

2

9

6

2

7

2

8

4

2

794

7

3

3

1

72

1

4

2

5

4

8

1

1

2

5 76

6

3

1

3

4

7

2

3

8

9

5

1

7

2

6

6

9

6

2

8

4

9

9

8

4

2

4

6

3

4

1

7

6

1

8

5

9

1

3

3

6

5

2

1

−100 −50 0 50 100

−50

050

Ztsne[,1]

Zts

ne[,2

]

Figure 4: Two examples (with the R code as in lectures). Top row: reconstructing a map frompairwise road distances in the US with least-squares scaling (left) and t-SNE (right). Here least-squares scaling is clearly better as it focusses on preserving large-scale structure (which is inherentlylow-rank in this application). Bottom row: two-dimensional embedding of the MNIST data withleast-squares scaling (left) and t-SNE (right). t-SNE manages to produce clusters that correspondto different digits by preserving the nearest neighbour information, while least-squares scaling triesto preserve the (mostly irrelevant) large-distance pairs in the original data.

9

Page 10: Multivariate Statistics { multi-dimensional scaling (MDS ...nicolai/mv/notes5.pdf · word2vec, and visualization of Random Forests below). In fact, we do not need to have the under-

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

● ●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

−0.3 −0.2 −0.1 0.0 0.1 0.2

−0.

3−

0.2

−0.

10.

00.

10.

2

Dim 1

Dim

2

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

−0.3 −0.2 −0.1 0.0 0.1 0.2 0.3

−0.

3−

0.2

−0.

10.

00.

10.

20.

3

Dim 1

Dim

2

Figure 5: The MDS visualization of the Vowel data for a nodesize of (from left to right) 20 and 40.Each colour corresponds to one of the ten vowels. If classes overlap, it means that Random Forestwill confuse these two classes and vice versa. Note that the visualizations are only well-defined upto rotations.

Adaptive input metric. Most of the embeddings start with a Euclidean-metric in some formfor the original data. We can also define the metric in original space adaptively. One idea is touse Random Forests in a supervised setting where we care about a target variable Y we would liketo forecast. The distance Di,j between two samples can then be defined as 1 less the fraction oftrees in RF for which sample xi and xj end in the same leaf node of the tree. This guarantees thatDi,i = 0 for all i = 1, . . . , n and Di,j ≥ 0 in general. Then least-squares MDS can for example beused after we replace step (i) with this new metric.

One example is the Vowel dataset of Newman (1998), where the goal is to forecast the type ofsteady state vowels (hiD,hlD, etc...) out of ten different types, using ten different features from thesound recordings.

The patterns also reveal which classes are more similar to each other, although this can also be readoff the confusion matrix. But it can also reveal whether there is just a partial overlap in the sensethat some members of a class can be classified to originate from this class with high confidenceeven though there is a partial overlap with another class for other members of the class. As acaution against over-interpreting these plots: a low-dimensional classification based on the first fewMDS dimensions of an RF output in general produces very similar classification performance tothe original RF, but this does not need to be true in general.

Word2vec embedding. Another interesting application is the word2vec embedding. It tries toembed each word w in a corpus into a q-dimensional space, where q typically 300 or thereabouts.Let zi be the position of word i. Each word now also has a second position ci in the same q-dimensional space, which is the “context-position”. The positions are determined as follows. Wetry to model the probability that word i appears in a neighbourhood of word j as

Pz,z′( word i in neighbourhood | word j in center ) = c · ectizj∑

i′ ecti′zj

,

10

Page 11: Multivariate Statistics { multi-dimensional scaling (MDS ...nicolai/mv/notes5.pdf · word2vec, and visualization of Random Forests below). In fact, we do not need to have the under-

Table 8: Examples of the word pair relationships, using the best word vectors from Table 4 (Skip-gram model trained on 783M words with 300 dimensionality).

Relationship Example 1 Example 2 Example 3France - Paris Italy: Rome Japan: Tokyo Florida: Tallahasseebig - bigger small: larger cold: colder quick: quicker

Miami - Florida Baltimore: Maryland Dallas: Texas Kona: HawaiiEinstein - scientist Messi: midfielder Mozart: violinist Picasso: painterSarkozy - France Berlusconi: Italy Merkel: Germany Koizumi: Japan

copper - Cu zinc: Zn gold: Au uranium: plutoniumBerlusconi - Silvio Sarkozy: Nicolas Putin: Medvedev Obama: Barack

Microsoft - Windows Google: Android IBM: Linux Apple: iPhoneMicrosoft - Ballmer Google: Yahoo IBM: McNealy Apple: Jobs

Japan - sushi Germany: bratwurst France: tapas USA: pizza

assumes exact match, the results in Table 8 would score only about 60%). We believe that wordvectors trained on even larger data sets with larger dimensionality will perform significantly better,and will enable the development of new innovative applications. Another way to improve accuracy isto provide more than one example of the relationship. By using ten examples instead of one to formthe relationship vector (we average the individual vectors together), we have observed improvementof accuracy of our best models by about 10% absolutely on the semantic-syntactic test.

It is also possible to apply the vector operations to solve different tasks. For example, we haveobserved good accuracy for selecting out-of-the-list words, by computing average vector for a list ofwords, and finding the most distant word vector. This is a popular type of problems in certain humanintelligence tests. Clearly, there is still a lot of discoveries to be made using these techniques.

6 Conclusion

In this paper we studied the quality of vector representations of words derived by various models ona collection of syntactic and semantic language tasks. We observed that it is possible to train highquality word vectors using very simple model architectures, compared to the popular neural networkmodels (both feedforward and recurrent). Because of the much lower computational complexity, itis possible to compute very accurate high dimensional word vectors from a much larger data set.Using the DistBelief distributed framework, it should be possible to train the CBOW and Skip-grammodels even on corpora with one trillion words, for basically unlimited size of the vocabulary. Thatis several orders of magnitude larger than the best previously published results for similar models.

An interesting task where the word vectors have recently been shown to significantly outperform theprevious state of the art is the SemEval-2012 Task 2 [11]. The publicly available RNN vectors wereused together with other techniques to achieve over 50% increase in Spearman’s rank correlationover the previous best result [31]. The neural network based word vectors were previously appliedto many other NLP tasks, for example sentiment analysis [12] and paraphrase detection [28]. It canbe expected that these applications can benefit from the model architectures described in this paper.

Our ongoing work shows that the word vectors can be successfully applied to automatic extensionof facts in Knowledge Bases, and also for verification of correctness of existing facts. Resultsfrom machine translation experiments also look very promising. In the future, it would be alsointeresting to compare our techniques to Latent Relational Analysis [30] and others. We believe thatour comprehensive test set will help the research community to improve the existing techniques forestimating the word vectors. We also expect that high quality word vectors will become an importantbuilding block for future NLP applications.

10

Figure 6: Some examples from queries by the word2vec embedding from the original paper Mikolovet al. (2013).

where c > 0 is a constant and can be set in first approximation to be the chosen number ofneighbours of a word (say 5 or 10). We now try to place the locations zi, z

′i for all words i such

that the “negative log-likelihood” is minimized:

argminz,z′

∑(i,j)∈D

− log(Pz,z′( word i in neighbourhood | word j)

),

where D is the set of co-occurrences of word j in a neighbourhood of word i in the corpus of textbeing used. We can define, with only slight abuse of notation, distances or rather affinities and thestress function as

Di,j := #{word i in neighbourhood of word j}

Di,j := c · ectizj∑

i′ ecti′zj

S(D, D) := −∑i,j

Di,j log(Di,j)

Note that D and D are not symmetric any longer. Minimizing the stress function, we get theoptimisation in the same format as the cross-entropy-type embedding formulations above. Similarto t-SNE, the stress functions takes the form of cross-entropy (which is identical to the Kullback-Leibler type divergence used in t-SNE up to the “entropy” of D, if D –after normalisation– isinterpreted as a probability distribution).

Some examples typically used for the embeddings are then results of the form (which seem to holdapproximately if training on a large corpus of text)

zking − zqueen + zwife ≈ zhusband

zqueens − zqueen + zking ≈ zkings

zSwitzerland − zBerne ≈ zFrance − zParis

Some examples from the original paper are shown in Figure 6.

11