jerome r. bellegarda

1

Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal

Continuous Parameter Modeling

Jerome R. Bellegarda

2

Outline

• Introduction

• LSM

• Applications

• Conclusions

3

Introduction

• LSA in IR:– Words of queries and documents– Recall and precision

• Assumption: There is some underlying latent semantic structure in the data– Latent structure is conveyed by correlation patterns– Documents: bag-of-words model

• LSA improves separability among different topics

4

Introduction

5

Introduction

• Success of LSA:– Word clustering– Document clustering– Language modeling– Automated call routing– Semantic Inference for spoken interface control

• These solutions all leverage LSA’s ability to expose global relationships in context and meaning

6

Introduction

• Three unique factors for LSA:– The mapping of discrete entries– The dimensionality reduction– The intrinsically global outlook

• The change of terminology to latent semantic mapping (LSM) to convey increased reliance on the general properties

7

Latent Semantic Mapping

• LSA defines a mapping between the discrete sets– M: an inventory of M individual units, such as words– N: an collection of N meaningful compositions of units,

such as documents– L: a continuous vector space

– ri: unit in M

– cj: composition in N

8

Feature Extraction

• Construction of a matrix W of co-occurrences between units and compositions

• The cell of W:

,,

,

(1 )

: the number of times occurs in

: the total number of units present in

: the normalized entropy of in the collection

i ji j i

j

i j i j

j j

i i

w

r c

c

r N

9

Feature Extraction

• The entropy of ri:

• Value of Entropy Close to 0 means that the unit is present only in a few specific compositions.

• The global weight is therefore a measure of the indexing power of the unit ri

, ,

1

,

, ,

1log

log

0 1 with equality if and only if and

Ni j i j

ij i i

i i jj

i i j i i j i

N

N

1 i

10

Singular Value Decomposition

• The MxN unit-composition matrix W defines two vector representations for the units and the compositions

• ri: a row factor of dimension N

• cj: a column factor of dimension M

• Unpractical:– M,N can be extremely large

– Vector ri, cj are typically sparse

– Two spaces are distinct from each other

11


• Employ SVD:• U: MxR left singular matrix with row vectors u i

• S: RxR diagonal matrix of singular values

• V: NxR right singular matrix

with row vector vj • U, V are column-orthonormal

– UTU=VTV=IR

• R<min(M, N)

ˆ TW W USV

1 2 ... 0Rs s s

12


13


• captures the major structural associations in and ignores higher order effects

• The closeness of vector in L:– Unit-unit comparison– Composition-composition comparison– Unit-Composition comparison

WW

14

Closeness Measure

• WWT: co-occurrences between units• WTW: co-occurrences between compositions

• ri, rj: units which have similar pattern of occurrence across the composition

• ci, cj: compositions which have similar pattern of occurrence across the unit

15

Closeness Measure

• Unit-Unit Comparisons:

• Cosine measure:

• Distance: [0, π]

TT UUSWW 2

2

( , ) cos( , )T

i ji j i j i

j

u S uK r r u S u S

u S u S

1( , ) cos ( , )i j i jD r r K r r

16

Unit-Unit Comparisons

17

Closeness Measure

• Composition-Composition Comparisons:

• Cosine measure:


2T TW W VS V

2

( , ) cos( , )T

i ji j i j i

j

v S vK c c v S v S

v S v S

1( , ) cos ( , )i j i jD c c K c c

18

Closeness Measure

• Unit-Composition Comparisons:

• Cosine measure:


TW USV

1/ 2 1/ 2

1/ 2 1/ 2( , ) cos( , )

Ti j

i j i j ij

u SvK r c u S v S

u S v S

1( , ) cos ( , )i j i jD r c K r c

19

LSM Framework Extension

• Observe a new composition , p>N, the tilde symbol reflects the fact that the composition was not part of the original N

• , a column vector of dimension M, can be thought of as an additional column of the matrix W

• U, S do not change: Tp pc USv

pc

pc

20


: pseudo-composition

: pseudo-composition vector

• If the addition of causes the major structural associations in W to shift in some substantial manner, the singular vectors will become inadequate.

is similar to a composition vector

Tp p p

p

v v S c U

v

pc

pv

pc

21


• It would be necessary to re-compute SVD to find a proper representation for pc

22

Salient Characteristics of LSM

• A single vector embedding for both units and compositions in the same continuous vector space L

• A relatively low dimensionality, which make operations such as clustering meaningful and practical

• An underlying structure reflecting globally meaningful relationships, with natural similarity metrics to measure the distance between units, between compositions or between units and compositions in L

23

Applications

• Semantic classification

• Multi-span language modeling

• Junk e-mail filtering

• Pronunciation modeling

• TTS Unit Selection

24

Semantic Classification

• Semantic classification refers to determine which one of predefined topic a given document is most closely aligned with

• The centroid of each clusters can be viewed as the semantic representation of this outcome in LSM space– Semantic anchor

• A newly observed word sequence measures by computing the distance between the document and semantic anchor, and pick minimum

1( , ) cos ( , )i j i jD c c K c c

25

Semantic Classification

• Domain knowledge is automatically encapsulated in the LSM space in a data-driven fashion

• For Desktop interface control:– Semantic inference

26

Semantic Inference

27

Multi-Span Language Modeling

• In a standard n-gram , the history is string

• In LSM language modeling, the history is the current document up to word

• Pseudo-document:– Continually updated as q increases

( )1 1 2 1...n

q q q q nH r r r

1qr

( )1 1lq qH c

1

1( 1) (1 )q q q q i i

q

c c S n c rSn

28


• An Integrated n-gram + LSM formulation for the overall language model probability:

– Different syntactic constructs can be used to carry the same meaning (content words)

( ) ( ) ( )1 1 1Pr( | ) Pr( | , )n l n l

q q q q qr H r H H

1 2 1 1( )1

1 2 1 1

Pr( | ... ) Pr( | )Pr( | )

Pr( | ... ) Pr( | )i

q q q q n q qn lq q

i q q q n q qr M

r r r r c rr H

r r r r c r

29

Multi-Span Language Modeling( ) ( ) ( )

1 1 1

( ) ( ) ( ) ( )1 1 1 1

( ) ( ) ( ) ( )1 1 1 1

( ) ( ) ( )1 1 1

( ) ( )1 1 1

Pr( | ) Pr( | , )

Pr( , | ) Pr( , | )

Pr( | ) Pr( , | )

Pr( | ) Pr( | , )

Pr( | ) Pr( | ,

i

n l l nq q q q q

l n l nq q q q q q

l n l nq q i q q

r M

n l nq q q q q

n li q q i q

r H r H H

r H H r H H

H H r H H

r H H r H

r H H r H

( )

1 2 1 1 1 2 1

1 2 1 1 1 2 1

1 2 1 1

1 2 1 1

)

Pr( | ... ) Pr( | , ... )

Pr( | ... ) Pr( | , ... )

Pr( | ... ) Pr( | )

Pr( | ... ) Pr( | )

i

i

i

n

r M

q q q q n q q q q q n

i q q q n q i q q q nr M

q q q q n q q

i q q q n q ir M

r r r r c r r r r

r r r r c r r r r

r r r r c r

r r r r c r

Assume that the probability of the document History given the current word is not affected by immediate context preceding it

30


1 2 1 1( )1

1 2 1 1

11 2 1

11 2 1

1 2

Pr( | ... ) Pr( | )Pr( | )

Pr( | ... ) Pr( | )

Pr( , )Pr( | ... )

Pr( )

Pr( , )Pr( | ... )

Pr( )

Pr( | .

i

i

q q q q n q qn lq q

i q q q n q ir M

q qq q q q n

q

q ii q q q n

r M i

q q q

r r r r c rr H

r r r r c r

c rr r r r

r

c rr r r r

r

r r r

1 11

1 11 2 1

Pr( | ) Pr( ).. )

Pr( )

Pr( | ) Pr( )Pr( | ... )

Pr( )i

q q qq n

q

i q qi q q q n

r M i

r c cr

r

r c cr r r r

r

31

Junk E-mail Filtering

• It can be viewed as a degenerate case of semantic classification (two categories)– Legitimate – Junk

• M: an inventory of words, symbols• N: a binary collection of email messages• Two semantic anchors

32

Pronunciation Modeling

• Also called grapheme-to-phoneme conversion (GPC)

• Orthographic anchors – (one for each in-vocabulary word)

• Orthographic neighborhood– In-vocabulary word with High closeness for out-

vocabulary word

33

Pronunciation Modeling

34

Conclusions

• Descriptive Power– Forgoing local constraints is not acceptable in some si

tuations

• Domain Sensitivity– Depend on the quality of the training data– polysemy

• Updating the LSM Space– SVD on the fly is not practical

• Success of LSM for three characteristics