1 bayesian learning for latent semantic analysis jen-tzung chien, meng-sun wu and chia-sheng wu...

1

Bayesian Learning for Latent Semantic AnalysisBayesian Learning for Latent Semantic Analysis

Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng WuJen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu

Presenter: Hsuan-Sheng Chiu

Speech Lab. NTNU 22

ReferenceReference

Chia-Sheng Wu, “Bayesian Latent Semantic Analysis for Text CChia-Sheng Wu, “Bayesian Latent Semantic Analysis for Text Categorization and Information Retrieval”, 2005ategorization and Information Retrieval”, 2005

Q. Huo and C.-H. Lee, “On-line adaptive learning of the continuoQ. Huo and C.-H. Lee, “On-line adaptive learning of the continuous density hidden Markov model based on approximate recursive us density hidden Markov model based on approximate recursive Bayes estimate”, 1997Bayes estimate”, 1997

Speech Lab. NTNU 33

OutlineOutline

IntroductionIntroduction

PLSAPLSAML (Maximum Likelihood)ML (Maximum Likelihood)

MAP (Maximum A Posterior)MAP (Maximum A Posterior)

QB (Quasi-Bayes)QB (Quasi-Bayes)

ExperimentsExperiments

ConclusionsConclusions

Speech Lab. NTNU 44


LSA vs. PLSALSA vs. PLSALinear algebra and probabilityLinear algebra and probability

Semantic space and latent topicsSemantic space and latent topics

Batch learning vs. Incremental learningBatch learning vs. Incremental learning

Speech Lab. NTNU 55

PLSAPLSA

PLSA is a general machine learning technique, which adopts the PLSA is a general machine learning technique, which adopts the aspect model to represent the co-occurrence data.aspect model to represent the co-occurrence data.

Topics (hidden variables)Topics (hidden variables)

Corpus (document-word pairs)Corpus (document-word pairs)

Kk zzZz ,...,1

MjNiji wwwdddwdY ,...,,,..., , , 11

Speech Lab. NTNU 66

PLSAPLSA

Assume that dAssume that dii and w and wjj are independent conditionally on the mixtu are independent conditionally on the mixtu

re of associated topic zre of associated topic zkk

Joint probability:Joint probability:

kjkikji zwPzdPzwdP |||,

K

kikkji

K

k i

kikjki

K

k i

kijki

K

k i

ijki

K

kijkiijiji

dzPzwPdPdP

zdPzwPzPdP

dP

zdwPzPdP

dP

dwzPdP

dwzPdPdwPdPwdP

11

11

1

||||

|,,,

|,|,

Speech Lab. NTNU 77

ML PLSAML PLSA

Log likelihood of Y:Log likelihood of Y:

ML estimation:ML estimation:

|logmaxarg YML

N

i

M

jjiji wdPwdnYP

1 1

,log,|log

ikkj dzPzwP |,|

Speech Lab. NTNU 88

ML PLSAML PLSA

Maximization:Maximization:

N

i

M

jijji

N

i

M

jijji

N

i

M

jiji

N

i

M

jijiji

N

i

M

jijiji

N

i

M

jjiji

dwPwdn

dwPwdndPwdn

dwPdPwdn

dwPdPwdn

wdPwdnY

1 1

1 11 1

1 1

1 1

1 1

|log,max

|log,log,max

|loglog,max

|log,max

,log,max|logmax

Speech Lab. NTNU 99

ML PLSAML PLSA

Complete data:Complete data:

Incomplete data:Incomplete data:

EM (Expectation-Maximization) AlgorithmEM (Expectation-Maximization) AlgorithmE-step E-step

M-stepM-step

ikj dzwP |,

ij dwP |

ijkijikj dwzPdwPdzwP ,|||,

Speech Lab. NTNU 1010

ML PLSAML PLSA

E-StepE-Step

iiii

N

i

M

j

K

kijkijkji

N

i

M

j

K

kikjijkji

N

i

M

jdwzijkikjji

N

i

M

jdwzijji

dddd

dwzPdwzPwdn

dzwPdwzPwdn

dwzPdzwPEwdn

dwPEwdn

ijk

ijk

ˆ,ˆ,

,|log,|,

|,log,|,

,|log|,log,

|log,

1 1 1

1 1 1

1 1,|

1 1,|


ML PLSAML PLSA

Auxiliary function:Auxiliary function:

AndAnd

K

l illj

ikkjjik

dzPzwP

dzPzwPwdzP

1||

||,|

N

i

M

j

K

kikkjjikji

z

dzPzwPwdzPwdn

YZYEQ

1 1 1

|ˆ|ˆlog,|,

,|ˆ|,log|ˆ


ML PLSAML PLSA

M-step:M-step:Lagrange multiplierLagrange multiplier

K

k

M

jkjk

N

i

M

j

K

kkjjikji

MLzwP

zwP

zwPwdzPwdnQkj

1 1

1 1 1|

|1

|ˆlog,|,

N

i

K

kikk

N

i

M

j

K

kikjikji

MLdzP

dzP

dzPwdzPwdnQik

1 1

1 1 1|

|1

|ˆlog,|,


ML PLSAML PLSA

DifferentiationDifferentiation

New parameter estimation:New parameter estimation:

N

jj

jj

N

j

N

jjjj

w

wyyywF

1

1 1

1log

K

l

M

j jilji

M

i jikjiikML

M

m

N

i mikmi

N

i jikjikjML

wdzPwdn

wdzPwdndzP

wdzPwdn

wdzPwdnzwP

1 1

1

1 1

1

,|,

,|,|ˆ

,|,

,|,|ˆ


MAP PLSAMAP PLSA

Estimation by Maximizing the posteriori probability:Estimation by Maximizing the posteriori probability:

Definition of prior distribution:Definition of prior distribution:Dirichlet density:Dirichlet density:

Prior density:Prior density:

gXPXPMAP log|logmaxarg|maxarg

K

i

K

iii xxxf i

1 1

1 1,0

K

k

N

iik

M

jkj

ikkj dzPzwPg1 1

1

1

1 ,, ||

jijiij

,0

,1Kronecker delta

kj zwP | kj zwP |Assume andare independent


MAP PLSAMAP PLSA

Consider prior density:Consider prior density:

Maximum a Posteriori:Maximum a Posteriori:

M

i

K

kikik

K

k

M

jkjkj dzPzwPg

1 1,

1 1, |log1|log1log

N

i

M

jijji dwPwdng

1 1

|log,logmax


MAP PLSAMAP PLSA

E-step:E-step:expectationexpectation


N

i

K

kikik

M

j

K

kkjkj

N

i

M

j

K

kikkjjikji

dzPzwP

dzPzwPwdzPwdnR

1 1,

1 1,

1 1 1

|ˆlog1|ˆlog1

|ˆ|ˆlog,|,|ˆ~

N

i

M

jdwzijji gdwPEwdn

ijk1 1

,|log|log,


MAP PLSAMAP PLSA

M-stepM-stepLagrange multiplierLagrange multiplier

K

kikd

M

jkjw

N

i

K

kikik

M

j

K

kkjkj

N

i

M

j

K

kikkjjikji

dzPzwP

dzPzwP

dzPzwPwdzPwdnR

11

1 1,

1 1,

1 1 1

|ˆ1|ˆ1

|ˆlog1|ˆlog1

|ˆ|ˆlog,|,|ˆ~


MAP PLSAMAP PLSA

DifferentiationDifferentiation

New parameter estimation:New parameter estimation:

M

m

N

i mjmikmi

N

i kjjikjikjMAP

wdzPwdn

wdzPwdnzwP

1 1 ,

1 ,

1,|,

1,|,|ˆ

K

l ili

M

j ikjikji

ikMAPdn

wdzPwdndzP

1 ,

1 ,

1

1,|,|ˆ

K

k

M

jijkiji dwzPdwndn

1 1

,|,


QB PLSAQB PLSA

It needs to update continuously for an online information system.It needs to update continuously for an online information system.Estimation by maximize the posteriori probability:Estimation by maximize the posteriori probability:

Posterior density is approximated by the closest tractable prior density Posterior density is approximated by the closest tractable prior density with hyperparameterswith hyperparameters

As compared to MAP PLSA, the key difference using QB PLSA As compared to MAP PLSA, the key difference using QB PLSA is due to the updating of hyperparameters.is due to the updating of hyperparameters.

1

1

||maxarg

||maxarg|maxarg

nn

nn

nnQB

gXP

PXPP

1,

1,

1 , nik

nkj

n

nikQBk

njQB

nQB dzPzwP |,|


QB PLSAQB PLSA

Conjugate prior:Conjugate prior:In Bayesian probability theory, a conjugate prior is a prior distribution In Bayesian probability theory, a conjugate prior is a prior distribution which has the property that the posterior distribution is the same type which has the property that the posterior distribution is the same type of distribution.of distribution.

A close-form solutionA close-form solution

A reproducible prior/posteriori pair for incremental learningA reproducible prior/posteriori pair for incremental learning


QB PLSAQB PLSA

Hyperparameter α:Hyperparameter α:

M

jkj

kjkj

M

jkjw

M

jkj

w

kjkjw

kj

kj

M

jkjw

M

j

K

kkjkj

zwPzwP

zwPzwP

zwPzwPg

1,

,

1,

1

,,

11 1,

1

1|ˆ,1,1|ˆ

1|ˆ0

|ˆ1

|ˆ1|ˆlog1log

M

m

N

i mjmikmi

N

i kjjikjikj

wdzPwdn

wdzPwdnzwP

1 1 ,

1 ,

1,|,

1,|,|ˆ

1,

1, ,|,

nkj

N

i

nj

nik

nnj

ni

nkj wdzPwdn


QB PLSAQB PLSA

After careful arrangement, exponential of posteriori expectation fAfter careful arrangement, exponential of posteriori expectation function can be expressed:unction can be expressed:

A reproducible prior/posterior pair is generated to build the updatA reproducible prior/posterior pair is generated to build the updating mechanism of hyperparametersing mechanism of hyperparameters

K

k

N

i

nik

nM

jk

nj

n

nn

nik

nkj dzPzwP

R

1 1

1

1

1 ,, |ˆ|ˆ

|ˆexp

1,

1, ,|,

nkj

N

i

nj

nik

nnj

ni

nkj wdzPwdn

1,

1, ,|,

nkj

M

j

nj

nik

nnj

ni

nik wdzPwdn


Initial HyperparametersInitial Hyperparameters

A open issue in Bayesian learningA open issue in Bayesian learning

If the initial prior knowledge is too strong or after a lot of If the initial prior knowledge is too strong or after a lot of adaptation data have been incrementally processed, the new adaptation data have been incrementally processed, the new adaptation data usually have only a small impact on parameters adaptation data usually have only a small impact on parameters updating in incremental training. updating in incremental training.

N

ijikkj wdzP

1

0, ,|1

M

jjikik wdzP

1

0, ,|1



MED Corpus: MED Corpus:

1033 medical abstracts with 30 queries1033 medical abstracts with 30 queries

7014 unique terms7014 unique terms

433 abstracts for ML training433 abstracts for ML training

600 abstracts for MAP or QB training600 abstracts for MAP or QB training

Query subset for testingQuery subset for testing

K=8K=8

Reuters-21578Reuters-21578

4270 documents for training4270 documents for training

2925 for QB learning2925 for QB learning

2790 documents for testing2790 documents for testing

13353 unique words13353 unique words

10 categories10 categories



This paper presented an adaptive text modeling and classification This paper presented an adaptive text modeling and classification approach for PLSA based information system.approach for PLSA based information system.

Future work:Future work:Extension of PLSA for bigram or trigram will be explored.Extension of PLSA for bigram or trigram will be explored.

Application for spoken document classification and retrievalApplication for spoken document classification and retrieval

30

Discriminative Maximum Entropy Discriminative Maximum Entropy Language Model for Speech RecognitionLanguage Model for Speech Recognition

Chuang-Hua Chueh, To-Chang Chien and Jen-TzunChuang-Hua Chueh, To-Chang Chien and Jen-Tzung Chieng Chien

Presenter: Hsuan-Sheng Chiu


ReferenceReference

R. Rosenfeld, S. F. Chen and X. Zhu, “Whole-sentence exponentiR. Rosenfeld, S. F. Chen and X. Zhu, “Whole-sentence exponential language models : a vehicle for linguistic statistical integrational language models : a vehicle for linguistic statistical integration”, 2001”, 2001

W.H. Tsai, “An Initial Study on Language Model Estimation and W.H. Tsai, “An Initial Study on Language Model Estimation and Adaptation Techniques for Mandarin Large Vocabulary ContinuoAdaptation Techniques for Mandarin Large Vocabulary Continuous Speech Recognition”, 2005us Speech Recognition”, 2005


OutlineOutline


Whole-sentence exponential modelWhole-sentence exponential model

Discriminative ME language modelDiscriminative ME language model

ExperimentExperiment




Language modelLanguage modelStatistical n-gram modelStatistical n-gram model

Latent semantic language modelLatent semantic language model

Structured language modelStructured language model

Based on maximum entropy principle, we can integrate different Based on maximum entropy principle, we can integrate different features to establish optimal probability distribution.features to establish optimal probability distribution.


Whole-Sentence Exponential ModelWhole-Sentence Exponential Model

Traditional method:Traditional method:

Exponential form:Exponential form:

Usage:Usage:When used for speech recognition, the model is not suitable for the When used for speech recognition, the model is not suitable for the first pass of the recognizer, and should be used to re-score N-best lists.first pass of the recognizer, and should be used to re-score N-best lists.

iii sfsp

Zsp exp

10

n

iiin wwwpwwpsp

1111 ...|...


Whole-Sentence ME Language ModelWhole-Sentence ME Language Model

Expectation of feature function:Expectation of feature function:Empirical:Empirical:

Actual:Actual:

Constraint:Constraint:

R

rr

Li

s

Li

Li sf

Rsfspfp

1

1~~

s

Li

Li sfspfp

Fifpfp Li

Li ,...,1for ,~


Whole-Sentence ME Language ModelWhole-Sentence ME Language Model

To Solve the constrained optimization problem:To Solve the constrained optimization problem:

' 1

1

1

1

11

1

1

1

'exp

exp

,

exp

11exp

11expexp

1expexp ,1log

0log1,

1~log

1~,

s

F

i

Li

Li

F

i

Li

Li

s

F

i

Li

Li

s

F

i

Li

Li

s

F

i

Li

Li

F

i

Li

Li

F

i

Li

Li

ME

s

F

i s

Li

s

Li

Li

s

s

F

i

Li

Li

LiME

sf

sf

sp

sf

sfsp

sfspsfsp

sfspsp

p

spsfspsfspspsp

spfpfppHp


GIS algorithmGIS algorithm

converged.not has if 2 step toGo 3.

''

1

,~

log1

on based update ,1each For 2.

,...,1 allfor 0tion with Initializa 1.

ˆ multiplier Lagrange Optimal :Output

~on distributi empirical and ,..., functions Feature :Input

''

'

1

Li

s i

Li

Li

s

Li

i

Li

LiL

iLi

Li

Li

LF

L

sfsfspsfsp

F

fp

fp

F

,...,Fi

Fi

spff


Discriminative ME Language ModelDiscriminative ME Language Model

In general, ME can be considered as a maximum likelihood In general, ME can be considered as a maximum likelihood model using log-linear distribution.model using log-linear distribution.

Propose a Discriminative language model based on whole-Propose a Discriminative language model based on whole-sentence ME model (DME)sentence ME model (DME)



Acoustic features for ME estimation:Acoustic features for ME estimation:Sentence-level log-likelihood ratio of competing and target sentencesSentence-level log-likelihood ratio of competing and target sentences

Feature weight parameter:Feature weight parameter:

Namely, we activate feature parameter to be one for those speech signals Namely, we activate feature parameter to be one for those speech signals observed in training database observed in training database

X

XX

AX

ss

sssXp

sXpsf

if 0

if |

|log

sentence competing :

sentence target :

s

sX

if 0

if 1

X

XAX

X



New estimation:New estimation:

Upgrade to discriminative linguistic parametersUpgrade to discriminative linguistic parameters

' 1

1

''exp

exp

s

AX

AX

F

i

Li

Li

AX

AX

F

i

Li

Li

LA

sfsf

sfsf

sp

' 1

1

'exp

exp

s

F

i

Li

DLi

F

i

Li

DLi

DME

sf

sf

sp



Corpus: TCC300Corpus: TCC30032 mixtures32 mixtures

12 Mel-frequency cepstral coefficients12 Mel-frequency cepstral coefficients

1 log-energy and first derivation1 log-energy and first derivation

4200 sentences for training, 450 for testing4200 sentences for training, 450 for testing

Corpus: Academia Sinica CKIP balanced corpusCorpus: Academia Sinica CKIP balanced corpusFive million wordsFive million words

Vocabulary 32909 wordsVocabulary 32909 words



A new ME language model integrating linguistic and acoustic A new ME language model integrating linguistic and acoustic features for speech recognitionfeatures for speech recognition

The derived ME language model was inherent with The derived ME language model was inherent with discriminative power.discriminative power.

DME model involved a constrained optimization procedure and DME model involved a constrained optimization procedure and was powerful for knowledge integration.was powerful for knowledge integration.


Relation between DME and MMI Relation between DME and MMI

MMI criterion:MMI criterion:

Modified MMI criterion:Modified MMI criterion:

Express ME model as ML model:Express ME model as ML model:

'

''|

|log

,log

S

MMI SpSXp

SXp

XpSp

XSp

R

r s r

rrr

S

MMI spsXp

spsXp

SpSXp

SpSXp

1 ''''|

|log

''|

|log

~


Relation between DME and MMIRelation between DME and MMI

The optimal parameter:The optimal parameter:

R

r

s

AX

AX

F

i

Li

Li

rA

XAX

F

ir

Li

Li

R

r

s

F

ir

LAi

LAi

F

ir

LAi

LAi

R

rrLALADME

sfsf

sfsf

sf

sf

spsp

rr

rr

r

1

' 1

1

1

' 1

1

1

''exp

exp

logmaxarg

'exp

exp

logmaxarg

logmaxargˆ

1 bayesian learning for latent semantic analysis jen-tzung chien, meng-sun wu and chia-sheng wu...

Documents

qb plsaconjugate prior

posterior density

posteriori probability

map plsaestep

map plsaestimation

dirichlet density

qb plsait

initial prior knowledge