big data machine learning topic models text recognition natural language processing

8/18/2019 Big data machine learning topic models text recognition natural language processing

1/58

CMSC 25025 / Stat 37601

Machine Learning andLarge Scale Data Analysis

Tuesday, April 21


2/58

For Today

• Mixtures (redux)

• Bayesian inference (redux)

• Topic models

2


3/58

Mixtures

• Key technique: Mixture models

• Mixtures have latent variables

• Flexible tool

• Simple and difficult at the same time

3


4/58

Gaussian Mixture

x

p ( x

)

0.00

0.05

0.10

0.15

0.20

!4 !2 0 2 4 6

p (x ) = 25φ(x ;−1.25, 1) + 35φ(x ; 2.95, 1)

4


5/58

Bumps and More Bumps (MacKay and Williams)

A mixture of k Gaussians models can have 53 k modes.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

-4 -3 -2 -1 0 1 2 3 4

-4

-3

-2

-1

0

1

2

3

4

.

.

.

.

.

.

.

.

-0.5 0 0.5 1

-0.5

0

0.5

1

5


6/58

Mixtures

• Mixture of f and g :

p (x ) = ηf (x ) + (1 − η)g (x )

Simplest, most common kind of latent variable model

• Hidden variable represention : Define Z ∼ Bernoulli(η) and

p (x ) =X

z =0,1

p (x | z ) p (z )

with p (x | 0) = f (x ), p (x | 1) = g (x ), p (z ) = ηz (1 − η)(1−z ).

6


7/58

Gaussian Mixture: All the Key Concepts

x

p ( x

)

0.00

0.05

0.10

0.15

0.20

!4 !2 0 2 4 6

7


8/58

Bayesian Inference

The parameter θ of a model is viewed as a random variable.

Inference usually carried out as follows:

• Choose a generative model p (x | θ) for the data.

• Choose a prior distribution π(θ) that expresses beliefs about theparameter before seeing any data.

• After observing data Dn = {x 1, . . . , x n }, update beliefs andcalculate the posterior distribution p (θ | Dn ).

8


9/58

Bayes’ Theorem

The posterior distribution can be written as

p (θ | x 1, . . . , x n ) = p (x 1, . . . , x n | θ)π(θ)

p (x 1, . . . , x n ) =

Ln (θ)π(θ)

c n ∝ Ln (θ)π(θ)

where Ln (θ) = Qn i =1 p (x i | θ) is the likelihood function andc n = p (x 1, . . . , x n ) =

Z p (x 1, . . . , x n | θ)π(θ)d θ =

Z Ln (θ)π(θ)d θ

is the normalizing constant, which is also called evidence .

9


10/58

Example

X ∼ Bernoulli(θ) with data Dn = {x 1, . . . , x n }. Prior Beta(α,β )distribution

πα,β (θ) = Γ(α + β )

Γ(α)Γ(β )θα−1(1 − θ)β −1

Let s =

Pn i =1 x i be the number of “successes.”

Posterior distribution θ | Dn is Beta(α + s ,β + n − s ). Posterior meanis a mixture:

θ̄ = α + s

α + β + n =

n

α + β + n bθ + α + β

α + β + n θ0where bθ = s /n is the MLE and θ0 = α/(α + β ) is the prior mean.

10


11/58

Example

n = 15 points sampled as X ∼ Bernoulli(θ = 0.4), with s = 7 heads.

!

D e n s i t y

0

1

2

3

0.0 0.2 0.4 0.6 0.8 1.0!

D e n s i t y

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

0.0 0.2 0.4 0.6 0.8 1.0

good prior bad prior

Prior distribution (black-dashed), likelihood function (blue-dotted),

posterior distribution (red-solid).

11


12/58

Dirichlet

Multinomial model with Dirichlet prior is generalization of the

Bernoulli/Beta model.

Dirichletα(θ) =

Γ(PK j =1 α j )QK

j =1 Γ(α j )

K

Y j =1

θ

α j −1

j

where α = (α1, . . . , αK ) ∈ RK + is a non-negative vector.

12


13/58

Example

!34

! 3 4

! 3 4

! 3 4

! 3 4

!32

! 3 2

! 3 2

! 3 2

! 3 2

! 3 2

!30!28

!26!24

!22

!20

!18

!45 !40!35!30

!25

! 2 5

! 2 5

! 2 5

! 2 5

!20

prior with Dirichlet(6,6,6) likelihood function with n = 20

!85

!85

!80!75

! 7 5

!70!65

! 6 5

! 6 5

! 6 5

! 6 5

! 6 5

!60 !55

!50

!45

!40

!550!500 !450!400

!350

! 3 5 0

! 3 5 0

! 3 5 0

!300

!250

posterior distribution with n = 20 posterior distribution with n = 200

13


14/58

Summary

• Mixtures are latent variable models

• The mixing weight encodes a hidden variable

• Computing with mixtures uses basic probabilistic reasoning

• But can get complicated

• Topic models are flexible mixtures models for complex data likedocuments and images (next)

14


15/58

Ball and Elephants

15


16/58

Captioning

there is a large bird on the water a professional baseball game is played in the middle of the field

a small bird sitting on top of a lake several players at the end of a baseball game

a large white bird standing on the water on a beach a group of players playing a baseball game

a bird is on the water on a beach the baseball players are playing games at the field

a bird that is standing in the water a baseball players are playing with a game and fans

www.cs.toronto.edu/˜nitish/nips2014demo/

16


17/58

Intro to Topic Modeling

Some of the following slides are from Dave Blei’s 2011 tutorial onTopic Modeling

http://www.cs.princeton.edu/˜blei/topicmodeling.html

A survey paper describing many of these ideas in more detail is here:

http://www.cs.princeton.edu/˜blei/papers/

BleiLafferty2009.pdf

See also:

http://awards.acm.org/award_winners/blei_3974465.cfm

17


18/58


19/58

Discover topics from a corpus

human evolution disease computer

genome evolutionary host models

dna species bacteria information

genetic organisms diseases data

genes life resistance computers

sequence origin bacterial systemgene biology new network

molecular groups strains systems

sequencing phylogenetic control model

map living infectious parallel

information diversity malaria methods

genetics group parasite networksmapping new parasites software

project two united new

sequences common tuberculosis simulations


20/58

Model the evolution of topics over time

1880 1900 1920 1940 1960 1980 2000

o o o o o oo

o

o

o

o

o

o

o

oo

oo

oo

o oo o o

o o o o o o

o o

o o

o o o

o o

o

o

o

o

o

o

o

o

o o

o o o o

oo

o

oo o

o

o

o

o

o

o

o

o

o

o

o oo

o o

1880 1900 1920 1940 1960 1980 2000

o o o

o

o o

oo

o

oo o

o

o

oo o o o

oo

o oo o

o o o

oo

o

o

oo

o

oo

o

o

o

o

o

oo

oo

o oo o

o o o o o o o o o o

o o

o o

o

o

o

o

o

o o o

o oo

RELATIVITY

LASER

FORCE

NERVE

OXYGEN

NEURON

"Theoretical Physics" "Neuroscience"


21/58

Model connections between topics

wild typemutant

mutations

mutantsmutation

plants

plant

gene

genes

arabidopsis

p53

cell cycle

activity

cyclin

regulation

amino acids

cdna

sequence

isolated

protein

genedisease

mutations

families

mutation

rna

dna

rna polymerase

cleavage

site

cells

cell

expressioncell lines

bone marrow

united states

women

universities

students

education

science

scientists

says

research

people

research

funding

support

nih

program

surface

tipimagesampledevice

laser

optical

light

electrons

quantum

materials

organic

polymer

polymers

molecules

volcanicdepositsmagmaeruption

volcanism

mantle

crust

upper mantle

meteorites

ratios

earthquake

earthquakes

fault

images

data

ancient

found

impactmillion years ago

africaclimate

ocean

ice

changes

climate change

cells

proteins

researchers

protein

found

patients

disease

treatment

drugs

clinical

genetic

populationpopulationsdifferences

variation

fossil record

birds

fossilsdinosaurs

fossil

sequence

sequences

genome

dnasequencing

bacteria

bacterial

host

resistance

parasitedevelopment

embryos

drosophila

genes

expression

speciesforest

forests

populations

ecosystems

synapsesltp

glutamate

synaptic

neurons

neurons

stimulus

motor

visualcortical

ozoneatmospheric

measurementsstratosphere

concentrations

sun

solar wind

earth

planets

planet

co2

carbon

carbon dioxide

methane

water

receptorreceptors

ligandligands

apoptosis

proteins

protein

binding

domain

domains

activatedtyrosine phosphorylation

activation

phosphorylation

kinase

magnetic

magnetic field

spin

superconductivity

superconducting

physicists

particles

physics

particle

experimentsurface

liquid

surfacesfluid

model reaction

reactionsmoleculemolecules

transition state

enzyme

enzymes

iron

active site

reduction

pressure

high pressure

pressures

core

inner core

brain

memorysubjects

left

task

computer

problem

information

computers

problems

starsastronomers

universe

galaxies

galaxy

virus

hivaids

infection

viruses

miceantigen

t cells

antigens

immune response


22/58


23/58

Annotate images

SKY WATER TREEMOUNTAIN PEOPLE

SCOTLAND WATER

FLOWER HILLS TREE

SKY WATER BUILDINGPEOPLE WATER

FISH WATER OCEAN

TREE CORAL

PEOPLE MARKET PATTERN

TEXTILE DISPLAY

BIRDS NEST TREE

BRANCH LEAVES


24/58

Discover influential articles

Year

W e i g h t e d I n f l u e n c e

0.000

0.005

0.010

0.015

0.020

0.025

0.030

1880 1900 1920 1940 1960 1980 2000

Jared M. Diamond, Distributional Ecology of New Guinea Birds. Science (1973)[296 citations]

W. B. Scott, The Isthmus of Panama in Its Relation to the Animal Life of North and South America , Science (1916)[3 citations]

William K. Gregory, The New Anthropogeny: Twenty-Five Stages ofVertebrate Evolution, from Silurian Chordate to Man , Science (1933)[3 citations]

Derek E. Wildman et al., Implications of Natural Selection in Shaping 99.4% NonsynonymousDNA Identity between Humans and Chimpanzees: Enlarging Genus Homo, PNAS (2003)[178 citations]


25/58

Predict links between articles

Markov chain Monte Carlo convergence diagnostics: A comparative review

Minorization conditions and convergence rates for Markov chain Monte Carlo

RTM

( ψ e

)

Rates of convergence of the Hastings and Metropolis algorithms

Possible biases induced by MCMC convergence diagnostics

Bounding convergence time of the Gibbs sampler in Bayesian image restoration

Self regenerative Markov chain Monte Carlo

Auxiliary variable methods for Markov chain Monte Carlo with applications

Rate of Convergence of the Gibbs Sampler by Gaussian Approximation

Diagnosing convergence of Markov chain Monte Carlo algorithms

Exact Bound for the Convergence of Metropolis Chains LDA + R

e gr e s si on

Self regenerative Markov chain Monte Carlo

Minorization conditions and convergence rates for Markov chain Monte Carlo

Gibbs-markov models

Auxiliary variable methods for Markov chain Monte Carlo with applications

Markov Chain Monte Carlo Model Determination for Hierarchical and Graphical Models

Mediating instrumental variables

A qualitative framework for probabilistic inference

Adaptation for Self Regenerative MCMC


26/58

Characterize political decisions

dod,defense,defense and appropriation,military,subtitle

veteran,veterans,bills,care,injury

people,woman,american,nation,school

producer,eligible,crop,farm,subparagraph

coin,inspector,designee,automobile,lebanon

bills,iran,official,company,sudan

human,vietnam,united nations,call,people

drug,pediatric,product,device,medical

child,fire,attorney,internet,billssurveillance,director,court,electronic,flood

energy,bills,price,commodity,market

land,site,bills,interior,river

child,center,poison,victim,abuse

coast guard,vessel,space,administrator,requires

science,director,technology,mathematics,bills

computer,alien,bills,user,collection

head,start,child,technology,award

loss,crop,producer,agriculture,trade

bills,tax,subparagraph,loss,taxablecover,bills,bridge,transaction,following

transportation,rail,railroad,passenger,homeland security

business,administrator,bills,business concern,loan

defense,iraq,transfer,expense,chapter

medicare,medicaid,child,chip,coverage

student,loan,institution,lender,school

energy,fuel,standard,administrator,lamp

housing,mortgage,loan,family,recipient

bank,transfer,requires,holding company,industrial

county,eligible,ballot,election,jurisdictiontax credit,budget authority,energy,outlays,tax


27/58

Organize and browse large corpora


28/58

This tutorial

• What are topic models?

• What kinds of things can they do?• How do I compute with a topic model?

• What are some unsanswered questions in this field?

• How can I learn more?


29/58

Uber Topics

Hi Prof. Lafferty,

I took your ML+LSDA course last Spring. The course was super helpful,

and I just wanted to let you know that I’m currently using Latent Dirichlet

Allocation at my current job at Uber!

We’re using LDA to discover topics in rider feedback – when riders write

comments about their driver after the trip. We’re trying to find topics such

as ’unprofessional driver’, ’driver no-show’, ’sexual harassment’, etc. LDA

has worked really well with this – so thank you for covering it in much

detail in your course.

18


30/58

Bag Demo

19


31/58

Introduction to Topic Modeling


32/58

Probabilistic modeling

1 Data are assumed to be observed from a generative probabilistic

process that includes hidden variables.

• In text, the hidden variables are the thematic structure.

2 Infer the hidden structure using posterior inference

• What are the topics that describe this collection?

3 Situate new data into the estimated model.

• How does a new document fit into the topic structure?


33/58

Latent Dirichlet allocation (LDA)

Simple intuition: Documents exhibit multiple topics.


34/58

Generative model for LDA

gene 0.04

dna 0.02

genetic 0.01

.,,

life 0.02

evolve 0.01

organism 0.01

.,,

brain 0.04

neuron 0.02

nerve 0.01

...

data 0.02

number 0.02

computer 0.01

.,,

Topics Documents Topic proportions and

assignments

• Each topic is a distribution over words

• Each document is a mixture of corpus-wide topics

• Each word is drawn from one of those topics


35/58

The posterior distribution


assignments

• In reality, we only observe the documents

• The other structure are hidden variables


36/58

The posterior distribution


assignments

• Our goal is to infer the hidden variables

• I.e., compute their distribution conditioned on the documents

p (topics, proportions, assignments | documents)


37/58

LDA as a graphical model

d d,n d,n β k η

Proportionsparameter

Per-documenttopic proportions

Per-wordtopic assignment

Observedword Topics

Topicparameter

• Encodes our assumptions about the data

• Connects to algorithms for computing with data

• See Pattern Recognition and Machine Learning (Bishop, 2006).


38/58


d d,n d,n β k η




Observedword Topics

Topicparameter

• Nodes are random variables; edges indicate dependence.

• Shaded nodes are observed.

• Plates indicate replicated variables.


39/58


d d,n d,n β k η




Observedword Topics

Topicparameter

K Yi =1

p (β i | η)D Y

d =1

p (θd |α)

N Y

n =1

p (z d ,n | θd )p (w d ,n | β 1:K , z d ,n )

!


40/58

LDA

θd Z d,n W d,nN D K

β kα η

• This joint defines a posterior.

• From a collection of documents, infer

• Per-word topic assignment z d

,

n • Per-document topic proportions θd • Per-corpus topic distributions β k

• Then use posterior expectations to perform the task at hand,

e.g., information retrieval, document similarity, exploration, ...


41/58

LDA


β kα η

Approximate posterior inference algorithms

• Mean field variational methods (Blei et al., 2001, 2003)

• Expectation propagation (Minka and Lafferty, 2002)

• Collapsed Gibbs sampling (Griffiths and Steyvers, 2002)• Collapsed variational inference (Teh et al., 2006)

• Online variational inference (Hoffman et al., 2010)

Also see Mukherjee and Blei (2009) and Asuncion et al. (2009).


42/58

Example inference

θd Z d,n W d,nN

D K

β kα η

• Data: The OCR’ed collection of Science from 1990–2000

• 17K documents• 11M words• 20K unique terms (stop words and rare words removed)

• Model: 100-topic LDA model using variational inference.


43/58

Example inference

1 8 16 26 36 46 56 66 76 86 96

Topics

P r o b a b i l i t y

0 . 0

0 . 1

0 . 2

0 . 3

0 . 4


44/58


45/58

Example inference (II)


46/58

Example inference (II)

problem model selection species

problems rate male forest

mathematical constant males ecology

number distribution females fish

new time sex ecological

mathematics number species conservation

university size female diversity

two values evolution population

first value populations natural

numbers average population ecosystems

work rates sexual populations

time data behavior endangeredmathematicians density evolutionary tropical

chaos measured genetic forests

chaotic models reproductive ecosystem


47/58

Used to explore and browse document collections


48/58

Aside: The Dirichlet distribution

• The Dirichlet distribution is an exponential family distribution over

the simplex, i.e., positive vectors that sum to one

p (θ | ~ α) = Γ (

Pi αi )

Qi

Γ(αi ) Yi

θαi −1i .

• It is conjugate to the multinomial. Given a multinomial

observation, the posterior distribution of θ is a Dirichlet.

• The parameter α controls the mean shape and sparsity of θ.

• The topic proportions are a K dimensional Dirichlet.

The topics are a V dimensional Dirichlet.


49/58

α = 1

item

v a l u e

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

1

!

!

!

! !

!

!! !

!

6

!

!

!

!

!

!

!

!

!

!

11

! !

!

!

!

!

!

!

!

!

1 2 3 4 5 6 7 8 9 10

2

!

!

!

!

!

!

!

!

!

!

7

!

! !

!

!

! !

! ! !

12

!

!

!!

!

!

!

! !

!

1 2 3 4 5 6 7 8 9 10

3

!

!

!

!

!

!

!

! !

!

8

!

!

!

!

!

!

!

!

!

!

13

!

!

!

! ! !

!!

!

!

1 2 3 4 5 6 7 8 9 10

4

! ! !

!

! !

! !!

!

9

!

!

!

!

!

!!

!

!!

14

!!

!

!!

!

!!

!

!

1 2 3 4 5 6 7 8 9 10

5

!!

!

!!

! !

!

!

!

10

!

!! !

!

!

! ! !

!

15

! !

!

! !

! !

!

! !

1 2 3 4 5 6 7 8 9 10


50/58

α = 10

item

v a l u e

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

1

!!

! ! ! !

!

!

!

!

6

!!

!

! ! !

!

!

!

!

11

!! !

!

! !

!!

!!

1 2 3 4 5 6 7 8 9 10

2

! !

!! !

!

!!

!!

7

!

!

!

! !!

! ! !

!

12

! !

!

! ! ! !!

!

!

1 2 3 4 5 6 7 8 9 10

3

!

! !

!

!

!

! !

!

!

8

!!

! ! !

!! !

!

!

13

!!

! ! !!

!!

!

!

1 2 3 4 5 6 7 8 9 10

4

! !! !

! !! !

!!

9

!

!!

!! !

! !

! !

14

!!

! !

!

!

!

!!

!

1 2 3 4 5 6 7 8 9 10

5

! !

!!

!

!! ! !

!

10

! !

!!

!

!!

!

!!

15

!

!

!

! ! !

! !

!

!

1 2 3 4 5 6 7 8 9 10


51/58

α = 100

item

v a l u e

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

1

! ! ! ! ! ! !

! ! !

6

! ! ! ! ! ! ! !

! !

11

! ! ! ! ! ! ! ! ! !

1 2 3 4 5 6 7 8 9 10

2

! ! ! ! ! ! ! ! !

!

7

! ! ! ! ! ! ! ! ! !

12

!! ! !

!! ! ! ! !

1 2 3 4 5 6 7 8 9 10

3

! ! ! ! ! ! ! ! ! !

8

! ! ! ! ! ! ! ! ! !

13

!! ! ! ! !

! ! ! !

1 2 3 4 5 6 7 8 9 10

4

! ! ! ! !

!!

! ! !

9

! ! ! ! ! ! ! ! ! !

14

! ! ! ! ! ! ! ! !

!

1 2 3 4 5 6 7 8 9 10

5

! ! ! ! ! ! ! ! ! !

10

! ! ! ! ! ! ! ! ! !

15

! ! ! !

! ! ! ! ! !

1 2 3 4 5 6 7 8 9 10


52/58

α = 1

item

v a l u e

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

1

!

!

!

! !

!

!! !

!

6

!

!

!

!

!

!

!

!

!

!

11

! !

!

!

!

!

!

!

!

!

1 2 3 4 5 6 7 8 9 10

2

!

!

!

!

!

!

!

!

!

!

7

!

! !

!

!

! !

! ! !

12

!

!

!!

!

!

!

! !

!

1 2 3 4 5 6 7 8 9 10

3

!

!

!

!

!

!

!

! !

!

8

!

!

!

!

!

!

!

!

!

!

13

!

!

!

! ! !

!!

!

!

1 2 3 4 5 6 7 8 9 10

4

! ! !

!

! !

! !!

!

9

!

!

!

!

!

!!

!

!!

14

!!

!

!!

!

!!

!

!

1 2 3 4 5 6 7 8 9 10

5

!!

!

!!

! !

!

!

!

10

!

!! !

!

!

! ! !

!

15

! !

!

! !

! !

!

! !

1 2 3 4 5 6 7 8 9 10


53/58

α = 0.1

item

v a l u e

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

1

! ! ! ! ! !

!

! ! !

6

! ! !

!

!

!

! !

!

!

11

! !

!

! !

!

!

! ! !

1 2 3 4 5 6 7 8 9 10

2

! !! ! !

!

! !

!

!

7

! ! !

!

!

!

!

! !

!

12

!

!! ! ! ! ! !

!

!

1 2 3 4 5 6 7 8 9 10

3

!

!

!

!

! ! !

!

! !

8

!

! !

!

!

! ! ! ! !

13

!!

!

!!

! ! ! !

!

1 2 3 4 5 6 7 8 9 10

4

!

!! ! ! ! ! ! ! !

9

!

!

! !

!

!! ! ! !

14

!

! ! ! ! ! ! ! ! !

1 2 3 4 5 6 7 8 9 10

5

! ! ! !

!

!

!

!

! !

10

! ! ! ! !

!

!

!

!

!

15

!

!

!

! !

!

!

! !

!

1 2 3 4 5 6 7 8 9 10


54/58

α = 0.01

item

v a l u e

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

1

!

!

! ! ! ! ! ! ! !

6

! ! !

!

! ! ! ! ! !

11

!! ! ! ! ! !

!

!

!

1 2 3 4 5 6 7 8 9 10

2

! ! ! ! !

! !

!

! !

7

! ! ! ! ! ! ! !

!

!

12

! ! !

!

! ! ! ! ! !

1 2 3 4 5 6 7 8 9 10

3

!

!

!

!

! ! ! ! ! !

8

! ! ! ! ! ! !

!

! !

13

! ! ! !

!

!

! ! ! !

1 2 3 4 5 6 7 8 9 10

4

! ! ! ! !

!

! ! ! !

9

! ! ! ! ! ! ! ! !

!

14

! ! ! ! !

!

! !

!

!

1 2 3 4 5 6 7 8 9 10

5

! ! ! ! ! ! !

!

! !

10

! ! ! ! ! !

!

! ! !

15

! ! ! ! ! ! !

!

! !

1 2 3 4 5 6 7 8 9 10


55/58

α = 0.001

item

v a l u e

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

1

!

! ! ! ! ! ! ! ! !

6

! !

!

! ! ! ! ! ! !

11

! ! ! ! ! ! ! !

!

!

1 2 3 4 5 6 7 8 9 10

2

! ! ! ! ! ! ! !

!

!

7

! !

!

! ! ! ! ! ! !

12

! ! ! ! ! ! ! !

!

!

1 2 3 4 5 6 7 8 9 10

3

! ! ! ! ! !

!

! ! !

8

! ! ! !

!

! ! ! ! !

13

! ! ! ! ! ! !

!

! !

1 2 3 4 5 6 7 8 9 10

4

!

!

! ! ! ! ! ! ! !

9

! ! ! ! ! !

!

! ! !

14

! ! ! ! ! !

!

! ! !

1 2 3 4 5 6 7 8 9 10

5

! ! ! ! ! ! ! ! !

!

10

! ! !

!

! ! ! ! ! !

15

! !

!

! ! ! ! ! ! !

1 2 3 4 5 6 7 8 9 10

Wh d LDA “ k”?


56/58

Why does LDA “work”?

Why does the LDA posterior put “topical” words together?

• Word probabilities are maximized by dividing the words among

the topics. (More terms means more mass to be spread around.)

• In a mixture, this is enough to find clusters of co-occurring words.

• In LDA, the Dirichlet on the topic proportions can encourage

sparsity, i.e., a document is penalized for using many topics.

• Loosely, this can be thought of as softening the strict definition of“co-occurrence” in a mixture model.

• This flexibility leads to sets of terms that more tightly co-occur.

S f LDA


57/58

Summary of LDA


β kα η

• LDA can

• visualize the hidden thematic structure in large corpora• generalize new data to fit into that structure

• Builds on Deerwester et al. (1990) and Hofmann (1999)

It is a mixed membership model (Erosheva, 2004).

Relates to multinomial PCA (Jakulin and Buntine, 2002)

• Was independently invented for genetics (Pritchard et al., 2000)

I l t ti f LDA


58/58

Implementations of LDA

There are many available implementations of topic modeling—

LDA-C∗ A C implementation of LDA

HDP∗ A C implementation of the HDP (“infinite LDA”)

Online LDA∗

A python package for LDA on massive dataLDA in R∗ Package in R for many topic models

LingPipe Java toolkit for NLP and computational linguistics

Mallet Java toolkit for statistical NLP

TMVE∗ A python package to build browsers from topic models

∗ available at www.cs.princeton.edu/ ∼blei/