semantic representations with probabilistic topic models

105
Semantic Representations with Probabilistic Topic Models Mark Steyvers Department of Cognitive Sciences University of California, Irvine oint work with: om Griffiths, UC Berkeley adhraic Smyth, UC Irvine

Upload: varden

Post on 19-Jan-2016

53 views

Category:

Documents


0 download

DESCRIPTION

Semantic Representations with Probabilistic Topic Models. Mark Steyvers Department of Cognitive Sciences University of California, Irvine. Joint work with: Tom Griffiths, UC Berkeley Padhraic Smyth, UC Irvine. Topic Models in Machine Learning. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Semantic Representations with Probabilistic Topic Models

Semantic Representations with Probabilistic Topic Models

Mark SteyversDepartment of Cognitive Sciences

University of California, Irvine

Joint work with:Tom Griffiths, UC BerkeleyPadhraic Smyth, UC Irvine

Page 2: Semantic Representations with Probabilistic Topic Models

Topic Models in Machine Learning

• Unsupervised extraction of content from large text collection

• Topics provide quick summary of content / gist

What is in this corpus?

What is in this document, paragraph, or sentence?

What are similar documents to a query?

What are the topical trends over time?

Page 3: Semantic Representations with Probabilistic Topic Models

Topic Models in Psychology

• Topic models address three computational problems for semantic memory system:

1) Gist extraction: what is this set of words about?

2) Disambiguation: what is the sense of this word?

- E.g. “football field” vs. “magnetic field”

3) Prediction: what fact, concept, or word is next?

Page 4: Semantic Representations with Probabilistic Topic Models

PLAY

How are these learned?

CASH

BAT

LOAN

MONEY

BANK

RIVER STREAM

Two approaches to semantic representation

Semantic networks Semantic Spaces

FUNGAME

STAGE

THEATER

BALL

Can be learned (e.g. Latent Semantic Analysis), but is this representation flexible enough?

Page 5: Semantic Representations with Probabilistic Topic Models

Overview

I Probabilistic Topic Models

generative model

statistical inference: Gibbs sampling

II Explaining human memory

word association

semantic isolation

false memory

III Information retrieval

Page 6: Semantic Representations with Probabilistic Topic Models

Probabilistic Topic Models

• Extract topics from large text collections

unsupervised

generative

Bayesian statistical inference

• Our modeling work is based on:

– pLSI Model: Hoffman (1999)

– LDA Model: Blei, Ng, and Jordan (2001, 2003)

– Topics Model: Griffiths and Steyvers (2003, 2004)

Page 7: Semantic Representations with Probabilistic Topic Models

Model input: “bag of words”

• Matrix of number of times words occur in documents

• Note: some function words are deleted: “the”, “a”, “and”, etc

documents

wor

ds

| ??P w d

1

16

0

MONEY

6195BANK

0012STREAM

0034RIVER

Doc3 … Doc2Doc1

Page 8: Semantic Representations with Probabilistic Topic Models

Probabilistic Topic Models

• A topic represents a probability distribution over words

– Related words get high probability in same topic

• Example topics extracted from NIH/NSF grants:

brainfmri

imagingfunctional

mrisubjects

magneticresonance

neuroimagingstructural

schizophreniapatientsdeficits

schizophrenicpsychosissubjects

psychoticdysfunction

abnormalitiesclinical

memoryworking

memoriestasks

retrievalencodingcognitive

processingrecognition

performance

diseasead

alzheimerdiabetes

cardiovascularinsulin

vascularblood

clinicalindividuals

Probability distribution over words. Most likely words listed at the top

Page 9: Semantic Representations with Probabilistic Topic Models

Document = mixture of topics

brainfmri

imagingfunctional

mrisubjects

magneticresonance

neuroimagingstructural

schizophreniapatientsdeficits

schizophrenicpsychosissubjects

psychoticdysfunction

abnormalitiesclinical

memoryworking

memoriestasks

retrievalencodingcognitive

processingrecognition

performance

diseasead

alzheimerdiabetes

cardiovascularinsulin

vascularblood

clinicalindividuals

Document

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

80%20%

Document

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

100%

Page 10: Semantic Representations with Probabilistic Topic Models

Generative Process

• For each document, choose a mixture of topics

Dirichlet()

• Sample a topic [1..T] from the mixture

z Multinomial()

• Sample a word from the topic

w Multinomial((z))

Dirichlet(β)

z

w

D

T

Nd

Page 11: Semantic Representations with Probabilistic Topic Models

Prior Distributions• Dirichlet priors encourage sparsity on topic mixtures and

topics

θ ~ Dirichlet( α ) ~ Dirichlet( β )

Topic 1 Topic 2

Topic 3

Word 1 Word 2

Word 3

(darker colors indicate lower probability)

Page 12: Semantic Representations with Probabilistic Topic Models

Creating Artificial Dataset

Docs

Can we recover the original topics and topic mixtures from this data?

River Stream Bank Money Loan123456789

10111213141516

Two topics

16 documents

topic 1 topic 2

River 0.33 0Stream 0.33 0Bank 0.33 0.33Money 0 0.33Loan 0 0.33

Page 13: Semantic Representations with Probabilistic Topic Models

Statistical Inference

• Three sets of latent variables– topic mixtures θ– word mixtures – topic assignments z

• Estimate posterior distribution over topic assignments – P( z | w )

(we can later infer θ and )

Page 14: Semantic Representations with Probabilistic Topic Models

Statistical Inference

• Exact inference is impossible

• Use approximate methods:

• Markov chain Monte Carlo (MCMC) with Gibbs sampling

Sum over Tn terms( , )

( | )( , ')

PP

P z'

w zz w

w z

Page 15: Semantic Representations with Probabilistic Topic Models

Gibbs Sampling

' '' '

( | )i i

td wti i i i

t d w tt w

n np z t z

n T n W

count of word w assigned to topic t

count of topic t assigned to doc d

probability that word i is assigned to topic t

Page 16: Semantic Representations with Probabilistic Topic Models

Example of Gibbs Sampling

• Assign word tokens randomly to topics:

(●=topic 1; ○=topic 2 )

River Stream Bank Money Loan123456789

10111213141516

Page 17: Semantic Representations with Probabilistic Topic Models

River Stream Bank Money Loan123456789

10111213141516

After 1 iteration

• Apply sampling equation to each word token:

(●=topic 1; ○=topic 2 )

Page 18: Semantic Representations with Probabilistic Topic Models

River Stream Bank Money Loan123456789

10111213141516

After 4 iterations

(●=topic 1; ○=topic 2 )

Page 19: Semantic Representations with Probabilistic Topic Models

River Stream Bank Money Loan123456789

10111213141516

After 8 iterations

(●=topic 1; ○=topic 2 )

Page 20: Semantic Representations with Probabilistic Topic Models

River Stream Bank Money Loan123456789

10111213141516

After 32 iterations

(●=topic 1; ○=topic 2 )

topic 1 topic 2

River 0.42 0Stream 0.29 0.05Bank 0.28 0.31Money 0 0.29Loan 0 0.35

Page 21: Semantic Representations with Probabilistic Topic Models

INPUT: word-document counts (word order is irrelevant)

OUTPUT:topic assignments to each word P( zi )likely words in each topic P( w | z )likely topics in each document (“gist”) P( θ | d )

Algorithm input/output

Page 22: Semantic Representations with Probabilistic Topic Models

Software

Public-domain MATLAB toolbox for topic modeling on the Web:

http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm

Page 23: Semantic Representations with Probabilistic Topic Models

Examples Topics from New York Times

WEEKDOW_JONES

POINTS10_YR_TREASURY_YIELD

PERCENTCLOSE

NASDAQ_COMPOSITESTANDARD_POOR

CHANGEFRIDAY

DOW_INDUSTRIALSGRAPH_TRACKS

EXPECTEDBILLION

NASDAQ_COMPOSITE_INDEXEST_02

PHOTO_YESTERDAYYEN10

500_STOCK_INDEX

WALL_STREETANALYSTS

INVESTORSFIRM

GOLDMAN_SACHSFIRMS

INVESTMENTMERRILL_LYNCH

COMPANIESSECURITIESRESEARCH

STOCKBUSINESSANALYST

WALL_STREET_FIRMSSALOMON_SMITH_BARNEY

CLIENTSINVESTMENT_BANKINGINVESTMENT_BANKERS

INVESTMENT_BANKS

SEPT_11WAR

SECURITYIRAQ

TERRORISMNATIONKILLED

AFGHANISTANATTACKS

OSAMA_BIN_LADENAMERICAN

ATTACKNEW_YORK_REGION

NEWMILITARY

NEW_YORKWORLD

NATIONALQAEDA

TERRORIST_ATTACKS

BANKRUPTCYCREDITORS

BANKRUPTCY_PROTECTIONASSETS

COMPANYFILED

BANKRUPTCY_FILINGENRON

BANKRUPTCY_COURTKMART

CHAPTER_11FILING

COOPERBILLIONS

COMPANIESBANKRUPTCY_PROCEEDINGS

DEBTSRESTRUCTURING

CASEGROUP

Terrorism Wall Street Firms

Stock Market

Bankruptcy

Page 24: Semantic Representations with Probabilistic Topic Models

Example topics from an educational corpus

PRINTINGPAPERPRINT

PRINTEDTYPE

PROCESSINK

PRESSIMAGE

PLAYPLAYSSTAGE

AUDIENCETHEATERACTORSDRAMA

SHAKESPEAREACTOR

TEAMGAME

BASKETBALLPLAYERSPLAYER

PLAYPLAYINGSOCCERPLAYED

JUDGETRIALCOURTCASEJURY

ACCUSEDGUILTY

DEFENDANTJUSTICE

HYPOTHESISEXPERIMENTSCIENTIFIC

OBSERVATIONSSCIENTISTS

EXPERIMENTSSCIENTIST

EXPERIMENTALTEST

STUDYTEST

STUDYINGHOMEWORK

NEEDCLASSMATHTRY

TEACHER

Example topics from psych review abstracts

SIMILARITYCATEGORY

CATEGORIESRELATIONS

DIMENSIONSFEATURES

STRUCTURESIMILAR

REPRESENTATIONONJECTS

STIMULUSCONDITIONING

LEARNINGRESPONSESTIMULI

RESPONSESAVOIDANCE

REINFORCEMENTCLASSICAL

DISCRIMINATION

MEMORYRETRIEVAL

RECALLITEMS

INFORMATIONTERM

RECOGNITIONITEMSLIST

ASSOCIATIVE

GROUPINDIVIDUAL

GROUPSOUTCOMES

INDIVIDUALSGROUPS

OUTCOMESINDIVIDUALSDIFFERENCESINTERACTION

EMOTIONALEMOTION

BASICEMOTIONS

AFFECTSTATES

EXPERIENCESAFFECTIVE

AFFECTSRESEARCH

Page 25: Semantic Representations with Probabilistic Topic Models

Choosing number of topics

• Bayesian model selection

• Generalization test

– e.g., perplexity on out-of-sample data

• Non-parametric Bayesian approach

– Number of topics grows with size of data

– E.g. Hierarchical Dirichlet Processes (HDP)

Page 26: Semantic Representations with Probabilistic Topic Models

Applications to Human Memory

Page 27: Semantic Representations with Probabilistic Topic Models

Computational Problems for Semantic Memory System

• Gist extraction

– What is this set of words about?

• Disambiguation

– What is the sense of this word?

• Prediction

– what fact, concept, or word is next?

2 1|P w w

| ,P z w context

|P z w

|P w

|iP z w

2 1|P w w

Page 28: Semantic Representations with Probabilistic Topic Models

Disambiguation

“FIELD”

0

0.1

0.2

0.3

0.4

0.5

FIELDMAGNETICMAGNET

WIRENEEDLE

CURRENTCOIL

POLES

BALLGAMETEAM

FOOTBALLBASEBALLPLAYERS

PLAYFIELD

P(

z FIE

LD |

w )

“FOOTBALL FIELD”

FIELDMAGNETICMAGNET

WIRENEEDLE

CURRENTCOIL

POLES

BALLGAMETEAM

FOOTBALLBASEBALLPLAYERS

PLAYFIELD

P(

z FIE

LD |

w )

0

0.2

0.4

0.6

0.8

1

Page 29: Semantic Representations with Probabilistic Topic Models

Modeling Word Association

Page 30: Semantic Representations with Probabilistic Topic Models

Word Association(norms from Nelson et al. 1998)

CUE: PLANET

Page 31: Semantic Representations with Probabilistic Topic Models

people

EARTH STARS SPACE

SUN MARS

UNIVERSE SATURN GALAXY

associate number

12345678

Word Association(norms from Nelson et al. 1998)

CUE: PLANET

(vocabulary = 5000+ words)

Page 32: Semantic Representations with Probabilistic Topic Models

Word Association as a Prediction Problem

• Given that a single word is observed, predict what other words

might occur in that context

• Under a single topic assumption:

2 1 2 1| | |z

P w w P w z P z w

Response Cue

z

wzPzwPwwP )|()|()|( 1212

Page 33: Semantic Representations with Probabilistic Topic Models

people

EARTH STARS SPACE

SUN MARS

UNIVERSE SATURN GALAXY

model

STARS STAR SUN

EARTH SPACE

SKY PLANET

UNIVERSE

associate number

12345678

Word Association(norms from Nelson et al. 1998)

CUE: PLANET

First associate “EARTH” has rank 4 in model

Page 34: Semantic Representations with Probabilistic Topic Models

# Topics

300

500

700

90011

0013

0015

0017

00

Me

dian

Ra

nk

0

10

20

30

40

0

10

20

30

40

TOPICS LSA

Cosine

Inner product

Median rank of first associate

TOPICS

Page 35: Semantic Representations with Probabilistic Topic Models

# Topics

300

500

700

90011

0013

0015

0017

00

Me

dian

Ra

nk

0

10

20

30

40

0

10

20

30

40

TOPICS LSA

Cosine

Inner product

Median rank of first associate

TOPICS LSA

Page 36: Semantic Representations with Probabilistic Topic Models

Episodic Memory

Semantic Isolation EffectsFalse Memory

Page 37: Semantic Representations with Probabilistic Topic Models

Semantic Isolation Effect

Study this list:PEAS, CARROTS, BEANS, SPINACH, LETTUCE, HAMMER, TOMATOES, CORN, CABBAGE, SQUASH

HAMMER,PEAS,

CARROTS,...

Page 38: Semantic Representations with Probabilistic Topic Models

Semantic isolation effect / Von Restorff effect

• Finding: contextually unique words are better remembered

• Verbal explanations:

– Attention, surprise, distinctiveness

• Our approach:

– assume memories can be accessed and encoded at

multiple levels of description

• Semantic/ Gist aspects – generic information

• Verbatim – specific information

Page 39: Semantic Representations with Probabilistic Topic Models

39

Computational Problem

• How to tradeoff specificity and generality?

– Remembering detail and gist

• Dual route topic model =

topic model + encoding of

specific words

Page 40: Semantic Representations with Probabilistic Topic Models

40

Dual route topic model

• Two ways to generate words:

– Topic Model

– Verbatim word distribution (unique to document)

• Each word comes from a single route

– Switch variable xi for every word i:

xi =0 topics

xi =1 verbatim

• Conditional prob. of a word under a document:

| ( 0 | ) ( | )

( 1 | ) ( | )

topics

verbatim

P w d P x d P w d

P x d P w d

Page 41: Semantic Representations with Probabilistic Topic Models

Graphical Model

Variable x is a switch :

x=0 sample from topicx=1 sample from verbatim word distribution

Page 42: Semantic Representations with Probabilistic Topic Models

42

Applying Dual Route Topic Model to Human Memory

• Train model on educational corpus (TASA)

– 37K documents, 1700 topics

• Apply model to list memory experiments

– Study list is a “document”

– Recall probability based on model

| ( 0 | ) ( | )

( 1 | ) ( | )

topics

verbatim

P w d P x d P w d

P x d P w d

Page 43: Semantic Representations with Probabilistic Topic Models

43

Study wordsPEAS

CARROTS BEANS

SPINACH LETTUCE HAMMER

TOMATOES CORN

CABBAGE SQUASH

Retrieval Probability

0.00 0.01 0.02

HAMMER

BEANS

CORN

PEAS

SPINACH

CABBAGE

LETTUCE

CARROTS

SQUASH

TOMATOES

Verbatim Word Probability

0.00 0.05 0.10 0.15

HAMMER

PEAS

SPINACH

CABBAGE

CARROTS

LETTUCE

SQUASH

BEANS

TOMATOES

CORN

Switch Probability

0.0 0.5 1.0

Verbatim

Topic

Topic Probability

0.0 0.1 0.2

VEGETABLES

FURNITURE

TOOLS

ENCODING RETRIEVAL

Specialverbatim

Page 44: Semantic Representations with Probabilistic Topic Models

44

Hunt & Lamb (2001 exp. 1)

OUTLIER LISTPEAS

CARROTS BEANS

SPINACH LETTUCE HAMMER

TOMATOES CORN

CABBAGE SQUASH

CONTROL LISTSAW

SCREW CHISEL DRILL

SANDPAPERHAMMER

NAILSBENCHRULERANVIL

outlier list pure list

Pro

b. o

f R

ecal

l

0.0

0.2

0.4

0.6

0.8

1.0TargetBackground

DATA

Page 45: Semantic Representations with Probabilistic Topic Models

45

False Memory(e.g. Deese, 1959; Roediger & McDermott)

Study this list:Bed, Rest, Awake, Tired, Dream, Wake, Snooze, Blanket, Doze, Slumber, Snore, Nap, Peace, Yawn, Drowsy

SLEEP,BED,REST,

...

Page 46: Semantic Representations with Probabilistic Topic Models

46

MADFEARHATE

SMOOTHNAVYHEAT

SALADTUNE

COURTSCANDYPALACEPLUSHTOOTHBLIND

WINTER

Number of Associates

3 6 9

MADFEARHATERAGE

TEMPERFURY

SALADTUNE

COURTSCANDYPALACEPLUSHTOOTHBLIND

WINTER

MADFEARHATERAGE

TEMPERFURY

WRATHHAPPYFIGHT

CANDYPALACEPLUSHTOOTHBLIND

WINTER

(lure = ANGER)

Number of Associates Studied

3 6 9 12 15

Pro

b. o

f R

ecal

l

0.0

0.2

0.4

0.6

0.8

1.0Studied itemsNonstudied (lure)

DATA

Robinson & Roediger (1997)

Number of Associates Studied

3 6 9 12 15

Pro

b. o

f R

etri

eval

0.00

0.01

0.02

0.03Studied associatesNonstudied (lure)

PREDICTED

False memory effects

Page 47: Semantic Representations with Probabilistic Topic Models

Modeling Serial Order Effects in Free Recall

Page 48: Semantic Representations with Probabilistic Topic Models

48

Problem

• Dual route model predicts no sequential effects

– Order of words is important in human memory

experiments

• Standard Gibbs sampler is psychologically implausible:

– Assumes list is processed in parallel

– Each item can influence encoding of each other item

Page 49: Semantic Representations with Probabilistic Topic Models

49

Semantic isolation experiment to study order effects

• Study lists of 14 words long

– 14 isolate lists (e.g. A A A B A A ... A A )

– 14 control lists (e.g. A A A A A A ... A A )

• Varied serial position of isolate (any of 14 positions)

Page 50: Semantic Representations with Probabilistic Topic Models

50

Immediate Recall Results

Isolate list: B A A A A ... A

Control list: A A A A A ... A

Page 51: Semantic Representations with Probabilistic Topic Models

51

Immediate Recall Results

Isolate list: B A A A A ... A

Control list: A A A A A ... A

Page 52: Semantic Representations with Probabilistic Topic Models

52

Immediate Recall Results

Isolate list: A B A A A ... A

Control list: A A A A A ... A

Page 53: Semantic Representations with Probabilistic Topic Models

53

Immediate Recall Results

Isolate list: A A B A A ... A

Control list: A A A A A ... A

Page 54: Semantic Representations with Probabilistic Topic Models

54

Immediate Recall Results

Page 55: Semantic Representations with Probabilistic Topic Models

55

Modified Gibbs Sampling Scheme

• Update items non-uniformly in Gibbs sampler

• Probability of updating item i after observing words 1..t

Words further back in time are less likely to be re-

assigned

Pr( ) 1t i

i t item to update

Current time Parameter

Page 56: Semantic Representations with Probabilistic Topic Models

560 5 10 15

0

0.2

0.4

0.6

0.8

1

Serial Position

Rec

all

Pro

ba

bili

ty

0 5 10 150

0.2

0.4

0.6

0.8

1

Serial Position

Rec

all

Pro

ba

bilit

y

ControlOutlier

0 5 10 150

0.2

0.4

0.6

0.8

1

Serial Position

Rec

all

Pro

bab

ility

1 2 3 4 5 6 7 8 9 10 11 12 13 141 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Effect of Sampling Scheme

Study order

=1 =0.3 =0

Page 57: Semantic Representations with Probabilistic Topic Models

57

Normalized Serial Position Effects

DATA

2 4 6 8 10 12 14-0.4

-0.2

0

0.2

0.4

Serial Position

P(

Iso

late

) -

P(

Mai

n )

MODEL

Page 58: Semantic Representations with Probabilistic Topic Models

Information Retrieval &

Human Memory

Page 59: Semantic Representations with Probabilistic Topic Models

59

Example

• Searching for information on Padhraic Smyth:

Page 60: Semantic Representations with Probabilistic Topic Models

60

Query = “Smyth”

Page 61: Semantic Representations with Probabilistic Topic Models

61

Query = “Smyth irish computer science department”

Page 62: Semantic Representations with Probabilistic Topic Models

Query = “Smyth irish computer science department weather prediction seasonal climate fluctuations hmm models nips conference consultant

yahoo netflix prize dave newman steyvers”

Page 63: Semantic Representations with Probabilistic Topic Models

Problem

• More information in a query can lead to worse search

results

• Human memory typically works better with more cues

• Problem: how can we better match queries to

documents to allow for partial matches, and matches

across documents?

Page 64: Semantic Representations with Probabilistic Topic Models

Dual route model for information retrieval

• Encode documents with two routes

– contextually unique words verbatim route

– Thematic words topics route

Page 65: Semantic Representations with Probabilistic Topic Models

Example encoding of a psych review abstract

alcove attention learning covering map is a connectionist model of category learning that incorporates an exemplar based representation d . l . medin and m . m . schaffer 1978 r . m . nosofsky 1986 with error driven learning m . a . gluck and g . h . bower 1988 d . e . rumelhart et al 1986 . alcove selectively attends to relevant stimulus dimensions is sensitive to correlated dimensions can account for a form of base rate neglect does not suffer catastrophic forgetting and can exhibit 3 stage u shaped learning of high frequency exceptions to rules whereas such effects are not easily accounted for by models using other combinations of representation and learning method .

Kruschke, J. K.. ALCOVE: An exemplar-based connectionist model of category learning. Psychological Review, 99, 22-44.

Contextually unique words:ALCOVE, SCHAFFER, MEDIN, NOSOFSKY

Topic 1 (p=0.21): learning phenomena acquisition learn acquired ...

Topic 22 (p=0.17): similarity objects object space category dimensional categories spatial

Topic 61 (p=0.08): representations representation order alternative 1st higher 2nd descriptions problem form

Page 66: Semantic Representations with Probabilistic Topic Models

Retrieval Experiments

• For each candidate document, calculate how likely the

query was “generated” from the model’s encoding

| 0 | | 1 | |topics verbatimw Query

P Query d P x d P w d P x d P w d

Page 67: Semantic Representations with Probabilistic Topic Models

Information Retrieval Results

Method Title Desc Concepts

TFIDF .406 .434 .549

LSI .455 .469 .523

LDA .478 .463 .556

SW .488 .468 .561

SWB .495 .473 .558

Method Title Desc Concepts

TFIDF .300 .287 .483

LSI .366 .327 .487

LDA .428 .340 .487

SW .448 .407 .560

SWB .459 .400 .560

Evaluation Metric: precision for 10 highest ranked docs

APs FRs

Page 68: Semantic Representations with Probabilistic Topic Models

68

Information retrieval systems in the mind & web

• Similar computational demands:

– Both retrieve the most relevant items from a large information

repository in response to external cues or queries.

• Useful analogies/ interdisciplinary approaches

• Many cognitive aspects in information retrieval

– Internet content is produced by humans

– Queries are formulated by humans

Page 69: Semantic Representations with Probabilistic Topic Models

69

Recent Papers• Steyvers, M., Griffiths, T.L., & Dennis, S. (2006). Probabilistic inference in

human semantic memory. Trends in Cognitive Sciences, 10(7), 327-334.

• Griffiths, T.L., Steyvers, M., & Tenenbaum, J.B.T. (2007). Topics in Semantic Representation. Psychological Review, 114(2), 211-244.

• Griffiths, T.L., Steyvers, M., & Firl, A. (in press). Google and the mind: Predicting fluency with PageRank. Psychological Science.

• Steyvers, M. & Griffiths, T.L. (in press). Rational Analysis as a Link between Human Memory and Information Retrieval. In N. Chater and M Oaksford (Eds.) The Probabilistic Mind: Prospects from Rational Models of Cognition. Oxford University Press.

• Chemudugunta, C., Smyth, P., & Steyvers, M. (2007, in press). Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model. In: Advances in Neural Information Processing Systems, 19.

Page 70: Semantic Representations with Probabilistic Topic Models

Text Mining Applications

Page 71: Semantic Representations with Probabilistic Topic Models

Topics provide quick summary of content

• Who writes on what topics?

• What is in this corpus? What is in this document?

• What are the topical trends over time?

• Who is mentioned in what context?

Page 72: Semantic Representations with Probabilistic Topic Models

Faculty Browser

• System spiders UCI/UCSD faculty websites related to

CalIT2 = California Institute for Telecommunications

and Information Technology

• Applies topic model on text extracted from pdf files

• Browser demo:

http://yarra.calit2.uci.edu/calit2/

Page 73: Semantic Representations with Probabilistic Topic Models

one topic

most prolific researchers for this topic

Page 74: Semantic Representations with Probabilistic Topic Models

topics this researcher works on

other researchers with similar

topical interests

one researcher

Page 75: Semantic Representations with Probabilistic Topic Models

Inferred network of researchers connected through topics

Page 76: Semantic Representations with Probabilistic Topic Models

330,000 articles

2000-2002

Analyzing the New York Times

Page 77: Semantic Representations with Probabilistic Topic Models

Three investigations began Thursday into the securities and exchange_commission's choice of william_webster to head a new board overseeing the accounting profession. house and senate_democrats called for the resignations of both judge_webster and harvey_pitt, the commission's chairman. The white_house expressed support for judge_webster as well as for harvey_pitt, who was harshly criticized Thursday for failing to inform other commissioners before they approved the choice of judge_webster that he had led the audit committee of a company facing fraud accusations. “The president still has confidence in harvey_pitt,” said dan_bartlett, bush's communications director …

Extracted Named Entities

• Used standard algorithms to extract named entities:

- People- Places- Organizations

Page 78: Semantic Representations with Probabilistic Topic Models

Standard Topic Model with Entities

team 0.028 tour 0.039 holiday 0.071 award 0.026play 0.015 rider 0.029 gift 0.050 film 0.020game 0.013 riding 0.017 toy 0.023 actor 0.020season 0.012 bike 0.016 season 0.019 nomination 0.019final 0.011 team 0.016 doll 0.014 movie 0.015games 0.011 stage 0.014 tree 0.011 actress 0.011point 0.011 race 0.013 present 0.008 won 0.011series 0.011 won 0.012 giving 0.008 director 0.010player 0.010 bicycle 0.010 special 0.007 nominated 0.010coach 0.009 road 0.009 shopping 0.007 supporting 0.010playoff 0.009 hour 0.009 family 0.007 winner 0.008championship 0.007 scooter 0.008 celebration 0.007 picture 0.008playing 0.006 mountain 0.008 card 0.007 performance 0.007win 0.006 place 0.008 tradition 0.006 nominees 0.007LAKERS 0.062 LANCE-ARMSTRONG 0.021 CHRISTMAS 0.058 OSCAR 0.035SHAQUILLE-O-NEAL0.028 FRANCE 0.011 THANKSGIVING 0.018 ACADEMY 0.020KOBE-BRYANT 0.028 JAN-ULLRICH 0.003 SANTA-CLAUS 0.009 HOLLYWOOD 0.009PHIL-JACKSON 0.019 LANCE 0.003 BARBIE 0.004 DENZEL-WASHINGTON 0.006NBA 0.013 U-S-POSTAL-SERVICE 0.002 HANUKKAH 0.003 JULIA-ROBERT 0.005SACRAMENTO 0.007 MARCO-PANTANI 0.002 MATTEL 0.003 RUSSELL-CROWE 0.005RICK-FOX 0.007 PARIS 0.002 GRINCH 0.003 TOM-HANK 0.005PORTLAND 0.006 ALPS 0.002 HALLMARK 0.002 STEVEN-SODERBERGH 0.004ROBERT-HORRY 0.006 PYRENEES 0.001 EASTER 0.002 ERIN-BROCKOVICH 0.003DEREK-FISHER 0.006 SPAIN 0.001 HASBRO 0.002 KEVIN-SPACEY 0.003

Basketball Holidays OscarsTour de France

Page 79: Semantic Representations with Probabilistic Topic Models

Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan030

50

100

Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan030

5

10

15

Topic Trends

Tour-de-France

Anthrax

Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan030

10

20

30Quarterly Earnings

Proportion of words assigned to topic for that

time slice

Page 80: Semantic Representations with Probabilistic Topic Models

Example of Extracted Entity-Topic Network

Muslim_Militance

Mid_East_Conflict

Palestinian_Territories

Pakistan_Indian_War

FBI_Investigation

Detainees

Mid_East_Peace

US_Military

Religion

Terrorist_Attacks

Afghanistan_War

AL_QAEDA

HAMID_KARZAIMOHAMMED

MOHAMMED_ATTA

NORTHERN_ALLIANCE

BIN_LADEN

TALIBAN

ZAWAHIRI

YASSER_ARAFAT

EHUD_BARAK

ARIEL_SHARON

HAMAS

AL_HAZMI

KING_HUSSEIN

Page 81: Semantic Representations with Probabilistic Topic Models

Prediction of Missing Entities in Text

Shares of XXXX slid 8 percent, or $1.10, to $12.65 Tuesday, as major

credit agencies said the conglomerate would still be challenged in repaying

its debts, despite raising $4.6 billion Monday in taking its finance group

public. Analysts at XXXX Investors service in XXXX said they were

keeping XXXX and its subsidiaries under review for a possible debt

downgrade, saying the company “will continue to face a significant debt

burden,'' with large slices of debt coming due, over the next 18 months.

XXXX said …

Test article with entities

removed

Page 82: Semantic Representations with Probabilistic Topic Models

Prediction of Missing Entities in Text

Shares of XXXX slid 8 percent, or $1.10, to $12.65 Tuesday, as major

credit agencies said the conglomerate would still be challenged in repaying

its debts, despite raising $4.6 billion Monday in taking its finance group

public. Analysts at XXXX Investors service in XXXX said they were

keeping XXXX and its subsidiaries under review for a possible debt

downgrade, saying the company “will continue to face a significant debt

burden,'' with large slices of debt coming due, over the next 18 months.

XXXX said …

fitch goldman-sachs lehman-brother moody morgan-stanley new-york-

stock-exchange standard-and-poor tyco tyco-international wall-street

worldco

Test article with entities

removed

Actual missing entities

Page 83: Semantic Representations with Probabilistic Topic Models

Prediction of Missing Entities in Text

Shares of XXXX slid 8 percent, or $1.10, to $12.65 Tuesday, as major

credit agencies said the conglomerate would still be challenged in repaying

its debts, despite raising $4.6 billion Monday in taking its finance group

public. Analysts at XXXX Investors service in XXXX said they were

keeping XXXX and its subsidiaries under review for a possible debt

downgrade, saying the company “will continue to face a significant debt

burden,'' with large slices of debt coming due, over the next 18 months.

XXXX said …

fitch goldman-sachs lehman-brother moody morgan-stanley new-york-

stock-exchange standard-and-poor tyco tyco-international wall-street

worldco

wall-street new-york nasdaq securities-exchange-commission sec merrill-

lynch new-york-stock-exchange goldman-sachs standard-and-poor

Test article with entities

removed

Actual missing entities

Predicted entities given observed words (matches

in blue)

Page 84: Semantic Representations with Probabilistic Topic Models

Model Extensions

Page 85: Semantic Representations with Probabilistic Topic Models

Model Extensions

• HMM-topics model

– Modeling aspects of syntax

• Hierarchical topic model

– Modeling relations between topics

• Collocation topic models

– Learning collocations of words within topics

Page 86: Semantic Representations with Probabilistic Topic Models

Hidden Markov Topic Model

Page 87: Semantic Representations with Probabilistic Topic Models

Hidden Markov Topics Model

• Syntactic dependencies short range dependencies

• Semantic dependencies long-range

z z z z

w w w w

s s s s

Semantic state: generate words from topic model

Syntactic states: generate words from HMM

(Griffiths, Steyvers, Blei, & Tenenbaum, 2004)

Page 88: Semantic Representations with Probabilistic Topic Models

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

z = 1 0.4

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 2 0.6

THE 0.6 A 0.3MANY 0.1

OF 0.6 FOR 0.3BETWEEN 0.1

0.9

0.1

0.2

0.8

0.7

0.3

Transition between semantic state and syntactic states

Page 89: Semantic Representations with Probabilistic Topic Models

THE ………………………………

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

z = 1 0.4

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 2 0.6

x = 1

THE 0.6 A 0.3MANY 0.1

x = 3

OF 0.6 FOR 0.3BETWEEN 0.1

x = 2

0.9

0.1

0.2

0.8

0.7

0.3

Combining topics and syntax

Page 90: Semantic Representations with Probabilistic Topic Models

THE LOVE……………………

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

z = 1 0.4

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 2 0.6

x = 1

THE 0.6 A 0.3MANY 0.1

x = 3

OF 0.6 FOR 0.3BETWEEN 0.1

x = 2

0.9

0.1

0.2

0.8

0.7

0.3

Combining topics and syntax

Page 91: Semantic Representations with Probabilistic Topic Models

THE LOVE OF………………

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

z = 1 0.4

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 2 0.6

x = 1

THE 0.6 A 0.3MANY 0.1

x = 3

OF 0.6 FOR 0.3BETWEEN 0.1

x = 2

0.9

0.1

0.2

0.8

0.7

0.3

Combining topics and syntax

Page 92: Semantic Representations with Probabilistic Topic Models

THE LOVE OF RESEARCH ……

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

z = 1 0.4

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 2 0.6

x = 1

THE 0.6 A 0.3MANY 0.1

x = 3

OF 0.6 FOR 0.3BETWEEN 0.1

x = 2

0.9

0.1

0.2

0.8

0.7

0.3

Combining topics and syntax

Page 93: Semantic Representations with Probabilistic Topic Models

FOODFOODSBODY

NUTRIENTSDIETFAT

SUGARENERGY

MILKEATINGFRUITS

VEGETABLESWEIGHT

FATSNEEDS

CARBOHYDRATESVITAMINSCALORIESPROTEIN

MINERALS

MAPNORTHEARTHSOUTHPOLEMAPS

EQUATORWESTLINESEAST

AUSTRALIAGLOBEPOLES

HEMISPHERELATITUDE

PLACESLAND

WORLDCOMPASS

CONTINENTS

DOCTORPATIENTHEALTH

HOSPITALMEDICAL

CAREPATIENTS

NURSEDOCTORSMEDICINENURSING

TREATMENTNURSES

PHYSICIANHOSPITALS

DRSICK

ASSISTANTEMERGENCY

PRACTICE

BOOKBOOKS

READINGINFORMATION

LIBRARYREPORT

PAGETITLE

SUBJECTPAGESGUIDE

WORDSMATERIALARTICLE

ARTICLESWORDFACTS

AUTHORREFERENCE

NOTE

GOLDIRON

SILVERCOPPERMETAL

METALSSTEELCLAYLEADADAM

OREALUMINUM

MINERALMINE

STONEMINERALS

POTMININGMINERS

TIN

BEHAVIORSELF

INDIVIDUALPERSONALITY

RESPONSESOCIAL

EMOTIONALLEARNINGFEELINGS

PSYCHOLOGISTSINDIVIDUALS

PSYCHOLOGICALEXPERIENCES

ENVIRONMENTHUMAN

RESPONSESBEHAVIORSATTITUDES

PSYCHOLOGYPERSON

CELLSCELL

ORGANISMSALGAE

BACTERIAMICROSCOPEMEMBRANEORGANISM

FOODLIVINGFUNGIMOLD

MATERIALSNUCLEUSCELLED

STRUCTURESMATERIAL

STRUCTUREGREENMOLDS

Semantic topics

PLANTSPLANT

LEAVESSEEDSSOIL

ROOTSFLOWERS

WATERFOOD

GREENSEED

STEMSFLOWER

STEMLEAF

ANIMALSROOT

POLLENGROWING

GROW

Page 94: Semantic Representations with Probabilistic Topic Models

GOODSMALL

NEWIMPORTANT

GREATLITTLELARGE

*BIG

LONGHIGH

DIFFERENTSPECIAL

OLDSTRONGYOUNG

COMMONWHITESINGLE

CERTAIN

THEHIS

THEIRYOURHERITSMYOURTHIS

THESEA

ANTHATNEW

THOSEEACH

MRANYMRSALL

MORESUCHLESS

MUCHKNOWN

JUSTBETTERRATHER

GREATERHIGHERLARGERLONGERFASTER

EXACTLYSMALLER

SOMETHINGBIGGERFEWERLOWER

ALMOST

ONAT

INTOFROMWITH

THROUGHOVER

AROUNDAGAINSTACROSS

UPONTOWARDUNDERALONGNEAR

BEHINDOFF

ABOVEDOWN

BEFORE

SAIDASKED

THOUGHTTOLDSAYS

MEANSCALLEDCRIED

SHOWSANSWERED

TELLSREPLIED

SHOUTEDEXPLAINEDLAUGHED

MEANTWROTE

SHOWEDBELIEVED

WHISPERED

ONESOMEMANYTWOEACHALL

MOSTANY

THREETHIS

EVERYSEVERAL

FOURFIVEBOTHTENSIX

MUCHTWENTY

EIGHT

HEYOU

THEYI

SHEWEIT

PEOPLEEVERYONE

OTHERSSCIENTISTSSOMEONE

WHONOBODY

ONESOMETHING

ANYONEEVERYBODY

SOMETHEN

Syntactic classes

BEMAKE

GETHAVE

GOTAKE

DOFINDUSESEE

HELPKEEPGIVELOOKCOMEWORKMOVELIVEEAT

BECOME

Page 95: Semantic Representations with Probabilistic Topic Models

MODELALGORITHM

SYSTEMCASE

PROBLEMNETWORKMETHOD

APPROACHPAPER

PROCESS

ISWASHAS

BECOMESDENOTES

BEINGREMAINS

REPRESENTSEXISTSSEEMS

SEESHOWNOTE

CONSIDERASSUMEPRESENT

NEEDPROPOSEDESCRIBESUGGEST

USEDTRAINED

OBTAINEDDESCRIBED

GIVENFOUND

PRESENTEDDEFINED

GENERATEDSHOWN

INWITHFORON

FROMAT

USINGINTOOVER

WITHIN

HOWEVERALSOTHENTHUS

THEREFOREFIRSTHERENOW

HENCEFINALLY

#*IXTN-CFP

EXPERTSEXPERTGATING

HMEARCHITECTURE

MIXTURELEARNINGMIXTURESFUNCTION

GATE

DATAGAUSSIANMIXTURE

LIKELIHOODPOSTERIOR

PRIORDISTRIBUTION

EMBAYESIAN

PARAMETERS

STATEPOLICYVALUE

FUNCTIONACTION

REINFORCEMENTLEARNINGCLASSESOPTIMAL

*

MEMBRANESYNAPTIC

CELL*

CURRENTDENDRITICPOTENTIAL

NEURONCONDUCTANCE

CHANNELS

IMAGEIMAGESOBJECT

OBJECTSFEATURE

RECOGNITIONVIEWS

#PIXEL

VISUAL

KERNELSUPPORTVECTOR

SVMKERNELS

#SPACE

FUNCTIONMACHINES

SET

NETWORKNEURAL

NETWORKSOUPUTINPUT

TRAININGINPUTS

WEIGHTS#

OUTPUTS

NIPS Semantics

NIPS Syntax

Page 96: Semantic Representations with Probabilistic Topic Models

Random sentence generation

LANGUAGE:[S] RESEARCHERS GIVE THE SPEECH[S] THE SOUND FEEL NO LISTENERS[S] WHICH WAS TO BE MEANING[S] HER VOCABULARIES STOPPED WORDS[S] HE EXPRESSLY WANTED THAT BETTER VOWEL

Page 97: Semantic Representations with Probabilistic Topic Models

Nested Chinese Restaurant Process

Page 98: Semantic Representations with Probabilistic Topic Models

Topic Hierarchies

• In regular topic model, no relations

between topicstopic 1

topic 2 topic 3

topic 4 topic 5 topic 6 topic 7

• Nested Chinese Restaurant Process

– Blei, Griffiths, Jordan, Tenenbaum

(2004)

– Learn hierarchical structure, as

well as topics within structure

Page 99: Semantic Representations with Probabilistic Topic Models

Example: Psych Review Abstracts

RESPONSESTIMULUS

REINFORCEMENTRECOGNITION

STIMULIRECALLCHOICE

CONDITIONING

SPEECHREADINGWORDS

MOVEMENTMOTORVISUALWORD

SEMANTIC

ACTIONSOCIALSELF

EXPERIENCEEMOTION

GOALSEMOTIONALTHINKING

GROUPIQ

INTELLIGENCESOCIAL

RATIONALINDIVIDUAL

GROUPSMEMBERS

SEXEMOTIONS

GENDEREMOTIONSTRESSWOMENHEALTH

HANDEDNESS

REASONINGATTITUDE

CONSISTENCYSITUATIONALINFERENCEJUDGMENT

PROBABILITIESSTATISTICAL

IMAGECOLOR

MONOCULARLIGHTNESS

GIBSONSUBMOVEMENTORIENTATIONHOLOGRAPHIC

CONDITIONINSTRESS

EMOTIONALBEHAVIORAL

FEARSTIMULATIONTOLERANCERESPONSES

AMODEL

MEMORYFOR

MODELSTASK

INFORMATIONRESULTSACCOUNT

SELFSOCIAL

PSYCHOLOGYRESEARCH

RISKSTRATEGIES

INTERPERSONALPERSONALITY

SAMPLING

MOTIONVISUAL

SURFACEBINOCULAR

RIVALRYCONTOUR

DIRECTIONCONTOURSSURFACES

DRUGFOODBRAIN

AROUSALACTIVATIONAFFECTIVEHUNGER

EXTINCTIONPAIN

THEOF

ANDTOINAIS

Page 100: Semantic Representations with Probabilistic Topic Models

Generative Process

RESPONSESTIMULUS

REINFORCEMENTRECOGNITION

STIMULIRECALLCHOICE

CONDITIONING

SPEECHREADINGWORDS

MOVEMENTMOTORVISUALWORD

SEMANTIC

ACTIONSOCIALSELF

EXPERIENCEEMOTION

GOALSEMOTIONALTHINKING

GROUPIQ

INTELLIGENCESOCIAL

RATIONALINDIVIDUAL

GROUPSMEMBERS

SEXEMOTIONS

GENDEREMOTIONSTRESSWOMENHEALTH

HANDEDNESS

REASONINGATTITUDE

CONSISTENCYSITUATIONALINFERENCEJUDGMENT

PROBABILITIESSTATISTICAL

IMAGECOLOR

MONOCULARLIGHTNESS

GIBSONSUBMOVEMENTORIENTATIONHOLOGRAPHIC

CONDITIONINSTRESS

EMOTIONALBEHAVIORAL

FEARSTIMULATIONTOLERANCERESPONSES

AMODEL

MEMORYFOR

MODELSTASK

INFORMATIONRESULTSACCOUNT

SELFSOCIAL

PSYCHOLOGYRESEARCH

RISKSTRATEGIES

INTERPERSONALPERSONALITY

SAMPLING

MOTIONVISUAL

SURFACEBINOCULAR

RIVALRYCONTOUR

DIRECTIONCONTOURSSURFACES

DRUGFOODBRAIN

AROUSALACTIVATIONAFFECTIVEHUNGER

EXTINCTIONPAIN

THEOF

ANDTOINAIS

Page 101: Semantic Representations with Probabilistic Topic Models

Collocation Topic Model

Page 102: Semantic Representations with Probabilistic Topic Models

What about collocations?

• Why are these words related?

– PLAY - GROUND

– DOW - JONES

– BUMBLE - BEE

• Suggests at least two routes for association:

– Semantic

– Collocation

Integrate collocations into topic model

Page 103: Semantic Representations with Probabilistic Topic Models

TOPIC MIXTURE

TOPIC

WORD

X

TOPIC

WORD

X

TOPIC

WORDIf x=0, sample a word from the topic

If x=1, sample a word from the distribution based on previous word

......

......

......

Collocation Topic Model

Page 104: Semantic Representations with Probabilistic Topic Models

TOPIC MIXTURE

TOPIC

DOW

X=1

JONES

X=0

TOPIC

RISES

Example: “DOW JONES RISES”

JONES is more likely explained as a word following DOW than as word sampled from topic

Result: DOW_JONES recognized as collocation

......

......

......

Collocation Topic Model

Page 105: Semantic Representations with Probabilistic Topic Models

Examples Topics from New York Times

WEEKDOW_JONES

POINTS10_YR_TREASURY_YIELD

PERCENTCLOSE

NASDAQ_COMPOSITESTANDARD_POOR

CHANGEFRIDAY

DOW_INDUSTRIALSGRAPH_TRACKS

EXPECTEDBILLION

NASDAQ_COMPOSITE_INDEXEST_02

PHOTO_YESTERDAYYEN10

500_STOCK_INDEX

WALL_STREETANALYSTS

INVESTORSFIRM

GOLDMAN_SACHSFIRMS

INVESTMENTMERRILL_LYNCH

COMPANIESSECURITIESRESEARCH

STOCKBUSINESSANALYST

WALL_STREET_FIRMSSALOMON_SMITH_BARNEY

CLIENTSINVESTMENT_BANKINGINVESTMENT_BANKERS

INVESTMENT_BANKS

SEPT_11WAR

SECURITYIRAQ

TERRORISMNATIONKILLED

AFGHANISTANATTACKS

OSAMA_BIN_LADENAMERICAN

ATTACKNEW_YORK_REGION

NEWMILITARY

NEW_YORKWORLD

NATIONALQAEDA

TERRORIST_ATTACKS

BANKRUPTCYCREDITORS

BANKRUPTCY_PROTECTIONASSETS

COMPANYFILED

BANKRUPTCY_FILINGENRON

BANKRUPTCY_COURTKMART

CHAPTER_11FILING

COOPERBILLIONS

COMPANIESBANKRUPTCY_PROCEEDINGS

DEBTSRESTRUCTURING

CASEGROUP

Terrorism Wall Street Firms

Stock Market

Bankruptcy