latent association analysis of document pairs

29
Latent Association Analysis of Document Pairs Gengxin Miao University of California, Santa Barbara Presented at the IBM T.J. Watson Research Center Hawthorne, NY December 2, 2011

Upload: wood

Post on 05-Jan-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Latent Association Analysis of Document Pairs. Gengxin Miao University of California, Santa Barbara Presented at the IBM T.J. Watson Research Center Hawthorne, NY December 2, 2011. DB2 logon. Symptoms. Diseases. Belong to the same search task. Users. Treatments. Queries. Web pages. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Latent Association Analysis  of Document Pairs

 Latent Association Analysis of Document Pairs

Gengxin MiaoUniversity of California, Santa Barbara

Presented at theIBM T.J. Watson Research Center

Hawthorne, NYDecember 2, 2011

Page 2: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 2

Networked Texts

t1

D

G

A

B

E

C

F

H

t2

t3

DB2logon

Diseases

Symptoms

Treatments

Texts flow on expert networks Semantically associated texts

Interconnected text streams

Users

Queries

Web pages

Belong to the same search task

Page 3: Latent Association Analysis  of Document Pairs

Semantically Associated Documents

+

Page 4: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 4

Applications

Software system maintenance Root cause finding Problem prediction

Machine translation

Question answering

Healthcare assistance

Page 5: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 5

Huge Datasets

Beyond human learner’s capability

Page 6: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 6

Modeling Options

Word-level mapping

Topic-level mapping

Document-level mapping

Source Document Set Target Document Set

Page 7: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 7

Word-Level Mapping (UAI’09)

Learns a dictionary between the two document sets Applies to machine translation Word mappings are typically noisy

Page 8: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 8

Topic-Level Mapping (EMNLP’09)

Assumes the associated documents share the same topic proportion

Works well for translations between languages

Topic simplex of the source document set

Topic simplex of the target document set

Page 9: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 9

Document-Level Mapping (our work)

One-to-many or many-to-one mappings are broken down into one-to-one document pairs

Two documents are associated by their association factor

Page 10: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 10

Latent Association Analysis – Framework

Generative process Draw an association factor for each document pair Draw topic proportions for both the source and

the target document Draw the words in each document

Generative Models Ranking Algorithms Experiment

Page 11: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 11

Latent Association Analysis – An Instantiation

Canonical Correlation Analysis (CCA) Captures the semantic association in document pairs

Correlated Topic Model (CTM) Captures the document and word co-occurrence

Generative Models Ranking Algorithms Experiment

Page 12: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 12

The Generative ProcessGenerative Models Ranking Algorithms Experiment

A pair of documents arise from the following process Draw an L-dimensional association factor

For the source/target document, draw the topic proportions

For each word in the documents, draw a topic and a word

Page 13: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 13

ProblemsGenerative Models Ranking Algorithms Experiment

Inference Given a model M and a document pair How to determine the association factor, topic proportions

and topic assignments that best describe the document pair?

Model fitting Given a set of document pairs How to calculate the parameters in M that best describes

the entire document pair set?

Page 14: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 14

InferenceGenerative Models Ranking Algorithms Experiment

Objective function

Given a model and a document pair Calculate the topic assignments and the topic proportions

Posterior distribution is intractable to compute The topic assignments and the topic proportions

are correlated when conditioned on observations

Page 15: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 15

Variational Inference

Decouple the parameters using a variational distribution Q

Fit the variational parameters to approximate the true posterior distribution

Generative Models Ranking Algorithms Experiment

Page 16: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 16

Variational ParametersGenerative Models Ranking Algorithms Experiment

Page 17: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 17

Model FittingGenerative Models Ranking Algorithms Experiment

Page 18: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 18

LAA Ranking MethodsGenerative Models Ranking Algorithms Experiment

Direct Ranking Ranking function for a candidate document pair

Word frequency can distort the probability

Latent Ranking

Page 19: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 19

Two-Step RankingGenerative Models Ranking Algorithms Experiment

Separate Topic Models Source document has topic proportion Target document has topic proportion

Topic-Level Mapping Canonical Correlation Analysis captures the association

between the topic proportions

Rank Target Documents

Page 20: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 20

Experiments

Datasets IT-Change: Changes made to an IT environment and

the consequent problems 24,317 document pairs 20,000 used for training, the rest used for testing

IT-Solution: IT problems and their solutions 19,696 document pairs 15,000 used for training, the rest used for testing

Evaluation Randomly select 100 document pairs in testing dataset For each source document, rank the 100 target documents Use the rank of the correct target document as accuracy

measurement

Generative Models Ranking Algorithms Experiment

Page 21: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 21

Accuracy AnalysisGenerative Models Ranking Algorithms Experiment

Page 22: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 22

ExampleGenerative Models Ranking Algorithms Experiment

Page 23: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 23

Summary

The LAA framework is capable of modeling two document sets associated by a bipartite graph

One-to-many mappings or many-to-one mappings of documents are taken into consideration

We instantiated LAA with CCA and CTM, but the framework can be used with other instantiations that fit specific applications

The LAA-latent ranking algorithm ranks the correct target document better than other state-of-the-art algorithms

Page 24: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 24

Acknowledgment

Prof. Louise E. Moser

Prof. Xifeng Yan

Dr. Shu Tao

Dr. Ziyu Guan

Dr. Nikos Anerousis

Page 25: Latent Association Analysis  of Document Pairs

Q & A?

Thanks!

Page 26: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 26

Unigram ModelGenerative Models Ranking Algorithms Experiment

N

nnwpp

1

)()(w

Page 27: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 27

Mixture of UnigramsGenerative Models Ranking Algorithms Experiment

z

N

nn zwpzpp

1

)|()()(w

Page 28: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 28

Probabilistic Latent Semantic IndexingGenerative Models Ranking Algorithms Experiment

z

nn dzpzwpdpwdp )|()|()(),(

Page 29: Latent Association Analysis  of Document Pairs

Gengxin Miao UC Santa Barbara 29

LDA and CTMGenerative Models Ranking Algorithms Experiment

topic 2 topic 3

topic 1

topic 1

topic 2 topic 3