lda training system [email protected] 8/22/2012

44
LDA Training System [email protected] 8/22/2012

Upload: kathleen-fitzgerald

Post on 17-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: LDA Training System xueminzhao@tencent.com 8/22/2012

LDA Training System

[email protected]/22/2012

Page 2: LDA Training System xueminzhao@tencent.com 8/22/2012

Outline

• Introduction

• SparseLDA

• Rethinking LDA: Why Priors Matter

• LDA Training System Design: MapReduce-LDA

Page 3: LDA Training System xueminzhao@tencent.com 8/22/2012

Outline

• Introduction

• SparseLDA

• Rethinking LDA: Why Priors Matter

• LDA Training System Design: MapReduce-LDA

Page 4: LDA Training System xueminzhao@tencent.com 8/22/2012

Problem – Text Relevance

• Q1: apple pie• Q2: iphone crack

• Doc1: Apple Computer Inc. is a well known company located in California, USA.

• Doc2: The apple is the pomaceous fruit of the apple tree, spcies Malus domestica in the rose.

Page 5: LDA Training System xueminzhao@tencent.com 8/22/2012

Topic Models

Page 6: LDA Training System xueminzhao@tencent.com 8/22/2012

Topic Model – Generative Process

Page 7: LDA Training System xueminzhao@tencent.com 8/22/2012

Topic Model - Inference

Page 8: LDA Training System xueminzhao@tencent.com 8/22/2012

Latent Dirichlet Allocation

Page 9: LDA Training System xueminzhao@tencent.com 8/22/2012

Outline

• Introduction

• SparseLDA

• Rethinking LDA: Why Priors Matter

• LDA Training System Design: MapReduce-LDA

Page 10: LDA Training System xueminzhao@tencent.com 8/22/2012

Gibbs Sampling for LDA

Page 11: LDA Training System xueminzhao@tencent.com 8/22/2012

Gibbs Sampling for LDA

Page 12: LDA Training System xueminzhao@tencent.com 8/22/2012

Document-Topic Statistics

Page 13: LDA Training System xueminzhao@tencent.com 8/22/2012

Topic-Word Statistics

Page 14: LDA Training System xueminzhao@tencent.com 8/22/2012
Page 15: LDA Training System xueminzhao@tencent.com 8/22/2012

For each token,

Page 16: LDA Training System xueminzhao@tencent.com 8/22/2012

For each token,

Page 17: LDA Training System xueminzhao@tencent.com 8/22/2012

For each token,

Page 18: LDA Training System xueminzhao@tencent.com 8/22/2012

For each token,

Page 19: LDA Training System xueminzhao@tencent.com 8/22/2012

For each token,

Page 20: LDA Training System xueminzhao@tencent.com 8/22/2012

Sample a new topic

Page 21: LDA Training System xueminzhao@tencent.com 8/22/2012

For each token,

Page 22: LDA Training System xueminzhao@tencent.com 8/22/2012

Summary so far

Page 23: LDA Training System xueminzhao@tencent.com 8/22/2012

The normalizing constant

Page 24: LDA Training System xueminzhao@tencent.com 8/22/2012

The normalizing constant

Page 25: LDA Training System xueminzhao@tencent.com 8/22/2012

The normalizing constant

Page 26: LDA Training System xueminzhao@tencent.com 8/22/2012

Statistics are sparse

Page 27: LDA Training System xueminzhao@tencent.com 8/22/2012

Summary so far

Page 28: LDA Training System xueminzhao@tencent.com 8/22/2012

Huge savings: time and memory

Page 29: LDA Training System xueminzhao@tencent.com 8/22/2012

Outline

• Introduction

• SparseLDA

• Rethinking LDA: Why Priors Matter

• LDA Training System Design: MapReduce-LDA

Page 30: LDA Training System xueminzhao@tencent.com 8/22/2012

Priors for LDA

Page 31: LDA Training System xueminzhao@tencent.com 8/22/2012

Priors for LDA

Page 32: LDA Training System xueminzhao@tencent.com 8/22/2012

Priors for LDA

Page 33: LDA Training System xueminzhao@tencent.com 8/22/2012

Priors for LDA

Page 34: LDA Training System xueminzhao@tencent.com 8/22/2012

Priors for LDA

Page 35: LDA Training System xueminzhao@tencent.com 8/22/2012

Comparing Priors for LDA

Page 36: LDA Training System xueminzhao@tencent.com 8/22/2012

Optimizing m

Page 37: LDA Training System xueminzhao@tencent.com 8/22/2012

Selecting T

Page 38: LDA Training System xueminzhao@tencent.com 8/22/2012

Outline

• Introduction

• SparseLDA

• Rethinking LDA: Why Priors Matter

• LDA Training System Design: MapReduce-LDA

Page 39: LDA Training System xueminzhao@tencent.com 8/22/2012

Overview

Page 40: LDA Training System xueminzhao@tencent.com 8/22/2012

MapReduce Jobs

Page 41: LDA Training System xueminzhao@tencent.com 8/22/2012

Scalability

• Hypothesis- memory 40GB per machine;- 5 words per doc.

• Scalability- if #<docs> <= 1,000,000,000, no #<topics> limit;- if #<topics> < 14,000, no #<docs> limit.

Page 42: LDA Training System xueminzhao@tencent.com 8/22/2012

Experiment for Correctness Validation

Page 43: LDA Training System xueminzhao@tencent.com 8/22/2012

References• D. Blei, Andrew Ng, and M. Jordan, Latent Dirichlet Allocation, JMLR2003.• Thomas L. Griffiths, and Mark Steyvers, Finding scientific topics, PNAS2004.• Gregor Heinrich, Parameter estimation for text analysis, Technical Report, 2009.• Limin Yao, David Mimno, and Andrew McCallum. Efficient Methods for Topic

Model Inference on StreamingDocument Collections. KDD'09.• Hanna M. Wallach, David Mimno, and Andrew McCallum, Rethinking LDA: Why

Priors Matter, NIPS2009.• David Newman, Arthur Asuncion, Padhraic Smyth, and Max Welling, Distributed

Inference for Latent Dirichlet Allocation, NIPS2007.• Yi Wang, Hongjie Bai, Matt Stanton, Wen-Yen Chen, and Edward Y. Chang, PLDA:

Parallel Latent Dirichlet Allocation for Large-scale Applications, AAIM2009.• Xueminzhao. LDA design doc. http://x.x.x.x/~

xueminzhao/html_docs/internal/modules/lda.html.

Page 44: LDA Training System xueminzhao@tencent.com 8/22/2012

Thanks!