lda training system xueminzhao@tencent.com 8/22/2012

LDA Training System

xueminzhao@tencent.com8/22/2012

Outline

• Introduction

• SparseLDA

• Rethinking LDA: Why Priors Matter

• LDA Training System Design: MapReduce-LDA

Outline

• Introduction

• SparseLDA

Problem – Text Relevance

• Q1: apple pie• Q2: iphone crack

• Doc1: Apple Computer Inc. is a well known company located in California, USA.

• Doc2: The apple is the pomaceous fruit of the apple tree, spcies Malus domestica in the rose.

Topic Models

Topic Model – Generative Process

Topic Model - Inference

Latent Dirichlet Allocation

Outline

• Introduction

• SparseLDA

Gibbs Sampling for LDA

Document-Topic Statistics

Topic-Word Statistics

For each token,

Sample a new topic

For each token,

Summary so far

The normalizing constant

Statistics are sparse

Summary so far

Huge savings: time and memory

Outline

• Introduction

• SparseLDA

Priors for LDA

Comparing Priors for LDA

Optimizing m

Selecting T

Outline

• Introduction

• SparseLDA

Overview

MapReduce Jobs

Scalability

• Hypothesis- memory 40GB per machine;- 5 words per doc.

• Scalability- if #<docs> <= 1,000,000,000, no #<topics> limit;- if #<topics> < 14,000, no #<docs> limit.

Experiment for Correctness Validation

References• D. Blei, Andrew Ng, and M. Jordan, Latent Dirichlet Allocation, JMLR2003.• Thomas L. Griffiths, and Mark Steyvers, Finding scientific topics, PNAS2004.• Gregor Heinrich, Parameter estimation for text analysis, Technical Report, 2009.• Limin Yao, David Mimno, and Andrew McCallum. Efficient Methods for Topic

Model Inference on StreamingDocument Collections. KDD'09.• Hanna M. Wallach, David Mimno, and Andrew McCallum, Rethinking LDA: Why

Priors Matter, NIPS2009.• David Newman, Arthur Asuncion, Padhraic Smyth, and Max Welling, Distributed

Inference for Latent Dirichlet Allocation, NIPS2007.• Yi Wang, Hongjie Bai, Matt Stanton, Wen-Yen Chen, and Edward Y. Chang, PLDA:

Parallel Latent Dirichlet Allocation for Large-scale Applications, AAIM2009.• Xueminzhao. LDA design doc. http://x.x.x.x/~

xueminzhao/html_docs/internal/modules/lda.html.

Thanks!

lda training system xueminzhao@tencent.com 8/22/2012

mapreducelda slide

memory slide

overview slide

sparse slide

new topic slide

mapreduce jobs slide

correctness validation

normalizing constant

Documents

generalized correspondence-lda models (gc-lda) for...

learning human behaviors and lifestyle by capturing ...let...

research article enhanced z-lda for small sample size...

realistic investigations of correlated electron systems...

generalized correspondence-lda models (gc-lda) for ... ·...

1d-lda vs. 2d-lda: when is vector-based linear discriminant

lda 141 final book - wordpress.com · lda 141 final book -...

enver sangineto wei bi victoriabi@tencent.com arxiv:2106

using lda - rutgers universityyhung/hw586/lda/lecture lda...

lda design

ace lda ma

plsa and lda

lda steel plant

problem: svm training is expensive – mining for hard...

m6d targeting model - paper reading xueminzhao@tencent.com...

lda portfolio

deep highdynamicrange imaging withlarge foreground...

1:21cr76jjm-lda case 1:21-cr-00076-jjm-lda document 3

lda visitormanagementplanningtoolkit

lda bylaws