![Page 2: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/2.jpg)
Outline
• Introduction
• SparseLDA
• Rethinking LDA: Why Priors Matter
• LDA Training System Design: MapReduce-LDA
![Page 3: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/3.jpg)
Outline
• Introduction
• SparseLDA
• Rethinking LDA: Why Priors Matter
• LDA Training System Design: MapReduce-LDA
![Page 4: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/4.jpg)
Problem – Text Relevance
• Q1: apple pie• Q2: iphone crack
• Doc1: Apple Computer Inc. is a well known company located in California, USA.
• Doc2: The apple is the pomaceous fruit of the apple tree, spcies Malus domestica in the rose.
![Page 5: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/5.jpg)
Topic Models
![Page 6: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/6.jpg)
Topic Model – Generative Process
![Page 7: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/7.jpg)
Topic Model - Inference
![Page 8: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/8.jpg)
Latent Dirichlet Allocation
![Page 9: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/9.jpg)
Outline
• Introduction
• SparseLDA
• Rethinking LDA: Why Priors Matter
• LDA Training System Design: MapReduce-LDA
![Page 10: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/10.jpg)
Gibbs Sampling for LDA
![Page 11: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/11.jpg)
Gibbs Sampling for LDA
![Page 12: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/12.jpg)
Document-Topic Statistics
![Page 13: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/13.jpg)
Topic-Word Statistics
![Page 14: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/14.jpg)
![Page 15: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/15.jpg)
For each token,
![Page 16: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/16.jpg)
For each token,
![Page 17: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/17.jpg)
For each token,
![Page 18: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/18.jpg)
For each token,
![Page 19: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/19.jpg)
For each token,
![Page 20: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/20.jpg)
Sample a new topic
![Page 21: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/21.jpg)
For each token,
![Page 22: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/22.jpg)
Summary so far
![Page 23: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/23.jpg)
The normalizing constant
![Page 24: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/24.jpg)
The normalizing constant
![Page 25: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/25.jpg)
The normalizing constant
![Page 26: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/26.jpg)
Statistics are sparse
![Page 27: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/27.jpg)
Summary so far
![Page 28: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/28.jpg)
Huge savings: time and memory
![Page 29: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/29.jpg)
Outline
• Introduction
• SparseLDA
• Rethinking LDA: Why Priors Matter
• LDA Training System Design: MapReduce-LDA
![Page 30: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/30.jpg)
Priors for LDA
![Page 31: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/31.jpg)
Priors for LDA
![Page 32: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/32.jpg)
Priors for LDA
![Page 33: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/33.jpg)
Priors for LDA
![Page 34: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/34.jpg)
Priors for LDA
![Page 35: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/35.jpg)
Comparing Priors for LDA
![Page 36: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/36.jpg)
Optimizing m
![Page 37: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/37.jpg)
Selecting T
![Page 38: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/38.jpg)
Outline
• Introduction
• SparseLDA
• Rethinking LDA: Why Priors Matter
• LDA Training System Design: MapReduce-LDA
![Page 39: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/39.jpg)
Overview
![Page 40: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/40.jpg)
MapReduce Jobs
![Page 41: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/41.jpg)
Scalability
• Hypothesis- memory 40GB per machine;- 5 words per doc.
• Scalability- if #<docs> <= 1,000,000,000, no #<topics> limit;- if #<topics> < 14,000, no #<docs> limit.
![Page 42: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/42.jpg)
Experiment for Correctness Validation
![Page 43: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/43.jpg)
References• D. Blei, Andrew Ng, and M. Jordan, Latent Dirichlet Allocation, JMLR2003.• Thomas L. Griffiths, and Mark Steyvers, Finding scientific topics, PNAS2004.• Gregor Heinrich, Parameter estimation for text analysis, Technical Report, 2009.• Limin Yao, David Mimno, and Andrew McCallum. Efficient Methods for Topic
Model Inference on StreamingDocument Collections. KDD'09.• Hanna M. Wallach, David Mimno, and Andrew McCallum, Rethinking LDA: Why
Priors Matter, NIPS2009.• David Newman, Arthur Asuncion, Padhraic Smyth, and Max Welling, Distributed
Inference for Latent Dirichlet Allocation, NIPS2007.• Yi Wang, Hongjie Bai, Matt Stanton, Wen-Yen Chen, and Edward Y. Chang, PLDA:
Parallel Latent Dirichlet Allocation for Large-scale Applications, AAIM2009.• Xueminzhao. LDA design doc. http://x.x.x.x/~
xueminzhao/html_docs/internal/modules/lda.html.
![Page 44: LDA Training System xueminzhao@tencent.com 8/22/2012](https://reader035.vdocument.in/reader035/viewer/2022062713/56649cf05503460f949bfef2/html5/thumbnails/44.jpg)
Thanks!