local linear matrix factorization for document modeling institute of computing technology, chinese...

18
Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences [email protected] Lu Bai, Jiafeng Guo, Yanyan Lan, Xueqi Cheng

Upload: rosalyn-cannon

Post on 19-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences bailu@software.ict.ac.cn Lu Bai,

Local Linear Matrix Factorization for Document

Modeling

Institute of Computing Technology, Chinese Academy of

Sciences

[email protected]

Lu Bai, Jiafeng Guo, Yanyan Lan, Xueqi Cheng

Page 2: Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences bailu@software.ict.ac.cn Lu Bai,

Outline

Introduction Our approach Experimental results Conclusion

Page 3: Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences bailu@software.ict.ac.cn Lu Bai,

Introduction

classification

ranking

recommendation

classification

ranking

recommendation

Page 4: Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences bailu@software.ict.ac.cn Lu Bai,

Background

The low dimensional representations can be produced from decomposing the document-word matrix into low rank matrices

Preserving local geometric

relations can improve the low dimensional representation

Smoothing the low dimensional representation

Improving the model’s generalization

Avoiding over fitting

DT∈ RN×M = θ∈ RN×K β∈ RK×M×

D : document-word matrix θ : document topic matrix β : term-topic matrix

L

Page 5: Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences bailu@software.ict.ac.cn Lu Bai,

Previous work

No local geometric regularization• None or global

regularization only • e.g. SVD, PLSA,

LDA, NMF, etc.• Over-fitting & poor

generalization

Pairwise Neighborhood Smoothing• Increasing the low

dimensional affinity over nearby document pairs

• e.g. LapPLSA, LTM, DTM, etc.

• Losing the geometric information among pairs, especially in unbalanced document distribution

Heuristic similarity measure & neighbors • Empirical similarity

threshold and neighbor numbers

• e.g. LapPLSA, LTM• Improper similarity

measure or number of neighbors hurts the representation

A new low dimensional representation mining method by better exploiting the geometric relationship among

documents

Page 6: Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences bailu@software.ict.ac.cn Lu Bai,

Our approach

Basic ideas

• Factorizing document-word matrix in NMF way

Mining low dimensional semantic representation

• Modeling document’s relationships with local linear combination

Preserving rich local geometric information

• Regularizing local linear combination weights with norm

Selecting neighbors without similarity measure and threshold

Page 7: Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences bailu@software.ict.ac.cn Lu Bai,

Local Linear Matrix Factorization(LLMF)

Factorizing the document-term matrix as NMF

, are used for reducing over-fitting Factorizing the matrix with neighbors

denotes the normalized document-word matrix , avoids the bias of long documents

denotes the linear combination weight weights the norm of

Picking document neighbors Learning salient combination weights

min

min

Page 8: Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences bailu@software.ict.ac.cn Lu Bai,

Cont’

Combining matrix factorization and local neighbor factorization ,

,

Final object function

min

Page 9: Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences bailu@software.ict.ac.cn Lu Bai,

Graphic Model of LLMF

Page 10: Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences bailu@software.ict.ac.cn Lu Bai,

LLMF vs Others

Comparing models without geometric information

E.g. NMF, PLSA, LDA LLMF smoothes document

representation with its neighbors

Comparing models with geometric constraints

E.g. LapPLSA, LTM LLMF is free of similarity

measure and neighborhood threshold

LLMF is more robust in preserving local geometric structure in unbalanced data distribution

φ ABφ AD

φ ACA

B

C

D

E

F

Page 11: Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences bailu@software.ict.ac.cn Lu Bai,

Model fitting

Estimating firstly Not differentiable, because of the

norm OWL-QN

Estimating , are bi-convex on Coordinate gradient descent

Page 12: Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences bailu@software.ict.ac.cn Lu Bai,

Experimental Settings

Data set 20news & la1(from Weka) Word Stemming Stop words removing

Data sets

Num. Of Document

Num. of word

Num. of category

20news 18,744 26, 214 20

la1 2,850 13,195 5

Page 13: Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences bailu@software.ict.ac.cn Lu Bai,

Cont’

Baseline method PLSA, LDA, NMF, LapPLSA

Parameter setting Low Dimension , , for norm for norm

Document classification Libsvm, linear kernel Training set : testing set = 3:2

Page 14: Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences bailu@software.ict.ac.cn Lu Bai,

Experimental Results

† Topic lablels are assigned according to top words in them manually

Topics Learned by LLMF over the Two Datasets

Page 15: Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences bailu@software.ict.ac.cn Lu Bai,

Cont’ Document classification

LapPLSA and LLMF are better than NMF, PLSA, LDA

LLMF achieves highest accuracy than all baseline methods in both datasets

LLMF with different s is consistently better than pure NMF

Page 16: Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences bailu@software.ict.ac.cn Lu Bai,

Conclusion

Conclusions We propose a novel method, namely LLMF for learning low

dimensional representations of document with local linear constraints.

LLMF can better capture the rich geometric information among documents than those based on independent pairwise relationships.

Experiments on benchmark of 20news and la1 show the proposed approach can learn better semantic representations compared to other baseline methods

Future works We would extend LLMF to paralleled and distributed

settings It is promising to apply LLMF in recommendation systems

Page 17: Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences bailu@software.ict.ac.cn Lu Bai,

References

D. M. Blei, A. Y. Ng, M. I. Jordan, and J. Lafferty. Latent dirichlet allocation. JMLR, 3:2003, 2003.

D. Cai, X. He, and J. Han. Locally consistent concept factorization for document clustering. TKDE, 23(6):902–913,2011

D. Cai, Q. Mei, J. Han, and C. Zhai. Modeling hidden topics on document manifold. CIKM ’08, 911–920,, NY, USA, 2008. ACM

T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. In Machine Learning, page 2001, 2001

S. Huh and S. E. Fienberg. Discriminative topic modeling based on manifold learning. KDD ’10, pages 653–662, New York, NY, USA, 2010. ACM

Page 18: Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences bailu@software.ict.ac.cn Lu Bai,

Thanks!!

Q&A