sophia(xueyao) liang cpsc 503 final project. k=3 unsupervised p( |d) olympic, vancouver snow, cold...
DESCRIPTION
W1W2W3W4… D11011 D2………… D3………… ……………TRANSCRIPT
Sophia(Xueyao) LiangCPSC 503 Final Project
K=3
Unsupervised
P( |d)P( |d)P( |d)
Olympic, vancouver
Snow, cold
Moon light, spider man
W1 W2 W3 W4 …D1 1 0 1 1D2 … … … …D3 … … … …… … … … …
W1 W2 W3 W4 …D1 1 0 1 1D2 … … … …D3 … … … …… … … … …
zk∈{z1,z2,…,zN}ln ( , )i j
i j
L p d w
( , ) ( ) ( | ) ( | )i j k i k j kk
p d w p z p d z p w z
' ' ''
( ) ( | ) ( | )( | , )
( ) ( | ) ( | )k i k j k
k i jk i k j k
k
p z P d z p w zp z d w
p z p d z p w z
', '
( | , )( | )
( | , )
k i ji
j kk i j
i j
p z d wp w z
p z d w
'',
( | , )( | )
( | , )
k i jj
i kk i j
i j
p z d wp d z
p z d w
,
', , '
( | , )( )
( | , )
k i ji j
kk i j
i j k
p z d wp z
p z d w
Expectation:
Maximization:
D1 D2 D3 D4 …D1 1 0 1 1D2 … … … …D3 … … … …… … … … …
( , )i jp d c( , )i jp d w
W1 W2 W3 W4 …D1 1 0 1 1D2 … … … …D3 … … … …… … … … …
( , ) :i jw d d
' '1. ( , ) 1( ') ( , ) 0( ')i i i iw d d i i w d d i i
'| ( )| | ( )|
' m '1 1'
2. ( , ) (I ( ), ( ))| ( ) || ( ) |
i iI d I d
i i i n im ni i
Cw d d w d I dI d I d
(1 )* *O L R 2
' ', '
( , ) ( ( | ) ( | )) ( ')i i k i k ii i k
R w d d p z d p z d i i
( | ) ( )* ( | )k i k i kp z d p z p d z
Efficient Algorithm:Expectation (PLSA)Maximization(PLSA)The result of the previous steps may not
ends in better value for O
Parameter Inference: No closed form solution for expectation step
' ''
''
( , ) ( | )( | ) (1 ) ( | )
( , )
i i i ki
i k i ki i
i
w d d p d zp d z p d z
w d d
Potential Problems of the model
Parameter InferenceHigher time complexity and slower to converge
(1 )* *O L R
-10000
100
Cora Data version 1.0
Cited paper not in the corpusNo abstract for some post-script files
Too many categoriesDuplicated or isolated papers
30000 scientific papers, with citation informationImportant files: papers (ID-name, link, author…..) citations (ID-cited ID) classifications (link-category) directory: extractions (post-script form of the papers)
Cora Data version 1.0Papers in category Machine LearningAbout 2700 papers1400 Frequent Words (stop words removed, stemmed)Theory 315
Reinforcement 217Geneti Algorithms 418Neural Networks 818Probabilistic 426Case based 298Rule Learning 180
arg max ( | )kk
p z d
(A) Accuracy
(B) RecallAccuray and Recall for each category
PHITS PLSA NetPLSA Overall Accuracy
0.470 0.501 0.562
Overall Accuracy
Justified the claim that adding network structure into the model could improve the result of topic modeling
Modeled the network on a scale of articles
Inherent problem exists in the picked framework
The result is still far from satisfactory
How to model the network structure of blog articles, especially considering model them on a scale of articles
Bag-of-words matrix extraction Better integral model, maybe LDA
based Efficiency of the algorithm Recommendation based on topic
communtiy discovery