term%frequency%analy/cs%and% documentclustering%files.meetup.com/7616132/dc-nlp-2013-10 thomas...
TRANSCRIPT
![Page 1: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%](https://reader033.vdocument.in/reader033/viewer/2022043021/5f3d6efe24c77e021b7f632a/html5/thumbnails/1.jpg)
Term Frequency Analy/cs and Document Clustering
Thomas Jones DC NLP Meetup 10/09/2013
![Page 2: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%](https://reader033.vdocument.in/reader033/viewer/2022043021/5f3d6efe24c77e021b7f632a/html5/thumbnails/2.jpg)
Who Am I and What Do I Do? • Sta/s/cian at IDA/Science and Technology Policy Ins/tute (STPI) since January
• Stats/econometrics (professionally) since early 2008
• Former enlisted infantry Marine – (but now I only shoot eigenvalues)
![Page 3: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%](https://reader033.vdocument.in/reader033/viewer/2022043021/5f3d6efe24c77e021b7f632a/html5/thumbnails/3.jpg)
Who is This Talk For?
![Page 4: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%](https://reader033.vdocument.in/reader033/viewer/2022043021/5f3d6efe24c77e021b7f632a/html5/thumbnails/4.jpg)
The Library of Babel
hRp://www.betaversion.org/~stefano/linotype/news/26/
![Page 5: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%](https://reader033.vdocument.in/reader033/viewer/2022043021/5f3d6efe24c77e021b7f632a/html5/thumbnails/5.jpg)
Frequency Analysis in 3 Steps
1. Data Cura/on a. Remove stop words and other terms/symbols/
numbers b. Count words/n-‐grams and re-‐weight c. Calculate distance/similarity between documents
2. Preliminary visualiza/on a. Plot a nearest neighbor network
3. Cluster analysis a. Choose your favorite algorithm b. Find the most frequent terms in a cluster
![Page 6: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%](https://reader033.vdocument.in/reader033/viewer/2022043021/5f3d6efe24c77e021b7f632a/html5/thumbnails/6.jpg)
The Document Term Matrix
10/14/13 6
Each row is an individual document
term
Raw count
![Page 7: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%](https://reader033.vdocument.in/reader033/viewer/2022043021/5f3d6efe24c77e021b7f632a/html5/thumbnails/7.jpg)
Texts as Points in Space
0
2
4
6
8
10
12
0 2 4 6 8 10 12
Hummus
Cheeseburgers
![Page 8: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%](https://reader033.vdocument.in/reader033/viewer/2022043021/5f3d6efe24c77e021b7f632a/html5/thumbnails/8.jpg)
“Distance” Between Documents
0
2
4
6
8
10
12
0 2 4 6 8 10 12
Hummus
Cheeseburgers
A C
B
![Page 9: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%](https://reader033.vdocument.in/reader033/viewer/2022043021/5f3d6efe24c77e021b7f632a/html5/thumbnails/9.jpg)
Which Words Contain Informa/on? The TF-‐IDF Frequency Weights
0
2
4
6
8
10
12
1 39
77
115
153
191
229
267
305
343
381
419
457
495
533
571
609
647
685
723
761
799
837
875
913
951
989
1027
1065
1103
1141
1179
1217
1255
1293
Inverse Document Frequency Weight
Number of Documents in Which a Term Appears
![Page 10: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%](https://reader033.vdocument.in/reader033/viewer/2022043021/5f3d6efe24c77e021b7f632a/html5/thumbnails/10.jpg)
VISUAL EXPLORATION Term Frequency Analy/cs and Document Clustering
![Page 11: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%](https://reader033.vdocument.in/reader033/viewer/2022043021/5f3d6efe24c77e021b7f632a/html5/thumbnails/11.jpg)
Sample Data
• Source: hRp://www.congressionalbills.org/
• Titles of 5,000 randomly sampled Congressional bills that were signed into law from the 80th to the 112th Congress
• Used for example visuals only, not a thorough analysis
![Page 12: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%](https://reader033.vdocument.in/reader033/viewer/2022043021/5f3d6efe24c77e021b7f632a/html5/thumbnails/12.jpg)
Nearest Neighbor Networks
![Page 13: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%](https://reader033.vdocument.in/reader033/viewer/2022043021/5f3d6efe24c77e021b7f632a/html5/thumbnails/13.jpg)
CLUSTER ANALYSIS Term Frequency Analy/cs and Document Clustering
![Page 14: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%](https://reader033.vdocument.in/reader033/viewer/2022043021/5f3d6efe24c77e021b7f632a/html5/thumbnails/14.jpg)
Par//onal Clustering
![Page 15: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%](https://reader033.vdocument.in/reader033/viewer/2022043021/5f3d6efe24c77e021b7f632a/html5/thumbnails/15.jpg)
What Does a Cluster Represent?
15
• Clusters are groups of documents.
• Documents are grouped around a co-‐occurrence of terms (TF-‐IDF)
• Manual inspec/on of documents augments analyses.
Bills Pertaining to the Navy and Marine Corps
Term Freqnavy 100corps 86
medical 44marine 37
appointments 36officers 35army 34band 29grade 28nurse 26duty 24
permanent 24united 23
authorize 21states 21nurses 19career 18estates 18norfolk 18held 17
members 16attendance 16
force 16air 15
![Page 16: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%](https://reader033.vdocument.in/reader033/viewer/2022043021/5f3d6efe24c77e021b7f632a/html5/thumbnails/16.jpg)
Nearest Neighbor Network of Clusters
![Page 17: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%](https://reader033.vdocument.in/reader033/viewer/2022043021/5f3d6efe24c77e021b7f632a/html5/thumbnails/17.jpg)
Cluster Nearest Neighbor Network
![Page 18: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%](https://reader033.vdocument.in/reader033/viewer/2022043021/5f3d6efe24c77e021b7f632a/html5/thumbnails/18.jpg)
![Page 19: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%](https://reader033.vdocument.in/reader033/viewer/2022043021/5f3d6efe24c77e021b7f632a/html5/thumbnails/19.jpg)
TAKE AWAYS Term Frequency Analy/cs and Document Clustering
![Page 20: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%](https://reader033.vdocument.in/reader033/viewer/2022043021/5f3d6efe24c77e021b7f632a/html5/thumbnails/20.jpg)
If you remember nothing else…
• Corpus representa/on = Document Term Matrix
• Frequency measure = Term Frequency Inverse Document Frequency
• Distance/Similarity measure = Cosine similarity
![Page 21: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%](https://reader033.vdocument.in/reader033/viewer/2022043021/5f3d6efe24c77e021b7f632a/html5/thumbnails/21.jpg)
Pro Tips
• Longer documents are more internally heterogeneous and can be more difficult to cluster meaningfully
• Context-‐specific dic/onaries are helpful.
• Dimensionality (i.e. data size) requires thoughgul programming
• Get more clusters than you think you need and then aggregate them aier inspec/on.