liangjie hong and brian d. davison computer science and engineering lehigh university bethlehem, pa...
DESCRIPTION
Empirical Study of Topic Modeling in Twitter. Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA. Why we care about text modeling in Twitter ?. SOMA 2010 . Why we care about text modeling in Twitter ?. Understanding users’ interests - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA](https://reader036.vdocument.in/reader036/viewer/2022062814/568168aa550346895ddf4915/html5/thumbnails/1.jpg)
Empirical Study of Topic Modeling in Twitter
Liangjie Hong and Brian D. DavisonComputer Science and Engineering
Lehigh UniversityBethlehem, PA USA
![Page 2: Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA](https://reader036.vdocument.in/reader036/viewer/2022062814/568168aa550346895ddf4915/html5/thumbnails/2.jpg)
SOMA 2010
Why we care about text modeling in Twitter ?
![Page 3: Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA](https://reader036.vdocument.in/reader036/viewer/2022062814/568168aa550346895ddf4915/html5/thumbnails/3.jpg)
SOMA 2010
Why we care about text modeling in Twitter ?
• Understanding users’ interests• Understanding social network• Identifying emerging topics
![Page 4: Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA](https://reader036.vdocument.in/reader036/viewer/2022062814/568168aa550346895ddf4915/html5/thumbnails/4.jpg)
Problems
SOMA 2010
• Tweets are too short (140 char)• Hash tags• Abbreviations• Multiple languages
![Page 5: Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA](https://reader036.vdocument.in/reader036/viewer/2022062814/568168aa550346895ddf4915/html5/thumbnails/5.jpg)
Question
SOMA 2010
How can we train an “effective” standard topic model ?
![Page 6: Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA](https://reader036.vdocument.in/reader036/viewer/2022062814/568168aa550346895ddf4915/html5/thumbnails/6.jpg)
We found
SOMA 2010
• Topics learned by different aggregation strategies are substantially different
• Training the model at user-level is faster
• Learned topics can help classification tasks
![Page 7: Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA](https://reader036.vdocument.in/reader036/viewer/2022062814/568168aa550346895ddf4915/html5/thumbnails/7.jpg)
A quick review of topic models
SOMA 2010
LDAAuthor-Topic
![Page 8: Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA](https://reader036.vdocument.in/reader036/viewer/2022062814/568168aa550346895ddf4915/html5/thumbnails/8.jpg)
Our goal
SOMA 2010
Obtain topic mixtures for both tweets and users
![Page 9: Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA](https://reader036.vdocument.in/reader036/viewer/2022062814/568168aa550346895ddf4915/html5/thumbnails/9.jpg)
Training Schemes
SOMA 2010
• Train on tweets• Infer users + tweets
• Train on aggregated tweets (by users)• Infer tweets
• Train on aggregated tweets (by terms)• Infer users + tweets
• Author-Topic model• Infer tweets
![Page 10: Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA](https://reader036.vdocument.in/reader036/viewer/2022062814/568168aa550346895ddf4915/html5/thumbnails/10.jpg)
Datasets
SOMA 2010
• 1,992,758 tweets + 514,130 users• 3,697,498 terms
• 274 verified users from Twitter Suggestion• 16 categories • 50,447 tweets (150 tweets per user)
![Page 11: Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA](https://reader036.vdocument.in/reader036/viewer/2022062814/568168aa550346895ddf4915/html5/thumbnails/11.jpg)
Tasks
SOMA 2010
• Topic modeling
• Retweet Prediction• User & Tweets Topical Classification
Logistic Regression
![Page 12: Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA](https://reader036.vdocument.in/reader036/viewer/2022062814/568168aa550346895ddf4915/html5/thumbnails/12.jpg)
Topic Modeling
SOMA 2010
![Page 13: Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA](https://reader036.vdocument.in/reader036/viewer/2022062814/568168aa550346895ddf4915/html5/thumbnails/13.jpg)
Topic Modeling
SOMA 2010
![Page 14: Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA](https://reader036.vdocument.in/reader036/viewer/2022062814/568168aa550346895ddf4915/html5/thumbnails/14.jpg)
Topic Modeling
SOMA 2010
![Page 15: Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA](https://reader036.vdocument.in/reader036/viewer/2022062814/568168aa550346895ddf4915/html5/thumbnails/15.jpg)
Retweet Prediction
SOMA 2010
Positive examples
@Jon Hello World2009-11-01
13:15pm
Hello World2009-11-01
12:00pm
@Kim @Jon Hello World2009-11-01
13:23pm
@Frank @Kim @Jon
Hello World2009-11-01
17:49pm
Negative examples
![Page 16: Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA](https://reader036.vdocument.in/reader036/viewer/2022062814/568168aa550346895ddf4915/html5/thumbnails/16.jpg)
Retweet Prediction
SOMA 2010
![Page 17: Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA](https://reader036.vdocument.in/reader036/viewer/2022062814/568168aa550346895ddf4915/html5/thumbnails/17.jpg)
Tweets Classification
SOMA 2010
![Page 18: Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA](https://reader036.vdocument.in/reader036/viewer/2022062814/568168aa550346895ddf4915/html5/thumbnails/18.jpg)
User Classification
SOMA 2010
![Page 19: Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA](https://reader036.vdocument.in/reader036/viewer/2022062814/568168aa550346895ddf4915/html5/thumbnails/19.jpg)
Conclusion
SOMA 2010
• User Level Aggregation is helpful• Fast and good result
• Author-Topic model does not directly apply
• Topic Modeling can help other tasks • tweets classification
![Page 20: Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA](https://reader036.vdocument.in/reader036/viewer/2022062814/568168aa550346895ddf4915/html5/thumbnails/20.jpg)
Thank you and IBM Travel Grant!
Contact Info:Liangjie [email protected] LaboratoryComputer Science and EngineeringLehigh UniversityBethlehem, PA 18015 USA
SOMA 2010