tweetmogaz - the arabic tweets platform: presented by ahmed adel, badr
TRANSCRIPT
![Page 1: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/1.jpg)
O C T O B E R 1 3 - 1 6 , 2 0 1 5 • A U S T I N , T X
![Page 2: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/2.jpg)
TweetMogaz: The Arabic Tweets Platform Ahmed Adel
Team Lead, BADR
![Page 3: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/3.jpg)
3
01Who Am I?
• Bs.c. Engineering from Alexandria University
• BADR Co-Founder
• Now: Part-time Team Lead @ BADR
• 8+ years experience in software development
• Mainly Java, JavaScript
• Solr, Hadoop, Hive, ...
![Page 4: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/4.jpg)
4
02BADR
• Established Software House in Egypt
• Was founded in 2006
• Provide BigData consulting servicesand solutions
• Machine Learning, NLP, Data Science, ...
• Hadoop, Solr, Spark, Hive, Flume, Incorta, ...
![Page 5: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/5.jpg)
5
02Agenda
• What is TweetMogaz • System Modules
• Tweets processing • Indexing • Event detection • Archivers • …
• System Architecture • Tricks and Challenges • What’s Next
![Page 6: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/6.jpg)
6
02What Is TweetMogaz?
• Innovation and applied research project @ BADR • Portal for browsing, filtering and searching Arabic Tweets • ... and events detection • Based on several research papers
• Magdy W. and A. Ali, and K. Darwish. A Summarization Tool for Time-Sensitive Social Media.CIKM 2012
• Magdy W. TweetMogaz: A News Portal of Tweets. SIGIR 2013
• Elsawy E., M. Mokhtar, and W. Magdy. TweetMogaz v2: Identifying News Stories in Social Media. CIKM 2014
• Magdy W. and T. Elsayed. Adaptive Method for Following Dynamic Topics on Twitter.ICWSM 2014
![Page 7: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/7.jpg)
7
02Why Arabic
• 230 Millions speakers • 6th largest in
the world (native + 2nd) • One of the 6 UN
official languages
Mandarin Chinese
English
Hindi
Spanish
Russian
Arabic
German
Bengali
Portuguese
Japanese
Speakers in Millions0 300 600 900 1,200
Native 2nd
![Page 8: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/8.jpg)
8
02Main Features
• Classifying • Browsing• Searching
• Event Detection • Time machine
![Page 9: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/9.jpg)
9
02System Modules
• Tweets processing module • Indexing module • Event detection module
• Events • Active Hashtags
• WordCloud generator • Archivers
• Short-term • Long-term
• Analytics
![Page 10: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/10.jpg)
10
Tweets Processing Module
![Page 11: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/11.jpg)
11
02Tweets Processing Module
• Retrieves tweets(streams and search q's)
• Filters out inappropriatetweets
• Text pre-processing • Normalization
• ي ، ى• أ ، ا ، إ ، آ• ه ، ة• Kashida: ـ ، ْ
• Removing stop-words
![Page 12: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/12.jpg)
12
02
• Classification at indexing time • Multiple classes map to multi-value field (politics, sport, religious, etc)
• Boolean classifier
• Adaptive classifier (Naïve Bayes/SVM (experimental))
• Scoring at indexing time • Recent (date): latest tweets in a specific category
• Top (score field): trending tweets (high retweet rate in the past 48 hours)
Tweets Processing Module
![Page 13: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/13.jpg)
13
02Score
Scor
e
0
0.005
0.009
0.014
0.018
Tweet Age (seconds)
0 3k 6k 9k 12k 15k 18k 21k 24k 27k 30k 33k 36k 39k
![Page 14: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/14.jpg)
14
Indexing Module
![Page 15: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/15.jpg)
15
02Indexing Module
• Responsible for indexingtweets to correspondingSolr cores
• Realtime core (< 10 mins) • up to 48 hours cores
• Media: photos, videos • Text only and text that contains
links • All tweets
• Short term archives cores(>48 hours and <30 days)
![Page 16: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/16.jpg)
16
Event Detection Module
![Page 17: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/17.jpg)
17
Event Detection Module
![Page 18: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/18.jpg)
18
Event Detection Module
• Responsible for detecting events • Elsawy E., M. Mokhtar, and W. Magdy.
TweetMogaz v2: Identifying News Storiesin Social Media. CIKM 2014
• Feature-pivot (term) approach
![Page 19: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/19.jpg)
19
02Event Detection Module
• Clusters are created based ona distance threshold (fuzzy clusters)
• Distance threshold 0.4 (experimental)
S
SS
S• In 8 hours window • Processed text faceting with using min_count • Builds facets for stems • For each facet, calculate distance
to all other facets O(n2)
![Page 20: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/20.jpg)
20
02Event Detection Module
• Cluster enrichment • Enhancing clusters with less than 6 terms • Running Solr AND query with all keywords and
selecting terms with highest TFIDF toenrich the cluster
![Page 21: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/21.jpg)
21
02Event Detection Module
• Cluster de-duplication over time • Search using cluster keywords of each detected
cluster • For each response result, build stem frequency
vector • Compare the two vectors for similarity
(Cosine = 0.5: experimental) • Old clusters are updated to maintain the
chronological order of events
![Page 22: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/22.jpg)
22
02Event Detection Module
• Relevant tweets retrieval • Query against 48 hours cores
![Page 23: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/23.jpg)
23
02Event Detection Module
• Active hash tag detection • Separate field added at index time • Stored in events core with type hashtag • Build normalized top hashtag facets every 24 hours for the past week • Query Solr for hashtags older that 1 week and eliminate them
![Page 24: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/24.jpg)
24
WordCloud
![Page 25: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/25.jpg)
25
02Word Cloud: Bi-gram detection
• Facet for specific class • Facets next to each other, with a specific threshold, tend to be a bi-gram • For example: ريال مدريد - كأس العالم (Real Madrid - World Cup) • min_count applies
![Page 26: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/26.jpg)
26
Archiving Module
![Page 27: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/27.jpg)
27
02Archiving Module
• Why? • Space in finite! • Faster performance of searching recent cores
• Short-term archiving • Archive tweets that are older than 48 hours • Same Solr instance
• Long-term archiving • Archive tweets that are older than 30 days • Separate Solr instance
![Page 28: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/28.jpg)
28
System Architecture
![Page 29: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/29.jpg)
29
02System Architecture
• SolrCloud • 2 Shards • Replication factor of 2 • Zookeeper ensemble
for distribution management • SolrJ API
• Front-end • Node.js • AngularJS (Web and mobile web)
• Long-term archive • Separate Solr Instance
![Page 30: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/30.jpg)
30
Analytics and Visualization
![Page 31: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/31.jpg)
31
02Analytics and Visualization
• Banana Dashboards • Deployed on both realtime
and archive • Insights on the tweets distribution
per class, trends over time ofspecific search queries
• Realtime on production with‘Auto-refresh’ feature
• Users with highest retweets
![Page 32: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/32.jpg)
32
Challenges and Tricks
![Page 33: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/33.jpg)
33
02
• Archiving • Initially developed on Solr 4.4 • Upgrade to 4.7+ for deep paging
• Archivers Sync’ing • Short-term is writing and long term is reading • Have to sync in case of deep paging
Short-term cores
Long-term cores
Short-term archiver(W)
Long-term archiver(R)
Tricks
![Page 34: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/34.jpg)
34
02Challenges
• Twitter (Micro-blogs) very short text • Arabic has many dialects: colloquial, formal, regional variations
![Page 35: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/35.jpg)
35
Next Steps
![Page 36: TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR](https://reader031.vdocument.in/reader031/viewer/2022030305/587067b41a28ab48378b54a1/html5/thumbnails/36.jpg)
36
02Next Steps
• Integrating an adaptive classifier that can handle thecharacteristics of micro-blogs
• Search query trend over time • Engage system users • Integrate R for statistical processing (classification, detection, …)