event identification in social media hila becker, luis gravano mor naaman columbia university...
TRANSCRIPT
EVENT IDENTIFICATION IN SOCIAL MEDIA Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University
Social Media Sites Host Many “Event” Documents
Photo-sharing: Flickr Video-sharing: YouTube Social networking: Facebook
2
“Event”= something that occurs at a certain time in a certain place [Yang et al. ’99]
Popular, widely known eventsPresidential Inauguration, Thanksgiving Day Parade
Smaller events, without traditional news coverageLocal food drive, street fair
…
Social media documents for “All Points West” festival, Liberty State Park, New
Jersey, 8/8/08
Social media documents for “All Points West” festival, Liberty State Park, New
Jersey, 8/8/08
Identifying Events and Associated Social Media Documents
Applications Event search and browsing Local search …
3
General approach: group similar documents via clusteringEach cluster corresponds to one event and its associated social media documents
Event Identification: Challenges
Uneven data quality Missing, short, uninformative text … but revealing structured context
available: tags, date/time, geo-coordinates Scalability Dynamic data stream of event
information Unknown number of events
Necessary for many clustering algorithms Difficult to estimate
4
Clustering Social Media Documents Social media document
representation Social media document similarity Social media document clustering
Clustering task: definition Ensemble algorithm: combining
multiple clustering results Preliminary evaluation
5
Social Media Document Representation
TitleTitle
Description
Description
TagsTags
Date/TimeDate/Time
LocationLocation
All-TextAll-Text
6
Social Media Document Similarity
Text: tf-idf weights, cosine similarity
7
TitleTitle
Description
Description
TagsTags
Date/TimeDate/Time
LocationLocation
All-TextAll-Text
TitleTitle
Description
Description
TagsTags
Date/Time-
Keywords
Date/Time-
Keywords
Location-ProximityLocation-Proximity
All-TextAll-Text
Location-KeywordsLocation-Keywords
Date/Time-
Proximity
Date/Time-
Proximity
time
Location: geo-coordinate proximity
AA AAAA BB BBBB
Time: proximity in minutes
Social Media Document Clustering Framework
Document featurerepresentation
Social mediadocuments
Event clusters
8
Consensus Function:combine ensemble similarities
Consensus Function:combine ensemble similarities
Clustering: Ensemble Algorithm
Wtitle
Wtags
Wtime
9
f(C,W)f(C,W)
Ctitle
Ctags
Ctime
Ensemble clustering solution
Ensemble clustering solution
Learned in a training step
Learned in a training step
Clustering: Measuring Quality Homogeneous clusters
10
✔
✔
Complete clusters
Metric: Normalized Mutual Information (NMI)Shared information between clustering solution and “ground truth”
Experimental Setup
Data: >270K Flickr photos Event labels from Yahoo!’s “upcoming” event
database Split into 3 parts for training/validation/testing
Clusterers: single pass algorithm with centroid similarity
Weighing scheme: Normalized Mutual Information (NMI) scores on validation set
Consensus function: weighted average of clusterers’ binary predictions
Final prediction step: single pass clustering algorithm
11
Preliminary Evaluation Results Individual clusterer performance
Highest NMI: Tags, All-Text Lowest NMI: Description, Title
Ensemble performance, compared against all individual clusterers Highest overall performance in terms of
NMI More homogenous clusters: each event
is spread over fewer clusters
12
Details in paper
Details in paper
Document similarity metric Ensemble approach
Weight assignment Choice of clusterers
Train a classifier to predict document similarity Features correspond to similarity scores
All-text, title, tags, time, location, etc. Numeric values in [0,1]
State-of-the-art classifiers: SVM, Logistic Regression, …
13
Future Work: Alternative Choices
Future Work: Alternative Choices
Final clustering step Apply graph partitioning algorithms
Requires estimating the number of clusters Evaluation metrics: beyond NMI Datasets
Flickr LastFM, YouTube Exploit social network connections
14
Conclusions
Identified events and their corresponding social media documents Proposed a clustering solution Leveraged different representations of social media
documents Employed various social media similarity metrics
Developed a weighted ensemble clustering approach
Reported preliminary results of our event identification approach on a large-scale dataset of Flickr photographs
15