a robust framework for classifying evolving document streams in an expert-machine-crowd setting
TRANSCRIPT
A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting
Muhammad Imran*, Sanjay Chawla*, Carlos Castillo**
*Qatar Computing Research Institute, Doha, Qatar**Eurecat, Barcelona, Spain
Data Stream Processing
Challenges1. Infinite length2. Concept-drift (change in data distributions)3. Concept-evolution (new categories emerge)4. Limited labeled data
Credit Card fraud detection Sensor data classification Social media stream mining
Data stream
Social Media Stream Processing in Time-Critical Situations
2013 Pakistan EarthquakeSeptember 28 at 07:34 UTC
2010 Haiti EarthquakeJanuary 12 at 21:53 UTC
Social MediaPlatforms
Availability of Immense Data:
Around 16 thousands tweetsper minute were posted duringthe hurricane Sandy in the US.
Opportunities:- Early warning and event detection
- Situational awareness
- Actionable information extraction
- Rapid crisis response
- Post-disaster analysis
Disease outbreaks
Social Media Data Streams Classification
We address two issues in the classification (supervised) of social media streams:
1. How to keep the categories used for classification up-to-date?
2. While adding new categories, how to maintain high classification accuracy?
Input and OutputCategory A Category B Category C Miscellaneous Z
Category A’ Category B’ Category C’
Z1 Z2
Z’
INPU
TO
UTP
UT
Problem DefinitionGiven as input a data set of documents:
Categorized into a taxonomy: containing
Partitioning of documents into taxonomy:
Our task is to produce a new taxonomy:
With the following characteristics:• There are N new categories: • Pre-existing categories are slightly modified:• New categories are different than the old:
• The size of the miscellaneous category is reduced:
Expert-Machine-Crowd Setting
Constraints Outlier Detection (COD-Means):
1. Constraints formation using classified items2. Clustering using COD-Means3. Labeling errors identification (using outlier detection)
Expert-Machine-Crowd Setting
Constraints Outlier Detection (COD-Means):
1. Constraints formation using classified items2. Clustering using COD-Means3. Labeling errors identification (using outlier detection)
12
3
4
Constraints Formation1. Items in same category have Must-link constraints2. Items belonging to different categories have Cannot-link constraints
Category A Category B Category C Category Z
Must-link
Cannot-linkNote: Items in Z do not have any constraints
Objective Function
Standard distortion error
If an ML constraint if violated then the cost of the violation is equal to the distance between the two centroids that contain the instances.
If a CL constraint is violated then the error cost is the distance between the centroid C assigned to the pair and its nearest centroid h(c).
Assignment and Update RulesRule 1: For items without any constraints (standard distortion error)
Rule 2: For items with Must-link constraints; cost of violation is distance b/w their centroids
Rule 3: For items with Cannot-link constraints; cost is the distance b/w centroid c and Its nearest centroid
is the Kronecker delta function i.e. it is 1 if x=y and 0 if x != y
Update rule: The update rule computes a modified average of all points that belong to a cluster.
COD-Means AlgorithmAlgorithm
1
2
3
Initialization (e.g. random pick of k centroids)
Assignment of items based on 3 assignment rules considering ML and CL constraints
Points in each cluster are sorted based on their distance to the centroid and top l are removed and inserted into L
Dataset and Experiments
1. Are the new clusters identified by the COD-Means algorithm genuinely different and novel?
2. What is the nature of outliers (labeling errors) discovered by the COD-Means algorithm? Are they genuine outliers?
3. What is the impact of outlier on the quality of clusters generated by COD-Means?4. Once refined clusters (without labeling errors) used in the training process, does the
overall accuracy improves?
8 disaster-related datasets were used from Twitter
Clusters Novelty and CoherenceK-Means vs. COD-Means
• The proposed approach generates more cohesive and novel clusters by removing outliers. • As the value of L increases, more tight and coherent clusters are observed.
Data Improvements Evaluation1. Labeling errors in non-miscellaneous categories2. Items incorrectly labeled as miscellaneous
Impact on Classification Performance
Conclusion
• Our setting: supervised stream classification• We presented COD-Means to learn novel
categories and labeling errors from live streams• We used real-word Twitter datasets and
performed extensive experimentation• We showed that COD-Means is able to identify
new categories and labeling errors efficiently
Thank you for your attention!