a robust framework for classifying evolving document streams in an expert-machine-crowd setting

A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting

Muhammad Imran*, Sanjay Chawla*, Carlos Castillo**

*Qatar Computing Research Institute, Doha, Qatar**Eurecat, Barcelona, Spain

Data Stream Processing

Challenges1. Infinite length2. Concept-drift (change in data distributions)3. Concept-evolution (new categories emerge)4. Limited labeled data

Credit Card fraud detection Sensor data classification Social media stream mining

Data stream

Social Media Stream Processing in Time-Critical Situations

2013 Pakistan EarthquakeSeptember 28 at 07:34 UTC

2010 Haiti EarthquakeJanuary 12 at 21:53 UTC

Social MediaPlatforms

Availability of Immense Data:

Around 16 thousands tweetsper minute were posted duringthe hurricane Sandy in the US.

Opportunities:- Early warning and event detection

- Situational awareness

- Actionable information extraction

- Rapid crisis response

- Post-disaster analysis

Disease outbreaks

Social Media Data Streams Classification

We address two issues in the classification (supervised) of social media streams:

1. How to keep the categories used for classification up-to-date?

2. While adding new categories, how to maintain high classification accuracy?

Input and OutputCategory A Category B Category C Miscellaneous Z

Category A’ Category B’ Category C’

Z1 Z2

Z’

INPU

TO

UTP

UT

Problem DefinitionGiven as input a data set of documents:

Categorized into a taxonomy: containing

Partitioning of documents into taxonomy:

Our task is to produce a new taxonomy:

With the following characteristics:• There are N new categories: • Pre-existing categories are slightly modified:• New categories are different than the old:

• The size of the miscellaneous category is reduced:

Expert-Machine-Crowd Setting

Constraints Outlier Detection (COD-Means):

1. Constraints formation using classified items2. Clustering using COD-Means3. Labeling errors identification (using outlier detection)

Expert-Machine-Crowd Setting

Constraints Outlier Detection (COD-Means):

1. Constraints formation using classified items2. Clustering using COD-Means3. Labeling errors identification (using outlier detection)

12

3

4

Constraints Formation1. Items in same category have Must-link constraints2. Items belonging to different categories have Cannot-link constraints

Category A Category B Category C Category Z

Must-link

Cannot-linkNote: Items in Z do not have any constraints

Objective Function

Standard distortion error

If an ML constraint if violated then the cost of the violation is equal to the distance between the two centroids that contain the instances.

If a CL constraint is violated then the error cost is the distance between the centroid C assigned to the pair and its nearest centroid h(c).

Assignment and Update RulesRule 1: For items without any constraints (standard distortion error)

Rule 2: For items with Must-link constraints; cost of violation is distance b/w their centroids

Rule 3: For items with Cannot-link constraints; cost is the distance b/w centroid c and Its nearest centroid

is the Kronecker delta function i.e. it is 1 if x=y and 0 if x != y

Update rule: The update rule computes a modified average of all points that belong to a cluster.

COD-Means AlgorithmAlgorithm

1

2

3

Initialization (e.g. random pick of k centroids)

Assignment of items based on 3 assignment rules considering ML and CL constraints

Points in each cluster are sorted based on their distance to the centroid and top l are removed and inserted into L

Dataset and Experiments

1. Are the new clusters identified by the COD-Means algorithm genuinely different and novel?

2. What is the nature of outliers (labeling errors) discovered by the COD-Means algorithm? Are they genuine outliers?

3. What is the impact of outlier on the quality of clusters generated by COD-Means?4. Once refined clusters (without labeling errors) used in the training process, does the

overall accuracy improves?

8 disaster-related datasets were used from Twitter

Clusters Novelty and CoherenceK-Means vs. COD-Means

• The proposed approach generates more cohesive and novel clusters by removing outliers. • As the value of L increases, more tight and coherent clusters are observed.

Data Improvements Evaluation1. Labeling errors in non-miscellaneous categories2. Items incorrectly labeled as miscellaneous

Impact on Classification Performance

Conclusion

• Our setting: supervised stream classification• We presented COD-Means to learn novel

categories and labeling errors from live streams• We used real-word Twitter datasets and

performed extensive experimentation• We showed that COD-Means is able to identify

new categories and labeling errors efficiently

Thank you for your attention!

a robust framework for classifying evolving document streams in an expert-machine-crowd setting

Science