research poster presentation design © 2012 (—this sidebar does not print—) design guide this...

1
RESEARCH POSTER PRESENTATION DESIGN © 2012 www.PosterPresentations.com Game of Cricket (IPL)– 2 Teams, 2 Sessions, 11 Players/Team, ~4 Hours #IPL, #IPL2015 – Official Hash tags for Indian Premier League Why is it Interesting – • Emotions on Twitter • The Buzz of IPL • IPL on Twitter - 62.7 Million Tweets last Week, Twitter Battle • Involvement - 101.77 million for first six games INTRODUCTION OBJECTIVE Sentiment Analysis MATERIALS AND METHODS Results CONCLUSIONS 1. Successfully classified human sentiments on tweets into 5 different categories - Unpleasant, Sad, Neutral, Happy, Pleasant/Ecstatic. 2. Named entities classified/recognized using gazettes with powerful pre-tagging and correction on the tweet data. 3. Successfully applied k-means/k-means++ on the tweet data to explore clusters based on known events in the game of cricket and unknown cluster initialization. 4. Summarization using time-based chunking, identifying the peaks and then provide "summarizing tweets" from the peak chunks done successfully. 5. Visualization of all the above methods using Data driven documents and python matplotlib done successfully. REFERENCES 1. Clustering Our Implementation : - The Advantages of Careful Seeding David Arthur | Sergie Vassilvitskii http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf Off the shelf Implementation : - sklearn (k-means, k-means++) http://scikit-learn.org/stable/modules/clustering. html#k-means 2. Summarization Summarizing Sporting Events Using Twitter – Jeffrey Nichols, Jalal Mahmud, Clemens Drews IBM Research – Almaden http://www.jeffreynichols.com/papers/summary- iui2012.pdf 3. Gazette Events, Venues, Players http://www.iplt20.com/ http://www.cricbuzz.com/ ACKNOWLEDGEMENTS We would like to thank Prof. Kenji Sagae and Justin Garten for their continuous support and valuable inputs. To apply following NLP techniques on tweets for the game of cricket (IPL) Sentiment Analysis Named Entity Recognition Clustering Summarization Computer Science Graduate Students at University of Southern California, USA Kunal Parakh, Preetam Shingavi CricTwee – Tweet Analysis for the Game of Cricket Sentiments Named Entities Clusters Summary DATASET Around 1000 manually annotated and corrected Tweets Gazettes – NER – Persons, Locations, Venues, Teams Events – Toss, Wicket, Milestones, Boundaries, Result DB Automated Pre- tagging (POS, NER, EVENTS) Tweepy StreamListener ARK POS Tagger Gazettes Pre-tagged File Manual Annotation & Correction Tagged Training File Untagged Data Named Entity Recognition Tagged Training File Feature Extraction Train Data Model File Naïve Bayes Train Development Data Naïve Bayes Classifier Result Feature Extraction Skipped tokens with POS tags D, #, P, ^, & Evaluation Method Accuracy (Approximate) Megam 76% Naïve Bayes Classifier 72% NLTK Naïve Bayes Classifier 72% Ngram Naïve Bayes Classifier 58% Classes – Unpleasant, Sad, Neutral, Happy, Ecstatic Tagged Training File Feature Extraction Gazettes Classifier Result Named Entities Named Entities – Persons, Locations, Team, Venues Feature Extraction – BIO encoding for tokens with POS “^” Evaluation – Manually checked the classified Named Entities with the entities in gazettes. Clustering DB Untagged Data Scikit-learn Clutering (k-means++) Result k-clusters Feature Extraction TFIDF & Cosine Similarity K-means clsutering Result k-clusters Known Events Unknown Events Pre-defined Clusters – Tweets belonging to each Event Evaluation – Exploratory Toss, Wickets, Boundaries, Milestones, Result Summarization DB Untagged Data Chunk Filter Find Peaks Result Summary Timeline Chunking Scikit-learn Clutering (k-means++) Top 5 Tweets of Each Chunk Gazettes Summary Filter Keywords Chunking – Segregate time stamped tweets in chunks of k minutes. Chunked Filter – Find Peaks based on threshold calculated by averaging all the tweets. Summary Filter – Calculate scores based on keywords from clusters and events from gazettes. Sentiment Analysis Named Entity Recognition Clustering Summarization CONTACT Kunal Parakh Preetam Shingavi Email – [email protected] Email – [email protected] CSCI 544 – Advanced Natural Language Processing University of Southern California Why is it Challenging – • About Tweets – Unstructured, Annotation Task • Manual Analysis • Dynamic Data • Evaluation of Models

Upload: kory-sutton

Post on 13-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: RESEARCH POSTER PRESENTATION DESIGN © 2012  (—THIS SIDEBAR DOES NOT PRINT—) DESIGN GUIDE This PowerPoint 2007 template produces

RESEARCH POSTER PRESENTATION DESIGN © 2012

www.PosterPresentations.com

Game of Cricket (IPL)– 2 Teams, 2 Sessions, 11 Players/Team, ~4 Hours #IPL, #IPL2015 – Official Hash tags for Indian Premier LeagueWhy is it Interesting –

• Emotions on Twitter

• The Buzz of IPL

• IPL on Twitter - 62.7 Million Tweets last Week, Twitter Battle

• Involvement - 101.77 million for first six games

INTRODUCTION

OBJECTIVE

Sentiment Analysis

MATERIALS AND METHODS Results CONCLUSIONS

1. Successfully classified human sentiments on tweets into 5 different categories - Unpleasant, Sad, Neutral, Happy, Pleasant/Ecstatic.

2. Named entities classified/recognized using gazettes with powerful pre-tagging and correction on the tweet data.

3. Successfully applied k-means/k-means++ on the tweet data to explore clusters based on known events in the game of cricket and unknown cluster initialization.

4. Summarization using time-based chunking, identifying the peaks and then provide "summarizing tweets" from the peak chunks done successfully.

5. Visualization of all the above methods using Data driven documents and python matplotlib done successfully.

REFERENCES

1. Clustering

  Our Implementation :

- The Advantages of Careful Seeding

David Arthur | Sergie Vassilvitskii

http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf

Off the shelf Implementation :

- sklearn (k-means, k-means++)

http://scikit-learn.org/stable/modules/clustering.html#k-means

2. Summarization

Summarizing Sporting Events Using Twitter –

Jeffrey Nichols, Jalal Mahmud, Clemens Drews

IBM Research – Almaden http://www.jeffreynichols.com/papers/summary-iui2012.pdf

3. Gazette

Events, Venues, Players http://www.iplt20.com/

http://www.cricbuzz.com/

ACKNOWLEDGEMENTS

We would like to thank Prof. Kenji Sagae and Justin Garten for their continuous support and valuable inputs.

To apply following NLP techniques on tweets for the game of cricket (IPL) Sentiment Analysis Named Entity Recognition Clustering Summarization

Computer Science Graduate Students at University of Southern California, USAKunal Parakh, Preetam Shingavi

CricTwee – Tweet Analysis for the Game of Cricket

SentimentsSentiments Named EntitiesNamed Entities ClustersClusters SummarySummary

DATASET

Around 1000 manually annotated and corrected Tweets Gazettes –

NER – Persons, Locations, Venues, Teams

Events – Toss, Wicket, Milestones, Boundaries, Result

DBDBAutomated Pre-tagging

(POS, NER, EVENTS)

Automated Pre-tagging

(POS, NER, EVENTS)

Tweepy

StreamListener

Tweepy

StreamListener

ARK POS Tagger

Gazettes

Pre-tagged File

Manual Annotation &

Correction

Manual Annotation &

Correction

Tagged Training File

Untagged Data

Named Entity Recognition

Tagged Training File

Feature ExtractionFeature ExtractionTrain

Data

Model File

Naïve Bayes TrainNaïve Bayes Train

DevelopmentData

Naïve Bayes ClassifierNaïve Bayes Classifier ResultResult

Feature Extraction

Skipped tokens with POS tags D, #, P, ^, &

EvaluationMethod Accuracy (Approximate)

Megam 76%

Naïve Bayes Classifier 72%

NLTK Naïve Bayes Classifier 72%

Ngram Naïve Bayes Classifier 58%

Classes – Unpleasant, Sad, Neutral, Happy, Ecstatic

Tagged Training File

Feature ExtractionFeature Extraction

GazettesGazettes

ClassifierClassifier ResultResult

Named Entities

Named Entities – Persons, Locations, Team, Venues Feature Extraction – BIO encoding for tokens with POS “^” Evaluation – Manually checked the classified Named Entities with the entities in gazettes.

Clustering

DBDB

Untagged Data

Scikit-learn Clutering

(k-means++)

Scikit-learn Clutering

(k-means++)ResultResult

k-clusters

Feature ExtractionFeature ExtractionTFIDF & Cosine Similarity

K-means clsutering

TFIDF & Cosine Similarity

K-means clsuteringResultResult

k-clusters

Known EventsKnown Events

Unknown Events

Unknown Events

Pre-defined Clusters – Tweets belonging to each Event Evaluation – Exploratory

Toss, Wickets, Boundaries, Milestones, Result

Summarization

DBDB

Untagged Data

Chunk FilterChunk Filter

Find Peaks

ResultResult

Summary

Timeline ChunkingTimeline Chunking

Scikit-learn Clutering

(k-means++)

Scikit-learn Clutering

(k-means++)

Top 5 Tweets of Each Chunk

Gazettes Gazettes Summary FilterSummary Filter

Keywords

Chunking – Segregate time stamped tweets in chunks of k minutes. Chunked Filter – Find Peaks based on threshold calculated by averaging all the tweets. Summary Filter – Calculate scores based on keywords from clusters and events from gazettes.

Sentiment Analysis

Named Entity Recognition

Clustering

Summarization

CONTACT

Kunal Parakh Preetam Shingavi

Email – [email protected] Email – [email protected]

CSCI 544 – Advanced Natural Language Processing

University of Southern California

Why is it Challenging – • About Tweets – Unstructured, Annotation Task

• Manual Analysis

• Dynamic Data

• Evaluation of Models