machine learning applications on text data

25
Using Machine learning and R Finding Order in the Chaos Harshad Saykhedkar

Upload: harshad-saykhedkar

Post on 27-May-2015

207 views

Category:

Technology


1 download

DESCRIPTION

o you get the feeling of ‘the cart before the horse’ on hearing buzz-words like social data mining or sentiment analysis and so on? Fundamental text mining methods are the real ‘workhorses’ behind these buzz-words. This prsentation aims to give understanding of the fundamentals in plain english.

TRANSCRIPT

Page 1: Machine learning applications on text data

Using Machine learning and R

Finding Order in the Chaos

Harshad Saykhedkar

Page 2: Machine learning applications on text data

The main ideaSource of text and applications

Emails Spam detection

Product descriptions / reviews

Sentiment analysis, recommendation

Blogs / informational content

Content recommendations

Web pages / news articles

Topic identification, trending topics

Tweets / comments / social content

Sentiment analysis, named entity recognition

Page 3: Machine learning applications on text data

(Text mining) is a wonderful world. Let's go exploring...!

The main ideaThe main idea

Page 4: Machine learning applications on text data

Itinerary

● R you ready ?

● Prep camp

● The wandering traveller

● The seeker

Page 5: Machine learning applications on text data

R you ready ?

Page 6: Machine learning applications on text data

The main ideaPacking our bags : Checks

● Starting R

● Loading required packages

● Check sessionInfo( )

Page 7: Machine learning applications on text data

The main ideaPacking our bags : Datatypes

Atomic Vector

Lists

"Let's try our hands"

Page 8: Machine learning applications on text data

The main ideaPacking our bags : Functions

● Expressions which are evaluated

● Can be passed around

● Definitions can be nested

Details not covered : Argument matching, Call by value,

Environments and lexical scoping, Promises etc..

Page 9: Machine learning applications on text data

Prep Camp

Page 10: Machine learning applications on text data

The main ideaPrep camp : Sentiment Analysis

● Bag of words model

● Simple aggregated score

' terrible service & disorganised '

' OK - some good some bad '

' Great location, fabulous staff '

Page 11: Machine learning applications on text data

The main idea

● Part of speech ambiguity

● Further exploration ?

● Equal weightage model

● Double negations ?

Prep camp : Improvements

Page 12: Machine learning applications on text data

The Wandering Traveller

Page 13: Machine learning applications on text data

The main ideawandering traveller : Unsupervised Learning

Can define distance

Entity as point in space

How to derive this model for text ?

Feature 1

Feature 2

Page 14: Machine learning applications on text data

The main ideawandering traveller : Vector Space Model

Word, Phrase, Theme

Comments,Blogs,Tweets

Word, Phrase, Theme

Page 15: Machine learning applications on text data

The main ideawandering traveller : TfIdf and other details

" But how to measure the importance of a word for a doc ? "

● Binary : Is the 'word' in the 'doc' ?

● Tf : # times the word in the 'doc' ?

● TfIdf : Penalize the obvious!

Page 16: Machine learning applications on text data

The main ideawandering traveller : Hierarchical Clustering

● Define distance measure

● Keep Merging based on similarity

Washing Machine

Washer Dryer

Camera

Page 17: Machine learning applications on text data

The main ideawandering traveller : Improvements

● Stemming, lemmatization

● Latent semantic analysis

"Cameras" Vs "Camera"

"Phone" "Touch Screen"

Page 18: Machine learning applications on text data

The Seeker

Page 19: Machine learning applications on text data

The main ideaSeeker : Supervised Learning

● Labels given with features

● Find rule, classify unobserved case

Feature 1

Feature 2

Page 20: Machine learning applications on text data

The main ideaSeeker : Naive Bayes Classifier

● Independence of features

● Train the model on training set

● Test accuracy on a holdout sample

Predicted 0 Predicted 1

Actual 0 F (0, 0) F(0, 1)

Actual 1 F (1, 0) F(1, 1)

Page 21: Machine learning applications on text data

Learnings

Page 22: Machine learning applications on text data

The main ideaLearnings

● How to cleanup and preprocess data

in text form ?

● How to model the data ?

● How to cluster the data ?

● How to classify the data ?

Page 23: Machine learning applications on text data

The main ideaSource of text and applications

Emails Spam detection

Product descriptions / reviews

Sentiment analysis, recommendation

Blogs / informational content

Content recommendations

Web pages / news articles

Topic identification, trending topics

Tweets / comments / social content

Sentiment analysis, named entity recognition

Page 24: Machine learning applications on text data

Questions ?

Page 25: Machine learning applications on text data

"Avid R learner, trying to apply bunch of these techniques to the digital ads world"

Contact [email protected]

The main ideaAbout me