introduction to machine learning and text mining

25
Introduction to Machine Learning and Text Mining Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Upload: megan

Post on 25-Feb-2016

25 views

Category:

Documents


1 download

DESCRIPTION

Introduction to Machine Learning and Text Mining. Carolyn Penstein Ros é Language Technologies Institute/ Human-Computer Interaction Institute. Data. Target Representation. Naïve Approach: When all you have is a hammer…. Data. Target Representation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to Machine Learning and Text Mining

Introduction to Machine Learning and Text Mining

Carolyn Penstein RoséLanguage Technologies Institute/

Human-Computer Interaction Institute

Page 2: Introduction to Machine Learning and Text Mining

Naïve Approach: When all you have is a hammer…

TargetRepresentationData

Page 3: Introduction to Machine Learning and Text Mining

Slightly less naïve approach: Aimless wandering…

TargetRepresentationData

Page 4: Introduction to Machine Learning and Text Mining

Expert Approach: Hypothesis driven

TargetRepresentationData

Page 5: Introduction to Machine Learning and Text Mining

Suggested Readings Witten, I. H., Frank, E., Hall,

M. (2011). Data Mining: Practical Machine Learning Tools and Techniques, third edition, Elsevier: San Francisco

Page 6: Introduction to Machine Learning and Text Mining

What is machine learning?

Automatically or semi-automatically Inducing concepts (i.e., rules) from dataFinding patterns in dataExplaining dataMaking predictions

Data Learning Algorithm Model

New Data

PredictionClassification Engine

Page 7: Introduction to Machine Learning and Text Mining
Page 8: Introduction to Machine Learning and Text Mining
Page 9: Introduction to Machine Learning and Text Mining

If Outlook = sunny, no else if Outlook = overcast, yes else if Outlook = rainy and Windy = TRUE, no else yes

Perfect ontraining data

Page 10: Introduction to Machine Learning and Text Mining

If Outlook = sunny, no else if Outlook = overcast, yes else if Outlook = rainy and Windy = TRUE, no else yes

Performance ontraining data?Not perfect on

testing data

Page 11: Introduction to Machine Learning and Text Mining

If Outlook = sunny, no else if Outlook = overcast, yes else if Outlook = rainy and Windy = TRUE, no else yes

IMPORTANT!If you evaluate the performanceof your rule on the same data

you trained on, you won’tget an accurate estimate of

how well it will do on new data.

Page 12: Introduction to Machine Learning and Text Mining

Simple Cross Validation Let’s say your data has

attributes A, B, and C

You want to train a rule to predict D

First train on 2, 3, 4, 5, 6,7 and apply trained model to

1 The results is Accuracy1

1

2

3

4

5

6

7

TEST

TRAIN

TRAIN

TRAIN

TRAIN

TRAIN

TRAIN

Fold: 1

Page 13: Introduction to Machine Learning and Text Mining

Simple Cross Validation Let’s say your data has

attributes A, B, and C

You want to train a rule to predict D

First train on 1, 3, 4, 5, 6,7 and apply trained model to

2 The results is Accuracy2

1

2

3

4

5

6

7

TRAIN

TRAIN

TRAIN

TRAIN

TRAIN

TRAIN

TEST

Fold: 2

Page 14: Introduction to Machine Learning and Text Mining

Simple Cross Validation Let’s say your data has

attributes A, B, and C

You want to train a rule to predict D

First train on 1, 2, 4, 5, 6,7 and apply trained model to

3 The results is Accuracy3

1

2

3

4

5

6

7

TRAIN

TRAIN

TRAIN

TRAIN

TEST

TRAIN

TRAIN

Fold: 3

Page 15: Introduction to Machine Learning and Text Mining

Simple Cross Validation Let’s say your data has

attributes A, B, and C

You want to train a rule to predict D

First train on 1,2, 3, 5, 6,7 and apply trained model to

4 The results is Accuracy4

1

2

3

4

5

6

7

TRAIN

TRAIN

TRAIN

TEST

TRAIN

TRAIN

TRAIN

Fold: 4

Page 16: Introduction to Machine Learning and Text Mining

Simple Cross Validation Let’s say your data has

attributes A, B, and C

You want to train a rule to predict D

First train on 1, 2, 3, 4, 6,7 and apply trained model to

5 The results is Accuracy5

1

2

3

4

5

6

7

TRAIN

TRAIN

TEST

TRAIN

TRAIN

TRAIN

TRAIN

Fold: 5

Page 17: Introduction to Machine Learning and Text Mining

Simple Cross Validation Let’s say your data has

attributes A, B, and C

You want to train a rule to predict D

First train on 1, 2, 3, 4, 5, 7 and apply trained model to

6 The results is Accuracy6

1

2

3

4

5

6

7

TRAIN

TEST

TRAIN

TRAIN

TRAIN

TRAIN

TRAIN

Fold: 6

Page 18: Introduction to Machine Learning and Text Mining

Simple Cross Validation Let’s say your data has

attributes A, B, and C

You want to train a rule to predict D

First train on 1, 2, 3, 4, 5, 6 and apply trained model to 7 The results is Accuracy7 Finally: Average Accuracy1

through Accuracy7

1

2

3

4

5

6

7

TRAIN

TRAIN

TRAIN

TRAIN

TRAIN

TEST

TRAIN

Fold: 7

Page 19: Introduction to Machine Learning and Text Mining

Working with Text

Page 20: Introduction to Machine Learning and Text Mining

Basic Idea

Represent text as a vector where each position corresponds to a term

This is called the “bag of words” approach

Cows make cheese. 110010

Hamsters eat seeds. 001101

CheeseCowsEatHamstersMakeSeeds

Page 21: Introduction to Machine Learning and Text Mining

Basic Idea

Represent text as a vector where each position corresponds to a term

This is called the “bag of words” approach

Cows make cheese.110010

Hamsters eat seeds.001101

CheeseCowsEatHamstersMakeSeeds

But same representationBut same representationfor “Cheese makes cows.”!for “Cheese makes cows.”!

Page 22: Introduction to Machine Learning and Text Mining

Part of Speech Tagging1. CC Coordinating

conjunction 2. CD Cardinal number 3. DT Determiner 4. EX Existential there 5. FW Foreign word 6. IN Preposition/subord 7. JJ Adjective 8. JJR Adjective,

comparative 9. JJS Adjective, superlative 10.LS List item marker 11.MD Modal

12.NN Noun, singular or mass

13.NNS Noun, plural 14.NNP Proper noun,

singular 15.NNPS Proper noun, plural 16.PDT Predeterminer 17.POS Possessive ending 18.PRP Personal pronoun 19.PP Possessive pronoun 20.RB Adverb 21.RBR Adverb, comparative 22.RBS Adverb, superlative

http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

Page 23: Introduction to Machine Learning and Text Mining

Part of Speech Tagging23.RP Particle 24.SYM Symbol 25.TO to 26.UH Interjection 27.VB Verb, base form 28.VBD Verb, past tense 29.VBG Verb,

gerund/present participle 30.VBN Verb, past participle 31.VBP Verb, non-3rd ps.

sing. present

32.VBZ Verb, 3rd ps. sing. present

33.WDT wh-determiner 34.WP wh-pronoun 35.WP Possessive wh-

pronoun 36.WRB wh-adverb

http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

Page 24: Introduction to Machine Learning and Text Mining

Basic Types of Features

Unigram Single words prefer, sandwhich, take

Bigram Pairs of words next to each other Machine_learning, eat_wheat

POS-Bigram Pairs of POS tags next to each other DT_NN, NNP_NNP

Page 25: Introduction to Machine Learning and Text Mining

Keep this picture in mind…

Machine learning isn’t magic But it can be useful for

identifying meaningful patterns in your data when used properly

Proper use requires insight into your data

?