machine learning and data mining an introduction with wekammartin/lds/ahpcrc081810.pdf · machine...

25
Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr. Martin Based on slides by Gregory Piatetsky-Shapiro from Kdnuggets http://www.kdnuggets.com/data_mining_course/

Upload: others

Post on 25-May-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

Machine Learning andData Mining

An Introduction with WEKA

AHPCRC Workshop - 8/18/10 - Dr. MartinBased on slides by Gregory Piatetsky-Shapiro from Kdnuggets

http://www.kdnuggets.com/data_mining_course/

Page 2: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

Some review

• What are we doing?• Data Mining• And a really brief intro to machine

learning

Page 3: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

Finding patterns

• Goal: programs that detect patterns andregularities in the data

• Strong patterns ⇒ good predictions– Problem 1: most patterns are not interesting– Problem 2: patterns may be inexact (or

spurious)– Problem 3: data may be garbled or missing

Page 4: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

Machine learning techniques

• Algorithms for acquiring structural descriptions fromexamples

• Structural descriptions represent patterns explicitly– Can be used to predict outcome in new situation– Can be used to understand and explain how prediction is

derived(may be even more important)

• Methods originate from artificial intelligence,statistics, and research on databases

witten&eibe

Page 5: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

Can machines really learn?• Definitions of “learning” from dictionary:

To get knowledge of by study,experience, or being taughtTo become aware by information orfrom observationTo commit to memoryTo be informed of, ascertain; to receive instruction

Difficult to measure

Trivial for computers

Things learn when they change theirbehavior in a way that makes themperform better in the future.

• Operational definition:Does a slipper learn?

• Does learning imply intention?

witten&eibe

Page 6: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

ClassificationLearn a method for predicting the instance class

from pre-labeled (classified) instances

Many approaches:Regression,Decision Trees,Bayesian,Neural Networks,...

Given a set of points from classeswhat is the class of new point ?

Page 7: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

Classification: LinearRegression

• Linear Regressionw0 + w1 x + w2 y >= 0

• Regressioncomputes wi fromdata to minimizesquared error to‘fit’ the data

• Not flexible enough

Page 8: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

Classification: Decision Trees

X

Y

if X > 5 then blueelse if Y > 3 then blueelse if X > 2 then greenelse blue

52

3

Page 9: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

Classification: Neural Nets

• Can select morecomplex regions

• Can be moreaccurate

• Also can overfit thedata – find patternsin random noise

Page 10: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

Built in Data Sets

• Weka comes with some built in datasets

• Described in chapter 1• We’ll start with the Weather Problem

– Toy (very small)– Data is entirely fictitious

Page 11: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

But First…

• Components of the input:– Concepts: kinds of things that can be learned

• Aim: intelligible and operational concept description

– Instances: the individual, independent examples ofa concept

• Note: more complicated forms of input are possible

– Attributes: measuring aspects of an instance• We will focus on nominal and numeric ones

Page 12: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

What’s in an attribute?• Each instance is described by a fixed predefined set of

features, its “attributes”• But: number of attributes may vary in practice

– Possible solution: “irrelevant value” flag• Related problem: existence of an attribute may depend of value

of another one• Possible attribute types (“levels of measurement”):

– Nominal, ordinal, interval and ratio

witten&eibe

Page 13: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

What’s a concept?• Data Mining Tasks (Styles of learning):

– Classification learning:predicting a discrete class

– Association learning:detecting associations between features

– Clustering:grouping similar instances into clusters

– Numeric prediction:predicting a numeric quantity

• Concept: thing to be learned• Concept description: output of learning scheme

witten&eibe

Page 14: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

The weather problem

notruehighmildrainyyesfalsenormalhotovercastyestruehighmildovercastyestruenormalmildsunnyyesfalsenormalmildrainyyesfalsenormalmildsunnynofalsehighmildsunnyyestruenormalmildovercastnotruenormalmildrainyyesfalsenormalmildrainyyesfalsehighmildrainyyesfalsehighhotovercastnotruehighhotsunnynofalsehighhotsunnyPlayWindyHumidityTemperatureOutlook

Given past data,Can you come upwith the rules for Play/Not Play ?

What is the game?

Page 15: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

The weather problem

• Given this data, what are the rules forplay/not play?

……………YesFalseNormalMildRainyYesFalseHighHotOvercastNoTrueHighHotSunnyNoFalseHighHotSunny

PlayWindyHumidityTemperatureOutlook

Page 16: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

The weather problem

• Conditions for playing

……………YesFalseNormalMildRainyYesFalseHighHotOvercastNoTrueHighHotSunnyNoFalseHighHotSunny

PlayWindyHumidityTemperatureOutlook

If outlook = sunny and humidity = high then play = noIf outlook = rainy and windy = true then play = noIf outlook = overcast then play = yesIf humidity = normal then play = yesIf none of the above then play = yes

witten&eibe

Page 17: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

Weather data with mixed attributes

notrue9171rainyyesfalse7581overcastyestrue9072overcastyestrue7075sunnyyesfalse8075rainyyesfalse7069sunnynofalse9572sunnyyestrue6564overcastnotrue7065rainyyesfalse8068rainyyesfalse9670rainyyesfalse8683overcastnotrue9080sunnynofalse8585sunnyPlayWindyHumidityTemperatureOutlook

Page 18: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

Weather data with mixedattributes

• How will the rules change when someattributes have numeric values?

……………YesFalse8075RainyYesFalse8683OvercastNoTrue9080SunnyNoFalse8585Sunny

PlayWindyHumidityTemperatureOutlook

Page 19: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

Weather data with mixed attributes

• Rules with mixed attributes

……………YesFalse8075RainyYesFalse8683OvercastNoTrue9080SunnyNoFalse8585Sunny

PlayWindyHumidityTemperatureOutlook

If outlook = sunny and humidity > 83 then play = noIf outlook = rainy and windy = true then play = noIf outlook = overcast then play = yesIf humidity < 85 then play = yesIf none of the above then play = yes

witten&eibe

Page 20: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

Some fun with WEKA

• Open WEKA preferably in Linux• We need to find the data file

find . -name \*arff -ls May want to copy into an easier place to

get to gunzip *.gz Take a look at the file format

Page 21: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

The ARFF format%% ARFF file for weather data with some numeric features%@relation weather

@attribute outlook {sunny, overcast, rainy}@attribute temperature numeric@attribute humidity numeric@attribute windy {true, false}@attribute play? {yes, no}

@datasunny, 85, 85, false, nosunny, 80, 90, true, noovercast, 83, 86, false, yes...

witten&eibe

Page 22: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

Open Weka Explorer Open file… Choose weather.arff

Note that if you have a file in .csv format E.g. from Excel It can be opened and will be automatically converted to

.arff format

Page 23: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

Weka

Page 24: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

Classifying Weather Data

• Click on Classify– Choose bayes -> NaïveBayesSimple– Choose trees -> J48– Try some more

Page 25: Machine Learning and Data Mining An Introduction with WEKAmmartin/LDS/AHPCRC081810.pdf · Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr

Keep Exploring

• Try the iris data set• Does it work better?