data mining practical machine learning tools and techniques chapter 1: what’s it all about? rodney...

111
Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall

Upload: kory-walker

Post on 12-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Data MiningPractical Machine Learning Tools and Techniques

Chapter 1: What’s it all about?

Rodney Nielsen

Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall

Page 2: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

What’s Data Mining All About?

• Data vs information• Data mining and machine learning• Structural descriptions

◆ Rules: classification and association

◆ Decision trees

• Datasets◆ Weather, contact lens, CPU performance, labour negotiation data, soybean

classification

• Fielded applications◆ Ranking web pages, loan applications, screening images, load forecasting,

machine fault diagnosis, market basket analysis

• Generalization as search• Data mining and ethics

Page 3: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Data Vs. Information

• Society produces huge amounts of data◆Sources: business, science, medicine, economics,

geography, environment, sports, …

• Potentially valuable resource• Raw data is useless: need techniques to

automatically extract information from it◆Data: recorded facts

◆Information: patterns underlying the data

Page 4: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Information is Crucial

• Example 1: in vitro fertilization◆Given: embryos described by 60 features

◆Problem: selection of embryos that will survive

◆Data: historical records of embryos and outcome

• Example 2: cow culling◆Given: cows described by 700 features

◆Problem: selection of cows that should be culled

◆Data: historical records and farmers’ decisions

Page 5: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Data Mining

• Goal: Extracting◆ implicit,

◆ previously unknown,

◆ potentially useful

information from data• Needed: programs that detect patterns and regularities in

the data• Strong patterns > good predictions

◆ Problem 1: most patterns are not interesting

◆ Problem 2: patterns may be inexact (or spurious)

◆ Problem 3: data may be garbled or missing

Page 6: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Machine Learning Techniques

• Algorithms for acquiring structural descriptions from examples

• Structural descriptions represent patterns explicitly◆Can be used to predict outcome in new situation

◆Can be used to understand and explain how prediction is derived

◆Which is more important?

• Methods originate from artificial intelligence, statistics, and research on databases,among other disciplines

Page 7: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Structural Descriptions

• Example: if-then rules

If tear production rate = reducedthen recommendation = none

Otherwise, if age = young and astigmatic = no then recommendation = soft

……………

HardNormalYesMyopePresbyopic

NoneReducedNoHypermetrope

Pre-presbyopic

SoftNormalNoHypermetrope

Young

NoneReducedNoMyopeYoung

Recommended lenses

Tear production

rate

AstigmatismSpectacle prescription

Age

Page 8: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Can Machines Really Learn?

• Definitions of “learning” from dictionaryTo get knowledge of by study,experience, or being taughtTo become aware by information orfrom observationTo commit to memoryTo be informed of, ascertain; to receive instruction

Difficult to measure

Trivial for computers

Things learn when they change their behavior in a way that makes them perform better in the future.

● Operational definition:

Does a slipper learn?

Page 9: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

The Weather Problem

• Conditions for playing a certain game

……………

YesFalseNormalMildRainy

YesFalseHighHotOvercast

NoTrueHighHotSunny

NoFalseHighHotSunny

PlayWindyHumidityTemperatureOutlook

If outlook = sunny and humidity = high then play = no

If outlook = rainy and windy = true then play = no

If outlook = overcast then play = yes

If humidity = normal then play = yes

If none of the above then play = yes

Page 10: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Classification vs. Association Rules

• Classification rule:• Predicts value of a given attribute (the classification of an example)

• Association rule:• Predicts value of arbitrary attribute (or combination)

If outlook = sunny and humidity = highthen play = no

If temperature = cool then humidity = normal

If humidity = normal and windy = falsethen play = yes

If outlook = sunny and play = no then humidity = high

If windy = false and play = no then outlook = sunny and humidity = high

Page 11: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Weather Data with Mixed Attributes

• Some attributes have numeric values

……………

YesFalse8075Rainy

YesFalse8683Overcast

NoTrue9080Sunny

NoFalse8585Sunny

PlayWindyHumidityTemperatureOutlook

If outlook = sunny and humidity > 83 then play = no

If outlook = rainy and windy = true then play = no

If outlook = overcast then play = yes

If humidity < 85 then play = yes

If none of the above then play = yes

Page 12: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Homework

• Download and familiarize yourself with Weka:• http://www.cs.waikato.ac.nz/~ml/weka/

downloading.html• http://www.cs.waikato.ac.nz/~ml/weka/svn.html

• Read Chapters 1 and 2 of:• Data Mining: Practical Machine Learning Tools

and Techniques, by Witten, Frank & Hall• Available online through UNT library as a pdf

• http://libproxy.library.unt.edu:2079/science/book/9780123748560

Page 13: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Homework

• Course website:

Page 14: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

The Contact Lenses Data

NoneReducedYesHypermetropePre-presbyopic NoneNormalYesHypermetropePre-presbyopic NoneReducedNoMyopePresbyopic

NoneNormalNoMyopePresbyopicNoneReducedYesMyopePresbyopicHardNormalYesMyopePresbyopicNoneReducedNoHypermetropePresbyopicSoftNormalNoHypermetropePresbyopic

NoneReducedYesHypermetropePresbyopicNoneNormalYesHypermetropePresbyopic

SoftNormalNoHypermetropePre-presbyopic

NoneReducedNoHypermetropePre-presbyopic

HardNormalYesMyopePre-presbyopic

NoneReducedYesMyopePre-presbyopic

SoftNormalNoMyopePre-presbyopic

NoneReducedNoMyopePre-presbyopic

hardNormalYesHypermetropeYoungNoneReducedYesHypermetropeYoungSoftNormalNoHypermetropeYoung

NoneReducedNoHypermetropeYoungHardNormalYesMyopeYoungNoneReducedYesMyopeYoungSoftNormalNoMyopeYoung

NoneReducedNoMyopeYoung

Recommended lenses

Tear production rate

AstigmatismSpectacle prescription

Age

Page 15: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

A Complete and Correct Rule Set

If tear production rate = reduced then recommendation = none

If age = young and astigmatic = noand tear production rate = normal then recommendation = soft

If age = pre-presbyopic and astigmatic = noand tear production rate = normal then recommendation = soft

If age = presbyopic and spectacle prescription = myopeand astigmatic = no then recommendation = none

If spectacle prescription = hypermetrope and astigmatic = noand tear production rate = normal then recommendation = soft

If spectacle prescription = myope and astigmatic = yesand tear production rate = normal then recommendation = hard

If age young and astigmatic = yes and tear production rate = normal then recommendation = hard

If age = pre-presbyopicand spectacle prescription = hypermetropeand astigmatic = yes then recommendation = none

If age = presbyopic and spectacle prescription = hypermetropeand astigmatic = yes then recommendation = none

Page 16: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

A Decision Tree for This Problem

Page 17: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Classifying Iris Flowers

Iris virginica1.95.12.75.8102

101

52

51

2

1

Iris virginica2.56.03.36.3

Iris versicolor

1.54.53.26.4

Iris versicolor

1.44.73.27.0

Iris setosa0.21.43.04.9

Iris setosa0.21.43.55.1

TypePetal width

Petal length

Sepal width

Sepal length

If petal length < 2.45 then Iris setosa

If sepal width < 2.10 then Iris versicolor

...

Page 18: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Predicting CPU Performance

• Example: 209 different computer configurations

• Linear regression function

0

0

32

128

CHMAX

0

0

8

16

CHMIN

Channels Performance

Cache (Kb)

Main memory (Kb)

Cycle time (ns)

45040001000480209

67328000512480208

26932320008000292

19825660002561251

PRPCACHMMAXMMINMYCT

PRP = -55.9 + 0.0489 MYCT + 0.0153 MMIN + 0.0056 MMAX+ 0.6410 CACH - 0.2700 CHMIN + 1.480 CHMAX

Page 19: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Data From Labor Negotiations

goodgoodgoodbad{good,bad}Acceptability of contracthalffull?none{none,half,full}Health plan contributionyes??no{yes,no}Bereavement assistancefullfull?none{none,half,full}Dental plan contributionyes??no{yes,no}Long-term disability

assistance

avggengenavg{below-avg,avg,gen}Vacation12121511(Number of days)Statutory holidays???yes{yes,no}Education allowance

Shift-work supplementStandby payPensionWorking hours per weekCost of living adjustmentWage increase third yearWage increase second yearWage increase first yearDurationAttribute

44%5%?Percentage??13%?Percentage???none{none,ret-allw, empl-

cntr}

40383528(Number of hours)none?tcfnone{none,tcf,tc}????Percentage4.04.4%5%?Percentage4.54.3%4%2%Percentage2321(Number of years)40…321Type

Page 20: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Decision Trees for the Labor Data

Page 21: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Soybean Classification

Diaporthe stem canker

19Diagnosis

Normal3ConditionRoot

Yes2Stem lodging

Abnormal2ConditionStem

…?3Leaf spot sizeAbnormal2ConditionLeaf?5Fruit spots

Normal4Condition of fruit pods

Fruit…

Absent2Mold growthNormal2ConditionSeed

…Above normal3PrecipitationJuly7Time of occurrence

Sample valueNumber of

values

Attribute

Page 22: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

The Role of Domain Knowledge

• In this domain, “leaf condition is normal” implies “leaf malformation is absent”!

If leaf condition is normaland stem condition is abnormaland stem cankers is below soil lineand canker lesion color is brown

thendiagnosis is rhizoctonia root rot

If leaf malformation is absentand stem condition is abnormaland stem cankers is below soil lineand canker lesion color is brown

thendiagnosis is rhizoctonia root rot

Page 23: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Fielded Applications

• The result of learning—or the learning method itself—is deployed in practical applications◆ Processing loan applications

◆ Screening images for oil slicks

◆ Electricity supply forecasting

◆ Diagnosis of machine faults

◆ Marketing and sales

◆ Separating crude oil and natural gas

◆ Reducing banding in rotogravure printing

◆ Finding appropriate technicians for telephone faults

◆ Scientific applications: biology, astronomy, chemistry

◆ Automatic selection of TV programs

◆ Monitoring intensive care patients

Page 24: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Processing Loan Applications(American Express)

• Given: questionnaire with financial and personal information

• Question: should money be lent?• Simple statistical method covers 90% of cases• Borderline cases referred to loan officers• But: 50% of accepted borderline cases defaulted!• Solution: reject all borderline cases?

◆ No! Borderline cases are most active customers

Page 25: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Enter Machine Learning

• 1000 training examples of borderline cases

• 20 attributes:

◆ age

◆ years with current employer

◆ years at current address

◆ years with the bank

◆ other credit cards possessed,…

• Learned rules: correct on 70% of cases

◆ human experts only 50%

• Rules could be used to explain decisions to customers

Page 26: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Screening Images

• Given: radar satellite images of coastal waters• Problem: detect oil slicks in those images• Oil slicks appear as dark regions with changing size and

shape• Not easy: lookalike dark regions can be caused by weather

conditions (e.g. high wind)• Expensive process requiring highly trained personnel

Page 27: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Enter Machine Learning

• Extract dark regions from normalized image• Attributes:

◆ size of region◆ shape, area◆ intensity◆ sharpness and jaggedness of boundaries◆ proximity of other regions◆ info about background

• Constraints:◆ Few training examples—oil slicks are rare!◆ Unbalanced data: most dark regions aren’t slicks◆ Regions from same image form a batch◆ Requirement: adjustable false-alarm rate

◆ Precision-Recall tradeoff

Page 28: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Load Forecasting

• Electricity supply companies need forecast of future demand for power

• Forecasts of min/max load for each hour Significant savings

• Given: manually constructed load model that assumes “normal” climatic conditions

• Problem: adjust for weather conditions• Static model consist of:

◆ Base load for the year

◆ Load periodicity over the year

◆ Effect of holidays

Page 29: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Enter Machine Learning

• Prediction corrected using “most similar” days• Attributes:

◆ Temperature

◆ Humidity

◆ Wind speed

◆ Cloud cover readings

◆ Plus difference between actual load and predicted load

• Average difference among three “most similar” days added to static model

• Linear regression coefficients form attribute weights in similarity function

Page 30: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Diagnosis of Machine Faults

• Diagnosis: classical domain of expert systems• Given: Fourier analysis of vibrations measured

at various points of a device’s mounting• Question: which fault is present?• Preventative maintenance of electromechanical

motors and generators• Information very noisy• So far: diagnosis by expert/hand-crafted rules

Page 31: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Enter Machine Learning

• Available: 600 faults with expert’s diagnosis• ~300 unsatisfactory, rest used for training• Attributes augmented by intermediate concepts

that embodied causal domain knowledge• Expert not satisfied with initial rules because they

did not relate to his domain knowledge• Further background knowledge resulted in more

complex rules that were satisfactory• Learned rules outperformed hand-crafted ones

Page 32: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Marketing and Sales I

• Companies precisely record massive amounts of marketing and sales data

• Applications:◆ Customer loyalty:

identifying customers that are likely to defect by detecting changes in their behavior(e.g. banks/phone companies)

◆ Special offers:identifying profitable customers(e.g. reliable owners of credit cards that need extra money during the holiday season)

Page 33: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Marketing and Sales II

• Market basket analysis◆Association techniques find groups of items that tend to

occur together in a transaction(used to analyze checkout data)

• Historical analysis of purchasing patterns• Identifying prospective customers

◆Focusing promotional mailouts(targeted campaigns are cheaper than mass-marketed ones)

Page 34: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

UNT

• Where should UNT be using Data Mining and why?

Page 35: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

The World

• What ripe opportunities do you see for DM in the world?

Page 36: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

DM vs. ML vs. Statistics

• What are the differences and relationships between:• Data Mining and Statistics?• Data Mining and Machine Learning?

Page 37: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Machine Learning and Statistics

• Historical difference (grossly oversimplified):◆Statistics: testing hypotheses

◆Machine learning: finding the right hypothesis

• But: huge overlap◆Decision trees (C4.5 and CART)

◆Nearest-neighbor methods

• Today: perspectives have significant overlap◆Many ML algorithms employ statistical techniques

Homework: http://stats.stackexchange.com/questions/5026/what-is-the-difference-between-data-mining-statistics-machine-learning-and-ai

Page 38: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Generalization as Search

• Learning = Generalization

• Interesting Learning => Generalization

Page 39: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Generalization as Search

• Inductive learning: find a concept description that fits the data

• Example: rule sets as concept description language

• … continued

Page 40: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Example Rules

if outlook=sunny and temperature=hot and humidity=high and windy=false

then play = no

ifoutlook

temperature humidity windy

thenplay

sunny hot high false no

if outlook = sunny and humidity = highthen play = no

play = yes

Page 41: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Weather Data Points (Examples)

outlook temperature humidity windy playsunny hot high false no

sunny hot high true no

overcast hot high false yes

rainy mild high false yes

rainy cool normal false yes

rainy cool normal true no

overcast cool normal true yes

sunny mild high false no

sunny cool normal false yes

rainy mild normal false yes

sunny mild normal true yes

overcast mild high true yes

overcast hot normal false yes

rainy mild high true no

Page 42: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Generalization as Search

• Inductive learning: find a concept description that fits the data

• Example: rule sets as concept description language◆Enormous, but finite, search space

• … continued

Page 43: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Enumerating the Concept Space

• Search space for weather problem◆Number of unique possible rules:

◆ 4 x 4 x 3 x 3 x 2 = 288 possible rules

Page 44: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

All Possible Rules

Rule #sif outlook temperature humidity windy

then play

1 sunny hot high false no

2 sunny hot high false yes

3 sunny hot high true no

4 sunny hot high true yes

5-6 sunny hot high * ...

7-12 sunny hot normal ...

13-18 sunny hot * ...

19-36 sunny mild ...

37-54 sunny cool ...

55-72 sunny * ...

73-144 overcast ...

145-216 rainy ...

217-288 * ...

* Match any value = omit feature from the rule

Page 45: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Example Rules

Rule #if outlook temperature humidity windy

then play

...

59 sunny * high * no

...

288 * * * * yes

59: if outlook = sunny and humidity = high then play = no

288: play = yes

Page 46: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Enumerating the Concept Space

• Search space for weather problem◆Number of unique possible rules:

◆ 4 x 4 x 3 x 3 x 2 = 288 possible rules

◆Do we have 288 total possible solutions/hypotheses?

◆Do we have 288 total possible concepts?

Page 47: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

The Concept Space

• A Concept is (or is defined by):• A set of rules• E.g., rules: {59, 144, 213, 282}

• The Concept Space is (or is defined by):• All the possible sets of rules

if outlook=sunny and temperature=hot and humidity=high and windy=false

then play = no

if outlook = sunny and humidity = high

then play = no

play = yes

Page 48: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Enumerating the Concept Space

• Search space for weather problem◆Number of unique possible rules:

◆ 4 x 4 x 3 x 3 x 2 = 288 possible rules

◆Number of unique possible rule sets?

Page 49: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Enumerate Concept Space

• {1}• {2}• …• {288}• {1, 2}• {1, 3}• …• {287, 288}• {1, 2, 3}• …• {1, 2, …, 288}

Page 50: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Enumerating the Concept Space

• Search space for weather problem◆Number of unique possible rules:

◆ 4 x 4 x 3 x 3 x 2 = 288 possible rules

◆Number of unique possible rule sets:

◆Simplifying assumption:◆Each rule set contains at most 14 rules

◆There are around 2.7x1034 possible rule “sets”

◆I.e., around 2.7x1034 possible hypotheses in the search space

Page 51: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Generalization as Search

• Inductive learning: find a concept description that fits the data

• Example: rule sets as concept description language◆Enormous, but finite, search space

• Simple solution:◆Enumerate the concept space (2.7x1034 rule “sets”)

◆Eliminate descriptions that do not fit examples

◆Surviving descriptions contain target concept

Page 52: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Enumerate All Possible Rules

Rule #sif outlook temperature humidity windy

then play

1 sunny hot high false no

2 sunny hot high false yes

3 sunny hot high true no

4 sunny hot high true yes

5-6 sunny hot high * ...

7-12 sunny hot normal ...

13-18 sunny hot * ...

19-36 sunny mild ...

37-54 sunny cool ...

55-72 sunny * ...

73-144 overcast ...

145-216 rainy ...

217-288 * ...

Page 53: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Enumerate Concept Space

• {1}• {2}• …• {288}• {1, 2}• {1, 3}• …• {287, 288}• …• {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}• {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15}• …• {275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288}

• ~2.7x1034 rule “sets”

Page 54: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Generalization as Search

• Inductive learning: find a concept description that fits the data

• Example: rule sets as concept description language◆Enormous, but finite, search space

• Simple solution:◆Enumerate the concept space (2.7x1034 rule “sets”)

◆Eliminate concept descriptions that do not fit examples

◆Surviving descriptions contain target concept

Page 55: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Eliminate Rules

Rule #sif outlook temperature humidity windy

then play

1 sunny hot high false no

2 sunny hot high false yes

3 sunny hot high true no

4 sunny hot high true yes

5-6 sunny hot high * ...

7-12 sunny hot normal ...

13-18 sunny hot * ...

19-36 sunny mild ...

37-54 sunny cool ...

55-72 sunny * ...

73-144 overcast ...

145-216 rainy ...

217-288 * ...

Page 56: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Eliminate Concepts

• {1}• {2}• …• {288}• {1, 2}• {1, 3}• …• {287, 288}• … there are some valid sets in here• {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}• {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15}• … and in here• {275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288}

• ~2.7x1034 rule “sets”

Page 57: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Generalization as Search

• Inductive learning: find a concept description that fits the data

• Example: rule sets as concept description language◆Enormous, but finite, search space

• Simple solution:◆Enumerate the concept space (2.7x1034 rule “sets”)

◆Eliminate descriptions that do not fit examples

◆Surviving descriptions contain target concept

◆This would rarely be practical

Page 58: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Enumerating the Concept Space

• Other practical problems:◆ More than one description may survive

◆ No description may survive• Language is unable to describe target concept• or data contains noise

• Another view of generalization as search:hill-climbing in description space according to pre-specified matching criterion◆ Most practical algorithms use heuristic search that cannot guarantee an

optimum solution

Page 59: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Bias

• Important decisions in learning systems:◆Concept description language

◆Order in which the space is searched

◆Way that overfitting to the particular training data is avoided

• These form the “bias” of the search:◆Language bias

◆Search bias

◆Overfitting-avoidance bias

Page 60: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Bias

• What is meant by bias in machine learning?

Page 61: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Language Bias

• Important question:◆Is the concept description language universal

or does it restrict what can be learned?

• A universal language can express arbitrary subsets of examples

• If language includes logical or (“disjunction”), it is universal• Example: rule sets

• Domain knowledge can be used to exclude some concept descriptions a priori from the search

Page 62: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Overfitting-Avoidance Bias

• Can be seen as a form of search bias• Modified evaluation criterion

◆E.g. balancing simplicity and number of errors

• Modified search strategy◆E.g. pruning (simplifying a description)

• Pre-pruning: stops at a simple description before search proceeds to an overly complex one

• Post-pruning: generates a complex description first and simplifies it afterwards

Page 63: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Search Bias

• Search heuristic◆“Greedy” search: performing the best single step

◆“Beam search”: keeping several alternatives

◆…

• Direction of search◆General-to-specific

• E.g. specializing a rule by adding conditions

◆Specific-to-general• E.g. generalizing an individual instance into a rule

Page 64: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Direction of Search

• What is the difference between the two directions of search, which is better, and why?

Page 65: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

The Examples outlook temperature humidity windy

play

sunny hot high false nosunny hot high true noovercast hot high false yesrainy mild high false yesrainy cool normal false yesrainy cool normal true noovercast cool normal true yessunny mild high false nosunny cool normal false yesrainy mild normal false yessunny mild normal true yesovercast mild high true yesovercast hot normal false yesrainy mild high true no

Page 66: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Examples Rules

Rule #sif outlook temperature humidity windy

then play

1 sunny hot high false no2 sunny hot high true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes6 rainy cool normal true no7 overcast cool normal true yes8 sunny mild high false no9 sunny cool normal false yes

10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes14 rainy mild high true no

Page 67: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Otherwise…

Rule #sif outlook temperature humidity windy

then play

1 sunny hot high false no2 sunny hot high true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes6 rainy cool normal true no7 overcast cool normal true yes8 sunny mild high false no9 sunny cool normal false yes

10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes14 rainy mild high true no15 * * * * yes

Page 68: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Specific General

Rule #sif outlook temperature humidity windy

then play

1 sunny hot high false no2 sunny hot high true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes6 rainy cool normal true no7 overcast cool normal true yes8 sunny mild high false no9 sunny cool normal false yes

10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes14 rainy mild high true no15 * * * * yes

Page 69: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Specific General

Rule #sif outlook temperature humidity windy

then play

1’ sunny hot high * no

3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes6 rainy cool normal true no7 overcast cool normal true yes8 sunny mild high false no9 sunny cool normal false yes

10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes14 rainy mild high true no15 * * * * yes

Page 70: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Specific General

Rule #sif outlook temperature humidity windy

then play

1’ sunny hot high * no

3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes6 rainy cool normal true no7 overcast cool normal true yes8 sunny mild high false no9 sunny cool normal false yes

10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes14 rainy mild high true no15 * * * * yes

Page 71: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Specific General

Rule #sif outlook temperature humidity windy

then play

1’ sunny hot high * no

2’ rainy * * true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes6 rainy cool normal true no7 overcast cool normal true yes8 sunny mild high false no9 sunny cool normal false yes

10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes14 rainy mild high true no15 * * * * yes

Page 72: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Specific General

Rule #sif outlook temperature humidity windy

then play

1’ sunny hot high * no

2’ rainy * * true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes

7 overcast cool normal true yes8 sunny mild high false no9 sunny cool normal false yes

10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes

15 * * * * yes

Page 73: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Specific General

Rule #sif outlook temperature humidity windy

then play

1” sunny * high * no

2’ rainy * * true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes

7 overcast cool normal true yes8 sunny mild high false no9 sunny cool normal false yes

10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes

15 * * * * yes

Page 74: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Specific General

Rule #sif outlook temperature humidity windy

then play

1” sunny * high * no

2’ rainy * * true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes

7 overcast cool normal true yes

9 sunny cool normal false yes10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes

15 * * * * yes

Page 75: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Specific General

Rule #sif outlook temperature humidity windy

then play

1” sunny * high * no

2’ rainy * * true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes

7 overcast cool normal true yes

9 sunny cool normal false yes10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes

15 * * * * yes

Page 76: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Specific General

Rule #sif outlook temperature humidity windy

then play

1” sunny * high * no

2’ rainy * * true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes

7 overcast cool normal true yes

9 sunny cool normal false yes10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes

15 * * * * yes

Page 77: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Specific General

Rule #sif outlook temperature humidity windy

then play

1” sunny * high * no

2’ rainy * * true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes

7 overcast cool normal true yes

9 sunny cool normal false yes10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes

15 * * * * yes

Page 78: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Specific General

Rule #sif outlook temperature humidity windy

then play

1” sunny * high * no

2’ rainy * * true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes

7 overcast cool normal true yes

9 sunny cool normal false yes10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes

15 * * * * yes

Page 79: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Specific General

Rule #sif outlook temperature humidity windy

then play

1” sunny * high * no

2’ rainy * * true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes

7 overcast cool normal true yes

9 sunny cool normal false yes10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes

Otherwise * * * * yes

Page 80: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Specific General

Rule #sif outlook temperature humidity windy

then play

1” sunny * high * no

2’ rainy * * true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes

7 overcast cool normal true yes

9 sunny cool normal false yes10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes

Otherwise * * * * yes

Page 81: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Specific General

Rule #sif outlook temperature humidity windy

then play

1” sunny * high * no

2’ rainy * * true no

Otherwise * * * * yes

Page 82: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Search Bias

• Search heuristic◆“Greedy” search: performing the best single step

◆“Beam search”: keeping several alternatives

◆…

• Direction of search◆General-to-specific

• E.g. specializing a rule by adding conditions

◆Specific-to-general• E.g. generalizing an individual instance into a rule

Page 83: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

General Specific

Rule #sif outlook temperature humidity windy

then play

1 * * * * yes

sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesovercast hot normal false yesovercast hot high false yesovercast mild high true yesovercast cool normal true yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes

Examples

Page 84: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

General Specific

Rule #sif outlook temperature humidity windy

then play

1 sunny * * * yes

sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesovercast hot normal false yesovercast hot high false yesovercast mild high true yesovercast cool normal true yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes

Examples

Page 85: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

General Specific

Rule #sif outlook temperature humidity windy

then play

1 rainy * * * yes

sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesovercast hot normal false yesovercast hot high false yesovercast mild high true yesovercast cool normal true yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes

Examples

Page 86: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

General Specific

Rule #sif outlook temperature humidity windy

then play

# Exs

1 overcast * * * yes 4

sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesovercast hot normal false yesovercast hot high false yesovercast mild high true yesovercast cool normal true yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes

Examples

Page 87: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

General Specific

Rule #sif outlook temperature humidity windy

then play

# Exs

1 overcast * * * yes 4

sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes

Examples

Page 88: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

General Specific

Rule #sif outlook temperature humidity windy

then play

# Exs

1 overcast * * * yes 42 * hot * * yes

sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes

Examples

Page 89: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

General Specific

Rule #sif outlook temperature humidity windy

then play

# Exs

1 overcast * * * yes 42 * mild * * yes

sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes

Examples

Page 90: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

General Specific

Rule #sif outlook temperature humidity windy

then play

# Exs

1 overcast * * * yes 42 * cool * * yes

sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes

Examples

Page 91: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

General Specific

Rule #sif outlook temperature humidity windy

then play

# Exs

1 overcast * * * yes 42 * * high * yes

sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes

Examples

Page 92: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

General Specific

Rule #sif outlook temperature humidity windy

then play

# Exs

1 overcast * * * yes 42 * * normal * yes

sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes

Examples

Page 93: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

General Specific

Rule #sif outlook temperature humidity windy

then play

# Exs

1 overcast * * * yes 42 * * * false yes

sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes

Examples

Page 94: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

General Specific

Rule #sif outlook temperature humidity windy

then play

# Exs

1 overcast * * * yes 42 * * * true yes

sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes

Examples

Page 95: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

General Specific

Rule #sif outlook temperature humidity windy

then play

# Exs

1 overcast * * * yes 42 rainy * * false yes 3

sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes

Examples

Page 96: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

General Specific

Rule #sif outlook temperature humidity windy

then play

# Exs

1 overcast * * * yes 42 rainy * * false yes 33 sunny * normal * yes 2

sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yes

Examples

Page 97: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

General Specific

Rule #sif outlook temperature humidity windy

then play

# Exs

1 overcast * * * yes 42 rainy * * false yes 33 sunny * normal * yes 2

Otherwise * * * * no 5

sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true no

Examples

Page 98: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

General Specific

Rule #sif outlook temperature humidity windy

then play

# Exs

1 overcast * * * yes 42 rainy * * false yes 33 sunny * normal * yes 2

Otherwise * * * * no 5

sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true no

Examples

Page 99: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

General Specific

Rule #sif outlook temperature humidity windy

then play

# Exs

1 overcast * * * yes 42 rainy * * false yes 33 sunny * normal * yes 2

Otherwise * * * * no 5

1: if outlook = overcast then play = yes

2: if outlook = rainy and windy = false then play = yes

3: if outlook = sunny and humidity = normal then play = yes

Otherwise, play = no

Page 100: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Bias

• Important decisions in learning systems:◆Concept description language

◆Order in which the space is searched

◆Way that overfitting to the particular training data is avoided

• These form the “bias” of the search:◆Language bias

◆Search bias

◆Overfitting-avoidance bias

Page 101: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Data Mining and Ethics I

• Ethical issues arise in practical applications• Anonymizing data is difficult

◆85% of Americans can be identified from just zip code, birth date and sex

• Data mining and discrimination◆ E.g. loan applications: using some information (e.g. sex, religion,

race) is unethical

• Ethical situation depends on application◆ E.g. same information ok in medical application

• Attributes may contain problematic information◆ E.g. zip code may correlate with race

Page 102: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Data Mining and Ethics II

• Important questions:◆Who is permitted access to the data?

◆For what purpose was the data collected?

◆What kind of conclusions can be legitimately drawn from it?

• Caveats must be attached to results• Purely statistical arguments are not sufficient!• Are resources put to good use?

Page 103: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Questions

Page 104: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

generalspecific: Search v2

Rule #sif outlook temperature humidity windy

then play

1 * * * * no

sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesovercast hot normal false yesovercast hot high false yesovercast mild high true yesovercast cool normal true yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes

Examples

Page 105: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Rule #sif outlook temperature humidity windy

then play

1 sunny * high * no2 rainy * * true no

Otherwise * * * * yes

sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesovercast hot normal false yesovercast hot high false yesovercast mild high true yesovercast cool normal true yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes

Examples

generalspecific: Search v2

Page 106: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Rule #sif outlook temperature humidity windy

then play

1 sunny * high * no2 rainy * * true no

Otherwise * * * * yes

1: if outlook = sunny and humidity = high then play = no

2: if outlook = rainy and windy = true then play = no

Otherwise, play = yes

generalspecific: Search v2

Page 107: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Search v1 versus Search v2

T\O sunny overcast rainyhot

mildcool

Rule #sif outlook temperature humidity windy

then play

# Exs

1 overcast * * * yes 42 rainy * * false yes 33 sunny * normal * yes 2

Otherwise * * * * no 5

T\O sunny overcast rainyhot

mildcool

T\O sunny overcast rainyhot

mildcool

T\O sunny overcast rainyhot

mildcool

Humidity high normalWindy

true

false

Page 108: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Rule #sif outlook temperature humidity windy

then play

1 sunny * high * no2 rainy * * true no

Otherwise * * * * yes

Search v1 versus Search v2

T\O sunny overcast rainyhot

mildcool

T\O sunny overcast rainyhot

mildcool

T\O sunny overcast rainyhot

mildcool

T\O sunny overcast rainyhot

mildcool

Humidity highWindy

true

false

normal

Page 109: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Bias

• Important decisions in learning systems:◆Concept description language

◆Order in which the space is searched

◆Way that overfitting to the particular training data is avoided

• These form the “bias” of the search:◆Language bias

◆Search bias

◆Overfitting-avoidance bias

Page 110: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Ross Quinlan

• Machine learning researcher from 1970’s• University of Sydney, Australia• 1986 “Induction of decision trees” ML

Journal• 1993 C4.5: Programs for machine learning.

• Morgan Kaufmann

• 199? Started

Page 111: Data Mining Practical Machine Learning Tools and Techniques Chapter 1: What’s it all about? Rodney Nielsen Many of these slides were adapted from: I. H

Rodney Nielsen, Human Intelligence & Language Technologies Lab

Statisticians

• Sir Ronald Aylmer Fisher

• Born: 17 Feb 1890 London, EnglandDied: 29 July 1962 Adelaide, Australia

• Numerous distinguished contributions to developing the theory and application of statistics for advancing quantitative modelling of vast field of biology

• Leo Breiman• Developed decision trees• 1984 Classification and Regression

Trees. Wadsworth.