data mining practical machine learning tools and techniques chapter 1: what’s it all about? rodney...
TRANSCRIPT
Data MiningPractical Machine Learning Tools and Techniques
Chapter 1: What’s it all about?
Rodney Nielsen
Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall
Rodney Nielsen, Human Intelligence & Language Technologies Lab
What’s Data Mining All About?
• Data vs information• Data mining and machine learning• Structural descriptions
◆ Rules: classification and association
◆ Decision trees
• Datasets◆ Weather, contact lens, CPU performance, labour negotiation data, soybean
classification
• Fielded applications◆ Ranking web pages, loan applications, screening images, load forecasting,
machine fault diagnosis, market basket analysis
• Generalization as search• Data mining and ethics
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Data Vs. Information
• Society produces huge amounts of data◆Sources: business, science, medicine, economics,
geography, environment, sports, …
• Potentially valuable resource• Raw data is useless: need techniques to
automatically extract information from it◆Data: recorded facts
◆Information: patterns underlying the data
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Information is Crucial
• Example 1: in vitro fertilization◆Given: embryos described by 60 features
◆Problem: selection of embryos that will survive
◆Data: historical records of embryos and outcome
• Example 2: cow culling◆Given: cows described by 700 features
◆Problem: selection of cows that should be culled
◆Data: historical records and farmers’ decisions
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Data Mining
• Goal: Extracting◆ implicit,
◆ previously unknown,
◆ potentially useful
information from data• Needed: programs that detect patterns and regularities in
the data• Strong patterns > good predictions
◆ Problem 1: most patterns are not interesting
◆ Problem 2: patterns may be inexact (or spurious)
◆ Problem 3: data may be garbled or missing
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Machine Learning Techniques
• Algorithms for acquiring structural descriptions from examples
• Structural descriptions represent patterns explicitly◆Can be used to predict outcome in new situation
◆Can be used to understand and explain how prediction is derived
◆Which is more important?
• Methods originate from artificial intelligence, statistics, and research on databases,among other disciplines
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Structural Descriptions
• Example: if-then rules
If tear production rate = reducedthen recommendation = none
Otherwise, if age = young and astigmatic = no then recommendation = soft
……………
HardNormalYesMyopePresbyopic
NoneReducedNoHypermetrope
Pre-presbyopic
SoftNormalNoHypermetrope
Young
NoneReducedNoMyopeYoung
Recommended lenses
Tear production
rate
AstigmatismSpectacle prescription
Age
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Can Machines Really Learn?
• Definitions of “learning” from dictionaryTo get knowledge of by study,experience, or being taughtTo become aware by information orfrom observationTo commit to memoryTo be informed of, ascertain; to receive instruction
Difficult to measure
Trivial for computers
Things learn when they change their behavior in a way that makes them perform better in the future.
● Operational definition:
Does a slipper learn?
Rodney Nielsen, Human Intelligence & Language Technologies Lab
The Weather Problem
• Conditions for playing a certain game
……………
YesFalseNormalMildRainy
YesFalseHighHotOvercast
NoTrueHighHotSunny
NoFalseHighHotSunny
PlayWindyHumidityTemperatureOutlook
If outlook = sunny and humidity = high then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Classification vs. Association Rules
• Classification rule:• Predicts value of a given attribute (the classification of an example)
• Association rule:• Predicts value of arbitrary attribute (or combination)
If outlook = sunny and humidity = highthen play = no
If temperature = cool then humidity = normal
If humidity = normal and windy = falsethen play = yes
If outlook = sunny and play = no then humidity = high
If windy = false and play = no then outlook = sunny and humidity = high
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Weather Data with Mixed Attributes
• Some attributes have numeric values
……………
YesFalse8075Rainy
YesFalse8683Overcast
NoTrue9080Sunny
NoFalse8585Sunny
PlayWindyHumidityTemperatureOutlook
If outlook = sunny and humidity > 83 then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity < 85 then play = yes
If none of the above then play = yes
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Homework
• Download and familiarize yourself with Weka:• http://www.cs.waikato.ac.nz/~ml/weka/
downloading.html• http://www.cs.waikato.ac.nz/~ml/weka/svn.html
• Read Chapters 1 and 2 of:• Data Mining: Practical Machine Learning Tools
and Techniques, by Witten, Frank & Hall• Available online through UNT library as a pdf
• http://libproxy.library.unt.edu:2079/science/book/9780123748560
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Homework
• Course website:
Rodney Nielsen, Human Intelligence & Language Technologies Lab
The Contact Lenses Data
NoneReducedYesHypermetropePre-presbyopic NoneNormalYesHypermetropePre-presbyopic NoneReducedNoMyopePresbyopic
NoneNormalNoMyopePresbyopicNoneReducedYesMyopePresbyopicHardNormalYesMyopePresbyopicNoneReducedNoHypermetropePresbyopicSoftNormalNoHypermetropePresbyopic
NoneReducedYesHypermetropePresbyopicNoneNormalYesHypermetropePresbyopic
SoftNormalNoHypermetropePre-presbyopic
NoneReducedNoHypermetropePre-presbyopic
HardNormalYesMyopePre-presbyopic
NoneReducedYesMyopePre-presbyopic
SoftNormalNoMyopePre-presbyopic
NoneReducedNoMyopePre-presbyopic
hardNormalYesHypermetropeYoungNoneReducedYesHypermetropeYoungSoftNormalNoHypermetropeYoung
NoneReducedNoHypermetropeYoungHardNormalYesMyopeYoungNoneReducedYesMyopeYoungSoftNormalNoMyopeYoung
NoneReducedNoMyopeYoung
Recommended lenses
Tear production rate
AstigmatismSpectacle prescription
Age
Rodney Nielsen, Human Intelligence & Language Technologies Lab
A Complete and Correct Rule Set
If tear production rate = reduced then recommendation = none
If age = young and astigmatic = noand tear production rate = normal then recommendation = soft
If age = pre-presbyopic and astigmatic = noand tear production rate = normal then recommendation = soft
If age = presbyopic and spectacle prescription = myopeand astigmatic = no then recommendation = none
If spectacle prescription = hypermetrope and astigmatic = noand tear production rate = normal then recommendation = soft
If spectacle prescription = myope and astigmatic = yesand tear production rate = normal then recommendation = hard
If age young and astigmatic = yes and tear production rate = normal then recommendation = hard
If age = pre-presbyopicand spectacle prescription = hypermetropeand astigmatic = yes then recommendation = none
If age = presbyopic and spectacle prescription = hypermetropeand astigmatic = yes then recommendation = none
Rodney Nielsen, Human Intelligence & Language Technologies Lab
A Decision Tree for This Problem
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Classifying Iris Flowers
…
…
…
Iris virginica1.95.12.75.8102
101
52
51
2
1
Iris virginica2.56.03.36.3
Iris versicolor
1.54.53.26.4
Iris versicolor
1.44.73.27.0
Iris setosa0.21.43.04.9
Iris setosa0.21.43.55.1
TypePetal width
Petal length
Sepal width
Sepal length
If petal length < 2.45 then Iris setosa
If sepal width < 2.10 then Iris versicolor
...
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Predicting CPU Performance
• Example: 209 different computer configurations
• Linear regression function
0
0
32
128
CHMAX
0
0
8
16
CHMIN
Channels Performance
Cache (Kb)
Main memory (Kb)
Cycle time (ns)
45040001000480209
67328000512480208
…
26932320008000292
19825660002561251
PRPCACHMMAXMMINMYCT
PRP = -55.9 + 0.0489 MYCT + 0.0153 MMIN + 0.0056 MMAX+ 0.6410 CACH - 0.2700 CHMIN + 1.480 CHMAX
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Data From Labor Negotiations
goodgoodgoodbad{good,bad}Acceptability of contracthalffull?none{none,half,full}Health plan contributionyes??no{yes,no}Bereavement assistancefullfull?none{none,half,full}Dental plan contributionyes??no{yes,no}Long-term disability
assistance
avggengenavg{below-avg,avg,gen}Vacation12121511(Number of days)Statutory holidays???yes{yes,no}Education allowance
Shift-work supplementStandby payPensionWorking hours per weekCost of living adjustmentWage increase third yearWage increase second yearWage increase first yearDurationAttribute
44%5%?Percentage??13%?Percentage???none{none,ret-allw, empl-
cntr}
40383528(Number of hours)none?tcfnone{none,tcf,tc}????Percentage4.04.4%5%?Percentage4.54.3%4%2%Percentage2321(Number of years)40…321Type
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Decision Trees for the Labor Data
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Soybean Classification
Diaporthe stem canker
19Diagnosis
Normal3ConditionRoot
…
Yes2Stem lodging
Abnormal2ConditionStem
…?3Leaf spot sizeAbnormal2ConditionLeaf?5Fruit spots
Normal4Condition of fruit pods
Fruit…
Absent2Mold growthNormal2ConditionSeed
…Above normal3PrecipitationJuly7Time of occurrence
Sample valueNumber of
values
Attribute
Rodney Nielsen, Human Intelligence & Language Technologies Lab
The Role of Domain Knowledge
• In this domain, “leaf condition is normal” implies “leaf malformation is absent”!
If leaf condition is normaland stem condition is abnormaland stem cankers is below soil lineand canker lesion color is brown
thendiagnosis is rhizoctonia root rot
If leaf malformation is absentand stem condition is abnormaland stem cankers is below soil lineand canker lesion color is brown
thendiagnosis is rhizoctonia root rot
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Fielded Applications
• The result of learning—or the learning method itself—is deployed in practical applications◆ Processing loan applications
◆ Screening images for oil slicks
◆ Electricity supply forecasting
◆ Diagnosis of machine faults
◆ Marketing and sales
◆ Separating crude oil and natural gas
◆ Reducing banding in rotogravure printing
◆ Finding appropriate technicians for telephone faults
◆ Scientific applications: biology, astronomy, chemistry
◆ Automatic selection of TV programs
◆ Monitoring intensive care patients
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Processing Loan Applications(American Express)
• Given: questionnaire with financial and personal information
• Question: should money be lent?• Simple statistical method covers 90% of cases• Borderline cases referred to loan officers• But: 50% of accepted borderline cases defaulted!• Solution: reject all borderline cases?
◆ No! Borderline cases are most active customers
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Enter Machine Learning
• 1000 training examples of borderline cases
• 20 attributes:
◆ age
◆ years with current employer
◆ years at current address
◆ years with the bank
◆ other credit cards possessed,…
• Learned rules: correct on 70% of cases
◆ human experts only 50%
• Rules could be used to explain decisions to customers
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Screening Images
• Given: radar satellite images of coastal waters• Problem: detect oil slicks in those images• Oil slicks appear as dark regions with changing size and
shape• Not easy: lookalike dark regions can be caused by weather
conditions (e.g. high wind)• Expensive process requiring highly trained personnel
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Enter Machine Learning
• Extract dark regions from normalized image• Attributes:
◆ size of region◆ shape, area◆ intensity◆ sharpness and jaggedness of boundaries◆ proximity of other regions◆ info about background
• Constraints:◆ Few training examples—oil slicks are rare!◆ Unbalanced data: most dark regions aren’t slicks◆ Regions from same image form a batch◆ Requirement: adjustable false-alarm rate
◆ Precision-Recall tradeoff
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Load Forecasting
• Electricity supply companies need forecast of future demand for power
• Forecasts of min/max load for each hour Significant savings
• Given: manually constructed load model that assumes “normal” climatic conditions
• Problem: adjust for weather conditions• Static model consist of:
◆ Base load for the year
◆ Load periodicity over the year
◆ Effect of holidays
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Enter Machine Learning
• Prediction corrected using “most similar” days• Attributes:
◆ Temperature
◆ Humidity
◆ Wind speed
◆ Cloud cover readings
◆ Plus difference between actual load and predicted load
• Average difference among three “most similar” days added to static model
• Linear regression coefficients form attribute weights in similarity function
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Diagnosis of Machine Faults
• Diagnosis: classical domain of expert systems• Given: Fourier analysis of vibrations measured
at various points of a device’s mounting• Question: which fault is present?• Preventative maintenance of electromechanical
motors and generators• Information very noisy• So far: diagnosis by expert/hand-crafted rules
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Enter Machine Learning
• Available: 600 faults with expert’s diagnosis• ~300 unsatisfactory, rest used for training• Attributes augmented by intermediate concepts
that embodied causal domain knowledge• Expert not satisfied with initial rules because they
did not relate to his domain knowledge• Further background knowledge resulted in more
complex rules that were satisfactory• Learned rules outperformed hand-crafted ones
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Marketing and Sales I
• Companies precisely record massive amounts of marketing and sales data
• Applications:◆ Customer loyalty:
identifying customers that are likely to defect by detecting changes in their behavior(e.g. banks/phone companies)
◆ Special offers:identifying profitable customers(e.g. reliable owners of credit cards that need extra money during the holiday season)
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Marketing and Sales II
• Market basket analysis◆Association techniques find groups of items that tend to
occur together in a transaction(used to analyze checkout data)
• Historical analysis of purchasing patterns• Identifying prospective customers
◆Focusing promotional mailouts(targeted campaigns are cheaper than mass-marketed ones)
Rodney Nielsen, Human Intelligence & Language Technologies Lab
UNT
• Where should UNT be using Data Mining and why?
Rodney Nielsen, Human Intelligence & Language Technologies Lab
The World
• What ripe opportunities do you see for DM in the world?
Rodney Nielsen, Human Intelligence & Language Technologies Lab
DM vs. ML vs. Statistics
• What are the differences and relationships between:• Data Mining and Statistics?• Data Mining and Machine Learning?
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Machine Learning and Statistics
• Historical difference (grossly oversimplified):◆Statistics: testing hypotheses
◆Machine learning: finding the right hypothesis
• But: huge overlap◆Decision trees (C4.5 and CART)
◆Nearest-neighbor methods
• Today: perspectives have significant overlap◆Many ML algorithms employ statistical techniques
Homework: http://stats.stackexchange.com/questions/5026/what-is-the-difference-between-data-mining-statistics-machine-learning-and-ai
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Generalization as Search
• Learning = Generalization
• Interesting Learning => Generalization
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Generalization as Search
• Inductive learning: find a concept description that fits the data
• Example: rule sets as concept description language
• … continued
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Example Rules
if outlook=sunny and temperature=hot and humidity=high and windy=false
then play = no
ifoutlook
temperature humidity windy
thenplay
sunny hot high false no
if outlook = sunny and humidity = highthen play = no
play = yes
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Weather Data Points (Examples)
outlook temperature humidity windy playsunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Generalization as Search
• Inductive learning: find a concept description that fits the data
• Example: rule sets as concept description language◆Enormous, but finite, search space
• … continued
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Enumerating the Concept Space
• Search space for weather problem◆Number of unique possible rules:
◆ 4 x 4 x 3 x 3 x 2 = 288 possible rules
Rodney Nielsen, Human Intelligence & Language Technologies Lab
All Possible Rules
Rule #sif outlook temperature humidity windy
then play
1 sunny hot high false no
2 sunny hot high false yes
3 sunny hot high true no
4 sunny hot high true yes
5-6 sunny hot high * ...
7-12 sunny hot normal ...
13-18 sunny hot * ...
19-36 sunny mild ...
37-54 sunny cool ...
55-72 sunny * ...
73-144 overcast ...
145-216 rainy ...
217-288 * ...
* Match any value = omit feature from the rule
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Example Rules
Rule #if outlook temperature humidity windy
then play
...
59 sunny * high * no
...
288 * * * * yes
59: if outlook = sunny and humidity = high then play = no
288: play = yes
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Enumerating the Concept Space
• Search space for weather problem◆Number of unique possible rules:
◆ 4 x 4 x 3 x 3 x 2 = 288 possible rules
◆Do we have 288 total possible solutions/hypotheses?
◆Do we have 288 total possible concepts?
Rodney Nielsen, Human Intelligence & Language Technologies Lab
The Concept Space
• A Concept is (or is defined by):• A set of rules• E.g., rules: {59, 144, 213, 282}
• The Concept Space is (or is defined by):• All the possible sets of rules
if outlook=sunny and temperature=hot and humidity=high and windy=false
then play = no
if outlook = sunny and humidity = high
then play = no
play = yes
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Enumerating the Concept Space
• Search space for weather problem◆Number of unique possible rules:
◆ 4 x 4 x 3 x 3 x 2 = 288 possible rules
◆Number of unique possible rule sets?
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Enumerate Concept Space
• {1}• {2}• …• {288}• {1, 2}• {1, 3}• …• {287, 288}• {1, 2, 3}• …• {1, 2, …, 288}
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Enumerating the Concept Space
• Search space for weather problem◆Number of unique possible rules:
◆ 4 x 4 x 3 x 3 x 2 = 288 possible rules
◆Number of unique possible rule sets:
◆Simplifying assumption:◆Each rule set contains at most 14 rules
◆There are around 2.7x1034 possible rule “sets”
◆I.e., around 2.7x1034 possible hypotheses in the search space
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Generalization as Search
• Inductive learning: find a concept description that fits the data
• Example: rule sets as concept description language◆Enormous, but finite, search space
• Simple solution:◆Enumerate the concept space (2.7x1034 rule “sets”)
◆Eliminate descriptions that do not fit examples
◆Surviving descriptions contain target concept
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Enumerate All Possible Rules
Rule #sif outlook temperature humidity windy
then play
1 sunny hot high false no
2 sunny hot high false yes
3 sunny hot high true no
4 sunny hot high true yes
5-6 sunny hot high * ...
7-12 sunny hot normal ...
13-18 sunny hot * ...
19-36 sunny mild ...
37-54 sunny cool ...
55-72 sunny * ...
73-144 overcast ...
145-216 rainy ...
217-288 * ...
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Enumerate Concept Space
• {1}• {2}• …• {288}• {1, 2}• {1, 3}• …• {287, 288}• …• {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}• {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15}• …• {275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288}
• ~2.7x1034 rule “sets”
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Generalization as Search
• Inductive learning: find a concept description that fits the data
• Example: rule sets as concept description language◆Enormous, but finite, search space
• Simple solution:◆Enumerate the concept space (2.7x1034 rule “sets”)
◆Eliminate concept descriptions that do not fit examples
◆Surviving descriptions contain target concept
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Eliminate Rules
Rule #sif outlook temperature humidity windy
then play
1 sunny hot high false no
2 sunny hot high false yes
3 sunny hot high true no
4 sunny hot high true yes
5-6 sunny hot high * ...
7-12 sunny hot normal ...
13-18 sunny hot * ...
19-36 sunny mild ...
37-54 sunny cool ...
55-72 sunny * ...
73-144 overcast ...
145-216 rainy ...
217-288 * ...
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Eliminate Concepts
• {1}• {2}• …• {288}• {1, 2}• {1, 3}• …• {287, 288}• … there are some valid sets in here• {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}• {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15}• … and in here• {275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288}
• ~2.7x1034 rule “sets”
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Generalization as Search
• Inductive learning: find a concept description that fits the data
• Example: rule sets as concept description language◆Enormous, but finite, search space
• Simple solution:◆Enumerate the concept space (2.7x1034 rule “sets”)
◆Eliminate descriptions that do not fit examples
◆Surviving descriptions contain target concept
◆This would rarely be practical
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Enumerating the Concept Space
• Other practical problems:◆ More than one description may survive
◆ No description may survive• Language is unable to describe target concept• or data contains noise
• Another view of generalization as search:hill-climbing in description space according to pre-specified matching criterion◆ Most practical algorithms use heuristic search that cannot guarantee an
optimum solution
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Bias
• Important decisions in learning systems:◆Concept description language
◆Order in which the space is searched
◆Way that overfitting to the particular training data is avoided
• These form the “bias” of the search:◆Language bias
◆Search bias
◆Overfitting-avoidance bias
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Bias
• What is meant by bias in machine learning?
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Language Bias
• Important question:◆Is the concept description language universal
or does it restrict what can be learned?
• A universal language can express arbitrary subsets of examples
• If language includes logical or (“disjunction”), it is universal• Example: rule sets
• Domain knowledge can be used to exclude some concept descriptions a priori from the search
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Overfitting-Avoidance Bias
• Can be seen as a form of search bias• Modified evaluation criterion
◆E.g. balancing simplicity and number of errors
• Modified search strategy◆E.g. pruning (simplifying a description)
• Pre-pruning: stops at a simple description before search proceeds to an overly complex one
• Post-pruning: generates a complex description first and simplifies it afterwards
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Search Bias
• Search heuristic◆“Greedy” search: performing the best single step
◆“Beam search”: keeping several alternatives
◆…
• Direction of search◆General-to-specific
• E.g. specializing a rule by adding conditions
◆Specific-to-general• E.g. generalizing an individual instance into a rule
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Direction of Search
• What is the difference between the two directions of search, which is better, and why?
Rodney Nielsen, Human Intelligence & Language Technologies Lab
The Examples outlook temperature humidity windy
play
sunny hot high false nosunny hot high true noovercast hot high false yesrainy mild high false yesrainy cool normal false yesrainy cool normal true noovercast cool normal true yessunny mild high false nosunny cool normal false yesrainy mild normal false yessunny mild normal true yesovercast mild high true yesovercast hot normal false yesrainy mild high true no
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Examples Rules
Rule #sif outlook temperature humidity windy
then play
1 sunny hot high false no2 sunny hot high true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes6 rainy cool normal true no7 overcast cool normal true yes8 sunny mild high false no9 sunny cool normal false yes
10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes14 rainy mild high true no
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Otherwise…
Rule #sif outlook temperature humidity windy
then play
1 sunny hot high false no2 sunny hot high true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes6 rainy cool normal true no7 overcast cool normal true yes8 sunny mild high false no9 sunny cool normal false yes
10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes14 rainy mild high true no15 * * * * yes
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Specific General
Rule #sif outlook temperature humidity windy
then play
1 sunny hot high false no2 sunny hot high true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes6 rainy cool normal true no7 overcast cool normal true yes8 sunny mild high false no9 sunny cool normal false yes
10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes14 rainy mild high true no15 * * * * yes
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Specific General
Rule #sif outlook temperature humidity windy
then play
1’ sunny hot high * no
3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes6 rainy cool normal true no7 overcast cool normal true yes8 sunny mild high false no9 sunny cool normal false yes
10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes14 rainy mild high true no15 * * * * yes
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Specific General
Rule #sif outlook temperature humidity windy
then play
1’ sunny hot high * no
3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes6 rainy cool normal true no7 overcast cool normal true yes8 sunny mild high false no9 sunny cool normal false yes
10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes14 rainy mild high true no15 * * * * yes
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Specific General
Rule #sif outlook temperature humidity windy
then play
1’ sunny hot high * no
2’ rainy * * true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes6 rainy cool normal true no7 overcast cool normal true yes8 sunny mild high false no9 sunny cool normal false yes
10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes14 rainy mild high true no15 * * * * yes
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Specific General
Rule #sif outlook temperature humidity windy
then play
1’ sunny hot high * no
2’ rainy * * true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes
7 overcast cool normal true yes8 sunny mild high false no9 sunny cool normal false yes
10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes
15 * * * * yes
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Specific General
Rule #sif outlook temperature humidity windy
then play
1” sunny * high * no
2’ rainy * * true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes
7 overcast cool normal true yes8 sunny mild high false no9 sunny cool normal false yes
10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes
15 * * * * yes
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Specific General
Rule #sif outlook temperature humidity windy
then play
1” sunny * high * no
2’ rainy * * true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes
7 overcast cool normal true yes
9 sunny cool normal false yes10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes
15 * * * * yes
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Specific General
Rule #sif outlook temperature humidity windy
then play
1” sunny * high * no
2’ rainy * * true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes
7 overcast cool normal true yes
9 sunny cool normal false yes10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes
15 * * * * yes
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Specific General
Rule #sif outlook temperature humidity windy
then play
1” sunny * high * no
2’ rainy * * true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes
7 overcast cool normal true yes
9 sunny cool normal false yes10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes
15 * * * * yes
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Specific General
Rule #sif outlook temperature humidity windy
then play
1” sunny * high * no
2’ rainy * * true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes
7 overcast cool normal true yes
9 sunny cool normal false yes10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes
15 * * * * yes
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Specific General
Rule #sif outlook temperature humidity windy
then play
1” sunny * high * no
2’ rainy * * true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes
7 overcast cool normal true yes
9 sunny cool normal false yes10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes
15 * * * * yes
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Specific General
Rule #sif outlook temperature humidity windy
then play
1” sunny * high * no
2’ rainy * * true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes
7 overcast cool normal true yes
9 sunny cool normal false yes10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes
Otherwise * * * * yes
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Specific General
Rule #sif outlook temperature humidity windy
then play
1” sunny * high * no
2’ rainy * * true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes
7 overcast cool normal true yes
9 sunny cool normal false yes10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes
Otherwise * * * * yes
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Specific General
Rule #sif outlook temperature humidity windy
then play
1” sunny * high * no
2’ rainy * * true no
Otherwise * * * * yes
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Search Bias
• Search heuristic◆“Greedy” search: performing the best single step
◆“Beam search”: keeping several alternatives
◆…
• Direction of search◆General-to-specific
• E.g. specializing a rule by adding conditions
◆Specific-to-general• E.g. generalizing an individual instance into a rule
Rodney Nielsen, Human Intelligence & Language Technologies Lab
General Specific
Rule #sif outlook temperature humidity windy
then play
1 * * * * yes
sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesovercast hot normal false yesovercast hot high false yesovercast mild high true yesovercast cool normal true yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes
Examples
Rodney Nielsen, Human Intelligence & Language Technologies Lab
General Specific
Rule #sif outlook temperature humidity windy
then play
1 sunny * * * yes
sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesovercast hot normal false yesovercast hot high false yesovercast mild high true yesovercast cool normal true yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes
Examples
Rodney Nielsen, Human Intelligence & Language Technologies Lab
General Specific
Rule #sif outlook temperature humidity windy
then play
1 rainy * * * yes
sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesovercast hot normal false yesovercast hot high false yesovercast mild high true yesovercast cool normal true yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes
Examples
Rodney Nielsen, Human Intelligence & Language Technologies Lab
General Specific
Rule #sif outlook temperature humidity windy
then play
# Exs
1 overcast * * * yes 4
sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesovercast hot normal false yesovercast hot high false yesovercast mild high true yesovercast cool normal true yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes
Examples
Rodney Nielsen, Human Intelligence & Language Technologies Lab
General Specific
Rule #sif outlook temperature humidity windy
then play
# Exs
1 overcast * * * yes 4
sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes
Examples
Rodney Nielsen, Human Intelligence & Language Technologies Lab
General Specific
Rule #sif outlook temperature humidity windy
then play
# Exs
1 overcast * * * yes 42 * hot * * yes
sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes
Examples
Rodney Nielsen, Human Intelligence & Language Technologies Lab
General Specific
Rule #sif outlook temperature humidity windy
then play
# Exs
1 overcast * * * yes 42 * mild * * yes
sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes
Examples
Rodney Nielsen, Human Intelligence & Language Technologies Lab
General Specific
Rule #sif outlook temperature humidity windy
then play
# Exs
1 overcast * * * yes 42 * cool * * yes
sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes
Examples
Rodney Nielsen, Human Intelligence & Language Technologies Lab
General Specific
Rule #sif outlook temperature humidity windy
then play
# Exs
1 overcast * * * yes 42 * * high * yes
sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes
Examples
Rodney Nielsen, Human Intelligence & Language Technologies Lab
General Specific
Rule #sif outlook temperature humidity windy
then play
# Exs
1 overcast * * * yes 42 * * normal * yes
sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes
Examples
Rodney Nielsen, Human Intelligence & Language Technologies Lab
General Specific
Rule #sif outlook temperature humidity windy
then play
# Exs
1 overcast * * * yes 42 * * * false yes
sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes
Examples
Rodney Nielsen, Human Intelligence & Language Technologies Lab
General Specific
Rule #sif outlook temperature humidity windy
then play
# Exs
1 overcast * * * yes 42 * * * true yes
sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes
Examples
Rodney Nielsen, Human Intelligence & Language Technologies Lab
General Specific
Rule #sif outlook temperature humidity windy
then play
# Exs
1 overcast * * * yes 42 rainy * * false yes 3
sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes
Examples
Rodney Nielsen, Human Intelligence & Language Technologies Lab
General Specific
Rule #sif outlook temperature humidity windy
then play
# Exs
1 overcast * * * yes 42 rainy * * false yes 33 sunny * normal * yes 2
sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yes
Examples
Rodney Nielsen, Human Intelligence & Language Technologies Lab
General Specific
Rule #sif outlook temperature humidity windy
then play
# Exs
1 overcast * * * yes 42 rainy * * false yes 33 sunny * normal * yes 2
Otherwise * * * * no 5
sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true no
Examples
Rodney Nielsen, Human Intelligence & Language Technologies Lab
General Specific
Rule #sif outlook temperature humidity windy
then play
# Exs
1 overcast * * * yes 42 rainy * * false yes 33 sunny * normal * yes 2
Otherwise * * * * no 5
sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true no
Examples
Rodney Nielsen, Human Intelligence & Language Technologies Lab
General Specific
Rule #sif outlook temperature humidity windy
then play
# Exs
1 overcast * * * yes 42 rainy * * false yes 33 sunny * normal * yes 2
Otherwise * * * * no 5
1: if outlook = overcast then play = yes
2: if outlook = rainy and windy = false then play = yes
3: if outlook = sunny and humidity = normal then play = yes
Otherwise, play = no
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Bias
• Important decisions in learning systems:◆Concept description language
◆Order in which the space is searched
◆Way that overfitting to the particular training data is avoided
• These form the “bias” of the search:◆Language bias
◆Search bias
◆Overfitting-avoidance bias
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Data Mining and Ethics I
• Ethical issues arise in practical applications• Anonymizing data is difficult
◆85% of Americans can be identified from just zip code, birth date and sex
• Data mining and discrimination◆ E.g. loan applications: using some information (e.g. sex, religion,
race) is unethical
• Ethical situation depends on application◆ E.g. same information ok in medical application
• Attributes may contain problematic information◆ E.g. zip code may correlate with race
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Data Mining and Ethics II
• Important questions:◆Who is permitted access to the data?
◆For what purpose was the data collected?
◆What kind of conclusions can be legitimately drawn from it?
• Caveats must be attached to results• Purely statistical arguments are not sufficient!• Are resources put to good use?
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Questions
Rodney Nielsen, Human Intelligence & Language Technologies Lab
generalspecific: Search v2
Rule #sif outlook temperature humidity windy
then play
1 * * * * no
sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesovercast hot normal false yesovercast hot high false yesovercast mild high true yesovercast cool normal true yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes
Examples
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Rule #sif outlook temperature humidity windy
then play
1 sunny * high * no2 rainy * * true no
Otherwise * * * * yes
sunny hot high false nosunny hot high true nosunny mild high false norainy mild high true norainy cool normal true nosunny mild normal true yessunny cool normal false yesovercast hot normal false yesovercast hot high false yesovercast mild high true yesovercast cool normal true yesrainy mild normal false yesrainy mild high false yesrainy cool normal false yes
Examples
generalspecific: Search v2
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Rule #sif outlook temperature humidity windy
then play
1 sunny * high * no2 rainy * * true no
Otherwise * * * * yes
1: if outlook = sunny and humidity = high then play = no
2: if outlook = rainy and windy = true then play = no
Otherwise, play = yes
generalspecific: Search v2
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Search v1 versus Search v2
T\O sunny overcast rainyhot
mildcool
Rule #sif outlook temperature humidity windy
then play
# Exs
1 overcast * * * yes 42 rainy * * false yes 33 sunny * normal * yes 2
Otherwise * * * * no 5
T\O sunny overcast rainyhot
mildcool
T\O sunny overcast rainyhot
mildcool
T\O sunny overcast rainyhot
mildcool
Humidity high normalWindy
true
false
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Rule #sif outlook temperature humidity windy
then play
1 sunny * high * no2 rainy * * true no
Otherwise * * * * yes
Search v1 versus Search v2
T\O sunny overcast rainyhot
mildcool
T\O sunny overcast rainyhot
mildcool
T\O sunny overcast rainyhot
mildcool
T\O sunny overcast rainyhot
mildcool
Humidity highWindy
true
false
normal
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Bias
• Important decisions in learning systems:◆Concept description language
◆Order in which the space is searched
◆Way that overfitting to the particular training data is avoided
• These form the “bias” of the search:◆Language bias
◆Search bias
◆Overfitting-avoidance bias
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Ross Quinlan
• Machine learning researcher from 1970’s• University of Sydney, Australia• 1986 “Induction of decision trees” ML
Journal• 1993 C4.5: Programs for machine learning.
• Morgan Kaufmann
• 199? Started
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Statisticians
• Sir Ronald Aylmer Fisher
• Born: 17 Feb 1890 London, EnglandDied: 29 July 1962 Adelaide, Australia
• Numerous distinguished contributions to developing the theory and application of statistics for advancing quantitative modelling of vast field of biology
• Leo Breiman• Developed decision trees• 1984 Classification and Regression
Trees. Wadsworth.