example data sets - university at buffalo
TRANSCRIPT
1
Example Data Sets• Contact Lens (symbolic)• Weather (symbolic data)• Weather ( numeric +symbolic)• Iris (numeric; outcome:symbolic)• CPU Perf.(numeric; outcome:numeric)• Labor Negotiations (missing values)• Soybean
2
Contact Lens Dataage
spectacle prescription astigmatism
tear production rate
recommendation lenses
young myope no reduced noneyoung myope no normal softyoung myope yes reduced noneyoung myope yes normal hardyoung hypermetrope no reduced noneyoung hypermetrope no normal softyoung hypermetrope yes reduced noneyoung hypermetrope yes normal hardpre-presbyopic myope no reduced nonepre-presbyopic myope no normal softpre-presbyopic myope yes reduced nonepre-presbyopic myope yes normal hardpre-presbyopic hypermetrope no reduced nonepre-presbyopic hypermetrope no normal softpre-presbyopic hypermetrope yes reduced nonepre-presbyopic hypermetrope yes normal nonepresbyopic myope no reduced nonepresbyopic myope no normal nonepresbyopic myope yes reduced nonepresbyopic myope yes normal hardpresbyopic hypermetrope no reduced nonepresbyopic hypermetrope no normal softpresbyopic hypermetrope yes reduced nonepresbyopic hypermetrope yes normal none
3
Structural Patterns
• Part of structural description
• Example is simplistic because all combinations of possible values are represented in table
If tear production rate = reduced then recommendation = none
Otherwise, if age = young and astigmatic = nothen recommendation = soft
4
Structural Patterns
• In most learning situations, the set of examples given as input is far from complete
• Part of the job is to generalize to other, new examples
5
Weather Dataoutlook temperature humidity windy playsunny hot high false nosunny hot high true noovercast hot high false yesrainy mild high false yesrainy cool normal false yesrainy cool normal true noovercast cool normal true yessunny mild high false nosunny cool normal false yesrainy mild normal false yessunny mild normal true yesovercast mild high true yesovercast hot normal false yesrainy mild high true no
6
Weather Problem
• This creates 36 possible combinations (3 X 3 X 2 X 2 = 36), of which 14 are present in the set of examples
If outlook = sunny and humidity = high then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes
7
Weather Data with Some Numeric Attributes
outlook temperature humidity windy playsunny 85 85 false nosunny 80 90 true noovercast 83 86 false yesrainy 70 96 false yesrainy 68 80 false yesrainy 65 70 true noovercast 64 65 true yessunny 72 95 false nosunny 69 70 false yesrainy 75 80 false yessunny 75 70 true yesovercast 72 90 true yesovercast 81 75 false yesrainy 71 91 true no
8
Classification and AssociationRules
• Classification Rules: rules which predict the classification of the example in terms of whether to play or not
If outlook = sunny and humidity = >83, then play = no
9
Classification and AssociationRules
• Association Rules: rules which strongly associate different attribute values
• Association rules which derive from weather table
If temperature = cool then humidity = normal
If humidity = normal and windy = false then play = yes
If outlook = sunny and play = no then humidity = high
If windy = false and play = no then outlook = sunnyand humidity = high
10
Rules for Contact Lens DataIf tear production rate = reduced then recommendation = none
If age = young and astigmatic = no andtear production rate = normal then recommendation = soft
If age = pre-presbyopic and astigmatic = no andtear production rate = normal then recommendation = soft
If age = presbyopic and spectacle prescription = myope andastigmatic = no then recommendation = none
If spectacle prescription = hypermetrope and astigmatic = no andtear production rate = normal then recommendation = soft
If spectacle prescription = myope and astigmatic = yes andtear production rate = normal then recommendation = hard
If age = young and astigmatic = yes andtear production rate = normal then recommendation = hard
If age = pre-presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none
If age = presbyopic and spectacle prescription = hypermetrope andastigmatic = yes then recommendation = none
11
Decision Tree for Contact Lens Data
tear production rate
astigmatism
spectacle prescription
none
soft
hard none
12
Iris Datasepal length
sepal width
pedal lenth
pedal width type
1 5.1 3.5 1.4 0.2 Iris setosa2 4.9 3.0 1.4 0.2 Iris setosa3 4.7 3.2 1.3 0.2 Iris setosa4 4.6 3.1 1.5 0.2 Iris setosa5 5.0 3.6 1.4 0.2 Iris setosa…51 7.0 3.2 4.7 1.4 Iris 52 6.4 3.2 4.5 1.5 Iris 53 6.9 3.1 4.9 1.5 Iris 54 5.5 2.3 4.0 1.3 Iris 55 6.5 2.8 4.6 1.5 Iris 101 6.3 3.3 6.0 2.5 Iris virginica102 5.8 2.7 5.1 1.9 Iris virginica103 7.1 3.0 5.9 2.1 Iris virginica104 6.3 2.9 5.6 1.8 Iris virginica105 6.5 3.0 5.8 2.2 Iris virginica
13
Iris Rules Learned• If petal-length <2.45 then Iris-setosa
• If sepal-width <2.10 then Iris-versicolor
• If sepal-width < 2.45 and petal-length <4.55 then Iris-versicolor
• ...
14
CPU Performance Data
cycle cacheperfor-mance
time (ns) min max (Kb) min max
MYCT MMIN MMAX CACH CHMIN CHMAX PRP1 125 256 6000 256 16 128 1982 29 8000 32000 32 8 32 2693 29 8000 32000 32 8 32 2204 29 8000 32000 32 8 32 1725 29 8000 16000 32 8 16 132
…207 125 2000 8000 0 2 14 52208 480 512 8000 32 0 0 67209 480 1000 4000 0 0 0 45
main memory (Kb) channels
15
CPU Performance
• Numerical Prediction: outcome as linear sum of weighted attributes
• Regression equation: • PRP=-55.9+.049MYCT+.+1.48CHMAX• Regression can discover linear
relationships, not non-linear ones
16
Labor Negotiations Dataattribute type 1 2 3 … 40duration (number of years) 1 2 3 2wage increase first year persentage 2% 4% 4.3% 4.5wage increase second year persentage ? 5% 4.4% 4.0wage increase third year persentage ? ? ? ?cost of living adjustment {none, tcf, tc} none tcf ? noneworking hours per week (number of hours 28 35 38 40pension {none, ret-allw, none ? ? ?standby pay persentage ? 13% ? ?shift-work supplement persentage ? 5% 4% 4education allowance {yes, no} yes ? ? ?statutory holidays (number of days) 11 15 12 12vacation {below-avg, avg, avg gen gen avglong-term disablity {yes, no} no ? ? yesdental plan contribution {none, half, full} none ? full fullbereavement assistance {yes, no} no ? ? yeshealth plan contribution {none, half, full} none ? full halfacceptablity of contract {good, bad} bad good good good
17
Classification
Debt
Loan
Income
No loan
A simple linear classification boundary for the loan data set; shaded region denotes class “no loan”
19
Clustering
DebtCluster 1 Cluster 2
Cluster 3
Income
A simple clustering of the loan data set into 3 clusters; note that the original labels are replaced by +’s
20
Non-Linear Classification
Debt No Loan
Loan
Income
An example of classification boundaries learned by a non-linear classifier (such asa neural network) for the loan data set
21
Nearest Neighbor Classifier
Debt No Loan
Loan
Income
Classification boundaries for a nearest neighbor classifier for the loan data set
Decision Trees for ...
Wage increase first year
Statutory holidays
Wage increase first year
Bad
Good
Bad Good
≤ 2.5> 2.5
> 10
< 4
≤ 10
≤ 4
22
23
… Labor Negotiations DataWage increase
first year
Good
Bad Good
≤ 2.5 > 2.5
> 10
< 4
≤ 10
≤ 4
Working hours per week
Statutory holidays
Health plan contribution
Wage increase first yearBad
Bad Good Bad
> 36
fullhalf
none
≤ 36
24
Soy Bean DataAttribute Number of Values Sample Value
Environment time of occurrence 7 Julyprecipitation 3 above normaltemperature 3 normal
Seed condition 2 normalmold growth 2 absentdiscoloration 2 absent
Fruit condition of fruit pods 4 normal
Leaves condition 2 abnormalyellow leaf spot halo 3 absentleaf spot margins 3 no data
Stem condition 2 abnormalstem lodging 2 yesstem cankers 4 above the soil line
Roots condition 3 normal
Diagnosis 19 diaporthe stem canker
25
Two Example RulesIf [leaf condition is normal and
stem condition is abnormal andstem cankers is below soil line andcanker lesion color is brown]
thendiagnosis is rhizoctonia root rot
If [leaf malformation is absent andstem condition is abnormal andstem cankers is below soil line and canker lesion color is brown]
thendiagnosis is rhizoctonia root rot
26
Iris Data–Clustering ProblemSepal Length Sepal Width Petal Length Petal Width
1 5.1 3.5 1.4 0.22 4.9 3 1.4 0.23 4.7 3.2 1.3 0.24 4.6 3.1 1.5 0.25 5 3.6 1.4 0.2…51 7 3.2 4.7 1.452 6.4 3.2 4.5 1.553 6.9 3.1 4.9 1.554 5.5 2.3 4 1.355 6.5 2.8 4.6 1.5…101 6.3 3.3 6 2.5102 5.8 2.7 5.1 1.9103 7.1 3 5.9 2.1104 6.3 2.9 5.6 1.8105 6.5 3 5.8 2.2
27
Weather Data–Numeric ClassOutlook Temperature Humidity Windy Play-timesunny 85 85 false 5sunny 80 90 true 0overcast 83 86 false 55rainy 70 96 false 40rainy 68 80 false 65rainy 65 70 true 45overcast 64 65 true 60sunny 72 95 false 0sunny 69 70 false 70rainy 75 80 false 45sunny 75 70 true 50overcast 72 90 true 55overcast 81 75 false 75rainy 71 91 true 10
29
Family Tree
First Person
Second Person
Sister-of ?
Peter Peggy noPeter Steven no… …Steven Peter noSteven Graham noSteven Pam yesSteven Grace no… …Ian Pippa yes… …Anna Nikki yes… …Nikki Anna yes
First Person
Second Person
Sister-of ?
Steven Pam yesGraham Pam yesIan Pippa yesBrian Pippa yesAnna Nikki yesNikki Anna yes
noAll the rest
30
Family Tree As Table
Name Gender Parent1 Parent2Peter male ? ?Peggy female ? ?Steven male Peter PeggyGraham male Peter PeggyPam female Peter PeggyIan male Grace Ray
31
Sister-of As Table
Sister of?Name Gender Parent1 Parent2 Name Gender Parent 1 Parent2Steven male Peter Peggy Pam female Peter Peggy yesGraham male Peter Peggy Pam female Peter Peggy yesIan male Grace Ray Pippa female Grace Ray yesIan male Grace Ray Pippa female Grace Ray yesAnnna female Pam Ian Nikki female Pam Ian yesNikki female Pam Ian Anna female Pam Ian yes
no
First Person Second Person
All the rest
32
Another Relationship As Table
Ancestor of?Name Gender Parent1 Parent2 Name Gender Parent 1 Parent2Peter male ? ? Steven male Peter Peggy yesPeter male ? ? Pam female Peter Peggy yesPeter male ? ? Anna female Pam Ian yesPeter male ? ? Nikki female Pam Ian yesPam female Peter Peggy Nikki female Pam Ian yesGrace female ? ? Ian male Grace Ray yesGrace female ? ? Nikki female Pam Ian yes
yesno
First Person Second Person
All the restOther examples here
33
ARFF File for Weather Data
% ARFF file for the weather data with some numeric features%@relation weather
@attribute outlook {sunny, overcast, rainy}@attribute temperature numeric@attribute humidity numeric@attribute windy {true, false}@attribute play? {yes, no}
@data%%14 instances%sunny, 85, 85, false, nosunny, 80, 90, true, noovercast, 83, 86, false, yes
rainy, 70, 96, false, yesrainy, 68, 80, false, yesrainy, 65, 70, true, noovercast, 64, 65, true, yessunny, 72, 95, false, nosunny, 69, 70, false, yesrainy, 75, 80, false, yessunny, 75, 70, true, yesovercast, 72, 90, true, yesovercast, 81, 75, false, yesrainy, 71, 91, true, no
35
Exclusive-Or Problem
X =1?
b
Y =1?
a a b
Y =1?
no yes
yes yesno no
a b
b a
If x = 1 and y = 0then class = a
If x = 0 and y = 1then class = a
If x = 0 and y = 0then class = b
If x = 1 and y = 1then class = b
0
1
0 1
36
Replicated Subtree
z
a
w
y
X
b b
b ba
If x = 1 and y = 1then class = a
If z = 0 and w = 1then class = a
Otherwise class = b
1 2
3
1
3
2 3
1
1
3
2
2
38
Rules for Iris Data
Default: Iris-setosa 1except if petal-length ≥ 2.45 and petal-length < 5.355 2
and petal-width < 1.75 3then Iris-versicolor 4
except if petal-length ≥ 4.95 and petal-width < 1.55 5then Iris-virginica 6else if sepal-length < 4.95 and sepal-width ≥ 2.45 7
then Iris-virginica 8else if petal-length ≥ 3.35 9
then Iris-virginica 10except if petal-length < 4.85 and sepal-length < 5.95 11
then Iris-versicolor 12
40
Training Data for ShapesProblem
Width Height Sides Class2 4 4 standing3 6 4 standing4 3 4 lying7 8 3 standing7 6 3 lying2 9 4 standing9 1 4 lying
10 2 3 lying
CPU Performance Data
MMIN
MMAX
CACH
CHMIN
MMAX
MMAX
CACH
MYCT
CHMAX
19.3(28/8.7%)
29.8(37/8.18%)
37.3(19/11.3%)
64.6(24/19.2%)
18.3(7/3.83%)
281(11/56%)
59.3(24/16.9%)
157(21/73.7%)
75.7(10/24.6%)
133(16/28.8%)
783(5/359%)
492(7/53.9%)
≤7.5 >7.5
≤8.5 (8.5,28]
>28 ≤28000 >28000
≤2500(2500,4250] >4250 ≤1000 >10000 ≤58 >58
≤12000 >12000(0.5,8.5]≤0.5
≤550 >550
PRP =-56.1+0.049 MYCT+0.015 MMIN+0.006MMAX+0.630CACH-0.270CHMIN+1.46 CHMAX
(a) linear regression
41(b) regression tree
42
CPU Performance Data
CHMIN
LM4(50/22.17%)
CACH
CACH MMAX
MMAX LM5(21/45.5%)
LM1(65/7.32%)
LM6(23/63.5%)
LM2(26/6.37%)
LM3(24/14.5%)
≤4250
(0.5,8.5]
≤7.5 >7.5
≤28000 >28000≤8.5
≤0.5
>4250
>8.5
LM1 PRP = 8.29 + 0.004 MMAX + 2.77 CHMINLM2 PRP = 20.3 + 0.004 MMIN - 3.99 CHMIN
+ 0.946 CHMAXLM3 PRP = 38.1 + 0.012 MMINLM4 PRP = 10.5 + 0.002 MMAX + 0.698 CACH
+0.969 CHMAXLM5 PRP = 285 - 1.46 MYCT + 1.02 CACH
- 9.39 CHMINLM6 PRP = -65.8 + 0.03 MMIN - 2.94 CHMIN
= 4.98 CHMAX
(c) model