example data sets - university at buffalo

44
1 Example Data Sets Contact Lens (symbolic) Weather (symbolic data) Weather ( numeric +symbolic) Iris (numeric; outcome:symbolic) CPU Perf.(numeric; outcome:numeric) Labor Negotiations (missing values) Soybean

Upload: others

Post on 26-Jan-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

1

Example Data Sets• Contact Lens (symbolic)• Weather (symbolic data)• Weather ( numeric +symbolic)• Iris (numeric; outcome:symbolic)• CPU Perf.(numeric; outcome:numeric)• Labor Negotiations (missing values)• Soybean

2

Contact Lens Dataage

spectacle prescription astigmatism

tear production rate

recommendation lenses

young myope no reduced noneyoung myope no normal softyoung myope yes reduced noneyoung myope yes normal hardyoung hypermetrope no reduced noneyoung hypermetrope no normal softyoung hypermetrope yes reduced noneyoung hypermetrope yes normal hardpre-presbyopic myope no reduced nonepre-presbyopic myope no normal softpre-presbyopic myope yes reduced nonepre-presbyopic myope yes normal hardpre-presbyopic hypermetrope no reduced nonepre-presbyopic hypermetrope no normal softpre-presbyopic hypermetrope yes reduced nonepre-presbyopic hypermetrope yes normal nonepresbyopic myope no reduced nonepresbyopic myope no normal nonepresbyopic myope yes reduced nonepresbyopic myope yes normal hardpresbyopic hypermetrope no reduced nonepresbyopic hypermetrope no normal softpresbyopic hypermetrope yes reduced nonepresbyopic hypermetrope yes normal none

3

Structural Patterns

• Part of structural description

• Example is simplistic because all combinations of possible values are represented in table

If tear production rate = reduced then recommendation = none

Otherwise, if age = young and astigmatic = nothen recommendation = soft

4

Structural Patterns

• In most learning situations, the set of examples given as input is far from complete

• Part of the job is to generalize to other, new examples

5

Weather Dataoutlook temperature humidity windy playsunny hot high false nosunny hot high true noovercast hot high false yesrainy mild high false yesrainy cool normal false yesrainy cool normal true noovercast cool normal true yessunny mild high false nosunny cool normal false yesrainy mild normal false yessunny mild normal true yesovercast mild high true yesovercast hot normal false yesrainy mild high true no

6

Weather Problem

• This creates 36 possible combinations (3 X 3 X 2 X 2 = 36), of which 14 are present in the set of examples

If outlook = sunny and humidity = high then play = no

If outlook = rainy and windy = true then play = no

If outlook = overcast then play = yes

If humidity = normal then play = yes

If none of the above then play = yes

7

Weather Data with Some Numeric Attributes

outlook temperature humidity windy playsunny 85 85 false nosunny 80 90 true noovercast 83 86 false yesrainy 70 96 false yesrainy 68 80 false yesrainy 65 70 true noovercast 64 65 true yessunny 72 95 false nosunny 69 70 false yesrainy 75 80 false yessunny 75 70 true yesovercast 72 90 true yesovercast 81 75 false yesrainy 71 91 true no

8

Classification and AssociationRules

• Classification Rules: rules which predict the classification of the example in terms of whether to play or not

If outlook = sunny and humidity = >83, then play = no

9

Classification and AssociationRules

• Association Rules: rules which strongly associate different attribute values

• Association rules which derive from weather table

If temperature = cool then humidity = normal

If humidity = normal and windy = false then play = yes

If outlook = sunny and play = no then humidity = high

If windy = false and play = no then outlook = sunnyand humidity = high

10

Rules for Contact Lens DataIf tear production rate = reduced then recommendation = none

If age = young and astigmatic = no andtear production rate = normal then recommendation = soft

If age = pre-presbyopic and astigmatic = no andtear production rate = normal then recommendation = soft

If age = presbyopic and spectacle prescription = myope andastigmatic = no then recommendation = none

If spectacle prescription = hypermetrope and astigmatic = no andtear production rate = normal then recommendation = soft

If spectacle prescription = myope and astigmatic = yes andtear production rate = normal then recommendation = hard

If age = young and astigmatic = yes andtear production rate = normal then recommendation = hard

If age = pre-presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none

If age = presbyopic and spectacle prescription = hypermetrope andastigmatic = yes then recommendation = none

11

Decision Tree for Contact Lens Data

tear production rate

astigmatism

spectacle prescription

none

soft

hard none

12

Iris Datasepal length

sepal width

pedal lenth

pedal width type

1 5.1 3.5 1.4 0.2 Iris setosa2 4.9 3.0 1.4 0.2 Iris setosa3 4.7 3.2 1.3 0.2 Iris setosa4 4.6 3.1 1.5 0.2 Iris setosa5 5.0 3.6 1.4 0.2 Iris setosa…51 7.0 3.2 4.7 1.4 Iris 52 6.4 3.2 4.5 1.5 Iris 53 6.9 3.1 4.9 1.5 Iris 54 5.5 2.3 4.0 1.3 Iris 55 6.5 2.8 4.6 1.5 Iris 101 6.3 3.3 6.0 2.5 Iris virginica102 5.8 2.7 5.1 1.9 Iris virginica103 7.1 3.0 5.9 2.1 Iris virginica104 6.3 2.9 5.6 1.8 Iris virginica105 6.5 3.0 5.8 2.2 Iris virginica

13

Iris Rules Learned• If petal-length <2.45 then Iris-setosa

• If sepal-width <2.10 then Iris-versicolor

• If sepal-width < 2.45 and petal-length <4.55 then Iris-versicolor

• ...

14

CPU Performance Data

cycle cacheperfor-mance

time (ns) min max (Kb) min max

MYCT MMIN MMAX CACH CHMIN CHMAX PRP1 125 256 6000 256 16 128 1982 29 8000 32000 32 8 32 2693 29 8000 32000 32 8 32 2204 29 8000 32000 32 8 32 1725 29 8000 16000 32 8 16 132

…207 125 2000 8000 0 2 14 52208 480 512 8000 32 0 0 67209 480 1000 4000 0 0 0 45

main memory (Kb) channels

15

CPU Performance

• Numerical Prediction: outcome as linear sum of weighted attributes

• Regression equation: • PRP=-55.9+.049MYCT+.+1.48CHMAX• Regression can discover linear

relationships, not non-linear ones

16

Labor Negotiations Dataattribute type 1 2 3 … 40duration (number of years) 1 2 3 2wage increase first year persentage 2% 4% 4.3% 4.5wage increase second year persentage ? 5% 4.4% 4.0wage increase third year persentage ? ? ? ?cost of living adjustment {none, tcf, tc} none tcf ? noneworking hours per week (number of hours 28 35 38 40pension {none, ret-allw, none ? ? ?standby pay persentage ? 13% ? ?shift-work supplement persentage ? 5% 4% 4education allowance {yes, no} yes ? ? ?statutory holidays (number of days) 11 15 12 12vacation {below-avg, avg, avg gen gen avglong-term disablity {yes, no} no ? ? yesdental plan contribution {none, half, full} none ? full fullbereavement assistance {yes, no} no ? ? yeshealth plan contribution {none, half, full} none ? full halfacceptablity of contract {good, bad} bad good good good

17

Classification

Debt

Loan

Income

No loan

A simple linear classification boundary for the loan data set; shaded region denotes class “no loan”

18

Linear RegressionRegression Line

Income

Debt

A simple linear regression for the loan data set

19

Clustering

DebtCluster 1 Cluster 2

Cluster 3

Income

A simple clustering of the loan data set into 3 clusters; note that the original labels are replaced by +’s

20

Non-Linear Classification

Debt No Loan

Loan

Income

An example of classification boundaries learned by a non-linear classifier (such asa neural network) for the loan data set

21

Nearest Neighbor Classifier

Debt No Loan

Loan

Income

Classification boundaries for a nearest neighbor classifier for the loan data set

Decision Trees for ...

Wage increase first year

Statutory holidays

Wage increase first year

Bad

Good

Bad Good

≤ 2.5> 2.5

> 10

< 4

≤ 10

≤ 4

22

23

… Labor Negotiations DataWage increase

first year

Good

Bad Good

≤ 2.5 > 2.5

> 10

< 4

≤ 10

≤ 4

Working hours per week

Statutory holidays

Health plan contribution

Wage increase first yearBad

Bad Good Bad

> 36

fullhalf

none

≤ 36

24

Soy Bean DataAttribute Number of Values Sample Value

Environment time of occurrence 7 Julyprecipitation 3 above normaltemperature 3 normal

Seed condition 2 normalmold growth 2 absentdiscoloration 2 absent

Fruit condition of fruit pods 4 normal

Leaves condition 2 abnormalyellow leaf spot halo 3 absentleaf spot margins 3 no data

Stem condition 2 abnormalstem lodging 2 yesstem cankers 4 above the soil line

Roots condition 3 normal

Diagnosis 19 diaporthe stem canker

25

Two Example RulesIf [leaf condition is normal and

stem condition is abnormal andstem cankers is below soil line andcanker lesion color is brown]

thendiagnosis is rhizoctonia root rot

If [leaf malformation is absent andstem condition is abnormal andstem cankers is below soil line and canker lesion color is brown]

thendiagnosis is rhizoctonia root rot

26

Iris Data–Clustering ProblemSepal Length Sepal Width Petal Length Petal Width

1 5.1 3.5 1.4 0.22 4.9 3 1.4 0.23 4.7 3.2 1.3 0.24 4.6 3.1 1.5 0.25 5 3.6 1.4 0.2…51 7 3.2 4.7 1.452 6.4 3.2 4.5 1.553 6.9 3.1 4.9 1.554 5.5 2.3 4 1.355 6.5 2.8 4.6 1.5…101 6.3 3.3 6 2.5102 5.8 2.7 5.1 1.9103 7.1 3 5.9 2.1104 6.3 2.9 5.6 1.8105 6.5 3 5.8 2.2

27

Weather Data–Numeric ClassOutlook Temperature Humidity Windy Play-timesunny 85 85 false 5sunny 80 90 true 0overcast 83 86 false 55rainy 70 96 false 40rainy 68 80 false 65rainy 65 70 true 45overcast 64 65 true 60sunny 72 95 false 0sunny 69 70 false 70rainy 75 80 false 45sunny 75 70 true 50overcast 72 90 true 55overcast 81 75 false 75rainy 71 91 true 10

28

Family Tree

PaF

= IanM

AnnaF

NikkiF

PippaF

BrianMM

StevenM

GraceF

= RayM

PeterM

= PeggyF

29

Family Tree

First Person

Second Person

Sister-of ?

Peter Peggy noPeter Steven no… …Steven Peter noSteven Graham noSteven Pam yesSteven Grace no… …Ian Pippa yes… …Anna Nikki yes… …Nikki Anna yes

First Person

Second Person

Sister-of ?

Steven Pam yesGraham Pam yesIan Pippa yesBrian Pippa yesAnna Nikki yesNikki Anna yes

noAll the rest

30

Family Tree As Table

Name Gender Parent1 Parent2Peter male ? ?Peggy female ? ?Steven male Peter PeggyGraham male Peter PeggyPam female Peter PeggyIan male Grace Ray

31

Sister-of As Table

Sister of?Name Gender Parent1 Parent2 Name Gender Parent 1 Parent2Steven male Peter Peggy Pam female Peter Peggy yesGraham male Peter Peggy Pam female Peter Peggy yesIan male Grace Ray Pippa female Grace Ray yesIan male Grace Ray Pippa female Grace Ray yesAnnna female Pam Ian Nikki female Pam Ian yesNikki female Pam Ian Anna female Pam Ian yes

no

First Person Second Person

All the rest

32

Another Relationship As Table

Ancestor of?Name Gender Parent1 Parent2 Name Gender Parent 1 Parent2Peter male ? ? Steven male Peter Peggy yesPeter male ? ? Pam female Peter Peggy yesPeter male ? ? Anna female Pam Ian yesPeter male ? ? Nikki female Pam Ian yesPam female Peter Peggy Nikki female Pam Ian yesGrace female ? ? Ian male Grace Ray yesGrace female ? ? Nikki female Pam Ian yes

yesno

First Person Second Person

All the restOther examples here

33

ARFF File for Weather Data

% ARFF file for the weather data with some numeric features%@relation weather

@attribute outlook {sunny, overcast, rainy}@attribute temperature numeric@attribute humidity numeric@attribute windy {true, false}@attribute play? {yes, no}

@data%%14 instances%sunny, 85, 85, false, nosunny, 80, 90, true, noovercast, 83, 86, false, yes

rainy, 70, 96, false, yesrainy, 68, 80, false, yesrainy, 65, 70, true, noovercast, 64, 65, true, yessunny, 72, 95, false, nosunny, 69, 70, false, yesrainy, 75, 80, false, yessunny, 75, 70, true, yesovercast, 72, 90, true, yesovercast, 81, 75, false, yesrainy, 71, 91, true, no

Simple Disjunction

b c

a

c d

d x

x

x

yn

y

y

y

y

y n

n

n

n

n

34

35

Exclusive-Or Problem

X =1?

b

Y =1?

a a b

Y =1?

no yes

yes yesno no

a b

b a

If x = 1 and y = 0then class = a

If x = 0 and y = 1then class = a

If x = 0 and y = 0then class = b

If x = 1 and y = 1then class = b

0

1

0 1

36

Replicated Subtree

z

a

w

y

X

b b

b ba

If x = 1 and y = 1then class = a

If z = 0 and w = 1then class = a

Otherwise class = b

1 2

3

1

3

2 3

1

1

3

2

2

37

New Iris Flower

Sepal Length Sepal Width Petal Length Petal Width Type5.1 3.5 2.6 0.2 ?

38

Rules for Iris Data

Default: Iris-setosa 1except if petal-length ≥ 2.45 and petal-length < 5.355 2

and petal-width < 1.75 3then Iris-versicolor 4

except if petal-length ≥ 4.95 and petal-width < 1.55 5then Iris-virginica 6else if sepal-length < 4.95 and sepal-width ≥ 2.45 7

then Iris-virginica 8else if petal-length ≥ 3.35 9

then Iris-virginica 10except if petal-length < 4.85 and sepal-length < 5.95 11

then Iris-versicolor 12

39

The Shapes ProblemShaded: StandingUnshaded: Lying

40

Training Data for ShapesProblem

Width Height Sides Class2 4 4 standing3 6 4 standing4 3 4 lying7 8 3 standing7 6 3 lying2 9 4 standing9 1 4 lying

10 2 3 lying

CPU Performance Data

MMIN

MMAX

CACH

CHMIN

MMAX

MMAX

CACH

MYCT

CHMAX

19.3(28/8.7%)

29.8(37/8.18%)

37.3(19/11.3%)

64.6(24/19.2%)

18.3(7/3.83%)

281(11/56%)

59.3(24/16.9%)

157(21/73.7%)

75.7(10/24.6%)

133(16/28.8%)

783(5/359%)

492(7/53.9%)

≤7.5 >7.5

≤8.5 (8.5,28]

>28 ≤28000 >28000

≤2500(2500,4250] >4250 ≤1000 >10000 ≤58 >58

≤12000 >12000(0.5,8.5]≤0.5

≤550 >550

PRP =-56.1+0.049 MYCT+0.015 MMIN+0.006MMAX+0.630CACH-0.270CHMIN+1.46 CHMAX

(a) linear regression

41(b) regression tree

42

CPU Performance Data

CHMIN

LM4(50/22.17%)

CACH

CACH MMAX

MMAX LM5(21/45.5%)

LM1(65/7.32%)

LM6(23/63.5%)

LM2(26/6.37%)

LM3(24/14.5%)

≤4250

(0.5,8.5]

≤7.5 >7.5

≤28000 >28000≤8.5

≤0.5

>4250

>8.5

LM1 PRP = 8.29 + 0.004 MMAX + 2.77 CHMINLM2 PRP = 20.3 + 0.004 MMIN - 3.99 CHMIN

+ 0.946 CHMAXLM3 PRP = 38.1 + 0.012 MMINLM4 PRP = 10.5 + 0.002 MMAX + 0.698 CACH

+0.969 CHMAXLM5 PRP = 285 - 1.46 MYCT + 1.02 CACH

- 9.39 CHMINLM6 PRP = -65.8 + 0.03 MMIN - 2.94 CHMIN

= 4.98 CHMAX

(c) model

43

Partitioning Instance Space

44

Ways to Represent Clusters