given a collection of records ( training set find a model ... · pdf file0. 375 log 0. 375 0....

20
1 Sesi 09 Dosen Pembina : Danang Junaedi IF-UTAMA 1 IF-UTAMA 2 IF-UTAMA 3 Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. IF-UTAMA 4

Upload: buianh

Post on 14-Mar-2018

224 views

Category:

Documents


1 download

TRANSCRIPT

1

Sesi 09

Dosen Pembina :Danang Junaedi

IF-UTAMA 1 IF-UTAMA 2

IF-UTAMA 3

� Given a collection of records (training set )◦ Each record contains a set of attributes, one of the attributes is

the class.

� Find a model for class attribute as a function of the values of other attributes.

� Goal: previously unseen records should be assigned a class as accurately as possible.◦ A test set is used to determine the accuracy of the model.

Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

IF-UTAMA 4

2

IF-UTAMA 5

Apply

Model

Learn

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

� Predicting tumor cells as benign or malignant

� Classifying credit card transactions as legitimate or fraudulent

� Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil

� Categorizing news stories as finance, weather, entertainment, sports, etc

IF-UTAMA 6

� Decision Tree based Methods

� Rule-based Methods

� Memory based reasoning

� Neural Networks

� Naïve Bayes and Bayesian Belief Networks

� Support Vector Machines

IF-UTAMA 7

� Mengubah data menjadi pohon keputusan(decision tree) dan aturan-aturan keputusan (rule)

IF-UTAMA 8

3

� A decision tree is a chronological representation of the decision problem.

� Each decision tree has two types of nodes; round nodes correspond to the states of nature while square nodes correspond to the decision alternatives.

� The branches leaving each round node represent the different states of nature while the branches leaving each square node represent the different decision alternatives.

� At the end of each limb of a tree are the payoffs attained from the series of branches making up that limb.

IF-UTAMA 9 IF-UTAMA 10

IF-UTAMA 11

� Diagnosa penyakit tertentu, seperti hipertensi, kanker, stroke dan lain-lain

� Pemilihan produk seperti rumah, kendaraan, komputer dan lain-lain

� Pemilihan pegawai teladan sesuai dengan kriteria tertentu

� Deteksi gangguan pada komputer atau jaringan komputer seperti Deteksi Entrusi, deteksi virus (trojan dan varians)

� dll

IF-UTAMA 12

4

IF-UTAMA 13

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Training Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

MarriedSingle, Divorced

< 80K > 80K

Splitting Attributes

Model: Decision Tree

IF-UTAMA 14

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

MarSt

Refund

TaxInc

YESNO

NO

NO

Yes No

MarriedSingle,

Divorced

< 80K > 80K

There could be more than one tree that

fits the same data!

IF-UTAMA 15

Apply

Model

Learn

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 1 0

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 1 0

IF-UTAMA 16

Decision

Tree

5

IF-UTAMA 17

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

MarriedSingle, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 1 0

Test Data

Start from the root of tree.

IF-UTAMA 18

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

MarriedSingle, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 1 0

Test Data

IF-UTAMA 19

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

MarriedSingle, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 1 0

Test Data

IF-UTAMA 20

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

MarriedSingle, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 1 0

Test Data

6

IF-UTAMA 21

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 1 0

Test Data

IF-UTAMA 22

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 1 0

Test Data

Assign Cheat to “No”

Apply

Model

Learn

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 1 0

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 1 0

IF-UTAMA 23

Decision

Tree

IF-UTAMA 24

7

IF-UTAMA 25 IF-UTAMA 26

IF-UTAMA 27

� Data dinyatakan dalam bentuk tabel dengan atribut dan record.

� AtributAtributAtributAtribut menyatakan suatu parameter yang dibuat sebagai kriteria dalam pembentukan tree. Misalkan untuk menentukan main tenis, kriteria yang diperhatikan adalah cuaca, angin dan temperatur. Salah satu atribut merupakan atribut yang menyatakan data solusi per-item data yang disebut dengan target target target target atributatributatributatribut.

� Atribut memiliki nilai-nilai yang dinamakan dengan instanceinstanceinstanceinstance. Misalkan atribut cuaca mempunyai instance berupa cerah, berawan dan hujan.

IF-UTAMA 28

8

1. Mengubah bentuk data (tabel) menjadi model tree.◦ ID3 Algorithm◦ C.45 Algorithm◦ etc

2. Mengubah model tree menjadi rule◦ Disjunction (v � OR)◦ Conjunction (^ � AND)

3. Menyederhanakan Rule (Pruning)

IF-UTAMA 29 IF-UTAMA 30

IF-UTAMA 31 IF-UTAMA 32

9

IF-UTAMA 33 IF-UTAMA 34

IF-UTAMA 35 IF-UTAMA 36

10

Given a set of examples, S, categorised in categories ci, then: 1. Choose the root node to be the attribute, A, which scores the highest

for information gain relative to S. 2. For each value v that A can possibly take, draw a branch from the

node. 3. For each branch from A corresponding to value v, calculate Sv. Then: ◦ If Sv is empty, choose the category cdefault which contains the most

examples from S, and put this as the leaf node category which ends that branch.

◦ If Sv contains only examples from a category c, then put c as the leaf node category which ends that branch.

◦ Otherwise, remove A from the set of attributes which can be put into nodes. Then put a new node in the decision tree, where the new attribute being tested in the node is the one which scores highest for information gain relative to Sv (note: not relative to S). This new node starts the cycle again (from 2), with S replaced by Sv in the calculations and the tree gets built iteratively like this.

� The algorithm terminates either when all the attributes have been exhausted, or the decision tree perfectly classifies the examples.

IF-UTAMA 37 IF-UTAMA 38

� Spesifikasikan masalah �menentukan AtributAtributAtributAtributdan Target AtributTarget AtributTarget AtributTarget Atribut berdasarkan data yang ada

� Hitung nilai EntropyEntropyEntropyEntropy dari setiap kriteria dengan data sample yang ditentukan.

� Hitung Information GainInformation GainInformation GainInformation Gain dari setiap kriteria

� Node terpilih adalah kriteria dengan Information Gain yang paling tinggi.

� Ulangi sampai diperoleh node terakhir yang berisi target atribut

IF-UTAMA 39

� Entropy(S) adalah jumlah bit yang diperkirakan dibutuhkan untuk dapat mengekstrak suatu kelas (+ atau -) dari sejumlah data acak pada ruang sample S.

� Entropy bisa dikatakan sebagai kebutuhan bit untuk menyatakan suatu kelas. Semakin kecil nilai Entropy maka semakin baik untuk digunakan dalam mengekstraksi suatu kelas.

� Panjang kode untuk menyatakan informasi secara optimal adalah –log2 p bits untuk messages yang mempunyai probabilitas p.

IF-UTAMA 40

11

IF-UTAMA 41 IF-UTAMA 42

IF-UTAMA 43 IF-UTAMA 44

Training Data Set

12

IF-UTAMA 45 IF-UTAMA 46

IF-UTAMA 47 IF-UTAMA 48

13

IF-UTAMA 49 IF-UTAMA 50

IF-UTAMA 51 IF-UTAMA 52

14

IF-UTAMA 53 IF-UTAMA 54

Decision Tree??

Sample Atribut Target Atribut

� Jumlah instance = 8

� Jumlah instance positif = 3

� Jumlah instance negatif = 5

IF-UTAMA 55

( )negatifceinsnegatifceinspositifceinspositifceins PPPPHipertensiEntropy _tan2_tan_tan2_tan loglog −−=

( ) ( )

( ) ( )

0,955

0,4240,531

0.678- 625.0-1.415375.0

625.0log625.0375.0log375.0

8

5log

8

5

8

3log

8

3

22

22

=

+=

×−×−=

×−×−=

×

×

−=

� Jumlah instance = 8� Instance Usia◦ Muda� Instance positif = 1 � Instance negatif = 3◦ Tua� Instance positif = 2 � Instance negatif = 2

� Entropy Usia

◦ Entropy(muda) = 0.906◦ Entropy(tua) = 1

IF-UTAMA 56

( )

( )negatifceinsnegatifceinspositifceinspositifceins

negatifceinsnegatifceinspositifceinspositifceins

PPPPTuaEntropy

PPPPMudaEntropy

_tan2_tan_tan2_tan

_tan2_tan_tan2_tan

loglog

loglog

−−=

−−=

15

IF-UTAMA 57

( ) ( ) ( )

( ) ( ) ( )

( ) ( ) ( )

002.0

5.0453.0955.0

18

4906.0

8

4955.0

,,

=

−−=

−−=

−−=

−= ∑∈

TuaTua

MudaMuda

v

TuaMudav

v

SEntropyS

SSEntropy

S

SSEntropy

SEntropyS

SSEntropyUsiaSGain

� Jumlah instance = 8

� Intance Berat◦ Overweight

� Instance positif = 3

� Instance negatif = 1

◦ Average

� Instance positif = 0

� Instance negatif = 2

◦ Underweight

� Instance positif = 0

� Instance negatif = 2

◦ Entropy(Overweight)=0.906

◦ Entropy(Average)=0.5

◦ Entropy(Underweight)=0.5

IF-UTAMA 58

( )

( )

( ) negatifceinsnegatifceinspositifceinspositifceins

negatifceinsnegatifceinspositifceinspositifceins

negatifceinsnegatifceinspositifceinspositifceins

PPPPtUnderweighEntropy

PPPPAverageEntropy

PPPPOverweightEntropy

_tan2_tan_tan2_tan

_tan2_tan_tan2_tan

_tan2_tan_tan2_tan

loglog

loglog

loglog

−−=

−−=

−−=

IF-UTAMA 59

( ) ( ) ( )

( ) ( ) ( ) ( )

( ) ( ) ( ) ( )

0,252

125.0125.0453.0955.0

5.08

25.0

8

2906.0

8

4955.0

,,,

=

−−−=

−−−=

−−−=

−= ∑∈

tUnderweigh

tUnderweigh

average

Average

Overweight

Overweight

v

tUnderweighAverageOverwightv

v

SEntropyS

SSEntropy

S

SSEntropy

S

SSEntropy

SEntropyS

SSEntropyBeratSGain

� Jumlah instance = 8

� Intance Jenis Kelamin◦ Pria

� Instance positif = 2

� Instance negatif = 4

◦ Wanita� Instance positif = 1

� Instance negatif = 1

◦ Entropy(Pria)=1◦ Entropy(Wanita)=0.75

( )

( ) negatifceinsnegatifceinspositifceinspositifceins

negatifceinsnegatifceinspositifceinspositifceins

PPPPWanitaEntropy

PPPPiaEntropy

_tan2_tan_tan2_tan

_tan2_tan_tan2_tan

loglog

loglogPr

−−=

−−=

IF-UTAMA 60

16

IF-UTAMA 61

( ) ( ) ( )

( ) ( ) ( )

( ) ( ) ( )

0,017

188.075.0955.0

75.08

21

8

6955.0

min,

PrPr

,Pr

=

−−=

−−=

−−=

−= ∑∈

WanitaWanita

iaia

v

Wanitaiav

v

SEntropyS

SSEntropy

S

SSEntropy

SEntropyS

SSEntropyJenisKelaSGain

� Atribut yang dipilih adalah atribut berat karena nilai Information Gainnya paling tinggi

� Jumlah Instance untuk Overweight = 4� Jumlah Instance untuk Average = 2� Jumlah Instance untuk Underweight = 2

� Hitung Gain paling tinggi untuk dijadikan cabang berikutnya

IF-UTAMA 62

Berat

Overweight

Average

Underweight

� Jumlah instance = 4� Instance (Berat = Overwight ) & Usia =◦ Muda� Instance positif = 1 � Instance negatif = 0◦ Tua� Instance positif = 2 � Instance negatif = 1

� Instance (Berat = Overwight ) & Jenis Kelamin =◦ Pria� Instance positif = 2 � Instance negatif = 1◦ Wanita� Instance positif = 1 � Instance negatif = 0

63 IF-UTAMA 64

Clasification Rule ????

17

� Underfitting and Overfitting◦ UnderfittingUnderfittingUnderfittingUnderfitting: when model is too simple, both training

and test errors are large

◦ OverfittingOverfittingOverfittingOverfitting results in decision trees that are more complex than necessary

� Missing Values

� Costs of Classification

IF-UTAMA 65

� Pre-Pruning (Early Stopping Rule)◦ Stop the algorithm before it becomes a fully-grown

tree◦ Typical stopping conditions for a node:� Stop if all instances belong to the same class

� Stop if all the attribute values are the same

◦ More restrictive conditions:� Stop if number of instances is less than some user-specified

threshold

� Stop if class distribution of instances are independent of theavailable features (e.g., using χ 2 test)

� Stop if expanding the current node does not improve impuritymeasures (e.g., Gini or information gain).

IF-UTAMA 66

� Post-pruning◦ Grow decision tree to its entirety

◦ Trim the nodes of the decision tree in a bottom-up fashion

◦ If generalization error improves after trimming, replace sub-tree by a leaf node.

◦ Class label of leaf node is determined from majority class of instances in the sub-tree

◦ Can use MDL for post-pruning

IF-UTAMA 67 IF-UTAMA 68

A?

A1

A2 A3

A4

Class = Yes 20

Class = No 10

Error = 10/30

Training Error (Before splitting) = 10/30

Pessimistic error = (10 + 0.5)/30 = 10.5/30

Training Error (After splitting) = 9/30

Pessimistic error (After splitting)

= (9 + 4 ×××× 0.5)/30 = 11/30

PRUNE!

Class = Yes 8

Class = No 4

Class = Yes 3

Class = No 4

Class = Yes 4

Class = No 1

Class = Yes 5

Class = No 1

18

� Missing values affect decision tree construction in three different ways:◦ Affects how impurity measures are computed

◦ Affects how to distribute instance with missing value to child nodes

◦ Affects how a test instance with missing value is classified

IF-UTAMA 69 IF-UTAMA 70

Tid Refund Marital Status

Taxable Income Class

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 ? Single 90K Yes 1 0

Class

= Yes

Class

= No

Refund=Yes 0 3

Refund=No 2 4

Refund=? 1 0

Split on Refund:

Entropy(Refund=Yes) = 0

Entropy(Refund=No)

= -(2/6)log(2/6) – (4/6)log(4/6) = 0.9183

Entropy(Children)

= 0.3 (0) + 0.6 (0.9183) = 0.551

Gain = 0.9 ×××× (0.8813 – 0.551) = 0.3303

Missing

value

Before Splitting:

Entropy(Parent) = -0.3 log(0.3)-(0.7)log(0.7) = 0.8813

Class=Yes 0 + 3/9

Class=No 3

IF-UTAMA 71

Tid Refund Marital Status

Taxable Income Class

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No 10

RefundYes No

Class=Yes 0

Class=No 3

Cheat=Yes 2

Cheat=No 4

RefundYes

Tid Refund Marital Status

Taxable Income Class

10 ? Single 90K Yes 1 0

No

Class=Yes 2 + 6/9

Class=No 4

Probability that Refund=Yes is 3/9

Probability that Refund=No is 6/9

Assign record to the left child with

weight = 3/9 and to the right child with weight = 6/9

IF-UTAMA 72

Refund

MarSt

TaxInc

YESNO

NO

NO

YesNo

MarriedSingle,

Divorced

< 80K > 80K

Married Single Divorced Total

Class=No 3 1 0 4

Class=Yes 6/9 1 1 2.67

Total 3.67 2 1 6.67

Tid Refund Marital Status

Taxable Income Class

11 No ? 85K ? 10

New record:

Probability that Marital Status

= Married is 3.67/6.67

Probability that Marital Status

={Single,Divorced} is 3/6.67

19

IF-UTAMA 73 IF-UTAMA 74

IF-UTAMA 75 IF-UTAMA 76

20

1. Dr. Mourad YKHLEF.2009. Decision Tree Decision Tree Decision Tree Decision Tree Induction SystemInduction SystemInduction SystemInduction System.King Saud University

2. Achmad Basuki, Iwan Syarif. 2003. Decision Decision Decision Decision TreeTreeTreeTree. Politeknik Elektronika Negeri Surabaya (PENS) – ITS

3. Simon Colton.2004. Decision Tree Learning.Decision Tree Learning.Decision Tree Learning.Decision Tree Learning.----

4. Tom M. Mitchell. 1997. Machine LearningMachine LearningMachine LearningMachine Learning. Mc-Graw Hill

IF-UTAMA 77