given a collection of records ( training set find a model ... · pdf file0. 375 log 0. 375 0....
TRANSCRIPT
1
Sesi 09
Dosen Pembina :Danang Junaedi
IF-UTAMA 1 IF-UTAMA 2
IF-UTAMA 3
� Given a collection of records (training set )◦ Each record contains a set of attributes, one of the attributes is
the class.
� Find a model for class attribute as a function of the values of other attributes.
� Goal: previously unseen records should be assigned a class as accurately as possible.◦ A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
IF-UTAMA 4
2
IF-UTAMA 5
Apply
Model
Learn
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
� Predicting tumor cells as benign or malignant
� Classifying credit card transactions as legitimate or fraudulent
� Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil
� Categorizing news stories as finance, weather, entertainment, sports, etc
IF-UTAMA 6
� Decision Tree based Methods
� Rule-based Methods
� Memory based reasoning
� Neural Networks
� Naïve Bayes and Bayesian Belief Networks
� Support Vector Machines
IF-UTAMA 7
� Mengubah data menjadi pohon keputusan(decision tree) dan aturan-aturan keputusan (rule)
IF-UTAMA 8
3
� A decision tree is a chronological representation of the decision problem.
� Each decision tree has two types of nodes; round nodes correspond to the states of nature while square nodes correspond to the decision alternatives.
� The branches leaving each round node represent the different states of nature while the branches leaving each square node represent the different decision alternatives.
� At the end of each limb of a tree are the payoffs attained from the series of branches making up that limb.
IF-UTAMA 9 IF-UTAMA 10
IF-UTAMA 11
� Diagnosa penyakit tertentu, seperti hipertensi, kanker, stroke dan lain-lain
� Pemilihan produk seperti rumah, kendaraan, komputer dan lain-lain
� Pemilihan pegawai teladan sesuai dengan kriteria tertentu
� Deteksi gangguan pada komputer atau jaringan komputer seperti Deteksi Entrusi, deteksi virus (trojan dan varians)
� dll
IF-UTAMA 12
4
IF-UTAMA 13
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
Training Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Splitting Attributes
Model: Decision Tree
IF-UTAMA 14
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
MarSt
Refund
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle,
Divorced
< 80K > 80K
There could be more than one tree that
fits the same data!
IF-UTAMA 15
Apply
Model
Learn
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 1 0
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 1 0
IF-UTAMA 16
Decision
Tree
5
IF-UTAMA 17
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 1 0
Test Data
Start from the root of tree.
IF-UTAMA 18
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 1 0
Test Data
IF-UTAMA 19
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 1 0
Test Data
IF-UTAMA 20
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 1 0
Test Data
6
IF-UTAMA 21
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 1 0
Test Data
IF-UTAMA 22
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 1 0
Test Data
Assign Cheat to “No”
Apply
Model
Learn
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 1 0
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 1 0
IF-UTAMA 23
Decision
Tree
IF-UTAMA 24
7
IF-UTAMA 25 IF-UTAMA 26
IF-UTAMA 27
� Data dinyatakan dalam bentuk tabel dengan atribut dan record.
� AtributAtributAtributAtribut menyatakan suatu parameter yang dibuat sebagai kriteria dalam pembentukan tree. Misalkan untuk menentukan main tenis, kriteria yang diperhatikan adalah cuaca, angin dan temperatur. Salah satu atribut merupakan atribut yang menyatakan data solusi per-item data yang disebut dengan target target target target atributatributatributatribut.
� Atribut memiliki nilai-nilai yang dinamakan dengan instanceinstanceinstanceinstance. Misalkan atribut cuaca mempunyai instance berupa cerah, berawan dan hujan.
IF-UTAMA 28
8
1. Mengubah bentuk data (tabel) menjadi model tree.◦ ID3 Algorithm◦ C.45 Algorithm◦ etc
2. Mengubah model tree menjadi rule◦ Disjunction (v � OR)◦ Conjunction (^ � AND)
3. Menyederhanakan Rule (Pruning)
IF-UTAMA 29 IF-UTAMA 30
IF-UTAMA 31 IF-UTAMA 32
10
Given a set of examples, S, categorised in categories ci, then: 1. Choose the root node to be the attribute, A, which scores the highest
for information gain relative to S. 2. For each value v that A can possibly take, draw a branch from the
node. 3. For each branch from A corresponding to value v, calculate Sv. Then: ◦ If Sv is empty, choose the category cdefault which contains the most
examples from S, and put this as the leaf node category which ends that branch.
◦ If Sv contains only examples from a category c, then put c as the leaf node category which ends that branch.
◦ Otherwise, remove A from the set of attributes which can be put into nodes. Then put a new node in the decision tree, where the new attribute being tested in the node is the one which scores highest for information gain relative to Sv (note: not relative to S). This new node starts the cycle again (from 2), with S replaced by Sv in the calculations and the tree gets built iteratively like this.
� The algorithm terminates either when all the attributes have been exhausted, or the decision tree perfectly classifies the examples.
IF-UTAMA 37 IF-UTAMA 38
� Spesifikasikan masalah �menentukan AtributAtributAtributAtributdan Target AtributTarget AtributTarget AtributTarget Atribut berdasarkan data yang ada
� Hitung nilai EntropyEntropyEntropyEntropy dari setiap kriteria dengan data sample yang ditentukan.
� Hitung Information GainInformation GainInformation GainInformation Gain dari setiap kriteria
� Node terpilih adalah kriteria dengan Information Gain yang paling tinggi.
� Ulangi sampai diperoleh node terakhir yang berisi target atribut
IF-UTAMA 39
� Entropy(S) adalah jumlah bit yang diperkirakan dibutuhkan untuk dapat mengekstrak suatu kelas (+ atau -) dari sejumlah data acak pada ruang sample S.
� Entropy bisa dikatakan sebagai kebutuhan bit untuk menyatakan suatu kelas. Semakin kecil nilai Entropy maka semakin baik untuk digunakan dalam mengekstraksi suatu kelas.
� Panjang kode untuk menyatakan informasi secara optimal adalah –log2 p bits untuk messages yang mempunyai probabilitas p.
IF-UTAMA 40
14
IF-UTAMA 53 IF-UTAMA 54
Decision Tree??
Sample Atribut Target Atribut
� Jumlah instance = 8
� Jumlah instance positif = 3
� Jumlah instance negatif = 5
IF-UTAMA 55
( )negatifceinsnegatifceinspositifceinspositifceins PPPPHipertensiEntropy _tan2_tan_tan2_tan loglog −−=
( ) ( )
( ) ( )
0,955
0,4240,531
0.678- 625.0-1.415375.0
625.0log625.0375.0log375.0
8
5log
8
5
8
3log
8
3
22
22
=
+=
×−×−=
×−×−=
×
−
×
−=
� Jumlah instance = 8� Instance Usia◦ Muda� Instance positif = 1 � Instance negatif = 3◦ Tua� Instance positif = 2 � Instance negatif = 2
� Entropy Usia
◦ Entropy(muda) = 0.906◦ Entropy(tua) = 1
IF-UTAMA 56
( )
( )negatifceinsnegatifceinspositifceinspositifceins
negatifceinsnegatifceinspositifceinspositifceins
PPPPTuaEntropy
PPPPMudaEntropy
_tan2_tan_tan2_tan
_tan2_tan_tan2_tan
loglog
loglog
−−=
−−=
15
IF-UTAMA 57
( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( )
002.0
5.0453.0955.0
18
4906.0
8
4955.0
,,
=
−−=
−−=
−−=
−= ∑∈
TuaTua
MudaMuda
v
TuaMudav
v
SEntropyS
SSEntropy
S
SSEntropy
SEntropyS
SSEntropyUsiaSGain
� Jumlah instance = 8
� Intance Berat◦ Overweight
� Instance positif = 3
� Instance negatif = 1
◦ Average
� Instance positif = 0
� Instance negatif = 2
◦ Underweight
� Instance positif = 0
� Instance negatif = 2
◦ Entropy(Overweight)=0.906
◦ Entropy(Average)=0.5
◦ Entropy(Underweight)=0.5
IF-UTAMA 58
( )
( )
( ) negatifceinsnegatifceinspositifceinspositifceins
negatifceinsnegatifceinspositifceinspositifceins
negatifceinsnegatifceinspositifceinspositifceins
PPPPtUnderweighEntropy
PPPPAverageEntropy
PPPPOverweightEntropy
_tan2_tan_tan2_tan
_tan2_tan_tan2_tan
_tan2_tan_tan2_tan
loglog
loglog
loglog
−−=
−−=
−−=
IF-UTAMA 59
( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
0,252
125.0125.0453.0955.0
5.08
25.0
8
2906.0
8
4955.0
,,,
=
−−−=
−−−=
−−−=
−= ∑∈
tUnderweigh
tUnderweigh
average
Average
Overweight
Overweight
v
tUnderweighAverageOverwightv
v
SEntropyS
SSEntropy
S
SSEntropy
S
SSEntropy
SEntropyS
SSEntropyBeratSGain
� Jumlah instance = 8
� Intance Jenis Kelamin◦ Pria
� Instance positif = 2
� Instance negatif = 4
◦ Wanita� Instance positif = 1
� Instance negatif = 1
◦ Entropy(Pria)=1◦ Entropy(Wanita)=0.75
( )
( ) negatifceinsnegatifceinspositifceinspositifceins
negatifceinsnegatifceinspositifceinspositifceins
PPPPWanitaEntropy
PPPPiaEntropy
_tan2_tan_tan2_tan
_tan2_tan_tan2_tan
loglog
loglogPr
−−=
−−=
IF-UTAMA 60
16
IF-UTAMA 61
( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( )
0,017
188.075.0955.0
75.08
21
8
6955.0
min,
PrPr
,Pr
=
−−=
−−=
−−=
−= ∑∈
WanitaWanita
iaia
v
Wanitaiav
v
SEntropyS
SSEntropy
S
SSEntropy
SEntropyS
SSEntropyJenisKelaSGain
� Atribut yang dipilih adalah atribut berat karena nilai Information Gainnya paling tinggi
� Jumlah Instance untuk Overweight = 4� Jumlah Instance untuk Average = 2� Jumlah Instance untuk Underweight = 2
� Hitung Gain paling tinggi untuk dijadikan cabang berikutnya
IF-UTAMA 62
Berat
Overweight
Average
Underweight
� Jumlah instance = 4� Instance (Berat = Overwight ) & Usia =◦ Muda� Instance positif = 1 � Instance negatif = 0◦ Tua� Instance positif = 2 � Instance negatif = 1
� Instance (Berat = Overwight ) & Jenis Kelamin =◦ Pria� Instance positif = 2 � Instance negatif = 1◦ Wanita� Instance positif = 1 � Instance negatif = 0
63 IF-UTAMA 64
Clasification Rule ????
17
� Underfitting and Overfitting◦ UnderfittingUnderfittingUnderfittingUnderfitting: when model is too simple, both training
and test errors are large
◦ OverfittingOverfittingOverfittingOverfitting results in decision trees that are more complex than necessary
� Missing Values
� Costs of Classification
IF-UTAMA 65
� Pre-Pruning (Early Stopping Rule)◦ Stop the algorithm before it becomes a fully-grown
tree◦ Typical stopping conditions for a node:� Stop if all instances belong to the same class
� Stop if all the attribute values are the same
◦ More restrictive conditions:� Stop if number of instances is less than some user-specified
threshold
� Stop if class distribution of instances are independent of theavailable features (e.g., using χ 2 test)
� Stop if expanding the current node does not improve impuritymeasures (e.g., Gini or information gain).
IF-UTAMA 66
� Post-pruning◦ Grow decision tree to its entirety
◦ Trim the nodes of the decision tree in a bottom-up fashion
◦ If generalization error improves after trimming, replace sub-tree by a leaf node.
◦ Class label of leaf node is determined from majority class of instances in the sub-tree
◦ Can use MDL for post-pruning
IF-UTAMA 67 IF-UTAMA 68
A?
A1
A2 A3
A4
Class = Yes 20
Class = No 10
Error = 10/30
Training Error (Before splitting) = 10/30
Pessimistic error = (10 + 0.5)/30 = 10.5/30
Training Error (After splitting) = 9/30
Pessimistic error (After splitting)
= (9 + 4 ×××× 0.5)/30 = 11/30
PRUNE!
Class = Yes 8
Class = No 4
Class = Yes 3
Class = No 4
Class = Yes 4
Class = No 1
Class = Yes 5
Class = No 1
18
� Missing values affect decision tree construction in three different ways:◦ Affects how impurity measures are computed
◦ Affects how to distribute instance with missing value to child nodes
◦ Affects how a test instance with missing value is classified
IF-UTAMA 69 IF-UTAMA 70
Tid Refund Marital Status
Taxable Income Class
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 ? Single 90K Yes 1 0
Class
= Yes
Class
= No
Refund=Yes 0 3
Refund=No 2 4
Refund=? 1 0
Split on Refund:
Entropy(Refund=Yes) = 0
Entropy(Refund=No)
= -(2/6)log(2/6) – (4/6)log(4/6) = 0.9183
Entropy(Children)
= 0.3 (0) + 0.6 (0.9183) = 0.551
Gain = 0.9 ×××× (0.8813 – 0.551) = 0.3303
Missing
value
Before Splitting:
Entropy(Parent) = -0.3 log(0.3)-(0.7)log(0.7) = 0.8813
Class=Yes 0 + 3/9
Class=No 3
IF-UTAMA 71
Tid Refund Marital Status
Taxable Income Class
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No 10
RefundYes No
Class=Yes 0
Class=No 3
Cheat=Yes 2
Cheat=No 4
RefundYes
Tid Refund Marital Status
Taxable Income Class
10 ? Single 90K Yes 1 0
No
Class=Yes 2 + 6/9
Class=No 4
Probability that Refund=Yes is 3/9
Probability that Refund=No is 6/9
Assign record to the left child with
weight = 3/9 and to the right child with weight = 6/9
IF-UTAMA 72
Refund
MarSt
TaxInc
YESNO
NO
NO
YesNo
MarriedSingle,
Divorced
< 80K > 80K
Married Single Divorced Total
Class=No 3 1 0 4
Class=Yes 6/9 1 1 2.67
Total 3.67 2 1 6.67
Tid Refund Marital Status
Taxable Income Class
11 No ? 85K ? 10
New record:
Probability that Marital Status
= Married is 3.67/6.67
Probability that Marital Status
={Single,Divorced} is 3/6.67
20
1. Dr. Mourad YKHLEF.2009. Decision Tree Decision Tree Decision Tree Decision Tree Induction SystemInduction SystemInduction SystemInduction System.King Saud University
2. Achmad Basuki, Iwan Syarif. 2003. Decision Decision Decision Decision TreeTreeTreeTree. Politeknik Elektronika Negeri Surabaya (PENS) – ITS
3. Simon Colton.2004. Decision Tree Learning.Decision Tree Learning.Decision Tree Learning.Decision Tree Learning.----
4. Tom M. Mitchell. 1997. Machine LearningMachine LearningMachine LearningMachine Learning. Mc-Graw Hill
IF-UTAMA 77