cis671-knowledge discovery and data mining vasileios megalooikonomou dept. of computer and...
TRANSCRIPT
![Page 1: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/1.jpg)
CIS671-Knowledge Discovery and Data Mining
Vasileios MegalooikonomouDept. of Computer and Information Sciences
Temple University
AI reminders
(based on notes by Christos Faloutsos)
![Page 2: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/2.jpg)
Agenda• Statistics• AI - decision trees
– Problem– Approach– Conclusions
• DB
![Page 3: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/3.jpg)
Decision Trees• Problem: Classification - Ie., • given a training set (N tuples, with M attributes,
plus a label attribute)• find rules, to predict the label for newcomers
Pictorially:
![Page 4: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/4.jpg)
Decision treesAge Chol-level Gender … CLASS-ID
30 150 M +
…
-
??
![Page 5: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/5.jpg)
Decision trees
• Issues:– missing values– noise– ‘rare’ events
![Page 6: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/6.jpg)
Decision trees
• types of attributes– numerical (= continuous) - eg: ‘salary’– ordinal (= integer) - eg.: ‘# of children’– nominal (= categorical) - eg.: ‘car-type’
![Page 7: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/7.jpg)
Decision trees
• Pictorially, we have
num. attr#1 (eg., ‘age’)
num. attr#2(eg., chol-level)
+
-++ +
++
+
-
--
--
![Page 8: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/8.jpg)
Decision trees
• and we want to label ‘?’
num. attr#1 (eg., ‘age’)
num. attr#2(eg., chol-level)
+
-++ +
++
+
-
--
--
?
![Page 9: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/9.jpg)
Decision trees
• so we build a decision tree:
num. attr#1 (eg., ‘age’)
num. attr#2(eg., chol-level)
+
-++ +
++
+
-
--
--
?
50
40
![Page 10: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/10.jpg)
Decision trees
• so we build a decision tree:
age<50
Y
+ chol. <40
N
- ...
Y N
![Page 11: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/11.jpg)
Agenda• Statistics• AI - decision trees
– Problem– Approach– Conclusions
• DB
![Page 12: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/12.jpg)
Decision trees
• Typically, two steps:– tree building– tree pruning (for over-training/over-fitting)
![Page 13: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/13.jpg)
Tree building
• How?
num. attr#1 (eg., ‘age’)
num. attr#2(eg.,chol-level)
+
-++ +
++
+
-
--
--
![Page 14: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/14.jpg)
Tree building
• How?• A: Partition, recursively - pseudocode:
Partition ( dataset S)if all points in S have same labelthen returnevaluate splits along each attribute Apick best split, to divide S into S1 and S2Partition(S1); Partition(S2)
• Q1: how to introduce splits along attribute Ai
• Q2: how to evaluate a split?
![Page 15: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/15.jpg)
Tree building
• Q1: how to introduce splits along attribute Ai
• A1:– for numerical attributes:
• binary split, or
• multiple split
– for categorical attributes:• compute all subsets (expensive!), or
• use a greedy approach
![Page 16: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/16.jpg)
Tree building
• Q1: how to introduce splits along attribute Ai
• Q2: how to evaluate a split?
![Page 17: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/17.jpg)
Tree building
• Q1: how to introduce splits along attribute Ai
• Q2: how to evaluate a split?
• A: by how close to uniform each subset is - ie., we need a measure of uniformity:
![Page 18: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/18.jpg)
Tree building
entropy: H(p+, p-)
p+10
0
1
0.5
Any other measure?
p+: relative frequency of class + in S
![Page 19: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/19.jpg)
Tree building
entropy: H(p+, p-)
p+10
0
1
0.5
‘gini’ index: 1-p+2 - p-
2
p+10
0
1
0.5
p+: relative frequency of class + in S
![Page 20: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/20.jpg)
Tree building
entropy: H(p+, p-) ‘gini’ index: 1-p+2 - p-
2
(How about multiple labels?)
![Page 21: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/21.jpg)
Tree building
Intuition:
• entropy: #bits to encode the class label
• gini: classification error, if we randomly guess ‘+’ with prob. p+
![Page 22: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/22.jpg)
Tree building
Thus, we choose the split that reduces entropy/classification-error the most: Eg.:
num. attr#1 (eg., ‘age’)
num. attr#2(eg., chol-level)
+
-++ +
++
+
-
----
![Page 23: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/23.jpg)
Tree building
• Before split: we need(n+ + n-) * H( p+, p-) = (7+6) * H(7/13, 6/13)
bits total, to encode all the class labels
• After the split we need:0 bits for the first half and
(2+6) * H(2/8, 6/8) bits for the second half
![Page 24: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/24.jpg)
Agenda• Statistics• AI - decision trees
– Problem– Approach
• tree building• tree pruning
– Conclusions
• DB
![Page 25: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/25.jpg)
Tree pruning• What for?
num. attr#1 (eg., ‘age’)
num. attr#2(eg., chol-level)
+
-++ +
++
+
-
----
...
![Page 26: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/26.jpg)
Tree pruning• Q: How to do it?
num. attr#1 (eg., ‘age’)
num. attr#2(eg., chol-level)
+
-++ +
++
+
-
----
...
![Page 27: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/27.jpg)
Tree pruning• Q: How to do it?• A1: use a ‘training’ and a ‘testing’ set -
prune nodes that improve classification in the ‘testing’ set. (Drawbacks?)
![Page 28: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/28.jpg)
Tree pruning• Q: How to do it?• A1: use a ‘training’ and a ‘testing’ set -
prune nodes that improve classification in the ‘testing’ set. (Drawbacks?)
• A2: or, rely on MDL (= Minimum Description Language) - in detail:
![Page 29: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/29.jpg)
Tree pruning• envision the problem as compression (of
what?)
![Page 30: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/30.jpg)
Tree pruning• envision the problem as compression (of
what?)• and try to minimize the # bits to compress
(a) the class labels AND(b) the representation of the decision tree
![Page 31: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/31.jpg)
(MDL)• a brilliant idea - eg.: best n-degree
polynomial to compress these points:• the one that minimizes (sum of errors + n )
![Page 32: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/32.jpg)
Conclusions• Classification through trees• Building phase - splitting policies• Pruning phase (to avoid over-fitting)• Observation: classification is subtly related
to compression
![Page 33: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on](https://reader036.vdocument.in/reader036/viewer/2022062805/5697c0151a28abf838cce092/html5/thumbnails/33.jpg)
Reference
• M. Mehta, R. Agrawal and J. Rissanen, `SLIQ: A Fast Scalable Classifier for Data Mining', Proc. of the Fifth Int'l Conference on Extending Database Technology, Avignon, France, March 1996