cis671-knowledge discovery and data mining vasileios megalooikonomou dept. of computer and...

33
CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on notes by Christos Faloutsos)

Upload: aron-randall

Post on 19-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

CIS671-Knowledge Discovery and Data Mining

Vasileios MegalooikonomouDept. of Computer and Information Sciences

Temple University

AI reminders

(based on notes by Christos Faloutsos)

Page 2: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Agenda• Statistics• AI - decision trees

– Problem– Approach– Conclusions

• DB

Page 3: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Decision Trees• Problem: Classification - Ie., • given a training set (N tuples, with M attributes,

plus a label attribute)• find rules, to predict the label for newcomers

Pictorially:

Page 4: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Decision treesAge Chol-level Gender … CLASS-ID

30 150 M +

-

??

Page 5: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Decision trees

• Issues:– missing values– noise– ‘rare’ events

Page 6: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Decision trees

• types of attributes– numerical (= continuous) - eg: ‘salary’– ordinal (= integer) - eg.: ‘# of children’– nominal (= categorical) - eg.: ‘car-type’

Page 7: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Decision trees

• Pictorially, we have

num. attr#1 (eg., ‘age’)

num. attr#2(eg., chol-level)

+

-++ +

++

+

-

--

--

Page 8: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Decision trees

• and we want to label ‘?’

num. attr#1 (eg., ‘age’)

num. attr#2(eg., chol-level)

+

-++ +

++

+

-

--

--

?

Page 9: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Decision trees

• so we build a decision tree:

num. attr#1 (eg., ‘age’)

num. attr#2(eg., chol-level)

+

-++ +

++

+

-

--

--

?

50

40

Page 10: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Decision trees

• so we build a decision tree:

age<50

Y

+ chol. <40

N

- ...

Y N

Page 11: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Agenda• Statistics• AI - decision trees

– Problem– Approach– Conclusions

• DB

Page 12: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Decision trees

• Typically, two steps:– tree building– tree pruning (for over-training/over-fitting)

Page 13: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Tree building

• How?

num. attr#1 (eg., ‘age’)

num. attr#2(eg.,chol-level)

+

-++ +

++

+

-

--

--

Page 14: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Tree building

• How?• A: Partition, recursively - pseudocode:

Partition ( dataset S)if all points in S have same labelthen returnevaluate splits along each attribute Apick best split, to divide S into S1 and S2Partition(S1); Partition(S2)

• Q1: how to introduce splits along attribute Ai

• Q2: how to evaluate a split?

Page 15: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Tree building

• Q1: how to introduce splits along attribute Ai

• A1:– for numerical attributes:

• binary split, or

• multiple split

– for categorical attributes:• compute all subsets (expensive!), or

• use a greedy approach

Page 16: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Tree building

• Q1: how to introduce splits along attribute Ai

• Q2: how to evaluate a split?

Page 17: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Tree building

• Q1: how to introduce splits along attribute Ai

• Q2: how to evaluate a split?

• A: by how close to uniform each subset is - ie., we need a measure of uniformity:

Page 18: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Tree building

entropy: H(p+, p-)

p+10

0

1

0.5

Any other measure?

p+: relative frequency of class + in S

Page 19: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Tree building

entropy: H(p+, p-)

p+10

0

1

0.5

‘gini’ index: 1-p+2 - p-

2

p+10

0

1

0.5

p+: relative frequency of class + in S

Page 20: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Tree building

entropy: H(p+, p-) ‘gini’ index: 1-p+2 - p-

2

(How about multiple labels?)

Page 21: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Tree building

Intuition:

• entropy: #bits to encode the class label

• gini: classification error, if we randomly guess ‘+’ with prob. p+

Page 22: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Tree building

Thus, we choose the split that reduces entropy/classification-error the most: Eg.:

num. attr#1 (eg., ‘age’)

num. attr#2(eg., chol-level)

+

-++ +

++

+

-

----

Page 23: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Tree building

• Before split: we need(n+ + n-) * H( p+, p-) = (7+6) * H(7/13, 6/13)

bits total, to encode all the class labels

• After the split we need:0 bits for the first half and

(2+6) * H(2/8, 6/8) bits for the second half

Page 24: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Agenda• Statistics• AI - decision trees

– Problem– Approach

• tree building• tree pruning

– Conclusions

• DB

Page 25: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Tree pruning• What for?

num. attr#1 (eg., ‘age’)

num. attr#2(eg., chol-level)

+

-++ +

++

+

-

----

...

Page 26: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Tree pruning• Q: How to do it?

num. attr#1 (eg., ‘age’)

num. attr#2(eg., chol-level)

+

-++ +

++

+

-

----

...

Page 27: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Tree pruning• Q: How to do it?• A1: use a ‘training’ and a ‘testing’ set -

prune nodes that improve classification in the ‘testing’ set. (Drawbacks?)

Page 28: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Tree pruning• Q: How to do it?• A1: use a ‘training’ and a ‘testing’ set -

prune nodes that improve classification in the ‘testing’ set. (Drawbacks?)

• A2: or, rely on MDL (= Minimum Description Language) - in detail:

Page 29: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Tree pruning• envision the problem as compression (of

what?)

Page 30: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Tree pruning• envision the problem as compression (of

what?)• and try to minimize the # bits to compress

(a) the class labels AND(b) the representation of the decision tree

Page 31: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

(MDL)• a brilliant idea - eg.: best n-degree

polynomial to compress these points:• the one that minimizes (sum of errors + n )

Page 32: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Conclusions• Classification through trees• Building phase - splitting policies• Pruning phase (to avoid over-fitting)• Observation: classification is subtly related

to compression

Page 33: CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on

Reference

• M. Mehta, R. Agrawal and J. Rissanen, `SLIQ: A Fast Scalable Classifier for Data Mining', Proc. of the Fifth Int'l Conference on Extending Database Technology, Avignon, France, March 1996