djamel a. zighed and nicolas nicoloyannis eric laboratory university of lyon 2 (france)...

23
Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) [email protected] Prague Sept. 04

Upload: laureen-adams

Post on 24-Dec-2015

222 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) zighed@univ-lyon2.fr Prague Sept. 04

Djamel A. Zighed and Nicolas Nicoloyannis

ERIC LaboratoryUniversity of Lyon 2 (France)

[email protected]

Prague Sept. 04

Page 2: Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) zighed@univ-lyon2.fr Prague Sept. 04

About Computer science dep.

• In Lyon, there are 3 universities, 100000 students

• Lumière university Lyon 2, has 22000 students, • Lyon 2, is mainly a liberal art university• The faculty of economic has tree departments,

among them the computer science one• We belong to this department• We have Bachelor, Master and PhD programs

for 300 students

Page 3: Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) zighed@univ-lyon2.fr Prague Sept. 04

ERIC Lab at the University

Economic Sociology Linguistic Law

Faculties of university of Lyon 2

ERICResearch centers of the university

Knowledge Engineering Research Center

- The budget of ERIC doesn’t depend from the university, it’s given parThe national ministry of education- We have a large autonomy in decision making

Page 4: Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) zighed@univ-lyon2.fr Prague Sept. 04

ERIC Lab

• Born in 1995,

• 11 professors (N. Nicoloyannis, director)

• 15 PhD Students

• Grants+contracts+WK+…=200K€/year

• Research topics– Data mining (theory, tools and applications)– Data warehouse management (T,T,A)

Page 5: Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) zighed@univ-lyon2.fr Prague Sept. 04

Data Mining (T,T,A)

• Theory– Induction graphs– Learning and classification

• Tools– SIPINA : Plate form for data mining

• Applications– Medical fields– Chemical applications– Human science– …

Data mining TTA for complex data

Page 6: Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) zighed@univ-lyon2.fr Prague Sept. 04

Data mining on complex data

• An example : Breast cancer diagnosis

Page 7: Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) zighed@univ-lyon2.fr Prague Sept. 04

Motivations

c

r

yyY

xxX

Ω

YX

,,

,,

data ofset a be

attributes twobe and

Let

1

1

Contingency table

Y

X

1y cy

1x

rx

rcn

11n

1rn

cn1

.1n

.rn

n1.n cn.

XYT

XYT Association measure :It measures the strength of the relationshipbetween X and Y

Page 8: Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) zighed@univ-lyon2.fr Prague Sept. 04

Motivations

c

r

yyY

xxX

Ω

YX

,,

,,

data ofset a be

attributes twobe and

Let

1

1

Y

X

1y cy

1x

rx

rcn

11n

1rn

cn1

.1n

.rn

n1.n cn.

Contingency table

XYT

XYT

Association measure :It measures the strength of the relationshipbetween X and Y

Page 9: Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) zighed@univ-lyon2.fr Prague Sept. 04

Motivations

c

r

yyY

xxX

Ω

YX

,,

,,

data ofset a be

attributes twobe and

Let

1

1

Y

X

1y cy

1x

rx

rcn

11n

1rn

cn1

.1n

.rn

n1.n cn.

Contingency table

XYT

XYT

Association measure :It measures the strength of the relationshipbetween X and Y

Page 10: Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) zighed@univ-lyon2.fr Prague Sept. 04

Motivations

c

r

yyY

xxX

Ω

YX

,,

,,

data ofset a be

attributes twobe and

Let

1

1

Y

X

1y cy

1x

rx

rcn

11n

1rn

cn1

.1n

.rn

n1.n cn.

Contingency table

XYT

XYT

Association measure :It measures the strength of the relationshipbetween X and Y

According to a specific association measure, may we improve the strength of the relationship by merging some rows and/or some columns ?

Page 11: Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) zighed@univ-lyon2.fr Prague Sept. 04

Motivations

Y

X

1y cy

1x

rx

rcn

11n

1rn

cn1

.1n

.rn

n1.n cn.

Contingency table

XYT

XYT Association measure :It measures the strength of the relationshipbetween X and Y

XYXY

XY

TT

rr

cc

T

'

and '

and'

: that such

'

According to a specific association measure, may we improve the strength of the relation ship by merging some rows and/or some columns ?

Page 12: Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) zighed@univ-lyon2.fr Prague Sept. 04

An example

140ˆ .tTXY

Page 13: Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) zighed@univ-lyon2.fr Prague Sept. 04

Goal:Find the groupings that maximize the association between attributes

Yes, we can improve the association by reducing the size of the contingency

table

tt ˆ'ˆ

320'ˆ' .tT XY

For the preceding examplethe maximization of the Tschuprow’s t gives

Page 14: Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) zighed@univ-lyon2.fr Prague Sept. 04

Extension

c

r

yyY

xxX

Ω

YX

,,

,,

data ofset a be

attributes twobe and

Let

1

1

Y

X

1y cy

1x

rx

rcn

11n

1rn

cn1

.1n

.rn

n1.n cn.

Contingency table

XYT

XYT

According to a specific association measure, may we find the optimal reduced contingency table ?

iXY

iXY

XY

TT

ll

cc

T

max *

*

*

*

Page 15: Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) zighed@univ-lyon2.fr Prague Sept. 04

Optimal solution (exhaustive search)

Goal : Find the best cross partition on T

case ordinal

case nominal

XT#P

YX

YY

XX

Y

X

T#T#

TT#

TT#

YT

XT

PP

PP

PP

P

P

ischeck tohave wecases ofnumber The

set theof size the:

set theof size the:

over about brought partitions all ofset The :

over about brought partitions all ofset The :

Page 16: Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) zighed@univ-lyon2.fr Prague Sept. 04

Optimal solution (exhaustive search)

case ordinal

case nominal

XT#P

Page 17: Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) zighed@univ-lyon2.fr Prague Sept. 04

Optimal solution (exhaustive search)

According to a specific association measure, may we find the optimal reduced contingency table ?

Yes, but the solution is intractable in real word because of the high time complexity

Page 18: Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) zighed@univ-lyon2.fr Prague Sept. 04

Heuristic

1

,0

whenStop

2,1 ly successive determines algorithm The

categoriesfinest the withStarting

kk

k

cr

TT

kT

TT

Proceed successively to the grouping of 2 (row or column) values that

maximizes the increase in the association criteria.

Page 19: Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) zighed@univ-lyon2.fr Prague Sept. 04

Complexity

Page 20: Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) zighed@univ-lyon2.fr Prague Sept. 04

Simulation

Goal: How far is the quasi-optimal solution from the true optimum?

Comparison tractable for tables not greater than 6 × 6.

Simulation Design

Randomly generate 200 tables

Analysis of the distribution of the deviations between optima andquasi-optima.

Generating the Tables

10000 cases distributed in the cxr cells of the table with an uniform distribution (worst case).

Page 21: Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) zighed@univ-lyon2.fr Prague Sept. 04

Quasi-optimal solution

Page 22: Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) zighed@univ-lyon2.fr Prague Sept. 04

Quasi-optimal solution

Page 23: Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) zighed@univ-lyon2.fr Prague Sept. 04

Conclusion

• Implementation for new approach induction decision tree.– Zighed, D.A., Ritschard, G., W. Erray and V.-M. Scuturici (2003),

Abogodaï,a New approach for Decision Trees, in Lavrac, N., D.Gamberger, L. Todorovski and H. Blockeel (eds), Knowledge Discovery in databases: PKDD 2003 , LNAI 2838, Berlin: Springer, 495--506.

– Zighed D. A., Ritschard G., Erray W., Scuturici V.-M. (2003), Decision tree with optimal join partitioning, To appear in Journal of Information Intelligent Systems, Kluwer (2004).

• Divisive top-down approach• Extension to multidimensionnal case