contributions to miningmart

24
Contributions to MiningMart Petr Berka Petr Berka Laboratory for Laboratory for Intelligent Systems Intelligent Systems University of Economics, University of Economics, Prague Prague [email protected] [email protected]

Upload: amal

Post on 01-Feb-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Contributions to MiningMart. Petr Berka Laboratory for Intelligent Systems University of Economics, Prague [email protected]. University of Economics, Prague. LISp - Laboratory for Intelligent Systems - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Contributions to MiningMart

Contributions to MiningMartContributions to MiningMart

Petr BerkaPetr Berka

Laboratory for Intelligent SystemsLaboratory for Intelligent Systems

University of Economics, PragueUniversity of Economics, Prague

[email protected]@vse.cz

Page 2: Contributions to MiningMart

MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001

22

University of Economics, PragueUniversity of Economics, Prague

LISp - LISp - Laboratory for Intelligent SystemsLaboratory for Intelligent Systems

SALOME - SALOME - Laboratory for Multidisciplinary Laboratory for Multidisciplinary Approaches to Decision-making Support in Economics Approaches to Decision-making Support in Economics and Managementand Management

Page 3: Contributions to MiningMart

MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001

33

LISp researchLISp research

probabilistic methods - decomposable probabilistic methods - decomposable probability models and bayesian networks probability models and bayesian networks

symbolic ML methods - 4FT association symbolic ML methods - 4FT association rules and decision rules rules and decision rules

logical calculi for knowledge discovery in logical calculi for knowledge discovery in databasesdatabases

Page 4: Contributions to MiningMart

MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001

44

LISp activitiesLISp activities

Organized conferencesOrganized conferences ECMLECML’97, PKDD’99’97, PKDD’99

Organized workshopsOrganized workshops Discovery Challenge (PKDD‘99, PKDD2000, PKDD20001), Discovery Challenge (PKDD‘99, PKDD2000, PKDD20001),

WUPES‘97, WUPES2000WUPES‘97, WUPES2000

International ProjectsInternational Projects MLNet, Sol-Eu-Net, EUNITE, MUM, MGTMLNet, Sol-Eu-Net, EUNITE, MUM, MGT KDNetKDNet

Page 5: Contributions to MiningMart

MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001

55

SALOME researchSALOME research

Quantitative and AI (pattern recognition, Quantitative and AI (pattern recognition, fuzzy, neural nets) approaches to support of fuzzy, neural nets) approaches to support of decision making in econmics and decision making in econmics and managementmanagement

Page 6: Contributions to MiningMart

MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001

66

SALOME activitiesSALOME activities

Organized workshopsOrganized workshops STIPR‘97, MME‘99STIPR‘97, MME‘99

International ProjectsInternational Projects Univ. Salzburg, Univ. Hokkaido, Univ. CambridgeUniv. Salzburg, Univ. Hokkaido, Univ. Cambridge

Page 7: Contributions to MiningMart

MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001

77

LISp softwareLISp software

LISp-Miner (data mining system)LISp-Miner (data mining system) DataSource (DataSource (forfor data manipulation)data manipulation) 4FT Miner 4FT Miner (4FT association rules) and(4FT association rules) and KEX KEX (decision rules)(decision rules)

experimental software for building graphical experimental software for building graphical modelsmodels

preprocessing procedurespreprocessing procedures related to KEXrelated to KEX based on information theoretic approachbased on information theoretic approach

Page 8: Contributions to MiningMart

MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001

88

LISP-Miner proceduresLISP-Miner procedures

DataSourceDataSourcecreating new (virtual) attributes using SQLcreating new (virtual) attributes using SQL

ekvidistant and equifrequent discretizationekvidistant and equifrequent discretization

grouping attribute values grouping attribute values

computing attribute-value frequenciescomputing attribute-value frequencies

Page 9: Contributions to MiningMart

MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001

99

LISP-Miner proceduresLISP-Miner procedures

4FT-Miner (GUHA procedure)4FT-Miner (GUHA procedure)4FT association rules in the form 4FT association rules in the form

Ant ~ Suc / CondAnt ~ Suc / Cond

KEXKEX

weighted decision rules in the formweighted decision rules in the form

Ant Ant C (weight) C (weight)

Page 10: Contributions to MiningMart

MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001

1010

4FT-Miner basic idea4FT-Miner basic idea

Generate a (potential) rule, e.g.Generate a (potential) rule, e.g.COLOUR(red) COLOUR(red) SIZE(small) SIZE(small) 0.9, 200.9, 20 TEMP(high) TEMP(high)

AGE(21-30) AGE(21-30) SALARY(low) SALARY(low) 0.85,15 0.85,15 PAYMENTS (High) PAYMENTS (High) LOAN(bad) LOAN(bad)

Verify a rule using four-fold tableVerify a rule using four-fold table

Suc Suc

Ant a bAnt c d

pba

aBaTRUEBp

iff ,

pcba

aBaTRUEBp

iff ,

Page 11: Contributions to MiningMart

MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001

1313

KEX basic ideaKEX basic idea

Generate a (potential) rule, e.g.Generate a (potential) rule, e.g.YEARS-IN-COMPANY(0-3) YEARS-IN-COMPANY(0-3) AGE(0-25) AGE(0-25) LOAN(GOOD) LOAN(GOOD)

If rule refines current set of rules If rule refines current set of rules (validity a/(a+b) differs from weight inferred during consultation)(validity a/(a+b) differs from weight inferred during consultation)

add into rule base with proper weightadd into rule base with proper weight

Page 12: Contributions to MiningMart

MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001

1616

LISp-Miner architectureLISp-Miner architecture

Data(ODBC

ACCESS)

MetaData(ODBC ACCESS)

ResultsLM

Windows

Page 13: Contributions to MiningMart

MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001

1717

Preprocessing (LISp) Preprocessing (LISp)

KEX-orientedKEX-oriented (fuzzy) discretization + grouping of values(fuzzy) discretization + grouping of values computing the amount of noise in datacomputing the amount of noise in data random sampling + balancing of datarandom sampling + balancing of data handling missing valueshandling missing values

Information theoryInformation theory attribute selectionattribute selection attribute groupingattribute grouping

Page 14: Contributions to MiningMart

MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001

1818

… fuzzy discretization … fuzzy discretization

NClass(Int)N(Int) < >

NClass

N

Page 15: Contributions to MiningMart

MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001

1919

… amount of noise… amount of noise

Amount of noise: 20% Amount of noise: 20%

max. possible accuracy = 80%max. possible accuracy = 80%

head body smile holding jacket tie classo r y s r y +o r y s r y -o r y f y n -o r y b y n -o r n s r y +

Page 16: Contributions to MiningMart

MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001

2020

… data sampling… data sampling

random split into training and testing setrandom split into training and testing set select random stratified sampleselect random stratified sample balance unbalanced classesbalance unbalanced classes

Page 17: Contributions to MiningMart

MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001

2121

… handling missing values… handling missing values

remove exampleremove example substitute missing with new valuesubstitute missing with new value substitute missing with majority valuesubstitute missing with majority value proportional substitutionproportional substitution

Page 18: Contributions to MiningMart

MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001

2222

… information theory… information theory

Attribute selection - Attribute selection - based on mutual informationbased on mutual information

Attribute grouping - Attribute grouping - based on information contentbased on information content

Page 19: Contributions to MiningMart

MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001

2323

Preprocessing architecturePreprocessing architecture

Data(ASCII)

Results procedure

Input data(ASCII)

procedure Output data(ASCII)

Page 20: Contributions to MiningMart

MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001

2424

SALOME softwareSALOME software

Feature Selection Toolbox (Feature Selection Toolbox (Multi-Purpose Multi-Purpose Tool for Pattern RecognitionTool for Pattern Recognition))

feature selection feature selection approximation-based modeling approximation-based modeling classification classification

a consulting system helping to choose the most a consulting system helping to choose the most suitable method is being developedsuitable method is being developed

Page 21: Contributions to MiningMart

MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001

2525

Search strategies for FSSearch strategies for FS

Search for a subset maximizing a criterion Search for a subset maximizing a criterion function (distance, divergence):function (distance, divergence): with apriori informationwith apriori information

exhaustive searchexhaustive search branch and bound based algorithmsbranch and bound based algorithms floating search algorithmsfloating search algorithms

without apriori informationwithout apriori information approximation methodapproximation method divergence methoddivergence method

Page 22: Contributions to MiningMart

MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001

2626

FST architectureFST architecture

Data(ASCII)

ResultsFST

Windows

Page 23: Contributions to MiningMart

MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001

2727

ReferencesReferences

LISp-Miner:LISp-Miner: Berka,P. - Ivanek,J.: Automated Knowledge Acquisition for Berka,P. - Ivanek,J.: Automated Knowledge Acquisition for

PROSPECTOR-like Expert Systems. In: (Bergadano, deRaedt PROSPECTOR-like Expert Systems. In: (Bergadano, deRaedt

eds.) Proc. ECML'94, Springer 1994, 339-342.eds.) Proc. ECML'94, Springer 1994, 339-342. Berka,P. - Rauch,J.: Data Mining using GUHA and KEX. In: Berka,P. - Rauch,J.: Data Mining using GUHA and KEX. In:

(Callaos, Yang, Aguilar eds.) 4th. Int. Conf. on Information (Callaos, Yang, Aguilar eds.) 4th. Int. Conf. on Information Systems, Analysis and Synthesis ISAS'98, 1998, Vol 2, 238- 244. Systems, Analysis and Synthesis ISAS'98, 1998, Vol 2, 238- 244.

Rauch,J.: Classes of Four Fold Table Quantifiers. In: (Zytkow, Rauch,J.: Classes of Four Fold Table Quantifiers. In: (Zytkow, Quafafou eds.) Principles of Data Mining and Knowledge Quafafou eds.) Principles of Data Mining and Knowledge Discovery. Springer 1998, 203 - 211. Discovery. Springer 1998, 203 - 211.

Page 24: Contributions to MiningMart

MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001

2828

ReferencesReferences

Preprocessing:Preprocessing: Bruha,I. - Berka,P.: Discretization and Fuzzification of Numerical Bruha,I. - Berka,P.: Discretization and Fuzzification of Numerical

Attributes in Attribute-Based Learning. In: Szepaniak, Lisboa, Attributes in Attribute-Based Learning. In: Szepaniak, Lisboa, Kacprzyk (eds.): Fuzzy Systems in Medicine, Physica Verlag, Kacprzyk (eds.): Fuzzy Systems in Medicine, Physica Verlag,

2000, 112-138.2000, 112-138. Pudil, P., Novovičová J.: Novel Methods for Subset Selection with Pudil, P., Novovičová J.: Novel Methods for Subset Selection with

Respect to Problem Knowledge, IEEE Transactions on Intelligent Respect to Problem Knowledge, IEEE Transactions on Intelligent Systems - Special Issue on Feature Transformation and Subset Systems - Special Issue on Feature Transformation and Subset Selection 1998, 66-74Selection 1998, 66-74

J. Zvarova and M. Studeny: Information theoretical approach to J. Zvarova and M. Studeny: Information theoretical approach to constitution and reduction of medical data. International Journal of constitution and reduction of medical data. International Journal of Medical Informatics 45 (1997), n. 1-2, pp. 65-74. Medical Informatics 45 (1997), n. 1-2, pp. 65-74.