knowledge discovery in remote access databases

40
Knowledge Discovery in Knowledge Discovery in Remote Access Databases Remote Access Databases A thesis submitted in partial fulfillment of the requirements for the A thesis submitted in partial fulfillment of the requirements for the degree of degree of Doctor of Philosophy in Computer Science Doctor of Philosophy in Computer Science at the Institute of Mathematics and Computer Science Informatics at the Institute of Mathematics and Computer Science Informatics Debrecen of University Debrecen of University By Zakaria Suliman Zubi By Zakaria Suliman Zubi Supervised by Prof. Arato Matyas Supervised by Prof. Arato Matyas and Prof.Fazekas Gábor and Prof.Fazekas Gábor

Upload: zakaria-zubi

Post on 04-Dec-2014

148 views

Category:

Technology


1 download

DESCRIPTION

Knowledge Discovery in Remote Access Databases

TRANSCRIPT

Page 1: Knowledge Discovery in Remote Access Databases

Knowledge Discovery in Knowledge Discovery in Remote Access DatabasesRemote Access Databases

A thesis submitted in partial fulfillment of the requirements for the degree ofA thesis submitted in partial fulfillment of the requirements for the degree ofDoctor of Philosophy in Computer ScienceDoctor of Philosophy in Computer Science

at the Institute of Mathematics and Computer Science Informaticsat the Institute of Mathematics and Computer Science Informatics

Debrecen of UniversityDebrecen of University

By Zakaria Suliman ZubiBy Zakaria Suliman ZubiSupervised by Prof. Arato Matyas and Supervised by Prof. Arato Matyas and

Prof.Fazekas GáborProf.Fazekas Gábor

Page 2: Knowledge Discovery in Remote Access Databases

2

Overview of the ThesisOverview of the Thesis Part I

Introduction to Knowledge Discovery in Databases ( KDD) and Data Mining (DM).

Goal of the Thesis Work.

Part 2 Remote Access KDD models. Logical Foundation in Data Mining. Mining the Discovered Association Rules. Data Mining Query Languages.

Part 3 Knowledge Discovery Query Language ( KDQL). I-extended Databases (I-ED). Implementation of KDQL. Conclusion. Appendix A , B.

Page 3: Knowledge Discovery in Remote Access Databases

3

Introduction to KDD Introduction to KDD and DMand DM

KDD is the process of extracting interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases.

DM is a single step in KDD process which deals with extracting trends or patterns from raw databases and carefully and accurately transforms them into useful and understandable information.

In the introduction part (chapter 1) I will follow the structure of expressing the History, Importance, Appearances and Tools for KDD and DM in all sections of the introduction part in this thesis.

Is a phase in which noise data and irrelevant data are removed from the collection. Multiple data sources,

often heterogeneous, may be combined in a common source.

The data relevant to the analysis is decided on and retrieved from the data collection.

It is a phase in which the selected data is transformed into forms appropriate for the mining procedure.

It is the crucial step in which clever techniques are applied to extract patterns potentially useful information.

Strictly interesting patterns representing knowledge are identified based on a given measures.

In the final phase in which the discovered knowledge is visually represented to the user.

KDD process

Page 4: Knowledge Discovery in Remote Access Databases

4

Introduction to KDD Introduction to KDD and DMand DM

KDD & DM shared with several topic

Page 5: Knowledge Discovery in Remote Access Databases

5

Introduction to KDD Introduction to KDD and DMand DM

Access to databases was established via Open Database Connectivity (ODBC) .

Querying the databases can be maintained by Structured Query

Language (SQL). The aim of using SQL is to allow users to define the data in databases and manipulate that data (adding, deleting and retrieving ) it from raw databases.

Using Data Visualization to represent Data Mining results.

Page 6: Knowledge Discovery in Remote Access Databases

6

Overview of the ThesisOverview of the Thesis Part I

Introduction to Knowledge Discovery in Databases ( KDD) and Data Mining (DM).

Goal of the Thesis Work.

Part 2 Remote Access KDD models. Logical Foundation in Data Mining. Mining the Discovered Association Rules. Data Mining Query Languages.

Part 3 Knowledge Discovery Query Language ( KDQL). I-extended Databases (I-ED). Implementation of KDQL. Conclusion. Appendix A , B.

Page 7: Knowledge Discovery in Remote Access Databases

7

Goal of the Thesis WorkGoal of the Thesis Work

In this thesis work, we investigated the problem of matching DM problems with the set of DM algorithms that are suitable for solving it.

The use of visualization and its integration with algorithmic approaches to tune the parameters of DM algorithms, in order to support the parameter selection process, currently only explored by algorithmic approaches, in a more systematic form than using default values or setting parameter values without clues.

Introducing visualization to provide expressive information about induced models and statistics entities, and to support the interactive and dynamic exploration of induced models for DM.

Page 8: Knowledge Discovery in Remote Access Databases

8

Overview of the ThesisOverview of the Thesis Part I

Introduction to Knowledge Discovery in Databases ( KDD) and Data Mining (DM).

Goal of the Thesis Work.

Part 2 Remote Access KDD models. Logical Foundation in Data Mining. Mining the Discovered Association Rules. Data Mining Query Languages.

Part 3 Knowledge Discovery Query Language ( KDQL). I-extended Databases (I-ED). Implantation of KDQL. Conclusion. Appendix A , B.

Page 9: Knowledge Discovery in Remote Access Databases

9

Remote Access KDD models

Connection between KDD and ODBC

Page 10: Knowledge Discovery in Remote Access Databases

10

The architectures of ODBC_KDD(1) model

Page 11: Knowledge Discovery in Remote Access Databases

11

The architectures of ODBC_KDD (2) model

Page 12: Knowledge Discovery in Remote Access Databases

12

Overview of the ThesisOverview of the Thesis Part I

Introduction to Knowledge Discovery in Databases ( KDD) and Data Mining (DM).

Goal of the Thesis Work.

Part 2 Remote Access KDD models. Logical Foundation in Data Mining. Mining the Discovered Association Rules. Data Mining Query Languages.

Part 3 Knowledge Discovery Query Language ( KDQL). I-extended Databases (I-ED). Implementation of KDQL. Conclusion. Appendix A , B.

Page 13: Knowledge Discovery in Remote Access Databases

13

Logical Foundation in Data Mining (LFDM)

Expressiveness :First order logic can represent more complex concepts than traditional attribute-value languages.

Readability : Formulae are easier to read than decision trees or a set of linear equations. Background knowledge: Background knowledge can be grown during discovery time

for example, in time series. Multiple tables: Multiple database tables can be handled without explicit and

expensive joins. Deductive databases: Logical discovery engines can be transparently linked to

relational databases via deductive databases.

Advantages of Logical Foundation in Data Mining

Disadvantages of Logical Foundation in Data Mining

Language complexity : First order hypothesis are usually constructed through heavy search ( discovery feasible).

Database access times: Checking one single candidate might involve heavy querying. Number handling: Logical approaches to discovery usually suffer from poor number

handling capabilities.

Page 14: Knowledge Discovery in Remote Access Databases

14

Translating first order queries into SQL

In our natural language a question such as “find all employers who are mangers and getting salary or expenses more than 1000000 HUF a year”:

expensive_employee(Name) ← employee(Name, Salary1, Manager),Salary1 > 1000000, employee(Manager, Salary2),Salary1 > Salary2

SELECT employee_0.NAME FROM employee employee_0, employee employee_1 WHERE employee_0.SALARY > 1000000 AND

employee_1.NAME = employee_0.MANAGER AND employee_0.SALARY > employee_1.SALARY

Logical Foundation in Data Mining (LFDM)

Page 15: Knowledge Discovery in Remote Access Databases

15

Overview of the ThesisOverview of the Thesis Part I

Introduction to Knowledge Discovery in Databases ( KDD) and Data Mining (DM).

Goal of the Thesis Work.

Part 2 Remote Access KDD models. Logical Foundation in Data Mining. Mining the Discovered Association Rules. Data Mining Query Languages.

Part 3 Knowledge Discovery Query Language ( KDQL). I-extended Databases (I-ED). Implementation of KDQL. Conclusion. Appendix A , B.

Page 16: Knowledge Discovery in Remote Access Databases

16

Association Rules

What is an Association Rule? Association rule is a set of items T={ia,ib,..,it}

T I, where I is the set of all possible items {i1,i2,…,in} in D the task relevant data, D is a set of transactions. An association rule is of the form :P Q, where P I, Q I, and P Q =Ø.P Q holds in D with support s andP Q has a confidence c in the transaction set D

Example: “In 80% of the cases when people buy bread, they also buy milk” Bread ==> milk /80%

Mining the Discovered Mining the Discovered Association RulesAssociation Rules

y(Q/P)ProbabilitQ)(PConfidence

Q)y(PProbabilitQ)Support(P

Page 17: Knowledge Discovery in Remote Access Databases

17

Mining the Association Rules

What is Mining the association rule? Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Selecting the most "interesting" rules based on their confidence factors. If holds in D with support s and has a confidence c in the transaction set D.

Applications: Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc.

Examples: “Body → Head [support, confidence]” buys(x, “bread”) → buys(x, “milk”) [6%, 65%] major(x, “CS”) takes(x, “Database”) → grade(x, “5”) [1%, 75%]

Mining the Discovered Mining the Discovered Association RulesAssociation Rules

Page 18: Knowledge Discovery in Remote Access Databases

18

How do we Mine Association Rules? Input :

A database of transactions. Each transaction is a list of items (Ex. purchased by a customer

in a visit).

Find all rules that associate the presence of one set of items with that of another set of items.

Example: “98% of people who purchase tires and auto accessories also get automotive services done”

There are no restrictions on number of items in the body of the rule.

Mining the Discovered Mining the Discovered Association RulesAssociation Rules

Mining the Association Rules cont.

Page 19: Knowledge Discovery in Remote Access Databases

19

Overview of the ThesisOverview of the Thesis Part I

Introduction to Knowledge Discovery in Databases ( KDD) and Data Mining (DM).

Goal of the Thesis Work.

Part 2 Remote Access KDD models. Logical Foundation in Data Mining. Mining the Discovered Association Rules. Data Mining Query Languages.

Part 3 Knowledge Discovery Query Language ( KDQL). I-extended Databases (I-ED). Implementation of KDQL. Conclusion. Appendix A , B.

Page 20: Knowledge Discovery in Remote Access Databases

20

What is Data Mining Query Language?

Data Mining Query Language (DMQL)Data Mining Query Language (DMQL): Is an iterative process to the KDD process, which discovered knowledge and presented the knowledge to the user, the evaluation measures can be enhanced, the mining can be further refined, new data can be selected or further transformed, or new data sources can be integrated, in order to get different, more appropriate results.

Data Mining Query Data Mining Query Language (DMQL)Language (DMQL)

Page 21: Knowledge Discovery in Remote Access Databases

21

Types of discovered patterns by DMQL Characterization: Data characterization is a summarization of general

features of objects in a target class, and produces what is called characteristic rules.

Discrimination: Data discrimination produces what are called discriminant rules and is basically the comparison of the general features of objects between two classes referred to as the target class and the contrasting class.

Association analysis: Association analysis is the discovery of what are commonly called association rules.

Classification: Classification analysis is the organization of data in given classes.

Prediction: Prediction has attracted considerable attention given the potential implications of successful forecasting in a business context.

Clustering: clustering is the organization of data in classes. Outlier analysis: Outliers are data elements that cannot be grouped in a given

class or cluster. Evolution and deviation analysis: Evolution and deviation analysis pertain

to the study of time related data that changes in time.

Data Mining Query Data Mining Query Language (DMQL)Language (DMQL)

Page 22: Knowledge Discovery in Remote Access Databases

22

Overview of the ThesisOverview of the Thesis Part I

Introduction to Knowledge Discovery in Databases ( KDD) and Data Mining (DM).

Goal of the Thesis Work.

Part 2 Remote Access KDD models. Logical Foundation in Data Mining. Mining the Discovered Association Rules. Data Mining Query Languages.

Part 3 Knowledge Discovery Query Language ( KDQL). I-extended Databases (I-ED). Implementation of KDQL. Conclusion. Appendix A , B.

Page 23: Knowledge Discovery in Remote Access Databases

23

Knowledge Discovery Query Knowledge Discovery Query Language ( KDQL)Language ( KDQL)

What is KDQL in principle ? Knowledge Discovery Query Language (KDQL) is a KDD query language suggested to

the ODBC_KDD(2) model for mining the association rules in the databases (i.e. DBMS, relational database), and then to visualize the discovered results in different charts forms (i.e. 2D and 3D). KDQL was not implemented namely yet. In KDQL we join KDD technology and data visualization with conjunction of the request of creating query language for DM tasks. This leads us to develop a language tool that can handle two approaches in one session.

Request Request DataData

Data to Data to VisualizeVisualize

Visualization ToolVisualization Tool

Database Management System Database Management System (DBMS)(DBMS)

Page 24: Knowledge Discovery in Remote Access Databases

24

Visualization techniques for DMQL

Data Mining Query Data Mining Query Language (DMQL)Language (DMQL)

Visualization ToolsVisualization Tools

Database Management System Database Management System (DBMS)(DBMS)

Knowledge Discovery Knowledge Discovery Query Language ( KDQL)Query Language ( KDQL)

Page 25: Knowledge Discovery in Remote Access Databases

25

Overview of the ThesisOverview of the Thesis Part I

Introduction to Knowledge Discovery in Databases ( KDD) and Data Mining (DM).

Goal of the Thesis Work.

Part 2 Remote Access KDD models. Logical Foundation in Data Mining. Mining the Discovered Association Rules. Data Mining Query Languages.

Part 3 Knowledge Discovery Query Language ( KDQL). I-extended Databases (I-ED). Implementation of KDQL. Conclusion. Appendix A , B.

Page 26: Knowledge Discovery in Remote Access Databases

26

Motivation

I-Extended Database I-Extended Database : Is a database that in addition to data also contain exceedingly defined generalizations about the data. Moreover, I-extended database is a database that has similar properties that are in inductive database. We formalize this concept and show how it can be used throughout the whole process of DM due to the closure property of the framework.

The basic message in I-extended database is as follow: I-extended database consists of a normal database associated to a subset of patterns from a class of

patterns, and an evaluation function that tells how the patterns occur in the data. I-extended database can be queried (in principle) just by using normal relational algebra or SQL,

with the added property of being able to refer to the values of the evaluation function on the patterns. Modeling KDD processes as a sequence of queries on i-extended database gives rise to chances for

reasoning and optimizing these processes.

I-Extended Databases (I-ED)I-Extended Databases (I-ED)

Page 27: Knowledge Discovery in Remote Access Databases

27

Overview of the ThesisOverview of the Thesis Part I

Introduction to Knowledge Discovery in Databases ( KDD) and Data Mining (DM).

Goal of the Thesis Work.

Part 2 Remote Access KDD models. Logical Foundation in Data Mining. Mining the Discovered Association Rules. Data Mining Query Languages.

Part 3 Knowledge Discovery Query Language ( KDQL). I-extended Databases (I-ED). Implementation of KDQL. Conclusion. Appendix A , B.

Page 28: Knowledge Discovery in Remote Access Databases

28

Motivation of KDQL

The background of KDQL came from the Structured Query Language (SQL) since several extensions to the SQL have been proposed to serve as a Data Mining Query Language (DMQL).

SQL + DM (rules) = is the appropriate form for this task on the user interface.

DM (rules) is based on the association rules to interact I-extended database. The association rules will be obtained by the use of KDQL rules, and the results will be graphically represented in a 2D and 3D charts.

Implementation of KDQLImplementation of KDQL

Page 29: Knowledge Discovery in Remote Access Databases

29

Architecture of KDQL

Implementation of KDQLImplementation of KDQL

Page 30: Knowledge Discovery in Remote Access Databases

30

Example of KDQL For example, the rule. { cheese, coke} ==> bread

States that if cheese and coke are bought together in a transaction, also bread is bought in the same transaction. In this association rules, the body is a set of items and the head is a single item. The rule {cheese, coke}==> cheese, is not interesting because it is a tautology: in fact if the head is implicated by the body the rule does not provide new information. This problem has the following formulation:

KDQL RULE Associations AS SELECT DISTINCT 1..n item AS BODY, 1..1 item AS HEAD,

SUPPORT, CONFIDENCE FROM Purchase GROUP BY transaction EXTRACTING RULES WITH SUPPORT: 0.1,

CONFIDENCE: 0.2

Implementation of KDQLImplementation of KDQL

Page 31: Knowledge Discovery in Remote Access Databases

31

Implementation of Implementation of KDQLKDQL

< KDQL_RULES_OP > := KDD RULES < TableName > AS SELECT DISTINCT < BodyDescr >, < HeadDescr >[,SUPPORT] [,CONFIDENCE] [WHERE < WhereClause >] FROM < FromList > [WHERE < WhereClause >] GROUP BY < Attribute > < AttributeList> [HAVING < HavingClause > ] [CLUSTER BY < Attribute> < AttributeList> [HAVING < HavingClause > ]

EXTRACTING RULES WITH SUPPORT :< real >, CONFIDENCE:<real> < Body_Description_KDQL>:= [< Cardinaly_Sheap > ] < AttrName > < AttrList > AS BODY

/* default cardinality sheap for the Body: 1..n */ < Head_Description_KDQL>:= [< Cardinaly_Sheap > ] < AttrName > < AttrList > AS HEAD/* default cardinality shaep for the Head: 1..1 */ < Cardinaly_Sheap >:=< Number> .. (< Number> | n) <AttributeList>:={<AttributeName>,<AttributeName>,…<AttributeName>}

KDQL rules operator

Page 32: Knowledge Discovery in Remote Access Databases

32

Overview of the ThesisOverview of the Thesis Part I

Introduction to Knowledge Discovery in Databases ( KDD) and Data Mining (DM).

Goal of the Thesis Work.

Part 2 Remote Access KDD models. Logical Foundation in Data Mining. Mining the Discovered Association Rules. Data Mining Query Languages.

Part 3 Knowledge Discovery Query Language ( KDQL). I-extended Databases (I-ED). Implantation of KDQL. Conclusion. Appendix A , B.

Page 33: Knowledge Discovery in Remote Access Databases

33

ConclusionConclusion

KDQL is a part of the ODBC_KDD (2) model .

KDQL calls I-extended database via ODBC connection.

I-extended database calls all the requested information from traditional databases via the ODBC.

KDQL was implemented to handle DM task with visualization.

Visualization techniques can be maintained to visualize interesting association rules discovered from the databases.

Page 34: Knowledge Discovery in Remote Access Databases

34

ResultsResults

The major results of the thesis work are summarized as follows. Proposing a new remote access KDD model called ODBC_KDD (2) to

build an attractive model that could get results with more detailed description such as visualization, scripts, statistical inferences and more.

Proposing and implementing a database concept, called I-extended database (I-ED) to be maintained and accelerated by the use of Knowledge Discovery Query Language (KDQL).

In ODBC_KDD (2) model we proposed a query language called KDQL.KDQL was suggested to interact into the conceptual database called I-extended database. KDQL is a result of a new KDD query language which could discover association rules.

Using visualization tools in KDQL to represent the retrieved data results in different 2D and 3D visual forms such as pie, points, lines and bars.

Using support and confidence of data item to locate the important associated rules from the databases by using I-extended database to be established by KDQL.

Page 35: Knowledge Discovery in Remote Access Databases

35

Overview of the ThesisOverview of the Thesis Part I

Introduction to Knowledge Discovery in Databases ( KDD) and Data Mining (DM).

Goal of the Thesis Work.

Part 2 Remote Access KDD models. Logical Foundation in Data Mining. Mining the Discovered Association Rules. Data Mining Query Languages.

Part 3 Knowledge Discovery Query Language ( KDQL). I-extended Databases (I-ED). Implementation of KDQL. Conclusion. Appendix A , B.

Page 36: Knowledge Discovery in Remote Access Databases

36

Appendix A , B

We introduced the proposed syntax of the KDQL statement rules.

Appendix A

Appendix B (Images from the program)

Page 37: Knowledge Discovery in Remote Access Databases

37

Dedications and AcknowledgmentsDedications and Acknowledgments• First I want to thank my wife Emaan Zubi for her understanding and

making the last steps of writing this dissertation enjoyable and also my kids Yhaia, Mohamed and Suliman for being nice kids while I’m doing this work.

• My parents father: Suliman Zubi and Mother: Memona Yousef. • I would like to thank Dr. Fazekas Gábor for accepting me as a Ph.D

student under his supervision. Also I would like to thank him for continuous encouragement, confidence and support, reviewing the text of this thesis, and for sharing with me his knowledge and love of this field .

• My senior supervisor Prof. Dr.Arató Mátyás for his encouragements. • Dr.Kormos Janos, my teacher and friend, for his insightful comments ,

advice and help.• Dr. Bajalinov Erik for the frequent constructive discussions regarding the

programming in Delphi. • My deepest thanks to Dr.Varga Katalin and Dr.Várterész Magdolna for

refereeing my Ph.D dissertation work. • Mr. Basheer Nassain the Libyan student advisor and Mr. Khalid Zintaney

the financial office in the Libyan Embassy, Budapest , for there support.• All people in this committee. • Finally I want to thank all my friends and people in the Institute of

Mathematical and Informatics, Debrecen University.

Page 38: Knowledge Discovery in Remote Access Databases

38

Thank you!!!

Page 39: Knowledge Discovery in Remote Access Databases

39

Page 40: Knowledge Discovery in Remote Access Databases

40