collectionscanada.cacollectionscanada.ca/obj/s4/f2/dsk3/ftp04/mq30577.pdf · abstract a major...
TRANSCRIPT
KNOWLEDGE DISCOVERY IN INTERNET DATABASES
A THESIS
SUBMITTED TO THE FACULTY OF GRADUATE STUDIES AND RESEARCH IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF MASTER OF SCIENCE
IN
COMPUTER SCIENCE UNIVERSITY OF REGINA
BY Xiaobo Yu
Regina, Saskatchewan
December 22, 1997 -.
@ Copyright 1997: Xiaobo Yu
395 Wellington Street 395, nie Wellington Ottawa ON KI A ON4 Ottawa ON K1A ON4 Canada Canada
Your lile Volm réfemnce
Our lile Notre rdldrence
The author has granted a non- L'auteur a accordé une licence non exclusive licence allowing the exclusive permettant a la National Library of Canada to Bibliothèque nationale du Canada de reproduce, loan, distribute or sel1 reproduire, prêter, distribuer ou copies of this thesis in microfom, vendre des copies de cette thèse sous paper or electronic formats. la forme de microfiche/film, de
reproduction sur papier ou sur format électronique.
The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substmtial extracts fiom it Ni la thèse ni des extraits substantiels may be printed or otherwise de celle-ci ne doivent être imprimés reproduced without the author' s ou autrement reproduits sans son permission. autorisation.
Abstract
A major objective in knowledge discovery in Internet database research is to s u p
port exploration and analysis of large amounts of data from severd databases, each
available via the Internet. This thesis describes an approacb to achieving this objec-
tive based on a multidatabase. The multidatabase system provides a single front-end
for several autonomous, heterogeneous database management systems.
A prototype software system, called KDTD, has been developed to perform dis-
covery tasks on Internet databases. A discovery task is decomposed into parameters
for the task and a global database query. The global query is translated and decom-
posed into a set of local database queries, which are sent to Internet databases by
database agents. KDID standardizes and accumulates the results of the local queries
in a single database called the multidatabase. Knowledge discovery is then performed
on the retrieved data by a discovery tool, DB-Discover, which performs high Ievel,
dynamic summarization and generalization of large amounts of data.
The approach is based on a global schema, which describes some related data. The
correspondence between this global schema and t h e individual databases is maintained
in a central registry. A registration subsystem is included in KDID to register Internet
databases. The subsystem interacts wit h database administrators to obtain database
schemas and integrate them with the global schema.
Acknowledgement
First, 1 must thank rny supervisor and mentor, Dr. Howard Hamilton, for his
guidance, critical advice, encouragement, and patience, without which this thesis
might not have been completed. 1 am grateful to hirn for accepting me as one of
his students, for believing in my ability to obtain this degree, for providing access
to equipment, and for arranging financial support for me. An excellent supervisor,
researcher, educator, and a fine human being, Dr. Hamilton has left a profound and
lasting impression on me.
Next, 1 want to thank members of my thesis cornmittee: Dr. Larry Saxton and Dr.
Menchi Liu provided helpful comments, and Dr. Ken Runtz served as the external
examiner. The Institute for Robotics and Intelligent Systems, the Faculty of Graduate
Studies and Research, and the Department of Computer Science provided much-
appreciated hancial support. 1 dso thank al1 rnembers of the Department.
1 a m also greatly indebted to Dr. Edmund H. Dale, Professor Emeritus, and Miss
Anne Rigney for helping me to come to the University of Regina to continue my
studies. Without their help, 1 might not have b e n able to come to Canada; and
having come, Dr. Dale was always there to offer support, advice and encouragement.
His belief in me gave me confidence and determination to succeed in a culture and
environment that were completely new to me. 1 also thank my friends Allan and
Sharon Schmidt, Stuart and Yvonne Mann, and Len Morrison for their valuable
friendship and help, which made my stay in Regina pleasant . Many thanks go to ali my friends, Chu Tongsheng, Gai Huifa, Hu Qiang, Shu Jun,
Wang Changwen, Xie Yongzeng, Xing Minqing and Xu Zhan, and their families, for
al1 their help and friendship. Special thanks are due to Brock Barber, for his valuable
and friendly help with DB-Discover. 1 also thank my fellow graduate students and
office mates, Carlos Rivera, Colin Carter, Li Liangchun, Pang Wanglin, Sivakumar
Nagarajan, November Scheidt and Zhang Jian, who had made this Iearning experience
enj oyable.
Especially, 1 am indebted to my parents and younger brother in China for their
constant love and understanding. Lat, but not least, 1 am indebted to my wife, Fu
Lei, for her unwavering love, devotion and understanding, witbout which this thesis
could not have been finished.
Contents
Abstract i
Acknowledgement
Tàble of Contents iv
List of Tables v
List of Figures vi
Chapter 1 Introduction 1
Chapter 2 Background and Related Research 6
. . . . . . . . . . . . . . . . . . 2.1 The Internet and Internet Databases 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 TheInternet 6
. . . . . . . . . . . . . . . . . . . . . 2.1.2 The Client-Server Mode1 7 . . . . . . . . . . . . . . . . 2.1.3 The Hypertext Transfer Protocol 8
. . . . . . . . . . . . . . . . . . . . . . . . 2.1.4 TheHTMLForms 9 . . . . . . . . . . . . . . . . . . . . . . 2.1.5 The World Wide Web 11
. . . . . . . . . . . . . . . . . . . . . . . . 2.1.6 Internet Databases 12
. . . . . . . . . . . . . . . . . . . . . . . . 2.2 The Multidatabase Mode1 13
. . . . . . . . . . . . . . . . . . . 2.3 Knowledge Discovery in Databases 15
. . . . . . . . . . . . . . . . . 2.3.1 Types of Discovered Knowledge 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 DB-Discover 17
. . . . . . . . . . . . . . . . . . . 2.4 Resource Discovery in the Internet 18
. . . . . . . . . . . . . . . 2.4.1 The Client Directory Server Mode1 18 . . . . . . . . . . . . . 2.4.2 The Multiple Layered Database Mode1 20
Chapter 3 An Overview of the KDID System 23
. . . . . . . . . . . . . . . . . . . . . . . . 3.1 The Architecture of KDID 23 . . . . . . . . . . . . . . . . . . . . . . 3.1.1 TheInternetF'ront.end 25
. . . . . . . . . . . . . . . . . . . . . . . 3.1.2 The Interface Module 26 . . . . . . . . . . . . . . . . . . . . 3.1.3 The Multidatabase Module 28
. . . . . . . . . . . . . . . . . . 3.1.4 The KDD Application Module 30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Databases Types 32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Overview 32
. . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Oracle Databases 33 . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Mini SQL Databases 33
. . . . . . . . . . . . . . . . . . . 3.2.4 Microsoft Access Databases 34
3.3 The Multidatabase Architecture for the KDID System . . . . . . . . 35 . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 The Architecture 35 . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Data integration 37
. . . . . . . . . . . . . . . . . . 3.3.3 Multidatabase Query Processing 41
Chapter 4 Design Issues 43
. . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 The Interface Module 43
. . . . . . . . . . . . . . . . . . . . . . 4.1.1 The Form Data Parser 44
. . . . . . . . . . . . . . . . . . . . 4.1.2 Global Query Composition 46 . . . . . . . . . . . . . . . . . . . . . . 4.2 On-line Database Registration 48
. . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Security Issues 48
. . . . . . . . . . . . . . . . . . . 4.2.2 The Registration Approach 49
. . . . . . . . . . . . . . 4.3 Global Query to Internal Query Translation 51
. . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Query Decomposition 53
. . . . . . . . . . . . . . . . . . 4.4.1 The Decomposition Algorithm 53
4.4.2 Query Decomposition Examples . . . . . . . . . . . . . . . . . 53
. . . . . . . . . . . 4.5 Mechanisms for Resolving Data Value Differences 58
4.5.1 Data Value Standardization . . . . . . . . . . . . . . . . . . . 58
. . . . . . . . . . . . . . . . . . . 4.5.2 Scale Conversion Functions 62
Chapter 5 Prototype Design and Testing 63
. . . . . . . . . . . . . . . 5.1 Constructing a Test Multidatabase System 63 . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Global Schema 64 . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 The D l Database 65 . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 The D2 Database 66 . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.4 The D3 Database 66
. . . . . . . . . . 5.1.5 Relationships Between the Three Databases 67 . . . . . . . . . . . . . . . . . . . . . . 5.2 Database Registration Process 68
. . . . . . . . . . . . . . . . . . . 5.3 Typical Knowledge Discovery Tasks 71 . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Discovery Task 1 72 . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Discovery Task 2 73
. . . . . . . . . . . . . . 5.3.3 Task Realization Using HTML Forms 74 . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 DiscoveryResults 77
. . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Performance Analysis 77
5.4.1 Parallel Data Retrieval from Participating Databases . . . . . 77 . . . . . . . . . . . . . . . . . . . 5.4.2 Data Value Standardization 81
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Discussion 84
Chapter 6 Conclusion 85 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Summary 85
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Contributions 87 . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Areas for Future Research 88
Bibliography 89
Appendix A Schema Information for the D l Database 99
. . . . . . . . . . . . . . . . . . . . . . . . . . . . A.l The AWARD Table 99
. . . . . . . . . . . . . . . . . . . . . . . . . A.2 The DISCIPLINE Table 100
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 The AREA Table 100
Appendix B Schema Information of Database D2 101
. . . . . . . . . . . . . . . . . . . . . . . B.l The SCHOLARSHIP Table 101 . . . . . . . . . . . . . . . . . . . . . . . . B.2 TheCOMMITTEETable 102 . . . . . . . . . . . . . . . . . . . . . . . . B.3 The GRANT-TYPE Table 102
Appendix C Schema Information of Database D3 103
. . . . . . . . . . . . . . . . . . . . . . C.l The ORGANIZATION Table 103
Appendix D GIossary 104
vii
List of Tables
. . . . . . . . . . . . . . . . . . . . . . 2.1 Typical Network Domain Types 7
. . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 ValuesoftheTYPETag 9 . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Example Internet Databases 13
. . . . . . . . . . . . . . . . . . . . . . . . . 2.4 A Sample Server Database 19
. . . . . . . . . . . . . . . . 2.5 A Sample Portion of a Relation in Layer 1 21
. . . . . . . . . . . . . 2.6 A Generalized Portion of a Relation in Layer 2 22
. . . . . . . . . . . . 3.1 Functions Used to Contact the Mini SQL Server 34
. . . . . . . . . . 3.2 ODBC Functions Used to Contact a Database Server 35
. . . . . . . . . . . . 3.3 Data Types in Oracle, Mini SQL and MS Access 39 . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Format of Schema Mapping 50
. . . . . . . . . . . . . . . . . . . . . . . 4.2 Examples of Schema Mapping 51
4.3 Database Information for the Example Multidatabase . . . . . . . . . 55
. . . . . . . . . . . . . . . . 4.4 Example of Provinces and their Variations 61
. . . . . . . . . . . . . . . . . . . 5.1 Global Schema of the AWARD Table 64
5.2 Global Schema of the ORGANIZATION Table . . . . . . . . . . . . . 64
. . . . . . . . . . . . . . . . . . . 5.3 Schema of the GRANT-TYPE Table 66
5.4 Cornparison Between the AWARD and SCHOLARSHIP Tables . . . . 67
5.5 Time for Sequential and Parallel Re t r ied . . . . . .. . . . . . . . . . 80
5.6 Standardization Time with 32K Memory and a 23788-Item Dictionary . 81
5.7 Standardization Tirne with 64K Memory and a 47576-Item Dictionary . 83
5.8 Standardization Time with 64K Memory and a 127576-Item Dictionary . 84
. . . . . . . . . . . . . . . . . . . . . . A . l Schema of the AWARD Table 99
. . . . . . . . . . . . . . . . . . . . A.2 Schema of the DISCIPLINE Table 100
. . . . . . . . . . . . . . . . . . . . . . . A.3 Schema of the AREA Table 100 . . . . . . . . . . . . . . . . . . B.l Schema of the SCHOLARSHIP Table 101
. . . . . . . . . . . . . . . . . . . B.2 Schema of the COMMITTEE Table 102 . . . . . . . . . . . . . . . . . . B.3 Schemaof theGRANT-TYPE Table 102
. . . . . . . . . . . . . . . . . C.1 Schema of the ORGANIZATION Table 103
List of Figures
. . . . . . . . . . . . . . . . . . . . . . . . . . . . Number of Servers 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . Internet User Growth 3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . Client-Server Mode1 7 . . . . . . . . . . . . . . . . . . . An Example of an HTML Form Script 10
. . . . . . . . . . . . . . . . . . . . . . . . . An HTML Form Example 12
. . . . . . . . . . . . . . . . . Architecture of a Multidatabase System 14
. . . . . . A Concept Hierarchy for the Province Attribute: Tree View 18 . . . . . . . . . . . . . . . . . . . . . . . Client Directory Server Mode1 19
. . . . . . . . . . . . . The Format of a Query to the Directory Server 19
. . . . . . . . . . . . . . . . . . . . Multiple Layered Database Mode1 20 . . . . . . . . . . . . . . . . . . . . . . . . . The Architecture of KDID 24
. . . . . . . . . . . . . . . . . . . . . . . . . Software Hierarchy Chart 24 . . . . . . . . . . . . . . . . . . . . . An HTML Form with User Data 27 . . . . . . . . . . . . . . . . . . . . . An Example of a Form Data Set 27
. . . . . . . . . . . . . . . . . . . . . . Processed (Name, Value) Pairs 28 . . . . . . . . . . . . . . . . . . . . A Tabbed Concept Hierarchy File 31
. . . . . . . . . . . . . . . . . Architecture of the KDID Multidatabase 36 . . . . . . . . . . . . . . . . . . . . . . . . . . Type Mapping Diagram 40
. . . . . . . . . . . . . . . . . . . Flow Diagram for Query Processing 44 . . . . . . . . . . . . . . . . . . . . . . . . . . . The Forrn Data Parser 45
. . . . . . . . . . . . . . . . . . . . . . . . Global Query Composition 47 . . . . . . . . . . . . . . . . . . . . . . On-line Database Registration 50
. . . . . . . . . . . . . . . Global Query to Internal Query Translation 51
. . . . . . . . . . . . . . . . . . . . 4.6 Procedure pro~essall~redicates 52 . . . . . . . . . . . . . . . . . . . . . . 4.7 Query Decompostion Algorithm 54
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Global Query 56 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Internal Query 56
. . . . . . . . . . . . . . . . . . 4.10 Local Query Submitted to DBi, DB2 57
. . . . . . . . . . . . . . . . . . . . . . 4.11 Local Query Submitted to DB3 57 . . . . . . . . . . . . . . . . . . . . . . . . 4.12 Data Value Standardkation 59
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.13 Procedure lookup 60 . . . . . . . . . . . . . . . . . . . . . . . . . . 4.14 A Hash Table Exarnple 61
. . . . . . . . . . . 5.1 A Concept Hierarchy for the PROVINCE Attribute 65 . . . . . . . . . . . . . . . . . . . . 5.2 Relations among the Three Tables 67
. . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Secure Site Certificate 68
. . . . . . . . . . . . . . 5.4 User Database Registration Application Form 69 . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Form for Schema Mapping 70
. . . . . . . . . . . . . . 5.6 User Database Registration Acknowledgement 71 . . . . . . . . . . . . . . . . . . . 5.7 A S QL-like Query for Discovery Task 1 72
. . . . . . . . . . . . . . . 5.8 SQL Query for Task 1 After Transformation 72
5.9 Query 1 for Task 1 for the Dl Database . . . . . . . . . . . . . . . . . 73
. . . . . . . . . . . . . . . . . 5.10 Query 2 for Task I for the D3 Database 73
. . . . . . . . . . . . . . . . . 5.11 Query 3 for Task 1 for the D2 Database 73
. . . . . . . . . . . . . . . . . 5.12 An SQGlike Query for Discovery Task 2 74
. . . . . . . . . . . . . . . 5.13 Query 1 for Task 2 for the D l DATABASE 74
. . . . . . . . . . . . . . . . . 5.14 Query 2 for Task 2 for the D2 Database 75
. . . . . . . . . . . . . . . . . 5.15 Query 3 for Task 2 for the D2 Database 75
. . . . . . . . . . . . . . . . . 5.16 Query 4 for Task 2 for the D3 Database 75 . . . . . . . . . . . . . . . . . . . . 5.17 HTML Form for Discovery Task 1 76
5.18 HTML Form for Discovery Task 1 with User Data . . . . . . . . . . . 78
5.19 Final Result for Discovery Task 1 . . . . . . . . . . . . . . . . . . . . . 79 . . . . . . . . . . . . . . . . . . 5.20 Retrieved Data without Generalization 80
5.21 Test Time for Parallel and Sequential Retrieval . . . . . . . . . . . . . 82
Chapter 1
Introduction
Since the amount of information connected to the Internet is growing rapidly, an
increasing potential for knowledge discovery in this information exists. As shown in
Figure 1.1 taken from [64], the number of servers connected to the Internet grew from
one thousand in 1984 to more than 16 million in January, 1997. Additionally, as
shown in Figure 1.2, the number of users grew from fewer than one million in 1993
to more than 47 million in January 1997 (651.
In the 1990s, internet information retrieval tools, such as Wide Area Information
Servers (WAIS), Archie, Prospero, Gopher, the World Wide Web (WWW), Netfind,
X.500, and Indie have been developed to help users find interesting information in the
Internet [68]. These tools facilitate browsing, searching, and organizing information
accessible via the Internet, but they do not provide knowledge discovery techniques
for structured data sources, such as databases, connected to the Internet.
KnowIedge discovery provides a means of coping with the massive amount of data
produced daily. Users want to find knowledge hidden in this information. Knowl-
edge discovery is "the non-trivial process of identifying valid, novel, potentially useful,
and ultimately understandable patterns in data" [34]. If the output of a knowledge
discovery technique is a pattern that is considered interesting, it can be considered
as "knowledge". Knowledge discovery combines database and machine learning tech-
niques. Many algorithms for discovering association rules, classification rules, sequen-
tial patterns, and time sequences have been proposed for knowledge extraction from
databases and have been applied successfuUy to large relational databases [34], but
Figure 1.1: Number of ~ e r v e n [Gd]
they have not been applied to the global information base, i.e., al1 data anilable via
the internet.
There are several complications to applying knowledge discovery techniques to the
global information base: the number of sites, the fast growing user community, the
limited bandwidth, the great amounts of data, the unstructured nature of much of
this data, and the slow speed of data analysis. Included in the global information base
are many databases, which due to their structured nature are arnenable to efficient
analysis.
An Intemet database is a database which provides access to Internet users. Often,
an Internet database provides authoritative information in a specific category with
WWW as its database management interface. For example, the Semaphore Corpora-
tion's Internet databases include a complete postal database with al1 United States
addresses, zip codes and mail carrier route numbers. In this thesis, attention is re-
stricted to relational databases with interfaces accepting Structured Query Language
(SQL) queries, which are the most common type of commercial databases on the
Figure 1.2: lnternet User Growth.
Knowledge discovery in Internet databases concerns the application of knowledge
discovery in database techniques to multiple relational databases available on the
Internet. The goal is to analyze more data than can be analyzed by a user without
automated support.
In this thesis, a multidatabase approach is introduced for conducting knowledge
discovery in Internet databases. The approach has been embodied in the Uowledge
Discovery in Mernet Databases (KDID) software, which was designed and imple- - mented as part of this thesis research.
There are four main reasons why constructing such a mode1 is worthwhile. First,
the multidatabase approach provides a means of using databases already connected
to the Internet. According to the Gale Directory of Databases [49], the number of
on-line databases increased from 4,000 in 1989 to 10,000 databases in 1996. These
databases are distributed worldwide and cover a large variety of sub jects areas. They
are provided by 1,800 on-line services and database distributors. A multidatabase
approach is appropriate because the databases are autonomous, heterogeneous, and
geographically dispersed. As well, databases containing similar information may use
different schemas. Building a multidatabase system allows the use of a single front-end
for many different database management systems.
Secondly, the muitidatabase approach extends the scope of existing knowledge dis-
covery from database systems frorn a single database to multiple databases. Several
existing knowledge discovery in database software packages permit the discovery of
useful information from large amounts of data stored in a single relational database.
For example, the DB-Discover research software system for knowledge discovery is
useful for data access and summarization for a single relational database [18]. It
allows high level, dynamic organization of data without modifying the data in any
database. Provided with access to a relational multidatabase, DB-Discover can sum-
marize interesting knowledge from multiple relational databases instead of a single
dat abase.
Thirdly, the proposed approach provides a convenient way for database adminis-
trators to make additional databases available for discovery tasks. By using Hypertext
Markup Language (HTML) forms, a database administrator can register a database
with the KDID system. As well, the relationship between data in this database and
the global schema used by the KDID system can be conveniently defined.
Fourthly, the approach provides a convenient way for users to initiate discovery
tasks. The KDID system is accessible via Internet browsers, such as Netscape. Us-
ing HTMLbased forms provided by KDID, a user can specify a DB-Discover task,
including what data are to be retrieved from the multidatabase. The usefulness of
the discovered knowledge is affected by how well the user understands the domain
described by the retrieved data 161. HTML forms help users understand the domain.
For exarnple, if the user tries to discover information relating patient age with the
occurrence of diseases, it is more meaningful and convenient if he/she is able to select
items from a form based on a database schema for medical information than if he/she
must compose SQL queries. The incorporation of an HTML-based interface in the
KDID system makes it easier for any user to initiate discovery tasks without the aid
of database specialists.
The main goal of this thesis is to demonstrate the feasibility of conducting knowl-
edge discovery from Internet databases. The thesis focuses on three particular tasks:
on-line registration of user dat abases, translation of discovery t asks entered on a
World Wide Web (web) page into a series of queries on appropriate databases, and
the integration of results from t hese databases.
Chapter 2 presents background material required to place the design of the KDID
system in context. The background material concerns the Internet and Internet
databases, multidatabases, and techniques for knowiedge discovery in databases. A
survey of relevant research on resource discovery in the Internet is also presented.
Examples are supplied to illustrate each model of resource discovery.
Chapter 3 is an overview of the KDID system for knowledge discovery in Internet
databases. It describes the general architecture of the KDID system and each com-
ponent. Relevant information concerning database types is also presented. The main
focus is on the multidatabase model, the platform for al1 knowledge diacovery tasks.
Chapter 4 presents the design issues encountered during the implernentation of
a prototype KDID system. An approach for manipulating schema information and
algorithms for query processing in the multidatabase context are described and ex-
amples are supplied. Mechanisms for resolving differences in data values are also
presented.
Chapter 5 describes the creation of a multidatabase for testing, the on-line database
registration, and the results of some knowledge discovery tasks. The performance of
the KDID system is also analyzed.
Chapter 6 presents conclusions. It surornarizes the contributions of this research
and discusses possible areas for future research. ,
Chapter 2
Background and Related Research
In this chapter , background material and related research are presented. Since
this thesis concerns the application of knowledge discovery techniques to Internet
databases using a multidatabase approach, three appropriate background subjects
are the Internet and Internet databases, multidatabases, and knowledge discovery
techniques. Section 2.1 descri bes the Internet, the client-server model, the HTTP
protocol, HTML forms, the WWW, and Internet databases. Section 2.2 explains the
multidatabase model. Section 2.3 describes the techniques applied in knowledge dis-
covery in databases. Section 2.4 presents a survey of approaches to resource discovery
in the Internet, which is the most relevant topic in the literature to the subject of
this thesis.
2.1 The Internet and Internet Databases
2.1.1 The Internet
The Indernet is the worldwide collection of inter-connected computer networks
and gateways that use the Internet Protocol (IP) and function as a single cooperative
network [48]. The Internet provides three levels of network services: connectionless
packet delivery, connection oriented streams, and application level services [48]. In-
formation can be transferred in electronic fonn via communication paths, such as
optical fiber lines and satellites, around the globe through the Internet.
Domain Code ed u
Table 2.1 : Typical Network Domain Types.
com g*V
mil net
_ 0% ca CII
uk
Figure 2.1: Client-Server Model.
Meaning educational oreanization
Any computer system directly c o ~ e c t e d to the network has a domain name and an
IP address. A domain name is typically of the form system.site.domain, for example,
gopher.voa.gov. The most common domain types are shown in Table 2.1. An IP address is a unique 32 bit unsigned integer assigned to a computer.
Example iwww .sims. ber kelev.edu
commercial organization US governmental organization US military organization ,
network organization non-profi t organization Canada China the United Kingdom
2.1.2 The Client-Server Model
www. bypass.com gopher .voa.gov web.nps.navy.mil www.myhomepage.net ftp.ifcss.org mercury.cs.uregina.ca www.qd.sd.cn www.cornp.brad.ac.uk
To understand how the Internet operates, the concept of client-server comput ing
is crucial. The client-semer model, as shown in Figure 2.1, is a form of distributed
computing that divides the application processing between a client and a server that
are connected by a network. A client process, which often corresponds to a user
interface, sends a request to a server process. The server receives the request, performs
the appropriate action, and returns the result to the client process 1501. Typically, a
server waits for any request messages sent to a particular port on a machine. Client-
server processing requires reliable communication between clients and servers.
With a client-server architecture, it is possible to create an interface that is in-
dependent of the server machine hosting the data. Therefore, the user interface of a
client-server application can be run on a WindowsNT cornputer and the server can
be run on a mainframe. Clients can be also written for DOS-based or UNIX-based
computers. This allows information to be stored in a central server and disseminated
to different types of remote computers.
2.1.3 The Hypertext 'Ikansfer Protocol
The Hypertezt Tmnsfer Protocol (HTTP) is an application-level protocol for dis-
tributed, collaborative, hypermedia information systems [9]. The HTTP protocol is
based on the client-server mode1 described in Section 2.1.2. A client establishes a con-
nection with a server and sends a request consisting of a request method, a Uniform
Resource Locator (URL), and the body of the request. The server responds with a
status line and the body of the reply. These ideas are now explained in more detail.
Most HTTP communication is initiated when a client process, such as a browser,
sends a request to a server controlling resources. HTTP communication takes place
over Transmission Control Protocol/Internet Protocol (TCP/IP) connections. The default is TCP port 80 [8], but other ports can be used; for exarnple, the port number
for the HTTP server on chiron.cs.ure.gina.ca is 8050.
An HTTP Uniform Resource Locator (URL) is used to specify the location of
network resources via the HTTP protocol [9]. The syntax for an HTTP URL is
''http:" "//" host [ ":" port ] [ path 1. The identified resource is located by the HTTP server process listening for TCP con-
nections on the specified port of the host. The path identifies the full path of the re-
source. For example, http://ch~on.cs.ureginaaca:8050/htb~soeaction means that
an HTTP server is listening for TCP connections on 8050 of chiron.cs.uregina.ca. The
directory on chiron is called /htbin and the resource is sorne-action.
The body of the request contains the actual data to be transferred preceded
by several optional header lines. The data type of the body is determined via the
CONTENT-TYPE and CONTENT-ENCODING header fields. A CONTENT-TYPE
header specifies the media type of the data, and a CONTENT-ENCODING header
indicates an additiond encoding technique applied to the content. The CONTENT-
LENGTH header indicates the size of a request or reply.
The request methods for HTTP include GET and POST. The GET method re-
trieves whatever data is identified by the URL. If the URL refers to a data producing
process, the produced data is returned in the response rather than the source text
of the process. The POST method requests that the destination server accept the
body of the request as a new subordinate of the resource identified by the URL. For
exarnple, POST might be used when the body of the request is a filled-in form being
submitted. The actual function performed by the POST method is determined by
the server and dependent on the URL 191.
2.1.4 The HTML Forrns
The Hypertezt Markup Language (HTML) is a simple data format used to create
hypertext documents that are portable from one platform to another [8). A document
in HTML consists of a mixture of HTML cormqands and m y number of American
Standard Code for Information Interchange (ASCII) texts.
A form is a template for a form data set and an associated method and action
URL. INPUT, SELECT, TEXTAREA are tags used to specify user-entered inputs
with the form. The INPUT tag specifies a simple input inside a form. Attributes
to INPUT are TYPE, NAME, VALUE, CHECKED, SIZE, and MAXLENGTH. The
TYPE tag has the six possible values shown in Table 2.2.
f Vdue 1 Ex~lanation text password
Default text entry field Text entry. Characters entered represented as asterisks
checkbox radio submit
Table 2.2: Values of the TYPE Tag.
A single toggle button A single button A push button causing the current form to be assembleci
reset into a query URL and sent to a remote server A push button causing the various input elements in the form to be reset to their default values
Any number of SELECT tags is allowed inside one form and they can be in- terrnixed with other HTML commands and plain text. SELECT has opening and
closing tags, i.e., (SELECT) and (/SELECT). Each SELECT tag has a sequence of
OPTION tags. The attributes to SELECT include NAME, SIZE, MULTIPLE.
Figure 2.2 gives an example of an HTML f o m script. The value of eacb INPUT TYPE determines a form action. For example, INPUT TYf E="RADIOn produces
a radio button on the form. A SELECT tag c m have multiple values if the tag
MULTIPLE follows. For example, the TARGET options are "amount " , 'province",
"orgnarne", and 'dept". Several or al1 of them c m be selected at the same tirne. -- -
(FORM ACTION= Uhttp://chiron.cs.uregina.ca:8050/htbin/kdid METHOD= "POST") Sort by (SELECT NAME="SORTn) (OPTION SELECTED = "SELECTEDn)disc~ode (0PTION)area-code (/SELECT) (INPUT TYPE="RADIOn NA.ME="ORDERn VALUE= "ASC" CHECKED) Ascending (INPUT TYPE = uR.ADIO" NAME = "ORDER" VALUE = "ASC')Deseending Choose a concept hierarchy file (SELECT NAME = "LOAD") (OPTION SELECTED = "SELECTED")nserc.chf (OPTION) pas1 .chf (0PTION)pas.chf (/SELECT) Select target attribut es first (SELECT NAME = UTARGETn MULTIPLE) (OPTION SELECTED = "SELECTEDn)amount (0PTION)province (0PTION)org~ame (0PTION)dept (/SELECT)
(INPUT TYPE = "SUBMIT" VALUE = "QUERY") (INPUT TYPE = "RESETn VALUE = URESET")
Figure 2.2: An Example of an HTML Form Script.
A form data set is a sequence of (NAME, VALUE) pairs. The names are specified
by the N A M E attribute of form inputs, and the values are either by default given
by the form or filled in by a user. The resulting form data set is used to access
information service as a function of the action and method [8].
In order to protect form data from misinterpretation during transmission and pro-
cessing, a number of encoding methods can be us&. The default encoding for al1 forms
is applicat ion/x-wu-f orm-urlencoded. When a form is appl icat ion/x-wu-
f orm-urlencoded, the form field names and d u e s are escaped. Space characters are
replaced by the "+" sign, and non-alphanumeric characters are replaced by "%HHn,
Le., a percent sign and two hexadecimal digits representing the ASCII code of the
character. Fields are listed in the order they appear in the HTML script with the
name sepamted from the value by an '=" sign, Gd the pairs are separated from each
other by an '&" sign.
An HTML browser, such as Netscape, processes a form by presenting the HTML script with each field in its initial state. The user c m modify the fields. When
the user submits the form, the form data set is processed according to its method
and action URL. For example, the action URL in the form shown in Figure 2.2
is http://chiron.cs. uregina.ca:8050/htbin/kdid, and the method is POST. Therefore,
when the user clicks on the submit button, an HTML form appears, as shown in
Figure 2.3 on the browser.
2.1.5 The World Wide Web
The World Wide Web ties together vast amounts of information stored on the
Internet. The World Wide Web is a collection of mutually referencing hypertext doc-
uments scattered across the Internet, and serviced by HTTP servers. Each server
provides access to the information at its site to local and rernote clients. Web docu-
ments are often written in HTML. By clicking on highlighted words on a web page,
the user jumps to another document linked to the highlighted text. Netscape is a
client application for the World Wide Web.
HTML documents link to each other by including URLs of documents they want
to reference. URLs can point either to static documents or to scripts, such as that
shown in Figure 2.2, which dynamically generate documents. The function of an
HTTP server is to accept URLs from clients, interpret those URLs as either local
Figure 2.3: An HTML Form Example.
document addresses or invocations of local scripts which produce documents, and
to transmit the resulting HTML documents to the client. The function of an KTTP client, such as Netscape, is to display documents received from servers and to transmit
URLs to servers in response to interactive events.
2.1.6 Internet Databases
As described in Chapter 1, an Internet database is a database with access p r e
vided to Internet users. Here we limit attention to relational databases that use the
American National Standards Institute (ANSI) Structured Query Language (SQL) statements to manipulate data. Internet databases run in distributed, client-server
environments. Clients of an Internet database are allowed to be on any Internet host.
Some Internet databases are accessed directly as SQL servers, i.e, servers that re-
ceive SQL queries as requests. Other Internet databases use WWW pages as their
database management interface. A user of such a database communicates with it
with interactive and dynarnic HTML forms.
Typically, an Internet database with a web interface has an underlying database
server. User requests are translated into SQL (or another query language) and trans-
mitted to the database server. The results are returned to the user via the web.
The HTTP server for the web interface and the database server may be on the same
machine or different machines. In the approach described in this thesis, a separate
database server is required for the Internet databases to be accessed by the KDID
system.
Table 2.3 shows 10 Internet databases and their data categories. For exarnple,
the College Theatre Internet Database provides a platform to share the college the-
atre experience for college theatre clubs and departments among US colleges and
universities.
Table 2.3: Example Internet Databases.
Internet Database College Theatre internet Database Showa Chemical Internet Database Past Pages Internet Database Florida Film Locations Internet Database Nexus Database ASFA Database Embase Database Environmental Sciences & PoIlu tion Management Database Findex Database Metadex Database
2.2 The Multidatabase Mode1
Subject College theatre experience Chemical products information on paperbacks, cheaper hardcovers, magazines, postcards, prints Location pictures from various film commissions in Florida Classifieci advertisements Aquatic sciences and fisheries abstracts Clinical and experimental aspects of art hritis Aspects of environmental sciences
Market research reports Aspects of metallurgical science and technology
A multidatabase system provides integrated access to autonomous, heterogeneous
databases via a single, relatively simple request [47]. Multidatabase systems facili-
tate the sharing of information arnong different departments of a Company or among
Figure 2.4: Architecture of a Multidatabase System.
different cornpanies where database management systems are incompatible with each
other. With a multidatabase system, a user does not have to send multiple requests
in different languages to multiple information sources. Instead the user sends one
query to the multidatabase system which handles the details.
Typically, a multidatabase system includes a rnultidatubase, which js a database
that acts as a front-end to multiple, possibly heterogeneous local database manage
ment systems [47], as shown in Figure 2.4. A local database management system
does not have to make any modification and retains full control over local data and
processing. Cooperating with the global system by serving global requests is strictly
voluntary.
A key aspect of multidatabases is site autonomy. Each site determines indepen-
dently what information it will share with the global system, and what global requests
it will service [47]. Global changes, such as the addition and deletion of other sites, do
not have any effect on the local database management systems. The multidatabase
mode1 is desirable because the capital invested in hardware, software and user train-
ing for the local databases is preserved. As well, site autonomy acts as a security
measure (471, because the local database management system bas full control over
what processing options are allowed.
A number of multidatabase systems have been developed for production use, in-
cluding the DATAPLEX system and the Integrated Manufaduring Data Administra-
tion System (IMDAS).
DATAPLEX is a distributed database management system. It allows queries to
retrieve and update data managed by diverse database systems. The relational mode1
is used as the global data modei to provide a uniform user interface. The DATAPLEX
system has an SQL Parser, a Distributed Query Decomposer and a Data Dictionary
Manager. The SQL Parser checks the syntax of the SQL statements and determines
their meaning. The Distiibuted Query Decomposer decomposes an SQL query into
a set of local queries. As well, it determines the location of the user and merges al1
results at the user's location. The Data Dictionary Manager finds the location of the
data referenced by a query [81].
The IMDAS system was developed to provide access to sources of manufacturing
data and to allow new application programs to be added without changing the existing
databases. The IMDAS system uses an SQLlike query language. Each local database
system has a Basic Data Semer to provide the interface between the local database
management system and the integrated database system 18 11.
2.3 Knowledge Discovery in Databases
The notion of finding useful patterns in data has been given a variety of names,
including data mining, knowledge extraction, information discovery, information har-
vesting, data archadogy, and data pattern processing 1201. The term knowledge
discovery in databases ( K D D ) was coined by Piatetsky-Shapiro in 1989 (331. This
term emphasizes that knowledge is the end product of a data-driven discovery pro-
cess.
KDD has evolved, and continues to evolve, from the intersection of the research
fields of machine learning, pattern recognition, databases, statist ics, knowledge ac-
quisition from experts systems, data visualization, and high performance computing
1201. Techniques from machine leming , pattern recognition, and st at ist ics are be-
ing adapted to serve in KDD components. An important factor behind KDD is the
research involving database management. Effective data manipulation cm greatly
improve the performance of a KDD system. In this thesis, the KDD technique imple-
mented in DB-Discover, the discovery of characteristic rules, is applied to Internet
dat abases.
In this section, knowledge discovery techniques are categorized based on the type
of discovered knowledge. As well, the DB-Discover system is descri bed.
2.3.1 Types of Discovered Knowledge
Types of knowledge include classification d e s , decision trees, association rules,
characteristic rules, and discriminant rules.
Data classification classifies data based on the values of certain attributes [20]. It is the process of finding the common properties among a set of data in a database. A training set is required to construct a classification model. Each tuple in the training
set consists of the same set of attributes as the tuples in a large database. Each tuple
has a known class identity. The objective of the classification process is to analyze
the training set and develop an accurate model for each class. Such models are used
to classify data in the large database to develop better descriptions for each class in
the database. These descriptions are called classification rules.
A decision tree is generated from a training set. A classification algorithm takes
the training set of attribute values and a class as input. The decision tree for the
training set consists of nodes that are tests on the attributes (32). The outgoing
branches of a node correspond to al1 possible outcornes of the test at the node. The
subset of the training set at a node in the tree are partitioned dong the branches.
Discovering association rules requires the derivation of a set of rules in the form of
Ai A ... A A, + Bi A ... A B,, where Ai (for i E (1, ..., m)) and Bj (for j E (1, ..., n)) are sets of attribute values, from the relevant data sets in a database (201. A good
example is to search for associations among items a customer purchases together in
a single transaction. For example, a KDD system might discover that if a customer
buys macaroni, he also has a tendency to buy cheese in the sarne transaction.
A characteristic discooery fask is to find interesting relationships between various
attributes of one or more relations in the database. A charucteristic relation is a
relation generalized from data retrieved from a database guided by a set of concept
hierarchies.
A discriminant rule is an assertion that discriminates concepts of the target class
from the contrasting class [40]. A discriminant mle can be discovered by generalizing
data in both the target dass and the contrasting classes synchronously and excluding
properties that overlap between the two classes.
DB-Discover is a research software package that permits the discovery of useful
information from large amounts of database data. DB-Discover is applied on rela-
tional databases and produces generalized relations. It performs generalization and
summarization efficiently. Data summarization presents the general characteristics
or a summarized high level view over a set of user specified data from a database.
DB-Discover allows high level, dynamic data organization without modifying the data
itself.
Software for performing characteristic discovery t asks has been implemented in
DB-Discover. DB-Discover is based on an attributmriented generalization algorithm
which takes as input a relation retrieved from a database and generalizes the data
guided by a set of concept hierarchies [18], as explained below.
Data generalization is then performed on the data by applying data generalization
techniques, including attribute removal, concept tree climbing, at tribute threshold
control, propagation of counts and other aggregation function values. The surnma-
rized data is expressed in the form of a generalized relation on which other operations
or transformations can be performed to transforrn the summarized data into different
output formats. For example, the generalized relation can be mapped into charts and
curves, using visualization tools.
A concept hierarchy, as shown in Figure 2.5, is a tree of concepts arranged hierar-
chically according to generality. For discrete valued attributes, leaf nodes correspond
to actual data values which may be found in the database. For continuous valued
attributes, leaf nodes represent ranges of values. Each higher level concept is a gen-
eralization.
Figure 2.5: A Concept Hierarchy for the Province Attribute: Tree View.
2.4 Resource Discovery in the Internet
A variety of researchers are attempting to apply knowledge discovery techniques
to data available on the Internet. In this section, two proposais, the client directory
server model and the multiple layered database model, are described and evaluated.
2.4.1 The Client Directory Semer Mode1
Li and Danzig [54] proposed the Client Directory Server mode1 (CDS) for resource
discovery in the Internet based on the client-server model, shown in Figure 2.1. Thou-
sands of servers are available to provide inf'ormation over the Internet. If every server
had to be consulted for every query, query time would be a bottleneck and there
would be duplicate searches.
Instead, the CDS model, shown in Figure 2.6, uses a directory to determine rele-
vant servers. Formally, a Client Directory Semer (CDS) system is dehed as (c, d, S) , where c is a client corresponding to a user interface that requests information, d is a
server directory that contains a mechanism for dynamically locating incoming server
names, server locations and other information, and S is a set of senters scattered
through the network.
Figure 2.6: Client Directory Server Model.
A server directory records a description and summary for each information server.
In Figure 2.6, the client sends a query to the directory server to identify the servers
that are appropriate for the query. Then the client sends a request to each of these
servers. The dashed lines show that if only one server has to be visited, no other
server will be contacted, which decreases the network load.
-- - - - - - -
Table 2.4: A Sample Server Database.
Table 2.4 shows a sample database with data for four servers- The format of the
query sent by the client to the directory server is shown in Figure 2.7.
Topic knowledge discovery intelligent scheduling knowledge discovery intelligent scheduling information filtering computational linguistics knowledge discovery
Serversame ca.fas.sfu.ca
rnercury.cs.uregina.ca
www .enee.umd.edu
www .rncc.com :80
select serveraame from server-db where one of keywords like "knowledge discovery"
Figure 2.7: The Format of a Query to the Directory Server.
Location Computing Science Simon Fraser Corn pu ter Science University of Regina University of Maryland MCC Corporation
If the client asks, Vind al1 sites that have knowledge discovery related materials,"
the result of the query is CS. fas.sfit. ca, rnercury. cs.ureginu. CU, and www. mcc. coms80.
Then the client sends requests to the three servers ta look for the information.
The advantage of the CDS model is the server directory. It keeps track of each
information server. Before a request is sent to any server, the directory is queried.
Only the servers involved in the request are contacted. This mechanism decreases
query time.
The CDS model is keyword based. The relevance of a database to a query is
deterrnined by keyword analysis of the entry for the database in the server directory
not the database itself. The CDS model does not apply to integrated data sources
because it assumes its input data is text data in unstructured form. As well, the
client server directory model assumes that descriptions of al1 servers are provided.
2.4.2 The Multiple Layered Database Model
Han and Zaiane proposed a multiple layered database model for knowledge and
resource discovery in the Internet 1411. A multiple layered database (MLDB) is a
database formed by generalization and transformation of information. Knowledge
discovery proceeds from the lowest layer to the highest. Layer O is the original
database. Layer 1 and higher layers of an MLDB are constructed on top of layer
O. Knowledge discovery can be performed efficiently in an MLDB once it has b e n
constructed. Generalization from each layer to the next higher layer reduces the size
of the database.
Layer 1
Layer O
Figure 2.8: Multiple Layered Database Model 1411-
20
The MLDB mode1 is based on the studies on databases and knowledge discovery
[34] [Tl] [42]. Wit h the generalization techniques developed for knowledge discovery from databases, the voluminous, primitive information in the global information base
can be t ransformed and generalized into classified, structured high level informat ion.
It can then be stored into a distributive database as layer 1 in the MLDB mode1 [41].
The database constructed in layer 1 is too large to create and maintain. Further
merging and generalization at this level produces a higher layer, layer 2. This level can
be replicated at each site for further integration. Information retrieval and knowledge
discovery can be conducted in this higher level database.
The MLDB mode1 is based on an extended relational mode1 with newly defined
operators to handle information in the form of hypertext and mu1 timedia. An MLDB has three cornponents (s, h, d), where s is a database schema, h is a set of concept
hierarchies, and d is a set of database relations. Successful generalization is a key to
the construction of an MLDB system.
Concept generalization on nonnurneric values, such as indices, keywords, site loca-
tions, server names, and organizations, relies on the concept hierarchies that represent
necessary background knowledge to direct generalization. With the usage of a con-
cept hierarchy, primitive data can be expressed in terms of generalized concepts in a
higher layer. Concept hierarchies in an MLDB rnust be standard and observed by al1 local databases.
An algorithm based on the attributeoriented generalization for the construction
of an MLDB is presented in [41]. A sarnple relation in layer 1 is presented in Table
2.5.
I I - ~ n t o m t I I http://www. ( Vanderbit ) cornputer ( muhine ... kdd tutorid
I voie. 1 Uaiverrity 1 rsience 1 Iearning - vanderbit.edu 1 department h!tp://wrrw. ibformrtion machine jnery reformnlrtion ui.edn Southcrn rcicncc Iearning for dynamic information
1 Cdifornia 1 inititote 1 data mininl 1 integration 1 http://wrrw-db. 1 Stanford ) department of j lntaroperability of 1 an apprarch t o 1 itanford. Univeriity computcr hetcropneour beharior .hiring in edu science databuco federried databue ryrtems .., . . . -.. --. ... ...
Table 2.5: A Sample Portion of a Relation in Layer 1.
By extracting documents related to computer science, a layer 2 relation can be
created by generalization, shown in Table 2.6.
O rganizat ion Vanderbit University University of Sout hern California Stanford University
Table 2.6: A Generalized Portion of a Relation in Layer 2.
...
The advantage of the MLDB model is that it tries to structure dl Internet re-
sources into the relational database model. Then KDD techniques can be applied to
a constructed relational database. The major weakness of the MLDB mode1 is that
al1 data at each layer have to be stored except layer O. With the huge arnounts of data
available on the Internet, it would take too much disk space to store data at the lower
levels, for example, level 1. The MLDB mode1 is a bottorn-up method, that is, no
results are produced until everything has been summarized. As well, there are many
possible summaries for the same documents, and it is hard to obtain and maintain
concept hierarchies.
Interest machine learning machine learning interoperabiity of
Publication ... ...
... heterogenmus databases ... ...
Chapter 3
An Overview of the KDID System
In this chapter, the architecture of the KDID system for knowledge discovery
in Internet databases is described. KDID addresses the need to conduct knowledge
discovery tasks in relational databases available on-line via the Internet.
This chapter is organized as follows. In Section 3.1, the components of the KDID system are described. Section 3.2 covers the database types. In Section 3.3, the
multidatabase model, which is the basis of the KDID system, is introduced, and the
data integration problems are discussed.
3.1 The Architecture of KDID
As shown in Figure 3.1, KDID consists of three main components: the Interface
Module, the Multidatabase Module and the KDD Application Module. Figure 3.2
gives a software hierarchy chart for the main components. The KDID Control Mod-
ule, shown at the top of Figure 3.2, is a relatively small module that implements the
communication paths shown in Figure 3.1. It invocates the other modules and coor-
dinates communication among them. The Internet Front-end is the communication
medium between the user and the KDID system. It receives user requests from Web
pages, sends user requests as HTML form data to the Interface Module, and receives
back information to be displayed. The Interface Module parses the HTML form data
and composes a global query. The global query is sent to the Query Manager of the
Multidatabase Module. The Query Manager then translates the global query into the
KDID internal format according to mappings between local and global schema. The
Query Manager decomposes the internal query into a set of local database queries. It
then checks each query for legitimacy and reports any errors. If al1 queries are decom-
posed successfdly, a database agent is run for each local database. Each agent sends a
query to its local database management system and receives the results. These results
are uploaded into the rnultidatabase system in standardized format. When al1 rele-
vant tasks have been processed, a KDD application is applied to the multidatabase.
The Knowledge discovered is then returned to the Internet Fkont-end.
Each of the modules is descrjbed in greater detail in the remainder of this subsec-
tion.
I ......................................................
Intelnet F m t-end
t
Figure 3.1: The Architecture of KDID.
KDD Application
- --
Figure 3.2: Software Hierarchy Chart.
- '
Query
Module InMace , ~ ~ -
1 1 J Module Manager
3.1.1 The Internet Front-end
The Internet F'ront-end provides two services for KDID: input of discovery task
parameters including a multidatabase query and output of discovered results.
An SQL query sent to a multidatabase can return a large arnount of data. The
KDID control module in Figure 3.2 directs the KDD application module to conduct
the discovery task on the results of the query. The discovered knowledge is given
at a higher conceptual level than the original data because high level concepts from
the concept hierarchies are used instead of base-level data. This helps to reduce the
amount of data returned dramaticdly. If the discovered knowledge does not meet the
user's need, the task parameters can be refined and the user can send a new request.
Knowledge discovered by the KDD application is sent to an Internet browser. The
discovered knowledge is translated to HTML format and displayed on a web page.
Advantages in irnplementing the Internet front-end based on the World Wide Web
include no interface development and versatile hypertext links. No actual interface
programming is needed because the browser provides generic graphical user interface
with features such as navigation, scrolling, interactive fields, text layouts and image
display. KDID supports hypertext links to database query forms and links that
represent database queries. Hypertext links not only provide a simple means of inter-
connecting databases, but they also allow databases to be connected to any resource
accessible via the Internet.
The World Wide Web does not solve al1 database or program problems. A number
of issues remain to be solved. They include performance and interface programma-
bility. The speed at which current browsers download documents leaves much to
be desired, even when the documents are locally resident. There are limitations on
the interfaces that can currently be created using HTML. The lack of a general-
purpose front-end scripting language with commands for storing and retrieving local
information, moving between forms and documents under program control are major
restrictions.
3.1.2 The Interface Module
The Interface Module facilitates the process of obtaining results for a discovery
task involving data from several databases for end users. Many existing knowledge
discovery from database systems can only deal with a single database or a single
type of database. The Interface Module parses HTML form data and composes the
processed form data into a global query. The global query is translated into a unique
intemal format which is decomposed into a set of queries suitable for each database
management system.
Form Data Parser
When the user at the Internet Front-end submits an HTML form, a form data
set is passed to the Interface Module. For example, when a user submits the BTML form shown in Figure 3.3, the form data set show in Figure 3.4 is parsed.
The Form Data Parser parses the form data and obtains a set of (Narne, Value)
pairs. For example, the (Narne, Value) pairs shown in Figure 3.5 are obtained from
the form data set in Figure 3.4.
Global Query Composer
The Global Query Composer takes the processed results from the Form Data Parser
to compose a knowledge discovery task and a global query. The global query is
decomposed into a set of local database queries to mtrieve the relevant data from
the local databases. Some (Narne, Value) pairs are relevant to the discovery task,
and some to both the discovery task and the global query. For example, the (Narne,
Value) pair (tuple, 10000) is a parameter of a typical DB-Discover discovery task that
sets the maximum number of tuples that should be retrieved from the database to
10000. The (TARGET, amant ) and (TARGET, areasode) pairs are the target
attributes of both the global query and discovery task.
From the exarnple result in Figure 3.5, a global query
and a discovery task
are composed.
Figure 3.3: An HTML Form with User Data.
Figure 3.4: An Example of a Form Data Set.
Figure 3.5: Processed (Narne, Value) Pairs.
The global query is also used to specie the relevant data to be retrieved from
the multidatabase for the discovery task. A discovery task has parameters to make
database connections and to load concept hierarchy files. It also has parameters to
set attribute options and thresholds. A predicate of a task might have higher level
concepts.
3.1.3 The Multidatabase Module
The Multidatabase Module provides an interface to multiple databases in a way
that makes them appear as a single database. This module communicates with the
select amount, area-code, disc-code, dept, orgaarne, province, cname from award a, organization b, committee c where a.org-code = b.org-code and a.cname = c-commame and ( disc-code >= 23500 or disc-code <= 24500 )
login dbd dbd connect pas load nserc.chf select amount, area-code, disc-code, dept, orgaarne, province, cname from awaxd a, organization b, committee c where a.org,code = b.org-code and a.cname = c.com,name and ( disc-code >= 23500 or disc-code <= 24500 ) set threshold for default 10 set threshold for tuple 10000 set sum arnount set percent sum
Interface Module and the KDD Application Module. It could be used unchanged
with any KDD application module that makes global queries requiring translation into
retrievals from multiple databases. The global query passed from the Interface Module
is sent to the Query Manager in the Mdtidatabase Module. The Query Manager
translata the query into an internal format according to the schema mappings. The
internal query is then decomposed into a set of local database queries. A database
agent is run for each of the local databases. A database agent forms a gateway to a
local database. It checks the validity and legality of the query passed by the Query
Manager. If the processing operation is permitted, and the local database is registered
to the schema information manager, the database agent sends the query to the local
database and receives the results. The retrieved data are standardized and uploaded
into the multidatabase.
An advantage of the configuration of the Multidatabase Module in Figure 3.1 is
that each database agent is irnplemented and installed at the site where the multi-
database resides. There is no imposition of any code on the local database machines,
which preserves site autonomy, as discussed in Section 2.3. Acceptance of KDID among database administrators should be increased because KDID only requires the
processing of database queries at their sites. A disadvantage of this approach is
that al1 raw data must be transferred across the network to the machine where the
multidatabase is running.
3.1.4 The KDD Application Module
The knowledge discovery applications most closely related to database technolo-
gies are data summarization and generalization tuols. These applications present the
general characteristics or a summarized high level view over a set of user specified
data in a database. Data in databases contain detailed information at the lowest
level. For exampie, the 'award" relation may contain attributes about information
concerning award amounts, discipline codes, and provinces or area codes. It is de-
sirable to summarize a large set of data and present it to the user at a higher level.
For exarnple, a discipline code that is over 23000 and lower than 23500 is "cornputer
science". One such tool available is DB-Discover, which is a data generalization tool.
Features of DB-Discover which KDID adapted to are described below.
Features of DB-Discover
A KDD technique implemented in DB-Discover is uttribute-oriented induction
(AOI). An A01 algorithm takes as input a relation retrieved from a database and
generalizes the data guided by a set of concept hierarchies. In KDID, the input is a
relation retrieved from the rnultidatabase, which is created by KDID from data sets
retrieved from several different types of databases.
To perform a knowledge discovery task requires several steps. First, a database
connection must be established. In the KDID system, since the multidatabase is
created based on the Oracle model, the connection to the multidatabase is under the
Oracle environment, which requires a logon narne and password. The logon name
and password are not transparent to any user because a system script is used. For
example,
(INPUT TYPE= "hidden" NAME="LOGINn VALUE= "login dbd dbdn),
where the input type for KDID is "hiddenn. This feature protects the security of the
multidatabase. After a connection is established, one or more concept hierarchy files m u t be
loaded. Concept hierarchy files are defined using tabbed ASCII files [74]. Figure
3.6 is an exarnple of a tabbed hierarchy file for the DISC-CODE attribute shown in Figure 2.5.
code Compter
HARDWARE 23000-PUW)
SYS ORGANIZATiON 23500-26000
SOrnARE 24tnKL2rlU)O
THEORY 24500-25000
MATHEMATICS 25000-25500
DATABASES 2 5 ~ 2 6 0 0 0
AI 26000-26500
COMPUTING METHODS 26500-27000
Figure 3.6: A Tabbed Concept Hierarchy File.
When the hierarchies have been loaded, the discovery task can be defined. The
first step is t o select target attributes. It is in the same format as a standard SQL query. The format is
select (attribute list),
in which the attribute list may be several attributes. When the target attributes and
tables have been specified, the predicate needs to be defined, e.g.,
where disc-code = "Cornputer" and province = "Westernn.
The d u e s "Computer" and "Western" are not actual data values in the database, but
concepts defined in the DISC-CODE and PROVINCE hierarchies. The system trans-
lates the predicate disc-code = "Computer" into disc-code >= 23000 and disc-code
<= 27000.
Other task parameters affect different aspects of the discovery task. These include
values to set retrieval t hreshold, which determines the maximum number of t uples to
be retrieved, and to set the attribute threshold, which specifies the maximum number
of distinct values that the attribute may have. After data has b e n retrieved, cer-
tain numerical summary information can be displayed. These include setting percent
count, which displays the proportion of the current tuple's counts in relation to the
sum of al1 tuple's count to be displayed. The sum of an attribute is the total number
of items for non-numeric data and the total of the d u e s for numeric data. Sums c m
be automatically calculated by the DB-Discover system [74].
3.2 Databases Types
3.2.1 Overview
As described in Section 2.3, types of databases include relational databases, object-
oriented databases, deductive databases, and multimedia databases. To ensure gener-
ality, three different database models are included in the KDID system. They include
Oracle databases, Mini SQL databases, and Microsoft Access databases. These three
types of databases are relational databases, but the data management and manipu-
lation languages differ. These databases are chosen because they are available on the
local network and it is easier to irnplement the network agents for relational databases.
The other database models are more complicated and current knowledge discovery
tools are designed for relational databases.
3.2.2 Oracle Databases
The Oracle Corporation introduced the Oracle relational database management
system in 1979. The Oracle database management system uses ANS1 standard SQL
as the database access language [58]. It is one of the largest selling relational database
management systems in the commercial world. Oracle developed a superset of the
regular SQL, called SQL *Plus. Any interaction that deals with the Oracle server is
done through these SQL statements.
The SQL standard was dehed by the ANSI/X3H2 cornmittee as a module lan-
guage. A module Lunguage is a small language for expressing SQL operation in pure
SQL syntactic form. It is not possible to use the module language to code direct calls
to the Oracle database management system. Instead SQL statements are embedded
directly into the host Ianguage. Embedded SQL statements are prefixed by EXEC
SQL and terminated by a semicolon in C. An e x ~ u t a b l e SQL statement can appear
wherever an execut able hos t st atement can appear.
3.2.3 Mini SQL Databases
Mini SQL is a database engine designed to provide fast access to stored data with
low memory requirements [45]. Mini SQL offers only a subset of the SQL standard
as its query interface. It is chosen as a test database management system because it
is designed to work in the client server environment over a TCP/IP network.
The Mini SQL server Zistens for co~ec t ions on a TCP socket. The availability
of the TCP socket allows client applications to access data stored on machines over
the network. The Mini SQL system supplies an Application Programming Interface
(API) library, which allows any C program to communicate wi th the database server.
The usage of the API library avoids the usage of the Ernbedded SQL. Table 3.1 shows
sorne of the functions used to implement the network agent to contact the Mini SQL
server over the network.
1 Name 1 Purpoee 1 L m I
1 msalConnect I form an interconnection with a Mini SQL server 1 1 - 1 msqlSelectDB 1 select a database
1 I
Table 3.1: Functions Used to Contact the Mini SQL Server.
- msqlQuery msqlStoreResuIt msqlReeResult msqlListDBs msqlListTables msqlLitFields msqlClose
3.2.4 Microsoft Access Databases
&nd a query to a selected database store result returned by a SELECT query free data space obtain a list of databases retrieve a list of tables obtain information about fields in a table close connection to the Mini SQL server
Microsoft Access is a relational database management system. MS Access provides
a variety of objects to display and manage information. These objects include macros
and modules. A rnacro is a list of actions to be performed by the database system.
For example, the MS Access database system can automatically open a set of forms
when a database is opened. MS Access provides a built-in database programming
language, Access Basic. Procedures can be written in Access Basic for operations
requiring cornplex, automated processing. A module is a Microsoft Access object that
contains Access Basic procedures. For example, a MS Access database may need to
get information from other database systems. Procedures to deal with this action are
modules.
The MS Access database management system uses the Open Database Connectiv-
ity (ODBC) standard defined by Microsoft. ODBC is an industry standard to enable
access to data in most of the popular data sources such as Access, dBase, Paradox,
Informix, Sybase and Microsoft SQL server. Using ODBC, a client can access local
data without knowing the database mode1 of the participating database. An ODBC driver is a database progrannrning interface which allows an application developed for
one database, such as Microsoft Access, to be ported to another database, such as
Mini SQL, without any involvement of the original developer. An application com-
municates with the ODBC driver and the ODBC driver passes each cal1 the database.
Table 3.2 lists some ODBC functions for interacting with a database server.
Name SQLPrepare SQLExecute
Purpose prepare an SQL string for execution execute a prepared statement, using the current d u e s of the
SQLFetch SQLGetData SQLTables SQLColumns
- -- - -- -
Table 3.2: ODBC Functions Used to Contact a Database Server.
parameter markers in the statement fetches a row of data from a result set return data for a single unbound column in the current row return a List of table names stored in a specific data source return the list of column names in specified tables
%QL=~& SQLError SQLDionnect
3.3 The Multidatabase Architecture for the KDID System
load a driver and establiihes a connection to data source return error or status information chse connection associateci with a specific connection handler
3.3.1 The Architecture
Users of KDID view al1 the local databases as a single database and leave the
interna1 view to the system. Consider a user who requests information on heart disease
medication in three different hospitals. Information about medication and drug usages
and pharmaceutical manufacturer is distributed among different databases. For users
of the KDID system, their goal is to obtain the knowledge from the different databases,
that is, there will be only the retrieval task involved at the local database level.
There is no need for the KDID system to update, insert or delete any data from any
local database. These tasks will be taken care of by the local database management
systems.
Figure 3.7 shows the architecture of the KDID d t i d a t a b a s e model. The global
database agent module parses a global SQL query, which in the KDID systern is part
of a discovery task entered through the Internet front-end. The global database agent
Figure 3.7: Architecture of the KDID Multidatabase.
contacts the schema information manager to check the legitimacy of the query. If the
query is legal, it is sent to the query processor, which xi11 translate it into the KDID
system's internal query format. Each a t tribute bas its full pat h to each local database.
The global query processor will check the attribute names used in each local database.
When the global query is translated into the internal query, the query decomposer
takes it as input to decompose into a set of local database queries. Each of the local
queries will be sent to a local database that has its own local database access module
to access the local database management system. These databases exist a t different
geographic locations and use different database schema architectures. Each employs
its own database management system with its own database query and manipulation
languages.
The Schema Infornation Manager coordinates the global database agent and the
local database agents. When a local database administrator decides to register his/her
database to the KDID system, the Schema Information Manager contacts the local
database, using the information supplied by the local user. Once the local schema
information has been retrieved, the schema information manager supply information
about the global schema. The local database administrator then describes the r e
lationship between the global and the local attributes. If there is no relationship
between their database and the databases registered, the local database adminis-
trator can abort the registration process. The registration is strictly voluntary and
before the registration, informat ion on the KDID system, and the brief descript ion
of the databases registered are presented to the local database administrator. Once
the local attributes have been selected for sharing with other databases, a concept
hierarchy is searched and displayed for the user.
The multidatabase system suits the KDID knowledge discovery task. It provides
a means for resolving the differences in data representation and function arnong local
database management systems, and it discovers more rneaningful knowIedge for users.
Data gathered from several databases include a wide range of information, and pattern
discovered based on these data might be more interesting.
KDID is capable of integrating relational databases in this research, but as ob-
served, the system can be extended to include network, hierarchical and objected-
oriented databases. The users will be provided with a global integrated view of data
stored in different databases. User requests will be formulated in terms of the i n t e
grated data mode1 and then translated into local databases queries. The information
to resolve data c o ~ c t s is stored in the KDID system directory.
3.3.2 Data Integration
Creating a multidatabase is complicated by the heterogeneity and autonomy of its
local systems. Heterogeneity exists through difierences at the operating, database,
hardware, or communication level of the local systems. In this section, the focus
is on the database integration issues. The types of data integration, the reason
the differences occurs, and examples of how to solve the integration problems are
discussed. The issues include at tribute name difference, at tri bute value difference,
which is purely syntactic, scale and type difference, missing data, and conflicting
values.
Each database participates in a rnultidatabase mode1 by exporting a part of its
schema, called the ezport schema. A global schema is created by the integration of
multiple export schemas. If there are no relations arnong the concepts represented
in each local schema, the global schema is simply the union of the local schemas.
The same concepts rnay be represented in different databases, and concepts rnay be
represented different ly.
Types of data integration confiicts are described below. According to Batini et al
(71, different perspectives and equivalent constructs cause data integration conflicts.
Perspective difference is a modeling problem caused during the design phase of a
dat abase schema. Different designers adopt different viewpoint s when modeling the
same information. The rich set of constructs in data models allows for possibilities of
modeling. This results in variations in the conceptual database structure. Typically,
in conceptual models, severaI combinat ions of constructs can mode1 the same real-
world domain equivdently.
Attribute Name Difference
Attributes having the same meaning may be given different names in different
databases. These are called synonyms. Anotber case are homonyms, that is, two
different attributes having an identical narne. Homonyms c m be resolved easily by
changing one of attribute names into a different name in the global database.
Attribute Value Difference
Owing to design perspective differences, different designers rnay use different mod-
els for the same entiicj. in different databases. For example, some designer rnay assign
one character to the attribute SEX, others rnay assign six, thus the data value re-
trieved from different databases are different. This can be resolved by the standard-
ization procedure.
Scale and Type Difference
The same attribute rnay be stored in different databases using difierent units of
measures. This difference cm be resolved by defining a conversion function when
the data is retrieved. For exarnple, if the attribute PRICE is measured in one
database in Canadian dollars, but measured in another in US dollars, a function
price = price x 1.36 c m convert the US dollars into Canadian dollars.
Different database systems define different data types and most relational databases
define according to the SQL standard. Type differences among database management
systems create problems when implementing the multidatabase modeI. Types have
to be converted for data definition and data manipulation before the mdtidatabase
creation. For example,
CREATE TABLE tempo ( col1 NUMBER(10) NOT NULL, col2 CHAR(5O)
)-
Table 3.3 shows the data types defined in Orade, Mini SQL, and Microsoft Access
and their inter-relationships. The names of data types and their parameters differ
when used for data definition. This makes an application program dependent on the
underlying database management system.
I
I I
1 1 UNSIGNED 1
Oracle
1 CHAR I I
1 CHAR 1 TEXT 1
Mini SQL
LONG M W M W LONG
1 NUMBER ( 1 I
1 CURRENCY 1
Microeoft Access BIT
LONGBINARY
MEMO
m
1 1 1 SHORT 1 FLOAT
INTEGER LONG DATETIME
VARCHAR.2
Table 3.3: Data Types in Oracle, Mini SQL and MS Access.
The data type mapping developed by Steindl[80] is presented in Figure 3.8. Based
on this mapping, a function can be implemented to convert Mini SQL data types and
Microsoft Access data types to Oracle data types.
void mapping(char *ldbtype, char *oracletype, int paralen)
p a m b n is determined by the local database type ldbtype. If ldbtype is a string of
characters, the length has to be determinecl. If it is a number, the precision is required.
Figure 3.8: Type Mapping
Missing Data
A database at one location may not store d l the information of interest concerning
an entity. Data can be present in one database but missing from the other; data in
one relation may be the summarization of the other relation.
Conflicting values
If databases at different geographic locations store information concerning the
similar data items, there is danger of codicting values. It is difficult to establish that
a confiict exists and to correct the discrepancy. If there are "Medication" relations at
two databases, how do we determine when the same drug is being prescribed in each
relation? If the drug has pices in each relation, should these prices necessarily be
equal? One option is to project both relations, then later in the process of analysis
and discovery, it might be safe to take an average depending on the discovery factors.
hrthermore, in the process of knowledge discovery, the exact arnount or exact price
may not be required; instead, a range may be required.
3.3.3 Multidatabase Query Processing
The user input from the Internet Front-end first passes through the HTML form
data parser , which identifies the necessary entries for query composition. The global
query is composed by the query composition module and is translated into a full-path
interna3 query. The interna1 query is decomposed into a set of local database queries,
which in turn are sent to different local database agents. If a local query is legitimate,
the query is executed against the local database management system.
A global query is a full-path query that rnay include constant predicates and join
conditions. A local database query has only the constant conditions with the join
condition decomposed into attributes to be retrieved.
The format of a global query is:
select (attribute list) from (relation list) where (constant condition) and (join condition)
The attribute list may contain several full-path attributes in the form of database:
relation. attribute. The relation list is of the form database:relation.
The basis of querying processing in KDID is the SQL select statement. Condi-
tions in the select statement have one of the following formats:
1. A 8 d or A 8 B, where 8 E (=, >, 2, <, s), d is a primitive value, and A and
B are attribute names. For example,
means that nsefc:awurd.arnount, nserc:award.org,code, and pas:oryanization.org,code are attribute names, and that 10000 is a primitive value.
2. A O c, where B E (like, not like}, c is a character string, and A is an
attribute name. For exarnple,
pas .organization.province like S%
means that pas.organization.province is an attribute name and that S% is any
character string starting with the letter S.
3. A 9 B, where 8 E { in, not in}, and A and B attribute narnes. For example,
pas.organization.province in (Alberta, Manitoba, Ontario)
means that pas. organization.province is an attribute name and that O can choose
from in and not in.
Chapter 4
Design Issues
In this chapter, design issues related to the KDID system are described. Var- ious design issues that arose during the implementation of a prototype system are
described and algorithms for the major modules are presented.
Figure 4.1 is the flow diagram for the query processing. A user query is parsed and
assembled from an HTML script file, which in turn is translated into an internal query.
The internal query is then decomposed into a set of local queries. h s u l t s from the
local queries are retrieved and transformed into internai results. The internal results
are imported into the multidatabase system. The user results are the final knowledge
discovered, using one of the KDD applications.
This chapter is organized as follows. In Section 4.1, design issues reiated to the
Interface Module are discussed. Section 4.2 presents the on-line database registration
process, and the mappings of local database schema information to the global schema.
Section 4.3 describes the global query to internal query translation algorithm. In
Section 4.4, the global query decomposition algorithm is proposed and examples of
decomposing a global query are given. In Section 4.5, mechanisms for resolving data
value differences are discussed.
4.1 The Interface Module
Two design issues related to the Interface Module are the design of the HTML f o m data parser and the global query composition algori t hm.
Figure 4.1: Flow Diagram for Query Processing.
4.1.1 The Forrn Data Parser
An HTML form data parser is provided in Figure 4.2. Given a form data set, this
parser removes redundant special characters, converts the hexadecimal numbers to
ASCII codes, and collects al1 (NAME, VALUE) pairs. It is described below.
Let S = (N, V, P) be an HTML form data set, where N is a set of HTML com-
rnands, V is a set of values following the HTML commanda, and P is a set of special
characters and hexadecimal numbers. Let E = {E; = (Ni , K) 1 Ni E N, K E V ) be a
set of processed (Narne, Value) pain.
The parser procedure processes a form data set S, removes redundant elements
in P and returns al1 tuples in E.
When a URL containing a link to an HTML form script is activated, a form a p
pears on a Web page. After the user fills in the fields and submits the form, the parser
begins to check the CONTENTTYPE of the form. If the CONTENTTYPE is
not application/x-wu-f orm-urlencoded, the procedure prints an error message
"encoding method not implemented'' and exits. Otherwise, the parser procedure
continues to check the request method. The request method implemented for the
Procedure parser
Input: S, an HTML form data set Output: E, a set of (Name, Value) pairs begin
t := get(S, "CONTENTTYPE") if t # "application/x-www-form-urlencodedn then
return error(Unot-implemented") end if rn := get(S, UMETHODn) if m ="POST" then
Zen := get(S, u C O N T E N T J E N G T H n ) while Zen > O do
N;' := getword(S, len) K' := getword(S, Zen) Ni := del dpcialdhar-converthez(N;')
:= delspecial~har-converthez (K') if V;. # null then
add (Ni , V;:) to E end if
end while else
return error(Umethodrnust4e90STn) end if return E
end
Figure 4.2: The Form Data Parser.
parser is POST. If a request method other than POST appears in the script file,
an error message umethod must be POST" is printed and the procedure exits. If
the method is POST, the CONTENTLENGTH is obtained from S. The function
getword extracts Ni and from S and the length Zen of S is decremented based on
the lengths of Ni and y. The length of eaeh N: or y is different. The pair Ni and K are the results after the special characters are removed and the hexadecimal nurnbers
are converted to ASCII codes. Al1 pairs are returned, using a data structure.
4.1.2 Global Query Composition
Let R be the assembled SQL query. E = {Ei = (N i , x) 1 Ni E N, E V ) is
defined in Section 4.1 .l. The GQC procedure for global query composition, as shown
in Figure 4.3, takes the (NAME, VALUE) pain processed by the HTML form data
parser as input, and assembles thern into an SQL like global query. The output string
R is sent to the query translation function.
The GQC procedure assumes that the series of pairs begins with one or more pairs
specifying the target of the SELECT staternent , followed by one pair specifying the
table name, then followed by zero or more pairs specifying predicates and connec-
tors to be combined to form a WHERE clause. The literal strings "TARGET",
"PREDICATE", and UCONNECTOR" are reserved words in HTML script files.
The GQC procedure checks each attribute name Ni to determine if it is "TARGETn . If so, the corresponding target attribute value is concatenated to R. The variable
next is updated to the position of the pair immediately following the last observed
pair with name Ni equal to "TARGETn. If Nnmt is "TABLE", then VnWi is concate
nated to R in the format of database : table. Otherwise, an error message "table not
foundn is printed and the procedure exits. When the procedure begins to process
the predicates, if Enac+l is the first predicate, the keyword WHERE is concate-
nated onto R. The procedure processes the keywords like and in separately to ensure
that the special characters, such as single quotation marks, double quotation marks,
backslashes, percent signs, and ampersands, are processed appropriately.
Procedure GQC
Input: E, a (Name, Value) set Output: R, an output SQL string begin
R := "SELECT" for each tuple Ei E E do
if Ni = "TARGET" then R := concat (R, l$) next := i + 1
end if end for R := cmrcat(R, "FROM") if N,,t # 'TABLE" then
return error(Utablenot- f a n d " ) else
R := concat (R, V,,,t) end if if next < 1El thea
R := concut(R, "WHERE") for i = next + l...IEl do
if N; = "PREDICATE" or Ni = "CONNECTOR" then R := concat(R, K)
else return error (Upredicatemlionnector,not- f ound" )
end if end for
end if return R
end
Figure 4.3: Global Query Cornposi tion.
4.2 On-line Database Registration
Before a query can be translated into local database queries, the mappings from
the global schema to each of the local schemas must be resolved. In [76][81][59], the
authors assume that mapping functions have been implemented and no details of
mappings are described. The mappings are vital to KDID since they involve Internet
databases. A fundamental assumption in KDID is that each local database aùmin-
istrator register his/her local database with the KDID system. By registering their
databases with KDID, tbey are providing access to their information under controlled
conditions. This sharing allows the discovery of more interesting knowledge.
An on-line strategy for mapping database schemas is proposed in this thesis. Us-
ing HTML forms provided by the KDID database registration subsystem, the local
database administrator can manipulate the meta-data, that is, information about
database schemas and their intended meaning, to indicate how their local database
maps to a global schema provided by KDID. In the remainder of this section, the
registration subsystem is described with emphasis on security aspects and the regis-
tration process. Secure connections are used becauee the information about the local
databases is sensitive; for example, a local database account name and password
might be provided for a dedicated account.
4.2.1 Security Issues
In this section, the relevant security issues are discussed and protocols and stan-
dards are introduced.
As the development of the Internet continues, the security problem is being con-
sidered. Security is a baseline requirement for network computing. Privacy, authenti-
cation, authorization, and integri ty are al1 required in any security strategy to prevent
eavesdropping, manipulation, and impersonation [63].
Various solutions to the security problem have been proposed and implernented.
The Netscape Company proposed the Secure Sockets Layer (SSL) transport prote
col to improve Internet security. The SSL protocol provides data encryption, server
authentication, message integrity, and optional client authentication for a TCP/IP
connection [63]. The connection security provided by the SSL protocol has t h e prop
erties: private connections, peers' identity aut hentication, and reliable connections.
Encryption is used after an initial handshake to define a secret key [63]. Symmetric
cryptography is used for data encryption. To authenticate a peer's identity, asym-
metric (or public key) cryptography is used. Message transport, which includes a
message integrity check using a keyed Message Acknowledge Code (MAC), ensures
reli able connections.
To manipulate schema information through the Internet, a secure Internet server
must be set up. The Apache-SSL is a secure web server, based on Apache [3] and the
SSL protocol. Apache is an HTTP server based on the NCSA httpd server version 1.3
with increased functionality, speed, and reliability. The Apache-SSL server has every
feature described by the SSL protocol. The server uses a single X.509 certificate that
enables the server to authenticate itself to clients requesting SSL connections. When a
server presents a certificate during an SSL handshake, the Internet browser checks the
certificate against its certificate database. If the server certificate is in the database,
or if the server certificate is signed by a Certificate Authority whose certificate is in
the database, the SSL handshake can conclude successfully. The format and meaning
of the X.509 certificates are defined by RSA Laboratory Inc. ($61.
4.2.2 The Registration Approach
Figure 4.4 describes the steps taken to register a local database with the KDID schema manager, retrieve local schema information and to integrate them with the
global schema. The local database to be registered is assumed to contain data relevant
to the subject of the global schema. Consider a global schema concerning books
published. The local user here is assumed to be the local database administrator
who knows the content of one publisher's database and is aware of the KDID system.
When a local database adrninistrator decides to register his/her local database, a
hypertext link is activated and a registration form is displayed. This connection is
secure, based on the secure server setup using the SSL protocol.
When the user fills in the required information and submits the registration forrn,
Stepa for On-line Database Registration
1. Local database administrator (DBA) initiates registration; 2. Establish a secure connection between server and client; 3. Local DBA completes the application form; 4. Parse user submitted data; 5. Create a database agent using user data; 6. Retrieve global schema information with their intended meaning; 7. Generate an HTML form for the local DBA; 8. Process user su bmitted mappings; 9. Update the global schema information.
Figure 4.4: On-line Database Registration.
the KDID system parses the user data and creates an agent for this database. The database agent uses the supplied information to determine the network location of the
database; for example, the address of the NSERC database is chiron. CS. uregina. ca.
Schema information is retrieved for the relations specified by the local database.
After the local schema information has been retrieved, the global schema is queried
to obtain all global attributes and their intended. meaniogs.
An HTML form is generated autornatically based on the global and local schema
information. In the first part of the form, each global attribute is shown, preceded
by a number. The second part of the form displays the content of the local database
schema, including the name, type and length for each local attribute. The local user
is required to fil1 in the number of the corresponding global attribute. If there is no
mapping for a local attribute in the global schema, the default is "N/AV.
1 Global attribute name 1 dbname:relation:attribute[, dbname:relation:attribute] ... 1 Table 4.1: Format of Schema Mapping.
The last step is to update the global schema according to the user mappings. The
format of schema rnapping is shown in Table 4.1. The part in square brackets is
optional, that is, a global attribute can be in several local databases and their local
Procedure giobalqyAo>nternalqy
Input: GQ, a global query Output: IQ, an internd query string
procedure global-qry-tointernal-qry begin
get-t hree,parts(GQ, targets, relations, predicates) processA3argets(targets, IQ) proceasall~elations(relations, IQ) processall-predicates(predicates, IQ)
end {global-qry-tointernal-qry)
Figure 4.5: Global Query to Internal Query Translation.
names might be different. Table 4.2 is an example.
Table 4.2: Examples of Schema Mapping.
DEPARTMENT ORG-CODE
4.3 Global Query t o Internal Query Tkanslation
NSERC:AWARD.DEPT, COMACC:SCHOLARSHIP.DEPARTMENT NSERC:AWARD.ORGCODE. PAS:ORGANIZATION.ORG-CODE
GQ is the assernbled global query. Let IQ be the internal query. Figure 4.5 shows
the global query to internal query translation procedure. The main objective is to
decompose the query string passed from Figure 4.3 into three parts: the targets, the
relations, and the predicates. Each part is processed separately, with target attributes,
relations, and predicates translated into attributes, relations and predicates with full
paths to local databases. Figure 4.6 shows the procedure to process all predicates. The
process-all-targets and the process-ail-relations procedures parse the target attributes
and al1 relations in the global query respectively.
The process-al/-predicates procedure get s the nurnber of total predicates and the
number of items in each predicate. Each predicate is processed and every attribute
Procedure process-all+redicat es
Input: predicates, an set of predicates Output: IQ, an internai query string
procedure processall-predicates begin
num-predicates := get,predicatemum(predicates) for i = 1 to numqredicates do
nurn,itemsinqredicate := getitems(predicate) for i = 1 to numitemsin-predicate do
if i # num-items-inpedicate - 1 then skipspecid-character(predicate, COMMA)
else skipspecid-character@redicate, ENDOFSTRING)
end if tmp-pred := processxach,predicate@redicate, attr') attr := lookup(attr) if attr # nul1 then
n e w ~ e d := assemble-predicate(attr) else
print er ror ("mappingnot,fmnd") end if I Q := concat(IQ, new-pred)
end for end for
end (processall-predicates)
Figure 4.6: Procedure processall-predicates.
is retrieved based on the schema mapping information from the system directory for
the valid path, using the function LohEup. If an attribute cannot be found from the
directory, an error message is passed back.
4.4 Query Decomposition
4.4.1 The Decomposition Algorit hm
The decomposition algorithm to process a translated internal query is presented
in Figure 4.7. Given an internal query, SELECT A FROM Tables WHERE Conds,
for each table specification DB..T appearing in Tables where DB is a database narne,
and T is a table name, a local database query
SELECT A D B . . ~ FROM T WHERE C m d s D B . . ~ is generated, where ADBbaT consists of each attribute N A M E such that DB. .T .NAME occurs either in A or in Cmd.
C m d ~ ~ . . ~ consists of those conditions in Conds involving only constant terms
and terms of the form DB..T.NAME, that is, CmdsDs.r does not involve any
references to other tables DB'..T' where DB # DB' or C # Cf.
The local database query is formed by consdting the schema mapping information
in the system directory to remove parts involving tables other than Dl?. .T. The query
is then translated into the specific DBMS manipulation language, replacing t e m s of
the form DB..T. N A M E with the attribute name N A M E . If C O T Z ~ S ~ ~ . . ~ is empty,
the WHERE keyword is omitted.
4.4.2 Query Decomposition Examples
To illustrate the decornpositon of internal queries, let us consider, an example mul-
tidatabase environment. The multidatase consists of four databases located at three
different sites: REGINA (Headquarters and Hospitals), WINNIPEG (Hospitals), ED- MONTON (Phanneceutical Manufacturing). Each local database management sytem
may be different. In Table 4.3, DBi at REGINA, DB2 at WINNIPEG, and DB3 at
EDMONTON are three local databases, and DB4 at REGINA is the multidatabaçe.
Algorithm For Query Decomposition
Input: Query of the form Select (A) h m (Tables) Wherre (C) Output: (database, query) pairs
Define DB to be the k t of databases in (DBS) ' A t o be the list of attributes in (A) T t o be the list of tables in (Tables) C t o be the lit of conditions in (C)
for each table DB..T do for each DB E DBS do
generate SELECT ADB..T FROM T WHERE CDB..T where every attribute N A M E E A D B . . ~
such that DB..T.NAME E A U DB..T.NAME E C and CDB..T E C
where C'B..T = constant terms U CD..T = DB..T.NAME and CD~..T # CDBI..TI such that DB # DB' U T # T'
check the system directory to remove parts not in DB..T translate query into a specific DBMS manipulation language replace DB..T.NAME with attribute name N A M E if CD~..T = NULL
omit keyword WHERE end if
end for end for
Figure 4.7: Query Decompost ion Algorit hm.
Three database schemas exist to store the medical information.
Table 4.3: Dat abase Informat ion for the Example Mult idat abase.
Schema A PATIENT (PNO, Name, Sex, Age, Phys, Diag, Pres, Drug-Use-Time) PHYSICIAN (Physno, Physname, Deptno)
Location REGINA WINNIPEG EDMONTON REGINA
Schema B DRUG (DNO, Dname, Ingredient, Manufacturer, Date)
Schema A A B C
Database DBI Dl32 DB3 DBa
Schema A exists in database DBi at site REGINA, and in database DB2 at site
WINNIPEG. Schema B exists in database D B3 at site EDMONTON. Only patients,
drugs and physicians at local hospitals are stored in databases D Bi, DBz, DB3, re-
spectively. Schema C is created dynamicdy based on the global query created by
the Interface Module. Schema C exists in the multidatabase DB4 at site REGINA, where the global information about patients, dmgs, physicians is stored. An example
of schema C wodd appear like
Local/Global local local local
global
PATIENT ( S m , AGE, DIAGNOSIS, PRESCRIPTION, ONDRUG-TIME).
Consider the following discovery task: ULook for interes ting relations between
male patients over age 60 and their medication". A suitable global query to retrieve
data relevant to this discovery task, generated by the Interface Module, is shown in
Figure 4.8.
A user is unaware of the complexity of the local and global schema mappings, and
the details of local database schema information, he/she can only submit qualified
SELECT S E X , AGE, DIAGNOSIS, PRESCRIPTION, ONDRUGTIME
FROM PATIENT, PHYSICIAN, DRUG
WHERE SEX = 'male' AND AGE > 6 0 AND PATIENT.Drugno = DRUG.DNO AND PATIENT.Physno = PHYSICIAN.Physno
Figure 4.8: Global Query.
SELECT DBl : PATIENT.SEX, DB2 : PATIENT.SEX, DBl : PATIENT.AGE, DB2 : PATIENT.AGE, DB1 : PATIENT.DIAGNOSIS, DB2 : PATIENT.DIAGNOSIS, DBi : PATIENT.PRESCRZPTZON, DB1 : PATIENT.ONDRUG2'IME, DBl :.PATIENT.ONDRUGJ'IME, DBi : PHYSICIAN.Physnarne, DBi : PHYSICIAN.Physnarne DBl : PATIENTBrugno, DB3 : DRUG.DN0
FROM DBl : PATIENT, DB2 : PATIENT, DBl : PHYSICIAN, DB2 : PHYSICIAN, DB3 : DRUG
WHERE DBl : PATIENT.SEX ='male7 AND DB2 : PATIENTSEX ='male' DB1 : PATIENT.AGE > 65 AND DBa : PATIENT.AGE > 60 AND DB1: PATIENT.Drugno = DB3 : DRUG-DNO AND DB2 : PATIENT.Drugno = DB3 : DRUG.DN0 AND DB1 : PATIENT.Physno = DB1 : PHYSZCZAN.Physno AND DB2 : PATIENT.Physno = DB2 : PHYSICIAN.Physno
Figure 4.9: Internal Query.
SELECT SEX, AGE, DIAGNOSIS, PRESCRIPTION, O N D R U G T I M E , Physname, Drugno
FROM PATIENT, PHYSICIAN
WHERE SEX ='male' AND AGE > 60 A N D PATIENT.Physno = PHYSICIANPhysno
Figure 4.10: Local Query Submitted to DB1, D&.
SELECT DNO, Dnarne
FROM DRUG
Figure 4.11: Local Query Submitted to DB3.
queries based on the information supplied by the HTML fom. The user's work is
made easier by the Intemet front-end and HTML forms, which require a user to enter
as few as possible keywo~ds to start a discovery task. After a fully qualified query
has b e n successfully composed, it is submitted to the query manager for further
processing. In this phase, the query manager produces a translated internal query,
as shown in Figure 4.9.
The query manager then calls the decompose procedure to decompose the internal
query string. The set of local queries L decomposed consists of queries pruned for
each local database systern. For this example, L includes CDBlr LDBa, .LDBs, as shown
in Figures 4.10, and 4.11.
Each local query is submitted to the local database management system. Under
the control of the locd database management systems, all local queries are executed.
The transaction of each local query is monitored by the query manager. If there is
any error, or transaction failure, an error message is returned by the query manager.
This makes it easier to keep track of the work of each local query.
The results of local queries are sent back to the headquarters' site at REGINA to
be inserted in the global database, DB4. Thus, when a user submits an HTML form, a query is composed by the Interface
Module. The query manager translates it into an interna1 query, and decornposes it
into a set of local queries. If the local queries are executed successfully, the local results
from each local database management system are transfered to the multidatabase at
the headquarters' site. The results are inserted into temporary tables created in the
multidatabase, DB4. The KDD application module is then executed, based on the
tables in LI&. The result returned to the user is the discovered knowiedge, that is,
the user result shown in Figure 4.1.
4.5 Mechanisrns for Resolving Data Value Differences
In this section, mechanisms are presented for resolving data value and scale dif-
ferences among component databases. When the data for the sarne attribute are
represented differently in two databases, it is difficult to provide a solution to a u t e
mate the standardization process. In (151 [24], the authors do not give any solution
but assume that the database administrator know the differences between the two
data sets retrieved from the two databases. The database administrator may either
treat the data differently by changing the attribute name in one database to another
name or retrieving al1 data and let the user decide.
4.5.1 Data Value Standardization
In this thesis, an assurnption for resolving data value differences is that a stan-
dard dictionary is supplied for each attribute in the global schema. The format for
a dictionary name is attribzrte.dic. For example, the dictionary for the attribute
PROVINCE is province.dic. If the type of an attribute is nurnber, the values of
that attribute are not checked. The reason is discussed in the next subsection.
Figure 4.12 shows the data value standardization algorithm. Qresult is an attribute
set retrieved from a relation. The algorithm checks each attribute attribute;. If the
type of attribute; is not nurnber, each value of attribvtei is checked. If the value
Stiandardization Algorithm
Input: qresult, a set of retrieved results Output: sresult, a set of standardized results
begin for each attribute; E qresult do
if attribute;.type # numberdype t hen for each attribute, E uttribute; do
standard := lookup(attribute;.dic, attribute,) if standard = not found then
insert (attributei .die, a t t r i b~ te ;~ ) insert(sresult, attribute,)
else insert (sresult, standard)
end if end for
end if end for
end
Figure 4.12: Data Value Standardization.
at tr ibutei j is found in the standard dictionary for at t r ik te ; , the value is replaced
by the standard value and added into sresult, the set of standardized result. If the
attribute value from the retrieved set cannot be found in the dictionary, add the value
into the dictionary as the standard value, and put it into sresult.
The procedure lookup, shown in Figure 4.13 uses a data structure called a hash
table. The attribute values and their standards are kept in linked lists as buckets.
The hash function hmh selects a slot in the bucket to keep an attribute value and
its standard value. If an attribute value is passed in to check if it has a standard
value, the first step is to determine which bucket to search. Then a linear search is
perforrned through al1 nodes in the bucket. If a standard value is found, return it.
Otherwise, "not found" is returned. The advantage of the hash table method depends
on the size of the bucket. The size of a bucket is normally a suitable power of 2 [43].
Procedure lookup
Input: dictionary, a set of standard values attribute, an attribute value
Output: standard, a standard value
procedure lookup(didimary, attribute) begin
bucket : = put A c t ionuryin4ucket (dictionar y) index := hush(attribute) for each node in bucket[indez] do
if compare(bucket[index] -+ item, attribute) == O t hen return bucket[indez] -î standard
eise ret urn 'net f ound"
end if end for
end
- -
Figure 4.13: Procedure lookup.
For example, if the size of the bucket is 64, the search is 64 times faster than to search
all values of dictionary.
Figure 4.14 is an example of the hash table. The bucket has 8 dots and contains
the values of Canadian provinces and their standard values. To make the diagram
easier to illustrate, it is assurned that the hash function return the same value O for
"ABn and "BC", 3 for '<MBn, '<NBn and "NFn, 5 for "ONn and "PEI", and 8
for 'PQn and '<SKn. When the hash function is implemented, it is crucial to choose
a good function to distribute values evenly.
Table 4.4 gives an example for the attribute '<provincen, showing variations in its
values and its standard value. By convention, a province may be in full name, or in
abbrevations, for example, '<Saskatchewann might be abbrevated as 'SKn or 'Saskn
or in full name as "Saskatchewan". The results retrieved from different databases
might have different values. Wi thout standardization, the multidatabase treats the
values differently even if they refer to the same attribute.
- O 1 NB] New Brunswick 1 - - - 4 NS 1 Nova Scotia la ONI Ontario - - - -
1 Bucket - - - -- -
Figure 4.14: A Hash Table Example.
1 Variation A ( Variation B 1 Variation C 1 Standard 1
B.C. Alberta British Columbia Manitoba New Brunswick Newfoundland Nova Scotia
Man. N.B. Nfld N.S.
AB BC MB NB NF NS
Manitoba ' New Brunswick Newfoundland Nova Scotia Ontario Prince Edward Island
Ontario Prince Edward Island
Table 4.4: Example of Provinces and their Variations.
ON PE
Quebec Saskatchewan Northwest Territory Yukon
Ont. P.E.I.
&c SK NT YT
P.&. Sask. ' N.W.T. Yuk.
Quebec Saskatchewan Northwest Territory Yukon
The standardization algorithm resolves the value difference problern. If i t finds a
standard for an attribute value in the standard dictionary, the standard value is put
into the result set. If a standard value cannot be found, the attribute value is inserted
into the standard dictionary. This mechanism guarantees that every attribute value
in the retrieved data set has a standard value.
4.5.2 Scde Conversion hnctions
The standardization approach in Section 4.5.1 is admittedly inadequate because
of the scale difference issue discussed in Section 3.3.2. For example, the attribute
PRICE might be measured in Canadian dollars in one database, but measured in
US dollars in another. As a database is registered, a conversion function for attribute
values from a local database to the global schema rnust be specified. When the
database registration subsystem is activated, a database agent is created to retrieve
the local schema. For each attribute, a conversion function is provided if the attribute
type is number. The default for an attribute is 1, that is, no conversion is necessary for
attribute values. For example, the attribute PRICE can have a conversion function
as CND = USD x Exchangerate. A fill-in area is proceded after Exchangerate on
the HTML forrn. The database adminstrator can fill in the exchange rate. When the
registration subsystem processes the registration form, a set of conversion functions
are provided to make the attribute value conversions.
Chapter 5
Prototype Design and Testing
In the previous chapters the architecture of the KDID system has been presented
and design issues have been described. A prototype of the KDID system has been
implemented and tested. In this chapter, the design of the prototype is described
and preliminary testing on the Nat ural Sciences and Engineering Research Council
(NSERC) data is presented. First, in Section 5.1, the design of the multidatabase is
exphined. Each local database and the relationships among the local databases are
described. Section 5.2 describes the database registration process, and section 5.3
presents two typical knowledge discovery tasks, In Section 5.4, the speed analysis of
the data retrieval from the participating databases and data value standardization
are presented.
5.1 Constructing a Test Multidatabase System
A multidatabase system was created for testing the KDID system by partitioning
an existing database. The original database was the Natural Sciences and Engineer-
ing Research Council (NSERC) database, which is an archivai database of award
and grant listings, grant allocations and award committees. This original NSERC database was partitioned into three related databases, with different schema and
data. Section 5.1.1 presents the global schema. Section 5.1.2 describes an Oracle
database of awards, Section 5.1.3 a Microsoft Access database of award committees,
and Section 5.1.4 a Mini SQL database on organizations and grants. Description of
1 CNT2 I I
1 NOT NULL 1 NUMBER 1
Attribute Name AMOUNT AREA-CODE
- - I 1 - 1 DEPARTMENT 1 NOT NULL 1 VARCHAR 1
Nuli NOT NULL NOT NULL
1 ORG-CODE 1 1
1 NOT NULL 1 NUMBER 1
Type NUMBER NUMBER
I
NUMBER NUMBER
C
COMP-YR COMMITTEE
Table 5.1: Global Schema of the AWARD Table.
NOT NULL NOT NULL
PROJECT RECEIPIENT
I I
PROVINCE ( NOT NULL 1 VARCHAR
NOT NULL NOT NULL
Attribute Name ORG-CODE ORGNAME
Table 5.2: Global Schema of the ORGANIZATION Table.
VARCWAR VARCHAR
the databases and the relations among them are presented in Section 5.1.5.
Nuli NOT NULL NOT NULL
51.1 Global Scherna
Type NUMBER VARCHAR
A global schema for a mutidatabase system provides users with an integrated view
of the multiple databases. A user does not have to know the underlying structure or
schema information of any participating database. Knowledge discovery is conducted
in the front-end multidatabase.
For the example multidatabase system, part of the global schema information is
shown in Table 5.1 and 5.2.
The "PROVINCEn attribute in the ORGANIZATION table, shown in Table 5.2,
is categorized according to the region information, into a concept hierarchy, presented
in Figure 5.1. The top level is Canada, and the second level can be divided according
to geographical location, for example, the maritime, the western, Ontario, Quebec,
the Yukon and Northwestern Areas, and areas outside Canada.
Canada Western
British Columbia Prairies
Alberta Saskatchewan Manitoba
Ontario Quebec Atlantic
Maritime New Brunswick Nova Scotia Prince Edward Island
Newfoundland
Figure 5.1: A Concept Hierarchy for the PROVINCE Attribute.
5.1.2 The Dl Database
The Dl database is an Oracle database of awards containhg the AWARD, DIS-
CIPLINE, and AREA tables. The Dl database maintained at the IRIS Center for
Excellence, University of Regina, and resides on a SUN 4 Sparcstation 10 with 32
megabytes main rnemory. It is connected to the network using a LANCE Ethernet
DMA pseudo device.
The AWARD table contains information on awards offered, the award amount,
the recipient, and the area which the university locates. The DISCIPLINE table
contains data on the discipline title, and each title has its own standard code. Area
information is represented in the AREA table.
Detailed schema information of Dl database is given in Appendix A.
5.1.3 The D2 Database
The D2 database is a Microsoft Access database on an IBM compatible per-
sonal computer running WindowsNT 4.0 with a 66 MHz Intel 486 processor and 16
megabytes of rnemory. A database server written using the Open DataBase Connec-
tivity (ODBC) protocol runs in the background, listening for connections on a TCP socket.
The D2 database contains SCHOLARSHIP, COMMITTEE, and GRANT-TYPE
tables. The SCHOLARSHIP table has data similar to that in the AWARD table in
the Dl database, with attribute name clifferences. For example, the attribute DEPT in the Dl database is called DEPARTMENT in the SCHOLARSHIP database. The
COMMITTEE table describes the committee names and their standard codes. Table
5.3 is part of the schema of the GRANT-TYPE table.
Detailed schema information is given in Appendix B.
1 GRANT-TITLE 1 NOT NULL 1 TEXT 1 -- - - - - -- - - - - --
Table 5.3: Schema of the GRANT-TYPE Table.
Type TEXT NUMBER
Attribute Name GMNT-CODE GRANT-ORDER
501.4 The D3 Database
Nul1 NOT NULL NOT NULL
The D3 database is a Mini SQL database. Mini SQL is a lightweight database
management systern designed to provide fast access to stored data with low memory
requirements 1451. The database utilizes a well known TCP socket and accepts mul-
tiple connections. The Mini SQL database utilizes memory mapped 1/0 and cache
techniques to offer rapid access to data.
The D3 database runs on DBLEARN.CS.UREGINA.CA, a SUN 40 MHz Sparc-
station IPX with 16 megabytes of main memory and 198 megabytes of disk space.
The D3 database contains tables related to different organizations. The ORGA- NIZATION table has data on al1 the provinces and areas.
Detailed schema information is given in Appendix C.
Table 5.4: Cornparison Between the AWARD and SCHOLARSHIP Tabks.
M::ORGANMTma
-,ORCORC- - I - N E PROVINCE
Global Attribute AREA-CODE DEPT ORG-CODE
Figure 5.2: Relations among the Three Tables.
D2 (Acceas)
5.1.5 Relationships Between the Three Databases
AREA-CODE DEPARTMENT ORGCODE
D l (Oracle)
The three databases have been designed to have interleaving relations between
each other. The Dl database has an AWARD table, which has the similarities to the
SCHOLARSHIP table in the D2 database. Table 5.4 compares relevant attributes of
the two tables. Figure 5.2 gives the relationships among the AWARD, the SCHOG
ARSHIP and the ORGANIZATION tables.
The ORGANIZATION table of the D3 database contains regional data on each
organisation. The AWARD table in the Dl database has an attribute "ORG-CODE", which has reference in the ORGANIZATION table. The "ORGCODE" attribute of
the SCHOLARSHIP table in the D3 database bas the same reference.
NUMBER TEXT NUMBER
AREA-CODE DEPT ORG-CODE
NUMBER(14) VARCHAR2(35) NUMBER(14)
5.2 Database Registration Process
Figure 5.3: Secure Site Certificate.
When the hypertext link about the on-line database registration is activated, a
new window pops up. It is a site certificate according to the X.509 standard. If the
user accepts the conditions stated on the certificate, he/she can continue with the
connection and registration. Figure 5.3 is a snapshot from the Netscape browser.
The Netscape browser dws not recognize the authonty signing the site certificate
because a test certificate was used. If a site needs the recognition from the Netscape
browser, the site is required to certifiy itself through some recognized authority. The
certificate describes the site authority, encryption method and encryption b e l .
After the user accepts the certificate, the application registration form appears on
the browser; otherwise, the connection fails. Figure 5.4 is a snapshot of the application
form with user input data.
On the registration application form, the user is again reminded of the issue of
connection security. If he/she feels any doubt, hefshe can exit from the registration
process. On Figure 5.4, the required information must be entered by the user. The
Figure 5.4: User Database Registration Application F'orm.
required informat ion includes the database platform, the database name, the login
name, the password, and the host that the database resides on. The optional infor-
mation section allows the user to give a brief description of the database content,
which will allow others to better understanding of the rnappings. The KDID system
cannot create a database agent if any part of the required information is rnissing.
When the user completes al1 required information and submits the form, the data is
encrypted. The database agent retrieves the participating database schema informa-
tion. AI1 tables are retrieved from the database with the SQL command SELECT
TABLENA ME FROM USER- TABLES, assuming that the database supports the
command. This command is supported by Orade. If the retrieved database does not
recognize the command, the local database administrator should supply the names of
the tables in the database.
If the schema retrieval is successful, a twepart HTML form is generated a u t e
matically. Figure 5.5 gives a snapshot of an example HTML form.
Figure 5.5: Form for Scbema Mapping.
As shown in Figure 5.5, the first part of the form describes the global schema.
Each global attribute has a corresponding number and its intended meaning. The
first global attribute is UAMOUNTn and its corresponding meaning is amount of an
award offered to a candidate. The second part of the form shows the local attribute
information. For each local attribute, the user is required to enter the number of
the corresponding global attribute. If there is no mapping between a participating
attribute and any of the global attributes, the default is "NIAn.
Once the user subrnits the form, the KDID system processes the mappings s u p
plied and updates the global schema accordingly, adding in mappings for the new
database. If the mapping supplied is "N/An , no update is executed for that attribute.
Another web page appears inforrning the user that the registration information has
been processed, and an acknowledgment message is issued by the KDID registration
subsystem. Figure 5.6 is a screen snapshot of the acknoledgement message. When
the user get the mail acknowledgment, the registration process is finished.
Figure 5.6: User Database Registration Acknowledgement.
5.3 Typical Knowledge Discovery 'Iàsks
This section illustrate two typicd knowledge discovery tasks for the KDID system.
Both require information from the Dl, D2, and D3 databases.
FROM AWARD a, ORGANIZATION b, COMMITTEE c WHERE a.org-code = b.orgcode AND a.ctee-code = cxteexode AND ( disc-code = "HARDWARE" OR disc-code = "SOFTWARE" )
I SELECT amount, area-code, discmde, dept, orgrname, province, cname
-
Figure 5.7: A SQGlike Query for Discovery Task 1.
SELECT amount, area-code, disc-code, dept, orgname, province, cname FROM AWARD a, ORGANIZATION b, COMMITTEE c WHERE a.org-code = b.orgcode AND a.cteexode = c.ctee_code AND (( discrode >= 23000 and disc-code <= 23500 ))
or (( disc-code >= 24000 and disc-code <=24500 ))
Figure 5.8: SQL Query for Task 1 After Transformation.
5.3.1 Discovery 'Pask 1
The following is a typical discovery task (expressed in English for comprehensibil-
ity):
Analyze t h e relationship between awards offered and t h e discipline area
where t he discipline area can be either hardware related or software
relat ed.
To identify the relationship between the amount of an award and the discipline area
it is necessary to select the amount, area-code, department name, organization code
and cornmittee information first. To specify the task, an SQLlike query can be
constructed, as shown in Figure 5.7.
The query in Figure 5.7 includes high level concepts such as "HARWARE" and
"SOFTWARE", which do not appear in the database as values for the DISC-CODE
attribute. It is necessary to transform these high level concepts to the primitive
level of the data values present in the local database. Using concept hierarchies,
the KDID system substitutes "HARDWAREn into a range of 23000 to 23500 and
'SOFTWARE" into a range of 24000 to 24500. The transformed query is shown in
Figure 5.8.
Figure 5.8 is the query against the global schema. It is translated into a full path
1 SELECT amount , areasode, disc-code, depart ment, org-code, c t ecode
d
FROM AWARD WHERE (( discrode >= 23000 and disc-code <= 23500 ))
or (( discrode >= 24000 and &SC-code <=24500 ))
Figure 5.9: Query 1 for Task 1 for the Dl Database.
SELECT orgname, province, orgcode FROM ORGANIZATION
Figure 5.10: Query 2 for Task 1 for the D3 Database.
query according to the global and local schema mappings in order to decompose it
into a set of local queries.
Figures 5.9, 5.10, and 5.1 1 are the decomposed and transformed queries for the
Dl, D2, and D3 databases.
In Figure 5.7, the global attribute name for department is ''dept" , while in Figure
5.9, it is changed according to the schema mapping to "departmentn . The attribute
name "org-coden in the global schema corresponds to "org-coden in Figure 5.9 and
"orgcode" in Figure 5.10. This mapping overcomes a semantic difference between the
local databases, as discussed in Chapter 4.
5.3.2 Discovery Task 2
Discovery task 2 is as follows:
Find interesting relations between al1 awards, scholarships and discipline where the discipline can be either structural engineering or mechanicd engineering.
This task is more complex than the first. First, it is necessary to retrieve the
AWARD relation from the Dl database and the SCHOLARSHIP relation from the
SELECT cname, cteexode FROM COMMITTEE
Figure 5.11: Query 3 for Task 1 for the D2 Database.
D2 database. An SQLlike query is shown in Figure 5.12.
SELECT arnount, area-code, discxode, dept , orgrname, province, cname FROM AWARD al, SCHOLARSHIP a2, ORGANIZATION b, COMMITTEE c WHERE a l .erg-code = b.orgcode AND a2 .erg-code = b.orgcode AND al.cteexode = c.cteemde AND ahteexode = c.cteexode AND ( disc-code = USTRUCTURAL ENGINEERING"
OR discade = "MECHANICAL ENGINEERING" )
Figure 5.12: An SQL-like Query for Discovery Task 2.
The high level concepts "MECHANICAL ENGINEERING" and "STRUCTURAL ENGINEERINGn are transformed into the ranges 00500 - 02000 and 07000 - 08500
respectively. According to the techniques described in Chapter 4, the query in Figure
5.12 is translated and decomposed into four local queries. Figure 5.13 is the query
for the D l database, Figure 5.14 and Figure 5.15 are the queries for the D2 database,
and the query in Figure 5.16 is the query for the D3 database. The data retrieved
from the queries in Figure 5.13 and Figure 5.14 are standardized and appended to
one file.
5.3.3 'Pask Realization Using HTML Forms
The discovery tasks can be specified using HTML forms. Using HTML forms
makes it easier for the end users to perform knowledge discovery tasks without regard
to the details of the query format. If the user had to compose an SQL query for a
complex multidatabase system, it would be easy to make mistakes.
HTMLbased forms need scripts to be activated. Figure 5.17 shows a snapshot of
SELECT amount , area-code, disc-code, depart ment, org-code, ctee-code FROM AWARD WHERE (( disc-code >= 00500 and disc-code <= 02000 ))
or (( disc-code >= 07000 and disc-code <=O8500 ))
Figure 5.13: Query 1 for Task 2 for the Dl DATABASE.
74
1 SELECT amount , areacode, disc-code, dept , org-code, ctee-code
i
FROM SCHOLARSHIP WHERE (( disc-code >= 00500 and disc-code <= 02000 ))
Figure 5.14: Query 2 for Task 2 for the D2 Database.
SELECT cnarne, cteemde FROM COMMITTEE
Figure 5.15: Query 3 for Task 2 for the D2 Database.
a sample form for the discovery task 1. The user can select a concept hierarchy and
target attributes. He/she can specify the attribute threshold and the tuple threshold.
The attn'bute threshold specifies the maximum number of distinct values that an
attribute may have in the prime relation. In DB-Discover the discovered result from
a database is called the prime relation. An attribute may have a many possible values
if it is numerical, as is the AMOUNT attribute in the AWARD table, and it may also
have many values if it is a discrete valued attribute. The DB-Discover system reduces
the number of distinct values for numerical attributes by mapping the values into a
finite number of ranges. Detailed information on attribute threshold is given in 1171.
The default threshold for ail attributes retrieved is 10 in Figure 5.17. The user has
options to change from 5 to 20. On the fom in Figure 5.17, the user can also change
tuple threshold, for tuples to retrieve in the prime relation. After the nurnber of
distinct d u e s for each attribute in the relation has b e n reduced to the attribute
thresholds in the first round of generalization, the number of tuples in the prime
relation is compared to the tuple threshold. The relation is generalized repeatedly
until the number of tuples is less than the tuple threshold.
The user might want the total number of items for non-numeric data and the total
of the values for numeric data. On the form in Figure 5.17, the user has options to
SELECT orgaame, province, orgcode FROM ORGANIZATION
Figure 5.16: Query 4 for Task 2 for the D3 Database.
Figure 5.17: HTML Fom for Discovery Task 1.
decide which attribute should be summed by the system. If an attribute is specified
to be sumrned, the values of that attribute are surnmed as it is retrieved from the
database.
Once the user specifies al l parameters and MIS in the predicates, he/she can submit
the form to conduct the discovery task. Figure 5.18 is a snapshot of the form in Figure
5.17 with user input data.
The intermediate steps and results which includes the communications between
participating databases and the KDID system, data retrievd, table creation, data
insertion, and knowledge discovery is not displayed for the user. The final result,
that is, the discover4 knowledge is displayed on another HTML page, which can be
scrolled or printed using the Netscape browser. Figure 5.19 is the snapshot of the
final result for the discovery task in Figure 5.7.
The results shown in Figure 5.19 are at a higher conceptual level than the data
stored in the multidatabase.
Figure 5.20 shows data retrieved directly from the multidatabase without any
generalization. The amounts shown in Figure 5.19 are generalized into ranges while
those in Figure 5.20 are numbers. As well, the AREA-CODES in Figure 5.19 are
more intuitive than the discrete numbers in Figure 5.20.
5.4 Performance Analysis
In this section, speed is analyzed with regard to the use of single versus multiple
dat abase agents and data value st andardizat ion using the lookup procedure.
5.4.1 Parallel Data Retrievd from Participating Databases
Two implementation techniques for data retrieval from al l participating databases
were considered. In the sequential implementation, after the successful query of the
first database, the second database agent is created. The global query is decomposed
Figure 5.18: HTML Form for Discovery Task 1 with User Data.
Figure 5.19: Final Result for Discovery Task 1.
16530 864 24004 Computer Science Br i t i sh Columbia Animal Biology 16651 850 24005 Computing Science Br i t i sh Columbia Animal Biology 17000 862 24005 Comp. & Info. Sc i . B r i t i sh Columbia Animal Biology 25178 864 24007 Compueer Science Br i t i sh Columbia Animal Biology 12000 121 24004 Cornputer Science Br i t i sh Columbia A n i m a l Biology 11000 862 24005 Computer Science Br i t i sh Columbia A n i m a l Biology (deleted) 7826 864 24004 Electrical Engineering Alberta Animal Biology 18530 861 24006 Computer Science Alberta Animal Biology (dele t ed) 18530 861 24006 Computer Science Saakat chewan Animal Biology 64300 850 24500 Computer Science Saskatchewan Animal Biology (deleted)
Figure 5.20: Retrieved Data without Generdization.
into a set of local queries, but only one database agent is activated and al1 other
agents are idle.
The pardel irnplementation is based on the observation that al1 database agents
can be created at the same time and no interaction among them is required.
The sequential and parallel retrieval methods have been both implemented and
empirical timing tests have been run for varied input sizes. In this section, the results
of those tests are presented.
The program sets up a signal function before it creates any database agents.
- - - - -
TabIe 5.5: Time for Sequential and ParaIlel Retrieval.
80
Number of Tuples
120,846 171,409 230,630 317,353 549,562
Time for Pardel
Ret rieval (sec) 199.038 463.220
1053.521 1398.433 2890.432
Time for Sequent i d Retrieval (sec) 1420.878 1770.956 2079.330 2777.899 4621.859
Performance Ratio
(Sequential/Parallel) 7.139 3.823 1.974 1.986 1.599
1 Memory 1 Dictionary 1 Number of ( Minimum 1 Maximum 1 Average 1 size S b tuples time (sec) time (sec) time (sec)
32K 23,788 23,788 0.155 0.317 0.189
--aq 23,788 2,728,364 30.035 32.373 31.761 32K 23,788 4,092,556 46.727 51.569 49.068 32K 23,788 6,566,967 81.711 84.507 82.955
Table 5.6: Standardkation Time with 32K Memory and a 23788-Item Dictionary.
According to the number n of databases passed in, the program forks n children and
creates n agents. After al1 agents finish retrieving, the parent pro- gets a signal
and continues its execution. In the implementation the fork() system cal1 is used.
forko makes an identical copy of the calling program with a new process ID number.
fork() returns a zero to the new task that is the child process, and returns the process
ID of the child to the parent process. Given a dis'covery task, the tuple threshold has
been varied in order to change the retrieval size.
The test varied the number of tuples retrieved from the three participating databases
from 120,000 tuples to 540,000 tuples. The amount of data retrieved increased from
5 Megabytes to 100 Megabytes. Each of the queries sent out to the participating
database has three attributes. The timing results are presented in Table 5.5, and a
graph of these results is shown in Figure 5.21. For Table 5.5, the retrieval methods
are listed in the top row and the retrieved number of tuples in the left hand column.
Cells of the table represent retrieval time in seconds. The last column gives the ratio
of the sequential to parallel retrieval times.
As shown in the graph in Figure 5.21, for the same query, the parallel method is
two to seven times faster than the sequential method for the same query. We observe
that the time required for both the sequential and parallel query approaches increases
as more tuples are retrieved.
5 A.2 Data Value St andardizat ion
In this subsection, the results obtained in resolving the data value difference prob
lem for various sizes of the dictionary and the input tuple set are presented. The tests
Figure 5.21 : Test Time for Parallel and Sequential Retrieval.
-- - -
Table 5.7: Standardkation Time with 64K Memory and a 4757ô-Item Dictionary.
were conducted on a Silicon Graphics 0 2 with a 174 MHz processor and 64 Megabytes
of main memory. Three parameters are varied in the tests. The first is the allocated
memory size for the bucket to store the items in the dictionary. The size is limited by
the amount of memory available for dynarnic allocation by the malloc function. The
dictionary size corresponds to the number of items of a dictionary for an attribute.
The number of tuples is varied from 23,788 to 6,566,967.
The timing results are presented in Tables 5.6, 5.7, and 5.8. In these tables, the
minimum, maximum and average times for 5 runs are presented. Time is recorded
in seconds. In Table 5.6, the memory allocated for the bucket is 32 Kbytes, and
the dictionary has 23,788 items. In Table 5.7, the bucket size is increased to 64
Kbytes, and the dictionary size is doubled to 47,576. The bucket size in Table 5.8
is not changed, but the dictionary size is increased to 127,576. The standardization
time for an attribute with 6 million d u e s is slightly more than one minute. When
Average time (sec)
0.177 30.798
memory size is increased, only a small speedup occurs; for example, with 2,728,364
input tuples, the speedup from Tables 5.6 and 5.7 is 1.03. With the dictionary size
increased and the memory size remained the same, the speed is slightly decreased.
Memory size 64K 64K
The analysis of the timing results shows that the retrieved data can be standard-
ized quickly with the standardization algorithm presented in Section 4.5.1. The hash function was not changed in the tests. The results show that a 23,788-item dictionary
and 32K of memory are adequate for standardizing up to 6.5 million tuples.
Minimum time (sec)
0.141 29.126
Maximum time (sec)
0.315 33.016
Dictionary S k
47,576 47.576
Numberof tuplee
23,788 2,728,364
1 Memory 1 Dictionary 1 Number of 1 Minimum 1 Maximum 1 Average 1
Table 5.8: S tandardization Time with 64K Memory and a 127576-Item Dictionary.
aize 64K 64K
5.5 Discussion
This chapter has described how discovery tasks are processed by the KDID system,
as well as the process for registering a database on-line, and a performance analysis
for parallel versus sequential retrieval and data'value standardization. Three test
databases were created to show how the KDID system works. The test data cornes
h m the NSERC database, and the data has b e n partitioned to simulate some of
the complexities of accessing multiple databases wi t h KDID.
Size 127,576 127.576
The on-line database registration process shows t hat the Internet connection be-
tween the participating database and client are secure and that the schema informa-
tion retrieval is fast and easy as long as the required parameters are supplied. Once
a known type of database is about to be contacted, a database agent is created and
tries to locate the site information of that database on the Internet.
The user can specify discovery tasks using HTML forms; details of the composition
of the task are handled by the KDID. The user has options to specify the parameters
for the discovery task. The KDID system dows the user to perform discovery tasks
on multiple databases as if they were a single integrated database.
tuples 23,788
2.728.364
The analysis for the parallel versus the sequential retrieval shows that for large
databases, the parallel retrieval method saves time drarnatically, thus improving the
performance of the KDID system.
time (sec)
0.142 29.057
time (sec) 0.308
35.482
time (sec) 0.148
32.521
Chapter 6
Conclusion
In this chapter, the thesis is summarized, the contri butions are identified, and
suggestions for future research are given.
6.1 Summary
This thesis has presented the KDID system for knowledge discovery in Internet
databases. In particular, it has described the overall system architecture, specific
design issues, and implementation of major parts of the KDID system , as well as
on-line dat abaae registration, query decomposition, pardel retrieval, and network
database agents.
KDID is intended as a pr&f-concept system showing that the overall approach
is feasible. It has four major components: the Internet Front-end, the Interface
Module, the Multidatabase Module, and the KDD Application Module. The Internet
Front-end is the primary communication medium between the end users and the
KDID system. The Internet Front-end is accessed using a web browser, such as
Netscape Navigator. The browser produces a form data set corresponding to the
discovery task specified by the user. The Interface Module parses this form data set
and composes a global query. The Multidatabase Module translates the global query
into an interna1 query and then decomposes i t into a set of local queries. It also
creates local database agents and monitors the data retrieval process. The retrieved
data is standardized and uploaded into a rnultidi$abase. After all local queries have
been processecl, the KDD application is activated, and discovered knowledge is passed
to the Internet Fkont-end for presentation to the end user.
The on-line database registration process makes it possible to query and m m i p
ulate meta-data, that is, information about database schemas and their intended
meaning. The Internet connection between the client of the KDID system and a
local database server is a secure one based on the SSL protocol. The client and
server use the SSL Handshake Protocol described, in [63]. SSL takes data to be trans-
mitted, fragments the data into manageable blocks, and encrypts them before any
transmission occurs. The user is asked about the required information concerning
the local database, and the KDID system creates a database agent to query the local
schema information. If the schema retrieval is successful, an HTML form is gener-
ated autornatically. The local user is asked to map the local attributes to their global
counterparts. According to the mappings supplied by the user, the integrated global
schema is updated.
The query decomposition algorithm ensures that no cross-database joins are per-
formed. If a cross-database join is present in the global query, the attributes in the
join condition are selected from each database separately. The join condition is exe-
cuted in the multidatabase after the data has been retrieved. Given a global query,
SELECT ATT FROM TABLES WHERE CONDS, a set of local queries are generated
in the form of SELECT ATTDB..T FROM T WHERE CONDSDB..T. ATTDB..T is a
list of attributes belonging to a single database. If the list of CONDSDB.r is empty,
the local query has no predicates and the keyword where is omitted. The exarnple in
Section 4.5.2 shows how the algorithm works.
Data integration issues make constructing a multidatabase difficult. The types of
data integration, the reason for the differences, and possible solutions are discussed in
Section 3.3.2. Mechanisms for resolving data value differences are proposed in Section
4.5 and an exarnple shows that the mechanisrns are feasible.
Three different database agents are impIemented in the prototype KDID system.
The Oracle database agent is irnplemented using embedded SQL, the Mini SQL agent
is implemented using the API supplied with the Mini SQL system, and the Microsoft
Access agent is implemented using the ODBC protocol. Al1 retrieved data are con-
verted into a standard format for further processing.
The section on data retrieval in parallel shows that this method is faster than the
sequential retrieval met hod.
6.2 Contributions
The research involved in this thesis makes the following original contributions.
1. The KDID system for knowledge discovery from different Internet databases
was proposed and a prototype system was implemented. Using a single, inte-
grated interface, an existing knowledge discovery technique was applied to data
collected from more than one relational database in the Internet. The system
makes it possible for administrators of relational databases scattered in the In-
ternet with similar contents to cooperate with each other, and KDID makes it
easier to summarize the complete set of related data.
2. On-line database registration makes meta-data manipulation feasible and rela-
tively easy. Previous research has aseumed that the mappings between database
schernas are provided [59] 176) 1811 1831. Here, schema mapping is accomplished
by creating database agents that generate HTML forms to obtain schema m a p
ping information from the local database administrators.
3. Participation into the KDID system is easy and voluntary. Once a database
adrninistrator has decided to participate, only the mappings between global and
local schemas at the central registry are updated. There is no need to change
any local database or to inform aay other participants about the new database.
Similady, an administrator can remove a database from KDID relatively easily,
although this thesis has not defined a process for deleting correspondences from
the general schema.
4. Re t r i ed of data from multiple databases can be conducted in parallel. In this
research, we showed the benefit that results from this approach in the KDID
system.
6.3 Areas for Future Research
The implemented KDID system can only conduct knowledge discovery in rela-
tional databases. In future research work, other types of databases could be consid-
ered, including object-oriented databases, deductive databases and temporal databases.
The data models describing these database types are more cornplex than the relational
model. As well, the the variations in the structures of object-oriented and deductive
databases will complicate the process of decomposing queries and integrating results.
Another factor is that existing knowledge discovery tools mainly focus on relational
databases. Thus, knowledge discovery tools will need to be created or adapted for
use with object-oriented and deductive databases.
Although several integration problems were described and examples of each were
presented, a solution to only one of these problems, differences in data d u e s , was
developed. Future research should focus on overcoming structural and semantic dif-
feren ces.
Given the proliferation of databases, it would be useful to deveIop tools that
can automate the specification of database schema rnappings. The on-line database
registration method only partially automates the process, because after the local
dat abase schema information has b e n retrieved, the local dat abase administrat or is
required to supply the mapping. If heuristics were developed to map local schemas to
the global schema automatically, the process of schema mapping would be simplified.
Some research work has been conducted by [72] [85]. A framework for supporting
automated data integration is being developed.
Bibliography
[l] Agrawal, R., Imielinski, T., and Swami A., UMining Associations between Sets
of Items in Massive Databases," Proceedings of the ACM SIGMOD International
Confennce on Management of Data, pp. 207-216, Washington D.C., May 1993.
[2] Agrawal, R., Maunila, H., Srikant, R., Toivonen H., and Verkarno, A. I., "Fast
Discovery of Association Rules," Fayyad, U. M., Piatetsky-Shapiro, Smyth, P.,
U thurusamy, R. ( 4 s ) , Advances in Know ledge DLPcovery and Data Mining, pp.
307-328, AAAI/MIT Press, 1996.
[3] The Apache Group, "Apache HTTP Server Project ," http://www .apache.org/.
[4] Arens, Y., Chen, C. Y., Hsu, C. N., and Knoblock, C. A., "Retrieving and
Integrating Data from Multiple Information Sources," International Journal of
Intelligent and Cooperative Information Systems, 2(2):127-158, 1993.
[5] Bal, H. E., Kaashoek, M. F., Tanenbaum, A. S., and Jansen, J., "Replication
Techniques for Speeding up Parallel Application on Distributed Systems," Con-
currency Practice and Ezperience, 4(5):337-355, 1992.
[6] Barber, D. B., "Attribute Selection Strategies for Attributeoriented Generaliza-
tion," M. Sc. thesis, University of Regina, 1997.
[7] Batini, C., Lemerini, M., and Navathe, S., "A Comparative Analysis of Method-
ologies of Dat abase Schema Integration," A CM Computing Surue ys, 18(4) :32%
364, 1986.
[8] Bernera-Lee, T., and Connolly, D., "Hypertext Markup Language - 2.0," RFC
1866, MIT/W3C, http://ecco.bsee.swin.edu.au/text/html-spec, November 1995.
[9] Berners-Lee, T., Fielding, R., and F'ryttyk, H., "Hypertext Tkansfer Protocol - HTTP/l.O," http://www.ics.uci.edu/pub/ietf/http/rfe, 1996.
[IO] Berry, M., "Large Scale Singular Value Computations," International Journal of
Supercornputer Applications, 6(1):1549, 1992.
Ill] Bestavros, A., Demand-Qased Document Dissemination for the World- Wide Web,
Technical Report BU-CS-95003, Computer Science Department, Boston Univer-
sity, 1995.
[12] Bestavros, A., Carter, R., Crovella, M. E., Cunha, C. R., Heddaya, A., and
Mirdad, S. A., Application-Level Document Caching in the Internet, Technical
Report BU-CS-95002, Computer Science Department, Boston University, 1995.
[13] Borgida, A., Brachman, R J., McGuimess, D. L., and Resnick, L. A., "CLAS-
SIC: A Structural Data Mode1 for O b j e ~ t s , ~ Proceedings ACM SIGMOD Sym-
posium on the Management of Data, pp. 58-67, 1989.
[14] Brachman, R. J. and Anand, T., uThe Process of Knowledge Discovery in
Databases", Fayyad, U. M., Piatetsky-Shapiro, Smyth, P., Uthurusamy, R. (eds),
Advances in Knowledge Discovery and Data Mining, pp. 37-57, AAAIIMIT
Press, 1996.
[15] Breitbart, Y., Olson, P. L., and Thompson G. R., uDatabase Integration in a
Distributed Heterogeneous Database Systemn , Hurson A. R., Bright, M. W., and
Pakzad, S., (eds) , Multidotabase: an Advanced Solution for Global Information
Sharing, IEEE Computer Society Press, 1994.
[16] Buenament, P., Davidson, S. B., Hart, K., and Overton, C., "A Data Transfor-
mation System for BioIogical Data Sources," Proceedings of International Con-
ference on Very Large Data Bases, pp. 158-169, 1995.
[17] Carter, C. L., Hamilton, H. J., and Cercone, N., The Soflwan Architecture
of DBLEARN, Technical Report CS94-04, Depart ment of Computer Science,
University of Regina, January, 1994.
[18] Carter, C. L. and Hamilton, H. J., Performance Eualuation of Attn'bute-On'ented
Algorithms for Knowledge Dkcovery fmm Dutuboses: Edended Report, Technical
Report 95-6, Department of Computer Science, University of Regina, 1995.
[19] Chankhunthod, A., Danzig, P., Neerdaels, C., Schwartz, M. F., and Worrell,
K. J., "A Hierarchical Internet Object Cache," USENIX 1996 Annual Technical
Conference, San Diego, CA, January, 1996.
[20] Chen, M., Han, J., and Yu, P., "Data Mining: An Overview from a Database
Perspective," IEEE Transactions on Knowledge Discovery and Data Engineer-
ing, 8(6):866-882, 1996.
[21] Crovella, M. E. and Carter, R. L., Dynarnic Server Selection in the Internet,
Technical Report BU-CS-95014, Computer Science Depart ment, Boston Uni-
versity, 1995.
[22] Date, C. J., An Introduction to Database Systems, Addison-Wesley, 1990.
[23] Dayal, U. and Hwang, H. Y ., "View Definition and Generalisation for Database
Integration in a Multidat abase System," IEEE Transactions on Software EngG
neering, SE10(6):628-644, 1984.
[24] Deen, S. M., Amin, R. R., and Taylor, M. C., "Data Integration in Distributed
Databases", Hunon A. R., Bright, M. W., and Pakzad, S., (eds), Multidatabase:
an Advanced Solution for Global Infomation Sharing, lEEE Computer Society
Press, 1994.
[25] Deerwester, S., Dumais, S. T., F'urnas, G. W., and Landauer, T. K., "Indexing
by Latent Semantic Analysis," Journal of the American Society for Information
Science, 41 (6):391-407, 1990.
1261 Desai, B. C., An Introduction to Database Systems, West Publishing, 1990.
[27] Dumais, S. T., 'Enhancing Performance in Latent Semantic Indexing (LSI) Re- trieval," Behavior Researeh Methods, Instruments and Cornputers, 23(2):229-236,
1991.
[28] Dumais, S. T., h a s , G. W., Lanauer, T. K., Deerwester, S., and Harshman,
R., uUsing Latent Semantic Andysis to Improve Access to Textual Informa-
tion," Proceedings of ACM CH1'88 Conference on Human Factors in Computing
Systems, pp. 281-285, 1988.
1291 El-Medani, G . , A Visual Query Facilit y for Multimedia Databases, Technical
Report TR95-18, Department of Cornputer Science, University of Alberta, 1995.
[30] Fang, D., Hammer, J., McLeod, D., and Si. A., 'Remote-Exchange: An A p proach to Controlled Sharing among Autonomous, Heterogeneous Database Sys-
tems ," Proceedings of the IEEE Spring Compcon. IEEE, San Francisco, February
1991.
[31] Fang, D., Hammer, J., and McLeod, D., 'An Approaeh to Behavior Sharing in
Federated Database Systems," M. T. ozsu, U. Dayal, and P. V Alduriez (eds), Distributed Object Management, Morgan Kaufman, 1993.
1321 Fayyad, U., Djorgovski, S., and Weir, N., uAutomating the Andysis and Cat-
aloging of Sky Surveys," Fayyad, U. M., Piatetsky-Shapiro, Smyth, P., Uthu-
rusamy, R. (eds), Advances in Knowledge Discouery and Data Mining, pp. 471-
493, AAAIIMIT Press, 1996.
[33] Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P., "From Data Mining to Knowl-
edge Discovery: An Overview," Fayyad, U. M., Piatetsky-Shapiro, Smyth, P.,
Uthurusamy, R. (eds), Advances in Knowledge Discouery and Data Mining, pp.
1-34, AAAIIMIT Press, 1996.
[34] Frawley, W. J., Piatetsky-Shapiro, G., and Matheus, C. J., "Knowledge Dis- covery in Databases: An Overview," Piatetsky-Shapiro, G. and Frawley, W. J.
(eds), Knowledge Discouery in Databases, pp. 1-27, AAAIIMIT Press, 1991.
[35] Furnas, G. W., Landauer, T. K., Gomez, L. M., and Dumais, S. T., "The V* cabulary Problem in Human-system Communication," Communications of the
ACM, 30(11):964971, 1987.
[36] Gunter, G., UThe Mixed Powerdomain," Theoretical Cornputer Science, 103:311-
334, 1992.
[37] Hammer, J., McLeod, D., and Si. A., "Object Discovery and Unification in a
Federated Database System," Proceedings of the Workshop on Intemperability
of Database Systems and Database Applications, pp. 3-18, Swiss Information
Society, University of Ribourg, Switzerland, October 1993.
[38] Hammer, J., Garcia-Molina, H., Labio, W., Widom, J., and Zhuge, Y., 'The
Stanford Data Warehousing Project," Data Engineering Bulletin, 18(2):41-48,
June 1995.
(391 Hammer, J., Garcia-Molina, H., Ireland, K., Papalconstantinou, Y., Ullman, J.,
and Widorn, J., "Information Translation, Mediation, and Mosaic-Based Brows-
ing in the TSIMMIS System," Proceedings of the ACM SIGMOD International
Conference on Management of Data, San Jose, California, June 1995.
[40] Han, J. and Fu, Y., 'Attribute-Oriented Induction in Data Mining," Fayyad,
U. M., Piatetsky-Shapiro, Smyth, P., Uthunisamy, R. (eds), Advances in Knowl-
edge Discovery and Data Mining, pp. 394421, AAAI/MIT Press, 1996.
1411 Han, J., Zaiane, O. R., and Fu, Y., Resource and Knowledge Discovery in Global
Information System: A Multiple Layered Database Approuch, Technicd Report
CMPT TR94-10, School of Computing Science, Simon Fraser University, Canada,
1994.
[42] Han, J., Fu, Y., and Ng, R., 'Cooperative Query Answering Using Multiple
Layered Databases," Proceedings of the Second International Conference on Co-
operative Information System, Toronto, Canada, May 1994.
1431 Horspool, R. N., The Berkeley UNIX Environment, Printice-Hall, 1992.
[44] Hsu, C. and Knoblock, C. A., "Discovering Robust Knowledge from Dynamic
Closed-World Data," Pmceedings of the Thirteenth National Conference on Ar- tificial Intelligence, Portland, Oregon, 1996.
[45] Hughes, D. J., Mini SQL: A Lightweight Databuse Engine, Hughes Technologies
Pty, 1996.
[46] Hull, R., "Managing Semantic Heterogeneity in Databases: A Theoretical
Perspective," ACM SIGACT-SIGMOD-SIGART Symposium on Principles of
Database Systems, PODS 1997, pp. 51-55, 1997.
[47] Hurson A. R., Bright, M. W., and Pakzad, S., Multidatabase: an Advanced
Solution for Global Information Sharing, IEEE Computer Society Press, 1994.
[48] Internet Resources, http://132.15.104.104/0kinawa/NetInfo/Resources.html.
[49] Knight-Ridder Information, "Gale Directory of Databases," http://www.rs.ch/
krinfo/products/datastar/sheet8/GDDB.HTM.
1501 Krochmal, J., LAN Applications, New Rjders Publishing, Carmel, Indiana USA,
1993.
[51] Levy, A. Y., Mendelzon, A. O., Sagiv, Y., and Srivastava, D., uAns~ering Queries
Using Views," Proceedings ACM Symposium on Principles of Database Systems,
pp. 95104, 1995.
[52] Levy, A. Y., Rajaraman, A., and Ordille, J. J., "Querying Heterogeneous Infor-
mation Sources Using Source Descriptions," Proceedings of International Con-
ference on Very Large Data Bases, pp. 251-262, 1996.
1531 Levy, A. Y., Rajaaman, and Ullman, J. D., "Answering Queries Using Lirn-
i ted External Query f rocessors," Proceedings A CM Symposium on Principles of
Database Systems, pp. 227-237, 1996.
1541 Li, S. H. and Danzig, P. B., TWO-~imensi*nal Visualization for Internet Re- source Discouery, Technical Report USGCS-96, Computer Science Department,
University of Southern California, 1996.
[55] Li, S. H., and Danzig, P. B., "Vocabulary Problem in Internet Resource Discov-
e r ~ , ~ Second International Workshop on Nezt Generation information Technolo-
gies and Systems, 1995.
1561 Li, S. H. and Danzig, P. B., Vintage: A Visual Information RetReual Interface
Based on Latent Semantic Indczing, Teehnjcd Report USCCS96xx, Cornputer
Science Department, University of Southern California, 1996.
(571 Libkin, L., 'Approximation in Databases," Pmceedings of the International Con-
ference on Database Theory, pp. 411-424, 1995.
[58] Linden B., Oracle 7 Semer SQL Language Refennce Manual, Orade Corporation,
December 1992.
159) Litwin, W., Mark, L., and Roussopolos, N.,"Interoperability on Multiple Au- tonornous data base^,^ A CM Computing Sume ys, 22(3):267-293, 1990.
[60] Meng, W., and Yu, C., Query Processing in Multiple Database Systerns, pp.
551-572, Addison-Wesley, Reading, MA, 1995.
[61] Moc~ , K., "Adaptive User Models for Intelligent Information Filtering," Proceed-
ings of the Third Golden West International Conference on Intelligent Systems,
Las Vegas, Nevada, 1994.
[62] Murthy, S. K., Kasif, S., and Salzberg, S., "A System for Induction of Oblique
Decision Trees ," Journal of A dificial Intelligence Research, 2: 1-32,1994.
[63] Netscape C o ~ u n i c a t i o n s Corporation, "The Secure Sockets Layer Protocol
(SSL)," ht tp: //home.netscape.com/info/security-dot-html.
[64] Network Wizards, "Internet Domain Survey, January 1997," http://wurw.
nw. com/zone/WWW/mport. html.
[65] Nua Internet Survey, "Internet Survey February 1997," http://www.nua.ie/ sur-
veys/WhatsNew.html#February.
1661 Nural, S., Koksal, P., Oz-, F., and Dogac, A., 'Query Decomposition and
Processing in Multidatabase Systems," Pmceedings of OODBMS Symposium of
the European Joint Conference on Engineering Systems Design and Analysis,
Mont pelier , July 1996.
1671 Object-Orien tation FAQ; see http://www. bjlkent .edu. tr/Online/oofaq/~~-fq-S-
3.5.htrn
[68] Obraczka, K., Danzig, P. B., and Li, S. H., "Internet Resource Diswvery Ser-
vices," IEEE Computer, 26(9):8-22, September 1993.
[69] Ozsu, M. T. and Valduriez, P., Principles of Distributed Database Systems,
Prentice-Hall, 1991.
[70] Paredaens, J., Van den Bussche, J., and Van Gucht, D., "Towards a Theory
of Spatial Database Queries," Proceedings of the 13th ACM Symposium on the
Principles of Database Systems, pp. 279-288, 1994.
[71] Read, R. L., Fussell, D. S., and Silberschatz, A., Multi-Resolution Relational
Data Model," Proceedings of the Eighteenth International Conference on Very
Large Data Bases, pp. 134150, Vanwuver, Canada, 1992.
(721 Reference Architecture for the Intelligent Integrat ion of Informat ion, ver-
sion 2.0, draft, 1995. Developed by the 13 Program of DARPA; see http://
dc.isx.com/I3/html/briefs/I3brief.html#ref.
[73] Reiter, R., "Towards a Logical reconstruction of Relational Database Theory,"
Bordie, M. L., et al. editors, On Conceptual Modeling, 1984.
[74] Rivera, C. B. and Carter, C. L., A Tutorial Guide to DB-Discover, Version
2.0, Technical Report CS95-05, Department of Computer Science, University of
Regina, July, 1995.
[75] Salton, G. and McGill, M. J., Introduction to Modern Information Retrieval,
McGraw-Hill, 1983.
[76] Sheth, A. P. and Larson, J . A., UFederated Database Systems for Managing Dis-
tributed, Heterogeneous, and Autonomous Databases," A CM Computing Sur-
veys, 22(3):183-236, 1990.
[77] Srilrant, R. and Agrawal, R., uMining Quantitative Association Rules in Large
Relational Tables," Pmceedings of the ACM SIGMOD Conference on Manage-
ment of Data, Montreal, Canada, June 1996.
[78] Srikant, R. and Agrawal, R., ""Mining Generalized Association Rules," Pro-
ceedings of the 21st International Conference on Very Large Databases, Zurich,
Switzerland, September, 1995.
[79] Srikant, R. and Agrawal, R., "Fast Algorithms for Mining Association Rules,"
Proceedings of the 20th International Conference on Vety Large Databases, San-
tiago, Chile, September, 1994.
[80] Steindl C., "1s Interoperability Achievable With ODBC?" Institute of Computer
Science, Jonahhes Kepler University Linz, Austria, 1996.
[81] Thomas, G., UHeterogeneous Distributed Database Systems for Production Use,"
ACM Computing Surveys, 22(3):237-266, 1990.
[82] Tsai, P. S. M. and Chen, A. L. P., 'Concept Hierarchies for Database Integration
in a Multidatabase System," International Conference on Management of Data,
1994.
[83] Ullman, J. D., "Information Integration Using Logical Views," Proceedings of
International Conference on Database Theory, pp. 19-40, 1997.
1841 Vaghani, J., Ramarnohanarao, K., Kemp, D. B., Somogyi, Z., Peter J. Stuckey,
P. J., Tim S. Leask, T. S., and James Harland, UThe Aditi Deductive Database
System," VLDB Journal, 3(2):245-288, 1994:
1851 Wiederhold, G., "Forward: Intelligent Integration of Information," Journal of
Intelligent Information Systerns, 6(2/3):281-291, 1996.
[86] "What is X.509?" RSA Lakmtories, Inc.; See http://www.rsa.com/nalabs
/faq/ql65.html.
[87] Zhuge, Y., Garcia-Molina, H., Hammer, J . , and Widom. J., uView Maintenance
in a Warehousing Environment ," Pmceedings' of the A CM SIGMOD international
Conference on Management of Data, pp. 316327, San Jose, California, June
1995.
Appendix A
Schema Information for the D l Database
A.l The AWARD Table
I Attribute Name I Nul1 I Type I AMOUNT AREA-CODE
COMP-YR CTEE-CODE DEPT DISC-CODE FISCAL-YR GRANT-CODE INSTAL ORG-CODE PROJECT RECPJAME
Table A.l: Schema of the AWARD Table
NOT NULL
NOT NULL
NOT NULL
N U M B E R ( ~ ~ ) VARCHAR2(35) NUMBER(14) NUMBER(14) VARCHARZ(6) VARCHAR2(6) NUMBER(14) VARCHAR2(2OO) VARCHAR2(25)
A.2 The DISCIPLINE Dble
Table A.2: Scherna of the DISCIPLINE Table
A.3 The AREA a b l e
Type NUMBER(14) VARCHAM(63)
Attribute Name DISC-CODE DISC-TITLE
1 Attribute Name 1 Nul1 1 Type 1
N d NOT NULL NOT NULL
- -
Table A .3: Schema of the AREA Table
AREA-CODE AREA-TITLE
N U M B E R ~ ~ VARCHAR2(65)
Appendix B
Schema Information of Database D2
B.1 The SCHOLARSHIP mble
Table B.l: Schema of the SCHOLARSHIP Table
Type NUMBER NUMBER NUMBER NUMBER NUMBER TEXT NUMBER NUMBER TEXT TEXT NUMBER TEXT TEXT
Attribute Name AMOUNT AREA-CODE CNT2 COMP-YR CTEE-CODE DEPARTMENT DISC-CODE FISCAL-YR GRANT-CODE INSTAL ORGCODE
Nul1
NOT NULL
NOT NULL
NOT NULL PROJECT 1 RECPJTAME 1 NOT NULL
B.2 The COMMITTEE Table
I Attribute Name I N d 1 Type I
Table B.2: Schema of the COMMITTEE Table
CNAME CTEE-CODE
B.3 The GRANT-TYPE Tàble
f
NOT NULL i TEXT NOT NULL f NUMBER
Table B.3: Schema of the GRANT-TYPE Table
Appendix C
Schema Information of Database D3
C.l The ORGANIZATION B b l e
Table C.l: Schema of the ORGANIZATION Table
ORG-CODE ORGNAME PROVINCE
' IN+EGER NOT NULL NOT NULL
CHAR(51) CHAR(17)
Appendix D
Glossary
Attribute-oriented induction: An attribute-oriented induction algorithm
takes as input a relation retrieved from a database and generalizes the data guided
by a set of concept hierarchies.
Characterist ic discovery task: A characteristic discovery task is one that
requires the finding of interesting relationships between various attributes of one or
more relations in the database.
Client-sewer model: A form of distributed computing that divides the appli-
cation processing between a client and a aerver that are connected by a network.
Concept hierarchy: A concept hierarchy is a tree of concepts arranged hierar-
chically according to generality.
Data classification: Data classification classifies data based on the values of
certain attributes.
Database agent: A database forms a gateway to a local database.
Decision tree: A decision tree is generated from a training set. A classification
algorithm takes the training set of attribute values and a class as input.
Form: A form is a template for a form data set and an associated method and
action URL.
Form data set: A form data set is a sequence of (NAME, VALUE) pairs.
miIl-path query: A query that specifies the database narne and relation narne
for each attribute mentioned. The form of an attribute is database: relation.attribute.
Global query: A global query is a full-path query that may include constant
predicates and join conditions.
Global schema: A global schema is created by the integration of multiple
export schemas.
Homonym: Homonyms are different attributes having identical names.
Hyper text Markup Language: The Hypertext Markup Language is a simple
data format used to create hypertext documents that are portable from one platform
to another.
Hyper text 'Iimsfer Protocol: The Hypertext Ransfer Protocol is an application-
level protocol for distri buted, collaborat ive, hypermedia informat ion systems.
Internet: The Internet is the worldwide collection of inter-connected computer
networks and gateways that use the Internet Protocol (IP) and function as a single
cooperative network.
Internet database: An Internet database is a database with access provided
to Internet users.
Knowledge discovery: Knowledge discovery is the non-trivial process of iden-
tifying valid, novel, potentially usefd, and ultimately understandable patterns in
data.
Knowledge discovery in Internet databaaes: Knowledge discovery in In-
ternet databases concerns the application of techniques for knowledge discovery in
dat abases to multiple databases amilable on the Internet.
Local database query: A local database query has only the constant conditions
with the join condition decomposed into attributes to be retrieved.
Macro: A macro is a list of actions to be perforrned by the database system.
Module language: A module language is a srnall language for expressing SQL operation in pure SQL syntactic form.
Multidatabase system: A muhidatabase provides integrated access to au-
tonornous, heterogeneous databases via a single, relatively simple request.
Multiple layered database: A multiple layered database is a database formed
by generalizat ion and transformation of information.
Schema Information Manager: The schema information manager coordinates
the global database agent and the local database agents.
Synonym: Synonyms are attributes in different databases that have the same
meaning although they may have different namea.
Secure Sockets Layer: The Secure Socket Layer transport protocol provides
data encryption, server authent icat ion, message integri ty, and opt ional client aut hen-
tication for a TCP/IP connection.
World Wide Web: The World Wide Web is a collection of mutually referencing
hypertext documents scattered across the Internet, and serviced by HTTP servers.
L L , LLL
APPLIED y 4 IMAGE. lnc = 1653 East Main Street - -. , , Rochester, NY 14609 USA -- -- - - Phone: 7161482-0300 -- -- - - Fax: 7161280-5989
0 1993. Applied Image. Inc.. All Rights R e s e ~ e d