introduction

3
Introduction Gregory Piatetsky-Shapi ro GTE Laboratories Incorporated, 40 Sylvan Rd., Waltham, Massachusetts 02254 The rapid growth of data and information creates both a need and an opportunity for extracting knowledge from databases. Scientific projects such as Earth observation satellites or human genome decoding are already producing gigabytes (and soon terabytes) of data. Increasing computerization of all aspects of business creates very large databases that can be analyzed to reveal important business knowledge. Data is also growing in complexity, with the addition of object-oriented databases, CAD/CAM, large knowledge bases, multimedia, and other nonrelational data. Knowledge Discovery in Databases (KDD), defined as the nontrivial and eflcient extraction of interesting high-level patterns from data, is an area of common interest for researchers in machine learning, statistics, intelligent data- bases, knowledge acquisition for expert systems, and data visualization. These researchers are united by the common KDD threads of discovering patterns, dealing with large amounts of data, handling noisy and incomplete information, using domain knowledge, and presenting the findings in an interactive, human- oriented way. These KDD issues overlap with, yet differ from issues in scientific discov- ery, which is concerned with finding scientific laws in observations of natural phenomena. ' Figure 1 illustrates the differences among various discovery do- mains with respect to attribute completeness (high when all relevant attributes are measured), and instance density (high when all possible instance values have been observed). Scientific discovery usually deals with accurately measured numerical data, collected as a result of a well-designed experiment intended to limit the number of observed phenomena. The control over the data collection (higher in experi- mental sciences, such as physics, lower in observational sciences, such as astronomy) means that usually most of the relevant attributes are measured, and the control over setting of independent variables means that most of the points in the potential instance space can be reached (high instance density). Scientific laws usually take the form of precise, quantitative, and concise for- mulas. Discovery in nonscientific databases usually deals with both numerical and INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL. 7, 587-589 (1992) 0 1992 John Wiley & Sons, Inc. CCC 0884-8173/92/070587-03$04.00

Upload: gregory-piatetsky-shapiro

Post on 12-Jun-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction

Introduction Gregory Piatetsky-Shapi ro GTE Laboratories Incorporated, 40 Sylvan Rd., Waltham, Massachusetts 02254

The rapid growth of data and information creates both a need and an opportunity for extracting knowledge from databases. Scientific projects such as Earth observation satellites or human genome decoding are already producing gigabytes (and soon terabytes) of data. Increasing computerization of all aspects of business creates very large databases that can be analyzed to reveal important business knowledge. Data is also growing in complexity, with the addition of object-oriented databases, CAD/CAM, large knowledge bases, multimedia, and other nonrelational data.

Knowledge Discovery in Databases (KDD), defined as the nontrivial and eflcient extraction of interesting high-level patterns from data, is an area of common interest for researchers in machine learning, statistics, intelligent data- bases, knowledge acquisition for expert systems, and data visualization. These researchers are united by the common KDD threads of discovering patterns, dealing with large amounts of data, handling noisy and incomplete information, using domain knowledge, and presenting the findings in an interactive, human- oriented way.

These KDD issues overlap with, yet differ from issues in scientific discov- ery, which is concerned with finding scientific laws in observations of natural phenomena. ' Figure 1 illustrates the differences among various discovery do- mains with respect to attribute completeness (high when all relevant attributes are measured), and instance density (high when all possible instance values have been observed).

Scientific discovery usually deals with accurately measured numerical data, collected as a result of a well-designed experiment intended to limit the number of observed phenomena. The control over the data collection (higher in experi- mental sciences, such as physics, lower in observational sciences, such as astronomy) means that usually most of the relevant attributes are measured, and the control over setting of independent variables means that most of the points in the potential instance space can be reached (high instance density). Scientific laws usually take the form of precise, quantitative, and concise for- mulas.

Discovery in nonscientific databases usually deals with both numerical and

INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL. 7, 587-589 (1992) 0 1992 John Wiley & Sons, Inc. CCC 0884-8173/92/070587-03$04.00

Page 2: Introduction

588 PIATETSKY -SHAPIRO

I

Instance Density

Figure 1. Scientific discovery versus database discovery.

categorical data. If the data was collected or entered by humans, it typically has many errors. The data reflects the chaotic nature of the real world and is influenced by many different phenomena. The data is typically collected for some other business purpose such as billing, and rarely has all the relevant fields (low attribute completeness). The database has only a few rare cases (by definition) and most combinations of the data values are not present. It is not always possible to set the independent variables to the desired values, although that is what market researchers are trying to do. The discovered patterns are typically imprecise, qualitative, and rarely concise. Finally, business data frequently contains sensitive personal information. Discovery in such data may pose ethical or legal problems, such as invasion of privacy.

Despite these difficulties, recent years saw a surge of interest in knowledge discovery in databases, in terms of research (as evidenced by IJCAJ-89 and AAAI-91 KDD workshops), commercially available tools for data analysis, and reported applications. The recent efforts in KDD span a variety of technical approaches including decision trees, data dependency networks, discovery of quantitative laws, statistical methods, knowledge acquisition tools, and hybrid systems (see Ref. 2). Applications of KDD have been reported in such diverse areas as astronomy, drug side-effects, loan analysis, LAN management, credit approval, insurance fraud detection, and molecular biology.

This special issue is based on extended versions of papers selected from the AAAI-91 workshop on Knowledge Discovery in Databases, held 14 July 1991 in Anaheim, California.

The first three articles deal with discovery of dependencies between data attributes, taking database (first article), statistical (second article), and A1 approaches (third article). Kantola, Mannila, Raiha, and Siirtola present meth- ods for discovery of functional and inclusion dependencies. These methods work well when the set of attributes is relatively complete. Scheines and Spirtes analyze the problem of finding the latent (missing) variables in data, offering a

Page 3: Introduction

INTRODUCTION 589

way to detect incompleteness of given attributes. Shen addresses the discovery of regularities in a knowledge base (CYC) . This topic will undoubtedly become more important in the future as knowledge bases grow and become impossible to examine manually.

While the above methods work autonomously, the next article, by Grinstein, Sieg, Smith, and Williams, presents a visualization approach that is inherently interactive and relies on human perceptual abilities for discovery. Visualization methods work well on data with high instance density.

A debate arose during the AAAI-91 KDD workshop on whether the goal of KDD should be human-assisted computer discovery or computer-assisted human discovery. I believe that both approaches have their merits. The first approach, with the ultimate goal of automated computer discovery, is a highly desirable but long-term effort, while the second approach of computer-assisted human discovery has the goal of improving human-computer synergy. The latter approach, while less ambitious, is also more likely to produce useful systems in the near future.

Both approaches are merged in the following two articles, which present integrated systems that combine autonomous and interactive discovery. Klos- gen describes EXPLORA, a system for statistical interpretation of survey and marketing data. Piatetsky-Shapiro and Matheus present a Knowledge Discovery Workbench for exploring business databases. Such integrated, interactive dis- covery systems are likely to become an essential part of any intelligent data analysis.

In the last article, Major and Riedinger describe an important deployed application for insurance fraud detection. What makes this application espe- cially interesting is that no experts existed that knew how to analyze the elec- tronic claims, making discovery from data the only way to solve the problem.

I am grateful to Ronald Yager for suggesting and encouraging this special issue. Many thanks to Bud Frawley, Chris Matheus, and Jan Zytkow, who helped to shape these ideas, and to Shri Goyal for his encouragement and support.

References

1. P. Langley, H. Simon, G. Bradshaw, and J . Zytkow, ScientiJic Discovery: Computa- tional Explorations of the Creative Process, MIT Press, Cambridge, MA, 1987.

2. G. Piatetsky-Shapiro and W. Frawley (Eds.), Knowledge Discovery in Databases, MIT/AAAI Press, Cambridge, MA, 1991.