knowledge discovery workbench for exploring business databases

Knowledge Discovery Workbench for Exploring Business Databases Gregory Piatetsky-Shapiro and Christopher J. Matheus GTE Laboratories Incorporated, Waltham, Massachusetts 02254

We describe the Knowledge Discovery Workbench, an interactive system for database exploration. We then illustrate KDW capabilities in data clustering, summarization, classification, and discovery of changes. We also examine extracting dependencies from data and using them to order the multitude of data patterns. 0 1992 John Wiley & Sons. Inc.

I. INTRODUCTION

To explore the different approaches to discovery in databases, we are developing a Knowledge Discovery Workbench (KDW). There is no single discovery method that is appropriate for all situations. Our ultimate goal is to integrate several discovery approaches, including data clustering, data visualization, summarization, classification,-and discovery of changes (Fig. 1 ) . Other systems that employ this integrated approach include INLEN' and ILS.2

Our approach is to use domain knowledge throughout the system for provid- ing the initial discovery focus, limiting search, evaluating the uncovered patterns, and presenting the discovered results in a meaningful way. We would also like to reuse the discovered knowledge within the system to allow for an incremental growth of knowledge and make the system more robust.

To handle large databases we rely on statistical estimation technique^.^.^ Certain patterns in large databases can be reliably found by analyzing random samples. Statistical methods are also used to deal with pattern uncertainty.

Finally, our intention is to make KDW an interactive workstation for data exploration (see also Klosgen, this issue). A good combination of human judgment and intuition along with computer speed and memory offers a tremen- dous potential for discovery. This approach requires human-oriented, visual and natural-language based presentation of data and discovered knowledge.

In developing KDW, we have used several real database files containing hospital, insurance, and telephone customer data. Most of the examples in the rest of the article come from a file of patient summary data from a midsize hospital. This data was collected over several months in 1990 for 1000 patients

INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL. 7, 675-686 (1992) 0 1992 John Wiley & Sons, Inc. CCC 0884-8173/92/070675-12$04.00

676 PIATETSKY-SHAPIRO AND MATHEUS

Data

Knowledge

I ( Summarization ’

In 1 DeDendencies human- oriented presen t a 11 :ion

Figure 1. Architecture of a Knowledge Discovery Workbench.

classified into DRG (Diagnostic Related Group) 467. This is the most frequent DRG referring to “Miscellaneous and Other Health-Related Factors.”

11. USING DOMAIN KNOWLEDGE IN KDW

Proper understanding of business databases requires significant domain knowledge about aspects of the world captured in data. This knowledge can be wide-ranging and can include economiclsocialidemographic trends, corporate policies, database design rules, legal requirements, and other restrictions of the real world. Some of this knowledge, appropriately encoded, can be used productively by a discovery system.

KDW uses several kinds of domain knowledge. The simplest kind can be called data-diciionauy knowledge. It includes the database field descriptions, types, sizes, and possible values. This data comes from the database dictionary and is used to define a valid search space.

KDW can also use field-value taxonomies, which are groupings of field- value codes into classes. Such taxonomies provide a basis for more meaningful generalizations and human-oriented presentation. They can be obtained from data definition manuals, from relationships between different fields, or from domain experts.

At any time during a KDW discovery session, any subset of fields can be active. Which fields are active can be specified by selecting one of several prespecified models, and/or by manually clicking on the field name in the field selector menu. In addition, a prior statistical analysis phase excludes fields which show little or no variability in their values.

Finally, data fields typically have many interdependencies. These include the usual functional dependencies present in databases, such as ZIP-code

KDW FOR DATABASE EXPLORATION 677

determines area-code.* There is a much larger number of dependencies between values of the different fields, for example, if doctor-type = OBGYN then patient-sex = female. In addition, there are numerical equalities, PROFIT = REVENUE - EXPENSES, and inequalities, for example, PROFIT 5 REVENUE.

Some of these dependencies are fuzzy, for example, if admission-type = obstetrics, then patient is of child-bearing age. Precise speci- fication of this age is difficult. Instead, we would like to have a model of distribution of ages within this range. Then, with some additional domain knowledge, the system would be able to detect, for example, that doctor Z has higher- risk patients, because he has higher proportion of both older and younger patients. Such dependencies are not yet used by KDW.

A. Finding and Using Field Dependencies

One type of domain knowledge-field dependencies-is especially useful in organizing the potentially huge number of patterns that can be found in data. Several methods have recently been developed for discovery of dependency networks. Cooper and Herskovitz’ describe a Bayesian algorithm for deriving a dependency network for discrete-valued fields. Methods for analyzing the dependency networks and determining the directionality of links and equiva- lence of different networks are presented in Geiger.6 A method for determining dependencies in numerical data is given in Glymour.’ Pearl and Verma8 present a comprehensive approach to inferring causal models.

A problem with these approaches is that they rely on assumptions on data distribution, such as normality and acyclicity of the dependency graph. They also do not provide a readily available quantitative measure of dependency and do not always indicate the direction of the dependence. We propose a method which directly and quantitatively measures the interdependence of data fields, does not make any distribution assumptions, and provides a direction to the dependency. This method is applicable to discrete-valued fields, but can also be applied to numeric fields which are discretized. The proposed method is (relatively) computationally efficient, having a complexity of O( F2N log N), where F is the number of fields, and N is the number of data records.

The idea is to measure how much knowing the value of the first field can predict the value of the second field. Given two randomly selected tuples that agree on X value, we define the probabilistic dependency (pdep) of Yon X as the probability that those tuples also agree on Y value.

Formally, let X, Y be data fields, t , , t, two randomly selected database tuples, and let tl .X denote the X value in t , . We define pdep(X, Y ) as the conditional probability p ( t l . Y = t,. YIt,.X = t 2 . X ) .

Let N be the total number of tuples. Consider a data subset with X = xi [whose size we denote asf(xi)l and let Y take values yi, , . . . , yik , in that subset,

*Even if the database was normalized, the discovery is usually done on a single data file, built from several original files. Thus the functional dependencies will manifest themselves as dependencies between fields in the same data file.


with absolute frequenciesf(y,,), . . . , f(y,,), respectively. The measurepdep(X = x, , Y) is the probability that two Y values randomly selected from that subset are equal, and is given by

k k

pdep(X = x, Y) = C P(Y = Y , , ) ~ = 2 [f(y,,)/f(x,)~~ (1)

The dependence pdep(X, Y) is the proportional sum of pdep(X = x, , Y ) over

J = I J = l

all x,, and is computed as:

pdep(X , Y ) = p(x,)pdep(X = x, , Y ) I

How well X can predict Y also depends on Y . If, for example, all (or almost all) values of Yare the same, then any X would be a good predictor of Y . We can measure this more precisely by defining pdmin( Y ) as the probability that any two randomly selected tuples have equal Y values. We have pdmin(Y) = Z j p ( Y = yi)'. Intuitively, any additional knowledge of X should only increase our ability to predict Y , and thus we should always havepdep(X, Y ) 2 pdrnin( Y ) (see Ref. 8a).

To account for the relationship between pdep(X , Y ) and pdmin(Y) we can normalize pdep using a proportional reduction in variation, which results in Goodman and Kruskalsb 7 (tau) measure of association:

pdep(X, Y ) - pdrnin(Y) 7 ( X , Y) = I - pdrnin(Y) (3)

The 7 measure is always between 0 and 1. If T(X, Y) > 7 ( Y , X ) , we infer that X + Y , and vice versa.

After computing a particular value of pdep(X, Y), we want to know its significance. Goodman and KruskaIgb have developed an asymptotic sampling theory for cases when data is a random sample from larger population. For other cases, which are more typical, we can use the idea of randomization testing. Consider a file with only two fields X and Y , and let pdep(X, Y ) = pdn. We can randomly permute Y values while keeping X values in place. We can then estimate the probability of pdep(X, Y ) 2 pdo as the percentage of permutations where pdep(X, Y) 2 pdn, assuming that all permutations of Y values are equally likely.

Elsewhere,8a we showed how to use the randomization approach to analyze, both analytically and experimentally, the significance of pdep(X, Y ) given pdmin( Y). We determined that x 2 measure can be used to closely approximate the significance of pdep(X, Y ) . We also found that under randomization, the expected value of pdep(X , Y ) is (where Xdist is the number of distinct X values)

[ 1 - pdrnin( Y)]. Xdist - E[pdep(X, Y)] = pdmin(Y) + N - 1 (4)


Length of Stay Norm

1.0 k. f

"**.. 0.97 0.

't

**.

dmbursement 0.94 Insurance Plan ..............

lm01 Primary Carrier /prwaure I .o

Figure 2. An example of the dependency network in a real database.

To combine the interfield dependencies into a network (see Fig. 2) we use

For each pair of active fields X , Y , if they are not independent at the the following heuristic method (see also Ref. 8c).

desired level of significance according to x 2 test, then

1. If T ( X , Y ) /T (Y , X ) > 1 + E , add linkX - Y , strength = pdep(X, Y ) . 2. If T( Y , X > / T ( X , Y ) > 1 + E , add link Y - X , strength = pdep( Y, X ) . 3. If 141 + E ) < T(X, Y ) h ( Y , X ) < 1 + E , add both links X + Y ,

Y 4 X .

Currently, we use E = 0.01. Another problem in determining dependency networks is the elimination

of transitive links. Given links X + Y , Y -+ Z , and X + Z , the last link should be eliminated if it is implied by the first two. A simple special case is when strength of all links is equal to 1-then X - t 2 logically follows from X -+ Y , Y -+ 2. When the strength of all links is close to 1, then X 4 Z is probabilistically implied by the other links. A rigorous study of this problem has been done by Pearl.8c

The dependency network can also be pruned manually, using domain expert knowledge about the dependent and independent variables. The data dictionary information on the primary and secondary keys can also be used.

Finally, an interesting possibility is to analyze not only the overall dependency measure between X and Y , but also the dependencies between the individual field values. Since there may be many distinct values, we need to cluster values with similar distributions. We are currently investigating this approach.

111. AN ANNOTATED DISCOVERY SESSION

KDW has a graphical, menu-driven, point and click interface, designed for ease of use. After the user selects a dataset, data description and default discovery focus appear on the screen. The discovery focus, such as Revenue/Ex-


Figure 3. KDW interface.

penses, activates fields related to the focus and deactivates other fields. In addition to field relevance information, the focus also stores domain knowledge representing expectations about the way the data might be clustered and summarized. This knowledge is used to filter out known or obvious (and therefore uninteresting) patterns from the unexpected and noteworthy.

There are many different types of patterns that can be found in data. KDW handles three types of discovery tasks: Clustering, Summarization, and Classification. They are illustrated in the following annotated session.

We are also exploring two more discovery tasks: discovery of anomalies and discovery of changes. Some preliminary ideas on these tasks are given in the following section.

A. Clustering and Data Visualization in KDW

The first task in the discovery process is to identify useful classes of instances. Unless these classes are prespecified by the domain knowledge, they must be discovered. This can be done either automatically using clustering algorithms or manually through the user’s interaction with data visualization tools. The KDW provides both capabilities.

The clustering algorithm currently implemented in KDW is designed spe- cifically to detect multiple linear groupings of data points. A standard linear regression will not work in such cases, because data fits not one line, but several lines. Linear relationships are common in the data we are working with, especially among fields containing dollar amounts. Figure 4 shows an example of the types of linear clusters we are interested in detecting. In this example, the distribution of 250 data points is plotted over the two fields Expected Reimbursement and Gross Charges.

KDW’s algorithm for linear clustering is as follows:

Linear Clustering Algorithm: 1. Examine all pairs of points. For each pair, compute the slope and

intercept of the line between them, rounding off to desired precision. 2. Group pairs with very similar (within some tolerance limit) slope

and intercept into lines.

KDW FOR DATABASE EXPLORATION 68 1

2oool----o- E

8 E loo]

f u

00 b;:* oo; 3 1 IZ

0 10 0 20 0

Gross Revenue-Charges

Figure 4. Plot of expected reimbursement versus charges

3. Discard lines with fewer than N points (where N is a parameter

4. Return lines, ordered by the decreasing number of points. equal or greater than 3).

Application of this algorithm to the data in Figure 4 has produced three lines:

0 Expected Reimbursement = 19.74 0 Expected Reimbursement = Gross Charges

Expected Reimbursement = 0.54 * Gross Charges

The running time of this algorithm is O ( N 2 ) , where N is a number of points. For a large number of points, an even faster version will take a random sample of points and compute all the lines in the sample. Then all new points will be matched against the found lines.

This algorithm is very simple and is limited to detecting line clusters. A possible refinement is to perform linear regression on each identified cluster. We intend to extend the KDW’s clustering capabilities as the need for more advanced methods arises.

In addition to automatic clustering algorithms, we expect the user to provide guidance in identifying classes of instances relevant to a discovery session. To assist the user in identifying useful clusters, KDW includes visualization tools for displaying and manipulating multidimensional data. These tools are based on a pair of public domain plotting/visualization packages, gnuplot and xlispstat, that have been integrated into the KDW environment. These tools permit the user to plot multiple fields, focus in on subsets of the data, and apply basic statistical algorithms to the data.


B. Data Summarization

Data summarization is the process of deriving a characteristic summary of a data subset that is interesting with respect to domain knowledge and the full data file. A related database research area is concerned with intensional query answers.

Summarization of a concept A in KDW is performed by scanning all tuples that satisfy A and computing for all fields, in parallel, statistics on their values. For a numeric field we currently compute its range of values, and for a nominal field a list of values, up to a user-determined limit K M M . After all the instances have been scanned, we have a condition Cond(B) on each field. For nominal fields it has a form B = 6 , , . . . , b,, while for numeric fields it is bmin 5 Bj i b,, . This can also be interpreted as a rule: if A then Cond(B).

We measure the quality of this rule using the 4 measure for 2 x 2 contin- gency tables." Let a be the size of A , b be the size of Cond(B) (estimated using precomputed statistic^,^ and N be the total number of records. The entries in this 2 x 2 table would be

I A not A I total

total

and the 4 measure is

u(N - b) = d a b ( N - a)(N - b)

The 4 measure varies from zero (no correlation) to 1 (perfect match between A and Cond(B)). The statistical significance of the rule is estimated by using the x 2 measure with one degree of freedom on the above table, or Fisher's exact test for small N .

For our example dataset, we have summarized the concept Expected Reimbursement = 19. 74 with the following results:

* * 999 records, 58 active fields summarized 181 instances in 400 milliseconds

Field Value Phi Simificance Primary Carrier is 2020 1.0 1.0 Insurance Plan is 22 1.0 1.0 Length of Stay Norm is 4 1.0 1.0 Reimbursement Procedure is DGRS556 0.993 >O. 99999

Previously, (see Fig. 2 ) we derived a dependency network for the fields Insuranceplan, Lengthof StayNorm, PrimaryCarrier, andReimburse- ment Procedure. From that network we see that the Primary Carrier field


field determines the values of the other three fields. This can be used to simplify the derived summarization: if we can assume the user knows the inherent data dependencies, then only the first condition should be presented to the user.

Other statistical methods of identifying differences can also be used. For numeric fields, we can compare the means and standard deviations of the fields in the subset with those in the full data set, and report the fields with sufficiently significant differences. For nominal fields, x2 test on the most frequent values can be used. We are currently examining ways to integrate these methods with domain knowledge.

C. Classification

Given two or more clusters of data, auser might be interested in discovering characteristics that can be used to classify data points into one of the clusters. For example, on the plot in Figure 4 there are three linear clusters. A user might want to know which characteristics determine into which cluster the point will fall.

To answer this question KDW can construct a decision tree for classifying instances into two or more classes defined by a target concept. The target concept is simply a function that labels all points as belonging to one of the predefined clusters.

The decision tree algorithm uses an information theoretical measure for selecting the decision variables in the same way as ID3." The actual implementa- tion is based on Frawley's Function Based Induction (FBI) algorithm. l 2 The output from the tree (see Fig. 4) is converted into rules for presentation to the user:

Discovered Rules:

If Primary Carrier is 1040 Then Expected Reimbursement = 0.54 * Charges

If Primary Insurance Carrier is 2020 Then Expected Reimbursement = 19.74

If Primary Carrier is one of 21 3010 4010 4020 5001 5020 5030 Then Expected Reimbursement = Charges

If Primary Carrier is one of 0 5090 6020 6030 Then the classification is undetermined

IV. OTHER DISCOVERY TASKS

An important task in analysis of business data is looking for the unusual. Things can be unusual in two ways: they can be different from the norm or they can be different from their previous values. While similar, these two tasks call for somewhat different approaches. Below, we examine them in more detail.


A. Discovery of Anomalies

A typical example of that task is to compare a group of similar doctors (or salesmen, or high-tech companies) and to identify those who stand apart from the average, either in a positive or a negative way. For example, an HMO director may look for obstetricians who perform more Caesarean sections than average. The problem with this straightforward approach is that the obstetrician with more C-sections may have patients who are more at risk for C-section, such as older women, or those who had a prior C-section. Thus, the C-section rate of a doctor should be adjusted for various risk factors of the patients, before comparing it to the average.

Current methods for risk adjustment usually rely on a logistic regression which produces an individual odds ratio for each risk factor, for example, if a woman had a prior C-section, then chances for a second C-section increase X times, and if a woman is over 30, then chances increase Y times. When considering a combination of factors, this method assumes that the factors are independent, and estimates that a woman over 30 with a prior C-section is XY times more likely to have it again. The independence assumption is usually not true, and other methods, such as belief networks, are needed to solve this problem.

B. Discovery of Changes

Discovery of changes in operational data is important in any business, but especially in marketing and sales. A typical question asked by a marketer may be: What changes have occurred as a result of the recent price changes/ promotions?

Such discovery is the goal of the recently developed systems Co~ers tory '~ and Sp~tl ight . '~ They analyze the supermarket scanner data by comparing product sales in the latest period with the previous one. In addition to overall figures, the sales are broken down by geographical regions and by subproducts. Whenever changes are found, they are explained in terms of a predefined, expert-derived causal model. A very nice feature of both systems is reporting the results as a business memo, produced with template-based natural language generation and standard business graphics. While limited to one type of discovery, both Coverstory and Spotlight are elegant and quite successful, and are currently used in a number of retail companies.

We have tested these ideas on hospital data. We used the x 2 two-way classification test to compare all the discrete-valued fields for the first and second quarters in 1990, each quarter having about 500 records. We found a number of significant changes, among them

The values of SERVICE TYPE are different with significance

The biggest changes are: 0.999999

proportion of SERVICE TYPE is Laboratory declined to 17.0% (86) from 32.8% (162) proportion of SERVICE TYPE is Radiology grew to 47.1% (238) from 28.3% (140)


O l r I Patient Sex

Figure 5. Data dependency graph.

The values of PATIENT SEX are different with significance 0.9999 proportion of PATIENT SEX = Female grew to 71.7% (362) from 57.7% (285) proportion of PATIENT SEX = Male declined to 28.3% (143) from 42.3% (209)

A prior analysis of dependencies gave the graph in Figure 5 . By analyzing the data on physicians whose practice patterns have changed

between quarters, the system found doctor XX, who treated 8% (42) of patients last quarter but none in the current quarter. Summarization of XX records has revealed that he specialized in laboratory tests on male patients, and admitted on Thursdays at 3 P . M . Thus, the system was able to report

These changes are partly explained by doctor xx. In 91 1990 doctor XX had 42 patients

with PATIENT SEX = Male and SERVICE TYPE = Laboratory. while in Q2 1990 doctor XX had 0 patients.

The admission day of the week and time were judged to be irrelevant and

While this method has not been explored fully by us, it is a promising not reported.

approach to making sense of the multitude of changes in data.

V. FUTURE DIRECTIONS

We have presented the Knowledge Discovery Workbench-an interactive, integrated environment for multiparadigm discovery in business databases. Our plans for technical development call for better integration of the existing modules, and for making more consistent use of the available domain knowledge, especially the field dependency networks, strong rules, and field value taxonomies. We are also considering some recently developed algorithms for constructive ind~c t ion . '~ . ' ~

Another priority is the integration of statistical methods, such as multivari- ate analysis, with knowledge-based approaches, with applications mainly to the discovery of changes.

While the current interactive interface is appropriate for short sessions, longer and more complex sessions require a way to record and manipulate long sequences of commands. Ultimately, we would like to develop a Database Discovery Language, with capabilities that combine the standard operations, the discovery tasks, and the graphical abilities.

The final test of practical usefulness, is, of course, the actual use of the system. We expect that actual applications will include only a subset of the

PIATETSKY-SHAPIRO AND MATHEUS

presented techniques, packaged with an appropriate user-friendly interface and a link to databases. The potential applications of t he discovery techniques to medical, retail, and cellular cell da ta have received a very positive response from a number of GTE business units and are now in pilot development stage.

We are grateful to Bud Frawley, Mary McLeish, Samy Uthurusamy, and Jan Zytkow for their comments, and to Shri Goyal and Bill Griffin for their encouragement and suppport.

References

1 . K. Kaufman, R. Michalski, and L. Kerschberg, “Mining for knowledge in databases: Goals and general description of the INLEN system,” In Knowledge Discovery in Databases, G. Piatetsky-Shapiro and W. Frawley (Eds.) AAAUMIT Press, 1991.

2. B. Silver, W. Frawley, G. Iba, J. Vittal, and K. Bradford, “ILS: A framework for multi-paradigmatic learning,” In Proceedings of the 7rh International Conference on Machine Learning, Morgan Kaufmann, San Mateo, CA, 1990, pp. 348-354.

3 . G. Piatetsky-Shapiro, “Discovery and analysis of strong rules in databases,” In Proceedings of Advanced Database System Symposium, Kyoto, Japan, 1989.

4. G. Piatetsky-Shapiro, “Discovery, analysis and presentation of strong rules,” In Knowledge Discovery in Databases, G. Piatetsky-Shapiro and W. Frawley (Eds.), AAAUMIT Press, 1991.

5. G. Cooper and E. Herskovits, A Bayesian Method for the Induction of Probabilistic Networks from Data, Stanford Knowledge Systems Laboratory Report KSL-91-02, 1991.

6. D. Geiger, A. Paz, and J. Pearl, “Learning causal trees from dependence information,” In Proceedings of AAAI-90, pp. 770-776.

7. C. Glymour, R. Scheines, P. Spirtes, and K. Kelly, Discovering Causal Structirre, Academic Press, Orlando, FL, 1987.

8. J. Pearl and T.S. Verma, “A theory of inferred causation,” In Proceedings of 2nd I n t . Con$ on Principles of Knowledge Representation and Reasoning Morgan Kaufmann, San Mateo, CA, pp. 441-452.

8a. G. Piatetsky-Shapiro, “Probabilistic data dependencies,” in Proceedings of Ma- chine Discovery Workshop, Aberdeen, July 1992 (unpublished).

8b. L.A. Goodman and W.H. Kruskal, Measures of Association for Cross Class$ca- tions, Springer-Verlag, New York, 1979.

8c. J . Pearl, Probabilistic Reasoning in Intelligent Systems, Morgan Kaufmann, San Mateo, CA, 1988.

9. T. Imielinski, “Intelligent query answering in rule-based systems,” Joirrnat of Logir Programming, 4 , 229-257 (1987).

10. R.L. Iman and W.J. Conover, A Modern Approach to Statistics, Wiley, New York, 1983.

1 1 . J.R. Quinlan, “Induction of decision trees,” Machine Learning, 1(1), 81-106 (1986). 12. W. Frawley, “Using function to encode domain and contextual knowledge in statisti-

cal induction,” In Knowledge Discovery in Databases, G. Piatetsky-Shapiro and W. Frawley (Eds.), AAAI/MIT Press, 1991.

13. J. Schmitz, G. Armstrong, and J.D.C. Little, “CoverStory-Automated news finding in marketing,” In DSS Transactions, Linda Volino (Ed.), The Institute of Man- agement Sciences, Providence, RI, 1990, pp. 46-54.

14. T. Anand and G. Kahn, “SPOTLIGHT: A data explanation system,” In Proceedings of CAIA-92, IEEE Computer Society, Washington, DC, 1992.

15. C.J. Matheus, “Adding domain knowledge to SBL through feature construction,” In Proceedings of AAAI-90, 1990, pp. 803-808.

16. R. Sutton and C.J. Matheus, “Learning polynomials by linear regression and feature construction,” in Proceedings of Machine Learning ’91, pp. 208-212 (unpublished).

knowledge discovery workbench for exploring business databases

Documents