cover feature mining very large databases be efficient, the data-mining techniques applied to very...

8
0018-9162/99/$10.00 © 1999 IEEE 38 Computer E stablished companies have had decades to accumulate masses of data about their cus- tomers, suppliers, and products and services. The rapid pace of e-commerce means that Web startups can become huge enterprises in months, not years, amassing proportionately large databases as they grow. Data mining, also known as knowledge discovery in databases, 1 gives organiza- tions the tools to sift through these vast data stores to find the trends, patterns, and correlations that can guide strategic decision making. Traditionally, algorithms for data analysis assume that the input data contains relatively few records. Current databases, however, are much too large to be held in main memory. Retrieving data from disk is markedly slower than accessing data in RAM. Thus, to be efficient, the data-mining techniques applied to very large databases must be highly scalable. An algo- rithm is said to be scalable if—given a fixed amount of main memory—its runtime increases linearly with the number of records in the input database. Recent work has focused on scaling data-mining algorithms to very large data sets. In this survey, we describe a broad range of algorithms that address three classical data-mining problems: market basket analy- sis, clustering, and classification. MARKET BASKET ANALYSIS A market basket is a collection of items purchased by a customer in an individual customer transaction, which is a well-defined business activity—for exam- ple, a customer’s visit to a grocery store or an online purchase from a virtual store such as Amazon.com. Retailers accumulate huge collections of transactions by recording business activity over time. One common analysis run against a transactions database is to find sets of items, or itemsets, that appear together in many transactions. Each pattern extracted through the analysis consists of an itemset and the number of trans- actions that contain it. Businesses can use knowledge of these patterns to improve the placement of items in a store or the layout of mail-order catalog pages and Web pages. An itemset containing i items is called an i-itemset. The percentage of transactions that contain an itemset is called the itemset’s support. For an itemset to be interesting, its support must be higher than a user-spec- ified minimum; such itemsets are said to be frequent. Figure 1 shows three transactions stored in a rela- tional database system. The database has five fields: a transaction identifier, a customer identifier, the item purchased, its price, and the transaction date. The first transaction shows a customer who bought a computer, MS Office, and Doom. As an example, the 2-itemset {hard disk, Doom} has a support of 67 percent. Why is finding frequent itemsets a nontrivial prob- lem? First, the number of customer transactions can be very large and usually will not fit in memory. Second, the potential number of frequent itemsets is exponential in the number of different items, although the actual number of frequent itemsets can be much smaller. The example in Figure 1 shows four different items, so there are 2 4 - 1 = 15 potential frequent item- sets. If the minimum support is 60 percent, only five itemsets are actually frequent. Thus, we want algo- rithms that are scalable with respect to the number of transactions and examine as few infrequent itemsets as possible. Efficient algorithms have been designed to address these criteria. The Apriori algorithm 2 provided one early solution, which subsequent algorithms built upon. APRIORI ALGORITHM This algorithm computes the frequent itemsets in several rounds. Round i computes all frequent i-item- sets. A round has two steps: candidate generation and candidate counting. Consider the ith round. In the can- didate generation step, the algorithm generates a set of candidate i-itemsets whose support has not yet been computed. In the candidate counting step, the algo- rithm scans the transaction database, counting the sup- ports of the candidate itemsets. After the scan, the The explosive growth of databases makes the scalability of data-mining techniques increasingly important. The authors describe algorithms that address three classical data-mining problems. Cover Feature Mining Very Large Databases Venkatesh Ganti Johannes Gehrke Raghu Ramakrishnan University of Wisconsin- Madison Cover Feature

Upload: others

Post on 25-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cover Feature Mining Very Large Databases be efficient, the data-mining techniques applied to very large databases must be highly scalable. An algo-rithm is said to be scalable if—given

0018-9162/99/$10.00 © 1999 IEEE38 Computer

Established companies have had decades toaccumulate masses of data about their cus-tomers, suppliers, and products and services.The rapid pace of e-commerce means thatWeb startups can become huge enterprises in

months, not years, amassing proportionately largedatabases as they grow. Data mining, also known asknowledge discovery in databases,1 gives organiza-tions the tools to sift through these vast data storesto find the trends, patterns, and correlations that canguide strategic decision making.

Traditionally, algorithms for data analysis assumethat the input data contains relatively few records.Current databases, however, are much too large to beheld in main memory. Retrieving data from disk ismarkedly slower than accessing data in RAM. Thus,to be efficient, the data-mining techniques applied tovery large databases must be highly scalable. An algo-rithm is said to be scalable if—given a fixed amount ofmain memory—its runtime increases linearly with thenumber of records in the input database.

Recent work has focused on scaling data-miningalgorithms to very large data sets. In this survey, wedescribe a broad range of algorithms that address threeclassical data-mining problems: market basket analy-sis, clustering, and classification.

MARKET BASKET ANALYSISA market basket is a collection of items purchased

by a customer in an individual customer transaction,which is a well-defined business activity—for exam-ple, a customer’s visit to a grocery store or an onlinepurchase from a virtual store such as Amazon.com.Retailers accumulate huge collections of transactionsby recording business activity over time. One commonanalysis run against a transactions database is to findsets of items, or itemsets, that appear together in manytransactions. Each pattern extracted through theanalysis consists of an itemset and the number of trans-actions that contain it. Businesses can use knowledgeof these patterns to improve the placement of items in

a store or the layout of mail-order catalog pages andWeb pages.

An itemset containing i items is called an i-itemset.The percentage of transactions that contain an itemsetis called the itemset’s support. For an itemset to beinteresting, its support must be higher than a user-spec-ified minimum; such itemsets are said to be frequent.

Figure 1 shows three transactions stored in a rela-tional database system. The database has five fields: atransaction identifier, a customer identifier, the itempurchased, its price, and the transaction date. The firsttransaction shows a customer who bought a computer,MS Office, and Doom. As an example, the 2-itemset{hard disk, Doom} has a support of 67 percent.

Why is finding frequent itemsets a nontrivial prob-lem? First, the number of customer transactions canbe very large and usually will not fit in memory.Second, the potential number of frequent itemsets isexponential in the number of different items, althoughthe actual number of frequent itemsets can be muchsmaller. The example in Figure 1 shows four differentitems, so there are 24 − 1 = 15 potential frequent item-sets. If the minimum support is 60 percent, only fiveitemsets are actually frequent. Thus, we want algo-rithms that are scalable with respect to the number oftransactions and examine as few infrequent itemsetsas possible. Efficient algorithms have been designed toaddress these criteria. The Apriori algorithm2 providedone early solution, which subsequent algorithms builtupon.

APRIORI ALGORITHMThis algorithm computes the frequent itemsets in

several rounds. Round i computes all frequent i-item-sets. A round has two steps: candidate generation andcandidate counting. Consider the ith round. In the can-didate generation step, the algorithm generates a setof candidate i-itemsets whose support has not yet beencomputed. In the candidate counting step, the algo-rithm scans the transaction database, counting the sup-ports of the candidate itemsets. After the scan, the

The explosive growth of databases makes the scalability of data-miningtechniques increasingly important. The authors describe algorithms thataddress three classical data-mining problems.

Cover Feature

Mining Very Large Databases

VenkateshGantiJohannesGehrkeRaghuRamakrishnanUniversity ofWisconsin-Madison

Cove

r Fea

ture

Page 2: Cover Feature Mining Very Large Databases be efficient, the data-mining techniques applied to very large databases must be highly scalable. An algo-rithm is said to be scalable if—given

algorithm discards candidates with support lowerthan the user-specified minimum and retains only thefrequent i-itemsets.

In the first round, the generated set of candidateitemsets contains all 1-itemsets. The algorithm countstheir support during the candidate counting step.Thus, after the first round, all frequent 1-itemsets areknown. What are the candidate itemsets generatedduring the candidate generation step of round two?Naively, all pairs of items are candidates. Apriorireduces the set of candidate itemsets by pruning—apriori—those candidate itemsets that cannot be fre-quent, based on knowledge about infrequent itemsetsobtained from previous rounds. The pruning is basedon the observation that if an itemset is frequent, all itssubsets must be frequent as well. Therefore, beforeentering the candidate counting step, the algorithmcan discard every candidate itemset with a subset thatis infrequent.

Consider the database in Figure 1. Assume that theminimum support is 60 percent—so an itemset is fre-quent if it is contained in at least two transactions. Inround one, all single items are candidate itemsets andare counted during the candidate counting step. Inround two, only pairs of items in which each item isfrequent can become candidates. For example, theitemset {MSOffice, Doom} is not a candidate, sinceround one determined that its subset {MSOffice} is notfrequent. In round two, therefore, the algorithmcounts the candidate itemsets {computer, Doom},{hard disk, Doom}, and {computer, hard disk}. Inround three, no candidate itemset survives the prun-ing step. The itemset {computer, hard disk, Doom} ispruned a priori because its subset {computer, harddisk} is not frequent. Thus, with respect to a minimumsupport of 60 percent, the frequent itemsets in oursample database and their support values are

• {computer} 67 percent,• {hard disk} 67 percent,• {Doom} 100 percent,• {computer, Doom} 67 percent, and• {hard disk, Doom} 67 percent.

Apriori counts not only the support of all frequentitemsets, but also the support of those infrequent can-didate itemsets that could not be eliminated duringthe pruning step. The set of all candidate itemsets thatare infrequent but whose support is counted byApriori is called the negative border. Thus, an itemsetis in the negative border if it is infrequent, but all itssubsets are frequent. In our example, the negative bor-der consists of itemsets {MSOffice} and {computer,hard disk}. All subsets of an itemset in the negativeborder are frequent; otherwise the itemset would havebeen eliminated by the subset-pruning step.

Optimizing AprioriApriori scans the database several times, depend-

ing on the size of the longest frequent itemset. Severalrefinements have been proposed that focus on reduc-ing the number of database scans, the number of can-didate itemsets counted in each scan, or both.

Partitioning. Ashok Savasere and colleagues3 devel-oped Partition, an algorithm that requires only twoscans of the transaction database. The database isdivided into disjoint partitions, each small enough tofit in memory. In a first scan, the algorithm reads eachpartition and computes locally frequent itemsets oneach partition using Apriori.

In the second scan, the algorithm counts the sup-port of all locally frequent itemsets toward the com-plete database. If an itemset is frequent with respect tothe complete database, it must be frequent in at leastone partition; therefore the second scan counts asuperset of all potentially frequent items.

Hashing. Jong Soo Park and colleagues4 proposedusing probabilistic counting to reduce the number ofcandidate itemsets counted during each round ofApriori execution. This reduction is accomplished bysubjecting each candidate k-itemset to a hash-basedfiltering step in addition to the pruning step.

During candidate counting in round k − 1, the algo-rithm constructs a hash table. Each entry in the hashtable is a counter that maintains the sum of the sup-ports of the k-itemsets that correspond to that par-ticular entry of the hash table. The algorithm uses thisinformation in round k to prune the set of candidatek-itemsets. After subset pruning as in Apriori, thealgorithm can remove a candidate itemset if the countin its hash table entry is smaller than the minimumsupport threshold.

Sampling. Hannu Toivonen5 proposed a sampling-based algorithm that typically requires two scans ofthe database. The algorithm first takes a sample fromthe database and generates a set of candidate item-sets that are highly likely to be frequent in the com-

August 1999 39

TID

101101101

102102

103103103

CID

201201201

201201

202202202

1,500300100

500100

1,500500100

Date

1/4/991/4/991/4/99

1/7/991/799

1/24/991/24/991/24/99

Item

ComputerMSOffice

Doom

Hard diskDoom

ComputerHard disk

Doom

Price

Figure 1. Database containing three transactions.

Page 3: Cover Feature Mining Very Large Databases be efficient, the data-mining techniques applied to very large databases must be highly scalable. An algo-rithm is said to be scalable if—given

40 Computer

plete database. In a subsequent scan over the data-base, the algorithm counts these itemsets’ exact sup-ports and the support of their negative border. If noitemset in the negative border is frequent, then thealgorithm has discovered all frequent itemsets.Otherwise, some superset of an itemset in the nega-tive border could be frequent, but its support has notyet been counted. The sampling algorithm generatesand counts all such potentially frequent itemsets in asubsequent database scan.

Dynamic itemset counting. Sergey Brin and col-leagues6 proposed the Dynamic Itemset Countingalgorithm. DIC partitions the database into severalblocks marked by start points and repeatedly scansthe database. In contrast to Apriori, DIC can add newcandidate itemsets at any start point, instead of justat the beginning of a new database scan. At each startpoint, DIC estimates the support of all itemsets thatare currently counted and adds new itemsets to theset of candidate itemsets if all its subsets are estimatedto be frequent.

If DIC adds all frequent itemsets and their negativeborder to the set of candidate itemsets during the firstscan, it will have counted each itemset’s exact supportat some point during the second scan; thus DIC willcomplete in two scans.

Extensions and generalizationsSeveral researchers have proposed extensions to the

basic problem of finding frequent itemsets.Is-a hierarchy. One extension considers an is-a hier-

archy on database items. An is-a hierarchy defineswhich items are specializations or generalizations ofother items. For instance, as shown in Figure 2, theitems {computer, hard disk} in Figure 1 can be gener-alized to the item hardware. The extended problem isto compute itemsets that include items from differenthierarchy levels.

The presence of a hierarchy modifies the notion ofwhen an item is contained in a transaction: In addi-tion to the items listed explicitly, the transaction con-tains their ancestors in the taxonomy. This allows thedetection of relationships involving higher hierarchy

levels, since an itemset’s support can increase if anitem is replaced by one of its ancestors.

Consider the taxonomy in Figure 2. The transac-tion {computer, MSOffice} contains not only the itemscomputer and MSOffice, but also hardware and soft-ware. In Figure 1’s sample database, the support ofthe itemset {computer, MSOffice} is 33 percent,whereas the support of the itemset {computer, soft-ware} is 67 percent.

One approach to computing frequent itemsets in thepresence of a taxonomy is to conceptually augment eachtransaction with the ancestors of all items in the trans-action. Any algorithm for computing frequent itemsetscan now be used on the augmented database. Opti-mizations on this basic strategy have been described byRakesh Agrawal and Ramakrishnan Srikant.7

Sequential patterns. With each customer, we can asso-ciate a sequence of transactions ordered over time. Thebusiness goal is to find sequences of itemsets that manycustomers have purchased in approximately the sameorder.7,8 For each customer, the input database consistsof an ordered sequence of transactions. Given an item-set sequence, the percentage of transaction sequencesthat contain it is called the itemset sequence’s support.

A transaction sequence contains an itemset sequenceif each itemset is contained in one transaction and thefollowing holds: If the ith itemset in the itemsetsequence is contained in transaction j in the transactionsequence, the (i + 1)st itemset in the itemset sequence iscontained in a transaction with a number greater thanj. The goal of finding sequential patterns is to find allitemset sequences that have a support higher than auser-specified minimum. An itemset sequence is fre-quent if its support is larger than this minimum.

In Figure 1, customer 101 is associated with thetransaction sequence [{computer, MSOffice}, {harddisk, Doom}]. This transaction sequence contains theitemset sequence [{MSOffice}, {hard disk}].

Calendric market basket analysis. Sridhar Rama-swamy and colleagues9 use the time stamp associatedwith each transaction to define the problem of calen-dric market basket analysis. Even though an itemset’ssupport may not be large with respect to the entiredatabase, it might be large on a subset of the databasethat satisfies certain time constraints.

Conversely, in certain cases, itemsets that are fre-quent on the entire database may gain their supportfrom only certain subsets. The goal of calendric mar-ket basket analysis is to find all itemsets that are fre-quent in a set of user-defined time intervals.

CLUSTERINGClustering distributes data into several groups so

that similar objects fall into the same group. In Figure1’s sample database, assume that to cluster customersbased on their purchase behavior, we compute for each

Hardware

Computer Hard disk

Software

MSOffice Doom

Figure 2. Sample taxonomy for an is-a hierarchy of databaseitems.

Page 4: Cover Feature Mining Very Large Databases be efficient, the data-mining techniques applied to very large databases must be highly scalable. An algo-rithm is said to be scalable if—given

customer the total number and average price of allitems purchased. Figure 3 shows clustering informa-tion for nine customers, distributed across three clus-ters. Customers in cluster one purchase fewhigh-priced items, customers in cluster two purchasemany high-priced items, and customers in cluster threepurchase few low-priced items. Figure 3’s data doesnot match Figure 1’s because the earlier figure accom-modated only a few transactions.

The clustering problem has been studied in manyfields, including statistics, machine learning, and biol-ogy. However, scalability was not a design goal inthese applications; researchers always assumed thecomplete data set would fit in main memory, and thefocus was on improving the clustering quality.Consequently, these algorithms do not scale to largedata sets. Recently, several new algorithms withgreater emphasis on scalability have been developed,including summarized cluster representation, sam-pling, and using data structures supported by data-base systems.

Summarized cluster representationsTian Zhang and colleagues10 proposed Birch, which

uses summarized cluster representations to achievespeed and scalability while clustering a data set. TheBirch approach can be thought of as a two-phase clus-tering technique: Birch is used to yield a collection ofcoarse clusters, and then other (main-memory) clus-tering algorithms can be used on this collection toidentify “true clusters.” As an analogy, if each datapoint is a marble on a table top, we replace clustersof marbles by tennis balls and then look for clusters oftennis balls. While the number of marbles may belarge, we can control the number of tennis balls tomake the second phase feasible with traditional clus-tering algorithms whose goal is to recover complexcluster shapes. Other work on scalable clusteringaddressed Birch’s limitations or applied the summa-rized cluster representation idea in different ways.

Birch and the Birch* framework. A cluster corre-sponds to a dense region of objects. Birch treats thisregion collectively through a summarized representa-tion called its cluster feature. A cluster’s CF is a tripleconsisting of the cluster’s number of points, centroid,and radius, with the cluster’s radius defined as thesquare root of the average mean-squared distance fromthe centroid of a point in the cluster. When a new pointis added to a cluster, the new CF can be computed fromthe old CF; we do not need the set of points in the clus-ter. The incremental Birch algorithm exploits this prop-erty of a CF and maintains only the CFs of clusters,rather than the sets of points, while scanning the data.Cluster features are efficient for two reasons:

• They occupy much less space than the naive rep-

resentation, which maintains all objects in a clus-ter.

• They are sufficient for calculating all interclusterand intracluster measurements involved in mak-ing clustering decisions. Moreover, these calcu-lations can be performed much faster than usingall the objects in clusters. For instance, distancesbetween clusters, radii of clusters, CFs—andhence other properties of merged clusters—canall be computed very quickly from the CFs ofindividual clusters.

In Birch, the CF’s definition relies on vector opera-tions like addition, subtraction, centroid computa-tion, and so on. Therefore, Birch’s definition of CFwill not extend to datasets consisting of characterstrings, say, for which these operations are not defined.

In recent work, the CF and CF-tree concepts usedin Birch have been generalized in the Birch* frame-work11 to derive two new scalable clustering algo-rithms for data in an arbitrary metric space. Thesenew algorithms will, for example, separate the set{University of Wisconsin-Madison, University ofWisconsin-White Water, University of Texas-Austin,University of Texas-Arlington} into two clusters ofWisconsin and Texas universities.

Other CF work. Recently, Paul Bradley and col-leagues12 used CFs to develop a framework for scal-ing up the class of iterative clustering algorithms, suchas the K-Means algorithm. Starting with an initialdata-set partitioning, iterative clustering algorithmsrepeatedly move points between clusters until the dis-tribution optimizes a criterion function.

The framework functions by identifying sets of dis-cardable, compressible, and main-memory points. Apoint is discardable if its membership in a cluster canbe ascertained; the algorithm discards the actual pointsand retains only the CF of all discardable points.

A point is compressible if it is not discardable butbelongs to a tight subcluster—a set of points that

August 1999 41

<2, 1,700><3, 2,000><4, 2,300>

<10, 1,800><12, 2,100><11, 2,040>

<2, 100><3, 200><3, 150>

Cluster 1

Cluster 2

Cluster 3

Figure 3. Sample set of clusters—data groups consisting ofsimilar objects.

Page 5: Cover Feature Mining Very Large Databases be efficient, the data-mining techniques applied to very large databases must be highly scalable. An algo-rithm is said to be scalable if—given

42 Computer

always share cluster membership. Such points canmove from one cluster to another, but they alwaysmove together. Such a subcluster is summarized usingits CF.

A point is a main-memory point if it is neither dis-cardable nor compressible. Main-memory points areretained in main memory. The iterative clusteringalgorithm then moves only the main-memory pointsand the CFs of compressible points between clustersuntil the distribution optimizes the criterion func-tion.

Gholamhosein Sheikholeslami and colleagues13 pro-posed WaveCluster, a clustering algorithm based onwavelet transforms. They first summarize the data byimposing a multidimensional grid on the data space.The number of points that map into a single cell sum-marizes all the points that mapped into the cell. Thissummary information typically fits in main memory.WaveCluster then applies the wavelet transform onthe summarized data to determine clusters of arbitraryshapes.

Other approachesOf the other proposed clustering algorithms for

large data sets, we mention two sampling-basedapproaches and one based on database system sup-port.

Raymond T. Ng and Jiawei Han14 proposedClarans, which formulates the clustering problem asa randomized graph search. In Clarans, each node rep-resents a partition of the data set into a user-specifiednumber of clusters. A criterion function determinesthe clusters’ quality. Clarans samples the solutionspace—all possible partitions of the data set—for agood solution. The random search for a solution stopsat a node that meets the minimum among a prespec-ified number of local minima.

Sudipto Guha and colleagues15 proposed CURE, asampling-based hierarchical clustering algorithm thatdiscovers clusters of arbitrary shapes. In the DBScanalgorithm, Martin Ester and colleagues16 proposed adensity-based notion of a cluster that also lets the clus-ter take an arbitrary shape.

The article “Chameleon: Hierarchical ClusteringUsing Dynamic Modeling” by George Karypis and col-leagues (p. 68) covers these last two algorithms in detail.

CLASSIFICATIONAssume that we have identified, through clustering

of the aggregated purchase information of current cus-tomers, three different groups of customers, as shownin Figure 3. Assume that we purchase a mailing listwith demographic information for potential cus-tomers. We would like to assign each person in themailing list to one of three groups so that we can senda catalog tailored to that person’s buying patterns.This data-mining task uses historical informationabout current customers to predict the cluster mem-bership of new customers.

Our database with historical information, alsocalled the training database, contains records that haveseveral attributes. One designated attribute is calledthe dependent attribute, and the others are called pre-dictor attributes. The goal is to build a model thattakes the predictor attributes as inputs and outputs avalue for the dependent attribute.

If the dependent attribute is numerical, the prob-lem is called a regression problem; otherwise it iscalled a classification problem. We concentrate onclassification problems, although similar techniquesapply to regression problems as well. For a classifica-tion problem, we refer to the attribute values of thedependent attribute as class labels. Figure 4 shows asample training database with three predictor attrib-utes: salary, age, and employment. Group is the depen-dent attribute.

Researchers have proposed many classificationmodels:17 neural networks, genetic algorithms,Bayesian methods, log-linear and other statisticalmethods, decision tables, and tree-structured mod-els—so-called classification trees. Classification trees,also called decision trees, are attractive in a data-min-ing environment for several reasons:

• Their intuitive representation makes the result-ing classification model easy to understand.

• Constructing decision trees does not require anyinput parameters from the analyst.

• The predictive accuracy of decision trees is equalto or higher than other classification models.

• Fast, scalable algorithms can be used to constructdecision trees from very large training databases.

Each internal node of a decision tree is labeled witha predictor attribute, called the splitting attribute, andeach leaf node is labeled with a class label. Each edgeoriginating from an internal node is labeled with asplitting predicate that involves only the node’s split-ting attribute. The splitting predicates have the prop-

Record ID

1

2

3

4

5

6

7

8

9

Salary

30K

40K

70K

60K

70K

60K

60K

70K

40K

Age

30

35

50

45

30

35

35

30

45

Employment

Self

Industry

Academia

Self

Academia

Industry

Self

Self

Industry

Group

C

C

C

B

B

A

A

A

C

Figure 4. Sample training database.

Page 6: Cover Feature Mining Very Large Databases be efficient, the data-mining techniques applied to very large databases must be highly scalable. An algo-rithm is said to be scalable if—given

erty that any record will take a unique path from theroot to exactly one leaf node. The combined informa-tion about splitting attributes and splitting predicatesat a node is called the splitting criterion. Figure 5shows a possible decision tree for the training data-base from Figure 4.

Decision tree construction algorithms consist of twophases: tree building and pruning. In tree building, thetree grows top-down in the following greedy way.Starting with the root node, the algorithm examinesthe database using a split selection method to com-pute the locally “best” splitting criterion. Then it par-titions the database according to this splitting criterionand applies the procedure recursively. The algorithmthen prunes the tree to control its size. Some decisiontree construction algorithms separate tree buildingand pruning, while others interleave them to avoid theunnecessary expansion of some nodes. Figure 6 showsa code sample of the tree-building phase.

The choice of splitting criterion determines the qual-ity of the decision tree, and it has been the subject ofconsiderable research. In addition, if the training data-base does not fit in memory, we need a scalable dataaccess method. One such method, the Sprint algorithmintroduced by John Shafer and colleagues,18 uses onlya minimum amount of main memory and scales a pop-ular split selection method called CART. Anotherapproach, the RainForest framework,19 scales a broadclass of split-selection methods, but has main-mem-ory requirements that depend on the number of dif-ferent attribute values in the input database.

Sprint. This classification-tree construction algo-rithm removes all relationships between main mem-ory and the data set size. Sprint builds classificationtrees with binary splits and it requires sorted access toeach attribute at each node. For each attribute, thealgorithm creates an attribute list, which is a verticalpartition of the training database D. For each tuple t∈ D, the entry of t in the attribute list consists of theprojection of t onto the attribute, the class labelattribute, and the record identifier of t. The attributelist of each attribute is created at the beginning of thealgorithm and sorted once in increasing order ofattribute values.

At the root node, the algorithm scans all attributelists once to determine the splitting criterion. Then itdistributes each attribute list among the root’s chil-dren through a hash-join with the attribute list of thesplitting attribute. The record identifier, which is dupli-cated in each attribute, establishes the connectionbetween the different parts of the tuple. During thehash-join, the algorithm reads and distributes eachattribute list sequentially, which preserves the initialsort order of the attribute list. The algorithm thenrecurses on each child partition.

RainForest. The RainForest framework19 operates

from the premise that nearly all split-selection meth-ods need only aggregate information to decide on thesplitting criterion at a node. This aggregated infor-mation can be captured in a relatively compact datastructure called the attribute-value class label group,or AVC group.

Consider the root node of the tree, and let D be thetraining database. The AVC set of predictor attributeA is the projection of D onto A, where counts of theindividual class labels are aggregated. The AVC groupat a node consists of the AVC sets of all predictorattributes. Consider the training database shown inFigure 4. The AVC group of the root node is shown inFigure 7.

The size of a node’s AVC group is not proportionalto the number of records in the training database, butrather to the number of different attribute values.Thus, in most cases, the AVC group is much smallerthan the training database and usually fits into main

August 1999 43

Salary

Employment

AgeGroup

Group C

Group B Group A

Academia, Industry Self

<= 40 > 40

<= 50K > 50K

Figure 5. Sample decision tree for a catalog mailing.

Input: node n, datapartition D, split selectionmethod CLOutput: decision tree for D rooted at n

Top-Down Decision Tree Induction Schema (BinarySplits):BuildTree(Node n, datapartition D, split selectionmethod CL)(1) Apply CL to D to find the splitting criterionfor n(2) if (n splits)(3) Create children n1 and n2 of n(4) Use best split to partition D into D1 and D2

(5) BuildTree(n1, D1, CL)(6) BuildTree(n2, D2, CL)(7) endif

Figure 6. Code sample for the tree-building phase.

Page 7: Cover Feature Mining Very Large Databases be efficient, the data-mining techniques applied to very large databases must be highly scalable. An algo-rithm is said to be scalable if—given

44 Computer

memory. Knowing that the AVC group contains allthe information any split-selection method needs,the problem of scaling up an existing split-selectionmethod is now reduced to the problem of efficientlyconstructing the AVC group at each node of thetree.

One simple data access method works by per-forming a sequential scan over the training data-base to construct the root node’s AVC group inmain memory. The split-selection method then com-putes the split of the root node. In the next sequen-tial scan, each record is read and appended to onechild partition. The algorithm then recurses on eachchild partition.

Rajeev Rastogi and Kyuseok Shim20 developedan algorithm called Public that interleaves treebuilding and pruning. Public eagerly prunes nodesthat need not be expanded further during tree build-ing, thus saving on the expansion cost of somenodes in the tree.

M ost current data-mining research assumes thatdata is static. In practice, data is maintained indata warehouses, which are updated contin-

uously by the addition of records in batches. Giventhis scenario, we believe that future research mustaddress algorithms for efficient model maintenanceand methods to measure changes in data character-istics.

The current data-mining paradigm resembles thatof traditional database systems. A user initiates datamining and awaits the complete result. But analystsare interested in quick, partial, or approximateresults that can then be fine-tuned through a seriesof interactive queries. Thus, further research mustfocus on making data mining more interactive.

Finally, the Web is the largest repository of struc-tured, semistructured, and unstructured data. TheWeb’s dynamic nature, as well as the extreme vari-ety of data types it holds, will challenge the researchcommunity for years to come. ❖

AcknowledgmentsVenkatesh Ganti is supported by a Microsoft Graduate

Fellowship. Johannes Gehrke is supported by an IBMGraduate Fellowship. The research for this article was sup-ported by Grant 2053 from the IBM Corp.

References1. U.M. Fayyad et al., eds. Advances in Knowledge Dis-

covery and Data Mining, AAAI/MIT Press, Menlo Park,Calif., 1996.

2. R. Agrawal et al., “Fast Discovery of AssociationRules,” Advances in Knowledge Discovery and DataMining, U.M. Fayyad et al., eds., AAAI/MIT Press,Menlo Park, Calif., 1996, pp. 307-328.

3. A. Savasere, E. Omiecinski, and S. Navathe, “An Effi-cient Algorithm for Mining Association Rules in LargeDatabases,” Proc. 21st Int’l Conf. Very Large DataBases, Morgan Kaufmann, San Francisco, 1995, pp.432-444.

4. J.S. Park, M.-S. Chen, and S.Y. Philip, “An EffectiveHashBased Algorithm for Mining Association Rules,”Proc. ACM SIGMOD Int’l Conf. Management of Data,ACM Press, New York, 1995, pp.175-186.

5. H. Toivonen, “Sampling Large Databases for Associa-tion Rules,” Proc. 22nd Int’l Conf. Very Large DataBases, Morgan Kaufmann, San Francisco, 1996, pp.134-145.

6. S. Brin et al., “Dynamic Itemset Counting and Implica-tion Rules for Market Basket Data, Proc. ACM SIG-MOD Int’l Conf. Management of Data, ACM Press,New York, 1997, pp. 255-264.

7. R. Agrawal and R. Srikant, “Mining Sequential Pat-terns,” Proc. 11th Int’l Conf. Data Eng., IEEE CS Press,Los Alamitos, Calif., 1995, pp. 3-14.

8. H. Mannila, H. Toivonen, and A.I. Verkamo, “Discov-ering Frequent Episodes in Sequences,” Proc. 1st Int’lConf. Knowledge Discovery Databases and Data Min-ing, AAAI Press, Menlo Park, Calif., 1995, pp. 210-215.

9. S. Ramaswamy, S. Mahajan, and A. Silbershatz, “On theDiscovery of Interesting Patterns in Association Rules,”Proc. 24th Int’l Conf. Very Large Data Bases, MorganKaufmann, San Francisco, 1998, pp. 368-379.

10. T. Zhang, R. Ramakrishnan, and M. Livny, “Birch: AnEfficient Data Clustering Method for Large Databases,”Proc. ACM SIGMOD Int’l Conf. Management of Data,ACM Press, New York, 1996, pp. 103-114.

11. V. Ganti et al., “Clustering Large Datasets in ArbitraryMetric Spaces,” Proc. 15th Int’l Conf. Data Eng., IEEECS Press, Los Alamitos, Calif., 1999, pp. 502-511.

12. P. Bradley, U. Fayyad, and C. Reina, “Scaling Cluster-ing Algorithms to Large Databases,” Proc. 4th Int’lConf. Knowledge Discovery and Data Mining, AAAIPress, Menlo Park, Calif., 1998, pp. 9-15.

13. G. Sheikholeslami, S. Chatterjee, and A. Zhang,“WaveCluster: A Multi-Resolution Clustering Approach

Salary

30K

40K

60k

70K

A

0

0

1

1

B

0

0

1

1

C

1

2

1

2

GroupAge

30

35

45

50

A

1

1

0

0

B

1

1

0

0

C

1

1

2

1

Group

Figure 7. AVC group of the root node for the sample input database in Figure 4.

Page 8: Cover Feature Mining Very Large Databases be efficient, the data-mining techniques applied to very large databases must be highly scalable. An algo-rithm is said to be scalable if—given

for Very Large Spatial Databases,” Proc. 24th Int’l Conf.Very Large Data Bases, Morgan Kaufmann, San Fran-cisco, 1998, pp. 428-439.

14. R.T. Ng and J. Han, “Efficient and Effective ClusteringMethods for Spatial Data Mining,” Proc. 20th Int’l Conf.Very Large Data Bases, Morgan Kaufmann, San Fran-cisco, 1994, pp. 144-155.

15. S. Guha, R. Rastogi, and K. Shim, “CURE: An EfficientClustering Algorithm for Large Databases,” Proc. ACMSIGMOD Int’l Conf. Management of Data, ACM Press,New York, 1998, pp. 73-84.

16. M. Ester et al., “A Density-Based Algorithm for Discov-ering Clusters in Large Spatial Databases with Noise,”Proc. 2nd Int’l Conf. Knowledge Discovery Databases andData Mining, AAAI Press, Menlo Park, Calif., 1996, pp.226-231.

17. D. Michie, D.J. Spiegelhalter, and C.C. Taylor, MachineLearning, Neural and Statistical Classification,” EllisHorwood, Chichester, UK, 1994.

18. J. Shafer, R. Agrawal, and M. Mehta. “SPRINT: A ScalableParallel Classifier for Data Mining,” Proc. 22nd Int’l Conf.Very Large Data Bases,” Morgan Kaufmann, San Fran-cisco, 1996, pp. 544-555.

19. J. Gehrke, R. Ramakrishnan, and V. Ganti, “RainFor-est—a Framework for Fast Decision Tree Constructionof Large Datasets,” Proc. 24th Int’l Conf. Very LargeData Bases, Morgan Kaufmann, San Francisco, 1998, pp.416-427.

20. R. Rastogi and K. Shim, “Public: A Decision Tree Classi-fier that Integrates Building and Pruning,” Proc. 24th Int’lConf. Very Large Data Bases, Morgan Kaufmann, SanFrancisco, 1998, pp. 404-415.

Venkatesh Ganti is a PhD candidate at the Univer-sity of Wisconsin-Madison. His primary researchinterests are the exploratory analysis of large data setsand monitoring changes in data characteristics. Gantireceived an MS in computer science from the Uni-versity of Wisconsin-Madison.

Johannes Gehrke is a PhD candidate at the Universityof Wisconsin-Madison. His research interests includescalable techniques for data mining, performance ofdata-mining algorithms, and mining and monitoringevolving data sets.

Raghu Ramakrishnan is a professor in the ComputerSciences Department at the University of Wisconsin-Madison. His research interests include database lan-guages, net databases, data mining, and interactiveinformation visualization.

Contact Ganti, Gehrke, and Ramakrishnan at theUniversity of Wisconsin-Madison, Dept. of ComputerScience, Madison, WI 53706; {vganti, johannes,raghu}@cs.wisc.edu.

As seen in the

May issue of Computer

Perl Creator Larry Wall“The most revolutionary thing about languagedesign these days is that we’re putting moreeffort into larger languages.”

Jini Lead Architect Jim Waldo“The big wads of software that wehave grown used to might be replacedby small, simple components that doonly what they need to do and canbe combined together.”

Tcl Creator John Ousterhout“Scripting languages will beused for a larger fraction ofapplication development inthe years ahead.” Innovative Technology for Computer Professionals May 1999

http://computer.org

Two NewAwards HonorInnovators, p. 11

MiddlewareSteps Forward

Thompson OnUnix and Beyond

CORBA 3Preview

http://computer.org

Softwarerevolutionaries

on software

Softwarerevolutionaries

on software