mining: a database perspective

1

Mining: A Database Perspective

Raghu Ramakrishnan

Univ. of Wisconsin-Madison

2THE EDAM PROJECT University of Wisconsin-Madison

Data Mining Classification

Decision trees Regression SVMs Naïve Bayes Meta-learners, ensembles

Clustering K-means Hierarchical methods EM

MRDM/ILP pattern discovery Horn rules; PRMs

Frequent item analysis Associations, sequential patterns

Time-series analysis Linear and nonlinear dynamics

Collaborative filtering Text, multimedia mining

ML/AI

Optimization

DB Stats


Mining at a Crossroads

Data Mining has drawn upon ideas and people from many disciplines, and has grown rapidly.

As yet, no unifying vision of how these disciplines leverage each other. Stats folks still do stats, ML folks still do ML, DB folks

still think about large datasets—and they rarely talk amongst each other.

What are the applications that will pay the piper?


About this Talk

A database perspective on data mining and its relationship to data management How can database-oriented thinking influence

research and practice in data mining? What are the difficult problems with big payoffs?

The EDAM project at Wisconsin Analyzing streams of mass spectra and other

spatio-temporal data Joint work with researchers in atmospheric

aerosols and climatology at UW-Madison and Carleton College, funded by an NSF ITR


Outline

A Database perspective Recent extensions to relational systems

OLAP: Cube, sequence queriesData mining support

Relational approaches to miningRelational clusteringMRDM/ILP

The EDAM project


A Database Perspective


All the World’s a Table

All data is in a database. If not, it’s not important

Data mining is a class of analysis techniques that complements current SQL data analysis capabilities. Data is in a DBMS for reasons that go well beyond

the analysis capabilities of the DBMS, even if these are often inadequate.

And if the past is any indication, the DB vendors will try to expand SQL to support whatever DM capabilities the market will pay for—and it’s not clear that this is the right architecture.


Scalability Widely recognized as a characteristic DB concern, and

that it provides useful techniques to deal with scale. BIRCH—Scalable pre-clustering that borrows ideas from B+

trees Rainforest—Framework for scaling decision tree construction

that borrows from hash joins (There are also scalable algorithms based on EM and

Bootstrapping) However, the focus has been on one aspect of scale:

Size of training data We also need scalability with respect to other problem

dimensions: Size of hypothesis space Rate of data capture and analysis


Queries vs. Mining

From the point of view of the user, SQL queries are one way to explore and understand the data. But is it “data mining”? The various data mining techniques are no more (or

less) than alternatives with different capabilities.

The query framework has some ideas worth borrowing and generalizing: Compositionality—more flexibility, more automation Usability—domain analysts, not tool experts Query Optimization


A Different Mindset …

Sometimes, just looking at the problem from a different perspective may lead to useful reformulations: Frequent itemsets Relational clustering Stream analysis Labeling spectra Subset mining

“What does a query mean?” vs. “How do I characterize my data?” Hopefully, not mutually exclusive!

Can raise very different concerns E.g., Coverage, accuracy (ML), confidence bounds (Stats) vs.

query equivalence, compositionality (DB) Combining multiple sources of information (e.g., multiple tables)


Query Optimization Driven by user’s query

Goal is to find answers to this query efficiently

Search space for optimization Defined through equivalences to given query

Exploits compositionality!

“Goodness” metric is estimated plan cost Contrast this with the search spaces typical in, e.g.,

rule discovery or attribute selection These are data-driven, not query-driven Search space based on hypothesis refinement “Goodness” metric based on coverage of training set


Data Management Management

Data storage and archival Privacy, sharing, collaboration

Focus has been on managing data; however: Queries can be stored in the DBMS Views, or tables defined by queries (Ownership, access control, re-optimization, caching)

We need more support for managing analyses: Managing analyses external to the DBMS Provenance of data and analysis Versioning and collaboration support Support for ongoing analyses: Impact of data changes

on analyses; monitoring; trend analysis over warehouses; deploying results into operational system


IndexerMiner

Files, Logs DBMS

RAID STORAGE

Warehouse

Data Co-Processor Architecture

Small readsLarge R/W

Periodicoffline activity

Queries/Searches


Updates

SQL Queries

OLAP Queries

Text Queries

SYNC

CUSTOMIZED ASYNCHRONOUS REPLICAS


Recent Extensions of Relational Queries


Star Schema

Transactions(timekey,storekey,pkey,promkey,ckey,units,price)

Time Store

CustomersProductsPromotions


Multidimensional Analysis

NY CA WI

Industry1 $1000 $2000 $1000

Industry2 $500 $1000 $500

Industry3 $3000 $3000 $3000

Industry

Category

Country=“USA”

State

City

Year

Quarter

Month Week

DayProduct


Slice and Drill-Down

SanFrancisco

San Jose Los Angeles

Category1 $300 $300 $400

Category2 $300 $300 $400

Category3 $100 $800 $100

Industry=“Industry3”

Category

Country

State=“CA”

City

Year

Quarter

Month Week

DayProduct

THE EDAM PROJECT University of Wisconsin-Madison

Comparison with SQL

SELECT SUM(S.sales)FROM Sales S, Times T, Locations LWHERE S.timeid=T.timeid AND S.timeid=L.timeidGROUP BY T.year, L.city

SELECT SUM(S.sales)FROM Sales S, Times TWHERE S.timeid=T.timeidGROUP BY T.year

SELECT SUM(S.sales)FROM Sales S, Location LWHERE S.timeid=L.timeidGROUP BY L.city


Visual Intuition: Cube

Location

Product

TimeM T W Th F S S

Product1

Product2

Product3

Product4

Product5

Product6

SHSF

LA

20

30

20

15

10

50

50 Units of Product6 sold on Monday in LA

roll-up to week

roll-up to category

roll-up to state


CUBE Operator

For k dimensions, we have 2^k possible SQL GROUP BY queries that can be generated through pivoting on a subset of dimensions.

CUBE pid, locid, timeid BY SUM Sales Equivalent to rolling up Sales on all eight subsets

of the set {pid, locid, timeid}; each roll-up corresponds to an SQL query of the form:

SELECT SUM(S.sales)FROM Sales SGROUP BY grouping-list


Observation

When you need to consider several related or overlapping computationsThink of how to expose this space to the

user, and to get user input on what part of the space might be interestingMarketing specialists can use OLAP interfaces to

do very complex queries easily

Think of how to optimize by exploiting commonality across computations


Querying Sequences

SQL-92 supports queries over relations. A relation is a (multi) set of records. No ordering of records in a relation!

Queries involving order are hard or impossible to express, and typically, inefficiently evaluated. Find weekly moving average of the DJIA. Compute % change of each stock during ‘97, and then find

stocks in the top 5% (those that changed most).

SQL:1999 supports the concept of windowing, which effectively orders tuples for query purposes.


SRQL(Ramakrishnan et al., SSDBM 98)

Proposed a sequencing operator as an extension to relational algebra.

g s v

3 4 a

3 6 b

3 6 c

3 9 b

2 1 a

4 3 d

Applied to a table R, with grouping attrs g and sequencing attrs s, it returns the corresponding composite sequence.

ord g s v

1 3 4 a

2 3 6 b

2 3 6 c

3 3 9 b

1 2 1 a

1 4 3 d


Find the 2-day moving average of volume sold for each product: In effect, creates a sequence by day for each product,

and computes the moving average over each of these sequences.

Observe how this generalizes SQL’s GROUP BY: illustrates power of composite sequences and aggregation.

SELECT product, day, AVG(vol) OVER 0 TO 1FROM SalesGROUP BY productSEQUENCE BY day

Example


Variants of Aggregation

We can now introduce “running sum” and other cumulative aggregate functions!OVER FIRST TO 0: This gives us “running”

or “cumulative” aggregates.RANK() is CUMULATIVE COUNT(*)PERCENTILE() is (RANK()/COUNT(*))*100

Elegant way to express concepts like “give me the first few answers”.

SQL:1999 does all this and more (different syntax)


Observation

Still much more limited than time-series analysis and mining techniques available elsewhere

No support for streams


DBMS Support for Managing Mining Models


Why Integrate?

Data

Copy

Extract

Models

Consistency?

Mine


Integration Objectives

Avoid isolation of querying from mining Difficult to do “ad-hoc”

mining

Provide simple programming approach to creating and using DM models

Make it possible to add new models

Make it possible to add new, scalable algorithms

Analysts (users) DM Vendors


DM Concepts to Support

Representation of input (cases) Representation of models Specification of training step Specification of prediction step

Should be independent of specific algorithms


Types of Columns

Cust ID

AgeMarital

StatusWealth

Product Purchases

Product

Quantity Type

1 35 M 380,000

TV 1 Appliance

Coke 6 Drink

Ham 3 Food Keys: Columns that uniquely identify a case Attributes: Columns that describe a case

Value: A state associated with the attribute in a specific case Attribute Property: Columns that describe an attribute

Unique for a specific attribute value (TV is always an appliance) Attribute Modifier: Columns that represent additional “meta” information

for an attribute Weight of a case, Certainty of prediction

Single case!


Representing a DMM

Specifying a Model Columns it should predict Algorithm to use Special parameters

Model is represented as a nested table Specification = Create table Training = Inserting data into the table Predicting = Querying the table


Training a DMM

Training a DMM requires passing it “known” cases Use an INSERT INTO in order to “insert” the data

to the DMM The DMM will usually not retain the inserted data Instead it will analyze the given cases and build the

DMM content (decision tree, segmentation model)

INSERT [INTO] <mining model name>[(columns list)]<source data query>


Making Predictions

SELECT [Customers].[ID],

MyDMM.[Hair Color],

PredictProbability(MyDMM.[Hair Color])

FROM

MyDMM PREDICTION JOIN [Customers]

ON MyDMM.[Gender] = [Customers].[Gender] AND

MyDMM.[Age] = [Customers].[Age]


Research DirectionsMRDM/ILP


MRDM Accomplishments

ILP origins, hypothesis discovery Classification Clustering Frequent itemsets Equational discovery Subgroup discovery Extensions of Bayesian nets to multiple

relations via key-foreign key traversals


Issues

Can we indeed capture the semantics exactly for each of these classes of patterns/models?Taking into account the details of the

underlying evaluation algorithm! Is the performance comparable to

specialized algorithms? Is it acceptable for a broad range of applications?


Positives

Impressive! Quite a range of patterns/models are shown to be expressible in this formalism Importantly, the added expressiveness allows new kinds

of patterns to be naturally formulated by a user

There is a (more or less) common computational structure consisting of Space of patterns to search Measure of support for a pattern Enumeration and pruning strategy over search space

What tangible benefits can we derive from this generality?


Challenges, Opportunities

If ILP notation is roughly analogous to relational calculus, what is the appropriate algebra? Equivalences, compositionality Cost-based optimization to find “optimal” evaluation

plans

What kind of user input/domain knowledge can be used to focus computation, or help with optimization?


Research DirectionsRelational Clustering


Problem Statement

Goal: Discover clusters of attribute-values Data: A table T with attributes drawn from domains

D1,…,Dn

Thus, a tuple of T consists of a value from each domain, e.g., (a1,b2,c1)

T could be an arbitrary view over several tables!

a2

a1

a3

a4

b1

b2

b3

c1

c2

c3

c4

A B C

Note: We expect sizes of D1,…,Dn to be small


STIRR (Gibson, Kleinberg, Raghavan, VLDB 98)

Intuition: Want to detect that “Honda and Toyota are related because unusually high numbers of both were sold in August.” If we also find that many Hondas and Nissans are

sold in Sept, and many dealers sell both Hondas and Acuras, this leads to a cluster best described as “late-summer sales of Japanese cars”

Approach: Techniques for spectral graph partitioning, generalized to hypergraphs. Attribute values as weighted vertices in a graph;

edges based on co-occurrence. Weights propagate along links, leading to a non-linear dynamical system.


CACTUS (Ganti, Gehrke, Ramakrishnan, KDD 99)

Same motivation, different problem formulation and approach

Precise definition of cluster, deterministic algorithm that computes all clusters

Very efficient, scalable, SQL-based algorithm


Similarity Between Attributes

“similarity’’ between a1 and b1 support(a1,b1) = number of tuples containing (a1,b1)

a1 and b1 are strongly connected if support(a1,b1) is higher than expected

{a1,a2,a3,a4} and {b1,b2} are strongly connected if all pairs are

a2

a1

a3

a4

b1

b2

b3

b4

c1

c2

c3

c4

A B C

Not strongly connected


Similarity Within an Attribute

simA(b1,b2): Number of values of A which are strongly connected with both b1 and b2

a2

a1

a3

a4

b1

b2

b3

b4

c1

c2

c3

c4

sim*(B) thru A thru C

(b1,b2) 4 2

(b1,b3) 0 2

(b1,b4) 0 0

(b2,b3) 0 2

(b2,b4) 0 0

A B C


Cluster Definition

Region: A cross-product of sets of attribute values: C1 x … x Cn

C=C1 x … x Cn is a cluster iff1. Ci and Cj are strongly connected, for all i,j

2. Ci is maximal, for all i

3. Support(C) > expected

Ci: cluster projection of C on Ai


The CACTUS Algorithm

Summarize Inter-attribute summaries: Scan dataset Intra-attribute summaries: Query IA

summaries Clustering phase

Compute cluster projectionsLevel-wise synthesis of cluster projections to

form candidate clusters Validation

Requires a scan of the dataset


Inter-Attribute Summaries

Supports of all strongly connected attribute value pairs from different attributes Similar in nature to “frequent’’ 2-itemsets So is the computation

a2

a1

a3

a4

b1

b2

b3

b4

c1

c2

c3

c4

IJ(A,B) IJ(A,C) IJ(B,C)

(a1,b1) (a1,c1) (b1,c1)

(a1,b2) (a1,c2) (b1,c2)

(a2,b1) (a2,c1) (b2,c1)

(a2,b2) (a2,c2) (b2,c2)

(a3,b1) (b3,c1)

… …

A B C


Intra-Attribute Summaries

simA(B): Similarities through A of attribute value pairs of B

a2

a1

a3

a4

b1

b2

b3

b4

c1

c2

c3

c4

sim*(B) thru A thru C

(b1,b2) 4 2

(b1,b3) 0 2

(b1,b4) 0 0

(b2,b3) 0 2

(b2,b4) 0 0

A B C


Experimental Evaluation

Compare CACTUS with STIRR [GKR98] Synthetic datasets

Quasi-random data [GKR98:STIRR]Fix domain of each attributeRandomly generate tuples from these

domains Identify clusters and plant additional (5%)

data within the clusters


Synthetic Datasets

{0,…9} x {0,…9}{10,…,19} x {10,…,19}

0

9

19

10

20…

99

Both CACTUS and STIRR identified the two clusters exactly


Synthetic Dataset (contd.)

0

9

19

10

20…

99

{0,…,9} x {0,…,9} x {0,…,9}{10,…,19} x {10,…,19} x {10,…,19}{0,…,9} x {10,…,19} x {10,…,19} Cactus identifies the 3 clusters

STIRR returns:{0,…,9} x {0,…,19} x {0,…,9}{10,…,19} x {0,…,19} x {10,…,19}


Scalability with #Tuples

Time vs. #Tuples

0

500

1000

1500

2000

2500

1 2 3 4 5

#Tuples (in millions)

Tim

e (i

n s

eco

nd

s)

CACTUS STIRR#Attributes: 10

Domain Size: 100

CACTUS is 10 times faster


Scalability with #Attributes

Time vs. #Attributes

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

4 6 8 10 20 30 40 50#Attributes

Tim

e (

in s

eco

nd

s)

CACTUS STIRR 1 million tuplesDomain size: 100


Scalability with Domain Size

Time vs. Domain Size

0

50

100

150

200

250

50 100 200 400 600 800 1000

#Attribute Values

Tim

e (

in s

ec

on

ds

)

CACTUS STIRR 1 million tuples#attributes: 4


Bibliographic Data

Database and theory bibliographic entries [Wie]—38500 entries

Attributes: first author, second author, conference/journal, and year

Example cluster projections on the conference attribute:

(1). ACM Sigmod, VLDB, ACM TODS, ICDE, ACM Sigmod Record(2). ACMTG, CompGeom, FOCS, Geometry, ICALP, IPL, JCSS, …(3). PODS, Algorithmica, FOCS, ICALP, INFCTRL, IPL, JCSS, …


ROCK (Guha, Rastogi, Shim, ICDE 99)

Each tuple is a node, and two nodes are linked if within a threshold distance.

Similarity between two nodes is the number of common neighbors.

ROCK does agglomerative hierarchical clustering based on similarity.


Research DirectionsThe EDAM Project


Example Tasks

Label a spectrum to identify elements Find common elements across (subsets of) spectra

Collected at multiple locations, and multiple conditions, and … At different times, and over time periods

Find subsets of spectra (e.g., based on time periods and locations) with Unusually common elements Interesting characteristics Correlations to other spectral streams

Want to be able to reconstruct analysis done a year ago and run it on different data

Want to share ongoing analysis with colleagues and track changes and their impact


[Slides omitted from this version]


Conclusions

Database systems hold a lot of the data people care about and want to mine, making them an important part of the mining environment Especially for ongoing analysis and collaboration

Beyond this, there are a number of ideas and techniques in the DB literature that can be applied more broadly Formulations of mining tasks Algorithms

Scalability is an important idea from databases But there are many more—compositionality, query-

driven approach, set-oriented analyses

mining: a database perspective

Documents

important data mining

data managementhow

tableall data

various data mining

crossroadsdata mining

db folks

talka database perspective

different capabilities