mining: a database perspective
DESCRIPTION
Mining: A Database Perspective. Raghu Ramakrishnan Univ. of Wisconsin-Madison. Data Mining. ML/AI. DB. Stats. Optimization. Classification Decision trees Regression SVMs Naïve Bayes Meta-learners, ensembles Clustering K-means Hierarchical methods EM - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/1.jpg)
1
Mining: A Database Perspective
Raghu Ramakrishnan
Univ. of Wisconsin-Madison
![Page 2: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/2.jpg)
2THE EDAM PROJECT University of Wisconsin-Madison
Data Mining Classification
Decision trees Regression SVMs Naïve Bayes Meta-learners, ensembles
Clustering K-means Hierarchical methods EM
MRDM/ILP pattern discovery Horn rules; PRMs
Frequent item analysis Associations, sequential patterns
Time-series analysis Linear and nonlinear dynamics
Collaborative filtering Text, multimedia mining
ML/AI
Optimization
DB Stats
![Page 3: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/3.jpg)
3THE EDAM PROJECT University of Wisconsin-Madison
Mining at a Crossroads
Data Mining has drawn upon ideas and people from many disciplines, and has grown rapidly.
As yet, no unifying vision of how these disciplines leverage each other. Stats folks still do stats, ML folks still do ML, DB folks
still think about large datasets—and they rarely talk amongst each other.
What are the applications that will pay the piper?
![Page 4: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/4.jpg)
4THE EDAM PROJECT University of Wisconsin-Madison
About this Talk
A database perspective on data mining and its relationship to data management How can database-oriented thinking influence
research and practice in data mining? What are the difficult problems with big payoffs?
The EDAM project at Wisconsin Analyzing streams of mass spectra and other
spatio-temporal data Joint work with researchers in atmospheric
aerosols and climatology at UW-Madison and Carleton College, funded by an NSF ITR
![Page 5: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/5.jpg)
5THE EDAM PROJECT University of Wisconsin-Madison
Outline
A Database perspective Recent extensions to relational systems
OLAP: Cube, sequence queriesData mining support
Relational approaches to miningRelational clusteringMRDM/ILP
The EDAM project
![Page 6: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/6.jpg)
6THE EDAM PROJECT University of Wisconsin-Madison
A Database Perspective
![Page 7: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/7.jpg)
7THE EDAM PROJECT University of Wisconsin-Madison
All the World’s a Table
All data is in a database. If not, it’s not important
Data mining is a class of analysis techniques that complements current SQL data analysis capabilities. Data is in a DBMS for reasons that go well beyond
the analysis capabilities of the DBMS, even if these are often inadequate.
And if the past is any indication, the DB vendors will try to expand SQL to support whatever DM capabilities the market will pay for—and it’s not clear that this is the right architecture.
![Page 8: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/8.jpg)
8THE EDAM PROJECT University of Wisconsin-Madison
Scalability Widely recognized as a characteristic DB concern, and
that it provides useful techniques to deal with scale. BIRCH—Scalable pre-clustering that borrows ideas from B+
trees Rainforest—Framework for scaling decision tree construction
that borrows from hash joins (There are also scalable algorithms based on EM and
Bootstrapping) However, the focus has been on one aspect of scale:
Size of training data We also need scalability with respect to other problem
dimensions: Size of hypothesis space Rate of data capture and analysis
![Page 9: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/9.jpg)
9THE EDAM PROJECT University of Wisconsin-Madison
Queries vs. Mining
From the point of view of the user, SQL queries are one way to explore and understand the data. But is it “data mining”? The various data mining techniques are no more (or
less) than alternatives with different capabilities.
The query framework has some ideas worth borrowing and generalizing: Compositionality—more flexibility, more automation Usability—domain analysts, not tool experts Query Optimization
![Page 10: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/10.jpg)
10THE EDAM PROJECT University of Wisconsin-Madison
A Different Mindset …
Sometimes, just looking at the problem from a different perspective may lead to useful reformulations: Frequent itemsets Relational clustering Stream analysis Labeling spectra Subset mining
“What does a query mean?” vs. “How do I characterize my data?” Hopefully, not mutually exclusive!
Can raise very different concerns E.g., Coverage, accuracy (ML), confidence bounds (Stats) vs.
query equivalence, compositionality (DB) Combining multiple sources of information (e.g., multiple tables)
![Page 11: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/11.jpg)
11THE EDAM PROJECT University of Wisconsin-Madison
Query Optimization Driven by user’s query
Goal is to find answers to this query efficiently
Search space for optimization Defined through equivalences to given query
Exploits compositionality!
“Goodness” metric is estimated plan cost Contrast this with the search spaces typical in, e.g.,
rule discovery or attribute selection These are data-driven, not query-driven Search space based on hypothesis refinement “Goodness” metric based on coverage of training set
![Page 12: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/12.jpg)
12THE EDAM PROJECT University of Wisconsin-Madison
Data Management Management
Data storage and archival Privacy, sharing, collaboration
Focus has been on managing data; however: Queries can be stored in the DBMS Views, or tables defined by queries (Ownership, access control, re-optimization, caching)
We need more support for managing analyses: Managing analyses external to the DBMS Provenance of data and analysis Versioning and collaboration support Support for ongoing analyses: Impact of data changes
on analyses; monitoring; trend analysis over warehouses; deploying results into operational system
![Page 13: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/13.jpg)
13THE EDAM PROJECT University of Wisconsin-Madison
IndexerMiner
Files, Logs DBMS
RAID STORAGE
Warehouse
Data Co-Processor Architecture
Small readsLarge R/W
Periodicoffline activity
Queries/Searches
![Page 14: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/14.jpg)
14THE EDAM PROJECT University of Wisconsin-Madison
Updates
SQL Queries
OLAP Queries
Text Queries
SYNC
CUSTOMIZED ASYNCHRONOUS REPLICAS
![Page 15: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/15.jpg)
15THE EDAM PROJECT University of Wisconsin-Madison
Recent Extensions of Relational Queries
![Page 16: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/16.jpg)
16THE EDAM PROJECT University of Wisconsin-Madison
Star Schema
Transactions(timekey,storekey,pkey,promkey,ckey,units,price)
Time Store
CustomersProductsPromotions
![Page 17: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/17.jpg)
17THE EDAM PROJECT University of Wisconsin-Madison
Multidimensional Analysis
NY CA WI
Industry1 $1000 $2000 $1000
Industry2 $500 $1000 $500
Industry3 $3000 $3000 $3000
Industry
Category
Country=“USA”
State
City
Year
Quarter
Month Week
DayProduct
![Page 18: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/18.jpg)
18THE EDAM PROJECT University of Wisconsin-Madison
Slice and Drill-Down
SanFrancisco
San Jose Los Angeles
Category1 $300 $300 $400
Category2 $300 $300 $400
Category3 $100 $800 $100
Industry=“Industry3”
Category
Country
State=“CA”
City
Year
Quarter
Month Week
DayProduct
![Page 19: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/19.jpg)
THE EDAM PROJECT University of Wisconsin-Madison
Comparison with SQL
SELECT SUM(S.sales)FROM Sales S, Times T, Locations LWHERE S.timeid=T.timeid AND S.timeid=L.timeidGROUP BY T.year, L.city
SELECT SUM(S.sales)FROM Sales S, Times TWHERE S.timeid=T.timeidGROUP BY T.year
SELECT SUM(S.sales)FROM Sales S, Location LWHERE S.timeid=L.timeidGROUP BY L.city
![Page 20: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/20.jpg)
20THE EDAM PROJECT University of Wisconsin-Madison
Visual Intuition: Cube
Location
Product
TimeM T W Th F S S
Product1
Product2
Product3
Product4
Product5
Product6
SHSF
LA
20
30
20
15
10
50
50 Units of Product6 sold on Monday in LA
roll-up to week
roll-up to category
roll-up to state
![Page 21: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/21.jpg)
THE EDAM PROJECT University of Wisconsin-Madison
CUBE Operator
For k dimensions, we have 2^k possible SQL GROUP BY queries that can be generated through pivoting on a subset of dimensions.
CUBE pid, locid, timeid BY SUM Sales Equivalent to rolling up Sales on all eight subsets
of the set {pid, locid, timeid}; each roll-up corresponds to an SQL query of the form:
SELECT SUM(S.sales)FROM Sales SGROUP BY grouping-list
![Page 22: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/22.jpg)
22THE EDAM PROJECT University of Wisconsin-Madison
Observation
When you need to consider several related or overlapping computationsThink of how to expose this space to the
user, and to get user input on what part of the space might be interestingMarketing specialists can use OLAP interfaces to
do very complex queries easily
Think of how to optimize by exploiting commonality across computations
![Page 23: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/23.jpg)
THE EDAM PROJECT University of Wisconsin-Madison
Querying Sequences
SQL-92 supports queries over relations. A relation is a (multi) set of records. No ordering of records in a relation!
Queries involving order are hard or impossible to express, and typically, inefficiently evaluated. Find weekly moving average of the DJIA. Compute % change of each stock during ‘97, and then find
stocks in the top 5% (those that changed most).
SQL:1999 supports the concept of windowing, which effectively orders tuples for query purposes.
![Page 24: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/24.jpg)
THE EDAM PROJECT University of Wisconsin-Madison
SRQL(Ramakrishnan et al., SSDBM 98)
Proposed a sequencing operator as an extension to relational algebra.
g s v
3 4 a
3 6 b
3 6 c
3 9 b
2 1 a
4 3 d
Applied to a table R, with grouping attrs g and sequencing attrs s, it returns the corresponding composite sequence.
ord g s v
1 3 4 a
2 3 6 b
2 3 6 c
3 3 9 b
1 2 1 a
1 4 3 d
![Page 25: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/25.jpg)
THE EDAM PROJECT University of Wisconsin-Madison
Find the 2-day moving average of volume sold for each product: In effect, creates a sequence by day for each product,
and computes the moving average over each of these sequences.
Observe how this generalizes SQL’s GROUP BY: illustrates power of composite sequences and aggregation.
SELECT product, day, AVG(vol) OVER 0 TO 1FROM SalesGROUP BY productSEQUENCE BY day
Example
![Page 26: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/26.jpg)
THE EDAM PROJECT University of Wisconsin-Madison
Variants of Aggregation
We can now introduce “running sum” and other cumulative aggregate functions!OVER FIRST TO 0: This gives us “running”
or “cumulative” aggregates.RANK() is CUMULATIVE COUNT(*)PERCENTILE() is (RANK()/COUNT(*))*100
Elegant way to express concepts like “give me the first few answers”.
SQL:1999 does all this and more (different syntax)
![Page 27: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/27.jpg)
27THE EDAM PROJECT University of Wisconsin-Madison
Observation
Still much more limited than time-series analysis and mining techniques available elsewhere
No support for streams
![Page 28: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/28.jpg)
28THE EDAM PROJECT University of Wisconsin-Madison
DBMS Support for Managing Mining Models
![Page 29: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/29.jpg)
29THE EDAM PROJECT University of Wisconsin-Madison
Why Integrate?
Data
Copy
Extract
Models
Consistency?
Mine
![Page 30: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/30.jpg)
30THE EDAM PROJECT University of Wisconsin-Madison
Integration Objectives
Avoid isolation of querying from mining Difficult to do “ad-hoc”
mining
Provide simple programming approach to creating and using DM models
Make it possible to add new models
Make it possible to add new, scalable algorithms
Analysts (users) DM Vendors
![Page 31: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/31.jpg)
31THE EDAM PROJECT University of Wisconsin-Madison
DM Concepts to Support
Representation of input (cases) Representation of models Specification of training step Specification of prediction step
Should be independent of specific algorithms
![Page 32: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/32.jpg)
32THE EDAM PROJECT University of Wisconsin-Madison
Types of Columns
Cust ID
AgeMarital
StatusWealth
Product Purchases
Product
Quantity Type
1 35 M 380,000
TV 1 Appliance
Coke 6 Drink
Ham 3 Food Keys: Columns that uniquely identify a case Attributes: Columns that describe a case
Value: A state associated with the attribute in a specific case Attribute Property: Columns that describe an attribute
Unique for a specific attribute value (TV is always an appliance) Attribute Modifier: Columns that represent additional “meta” information
for an attribute Weight of a case, Certainty of prediction
Single case!
![Page 33: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/33.jpg)
33THE EDAM PROJECT University of Wisconsin-Madison
Representing a DMM
Specifying a Model Columns it should predict Algorithm to use Special parameters
Model is represented as a nested table Specification = Create table Training = Inserting data into the table Predicting = Querying the table
![Page 34: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/34.jpg)
34THE EDAM PROJECT University of Wisconsin-Madison
Training a DMM
Training a DMM requires passing it “known” cases Use an INSERT INTO in order to “insert” the data
to the DMM The DMM will usually not retain the inserted data Instead it will analyze the given cases and build the
DMM content (decision tree, segmentation model)
INSERT [INTO] <mining model name>[(columns list)]<source data query>
![Page 35: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/35.jpg)
35THE EDAM PROJECT University of Wisconsin-Madison
Making Predictions
SELECT [Customers].[ID],
MyDMM.[Hair Color],
PredictProbability(MyDMM.[Hair Color])
FROM
MyDMM PREDICTION JOIN [Customers]
ON MyDMM.[Gender] = [Customers].[Gender] AND
MyDMM.[Age] = [Customers].[Age]
![Page 36: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/36.jpg)
36THE EDAM PROJECT University of Wisconsin-Madison
Research DirectionsMRDM/ILP
![Page 37: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/37.jpg)
37THE EDAM PROJECT University of Wisconsin-Madison
MRDM Accomplishments
ILP origins, hypothesis discovery Classification Clustering Frequent itemsets Equational discovery Subgroup discovery Extensions of Bayesian nets to multiple
relations via key-foreign key traversals
![Page 38: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/38.jpg)
38THE EDAM PROJECT University of Wisconsin-Madison
Issues
Can we indeed capture the semantics exactly for each of these classes of patterns/models?Taking into account the details of the
underlying evaluation algorithm! Is the performance comparable to
specialized algorithms? Is it acceptable for a broad range of applications?
![Page 39: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/39.jpg)
39THE EDAM PROJECT University of Wisconsin-Madison
Positives
Impressive! Quite a range of patterns/models are shown to be expressible in this formalism Importantly, the added expressiveness allows new kinds
of patterns to be naturally formulated by a user
There is a (more or less) common computational structure consisting of Space of patterns to search Measure of support for a pattern Enumeration and pruning strategy over search space
What tangible benefits can we derive from this generality?
![Page 40: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/40.jpg)
40THE EDAM PROJECT University of Wisconsin-Madison
Challenges, Opportunities
If ILP notation is roughly analogous to relational calculus, what is the appropriate algebra? Equivalences, compositionality Cost-based optimization to find “optimal” evaluation
plans
What kind of user input/domain knowledge can be used to focus computation, or help with optimization?
![Page 41: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/41.jpg)
41THE EDAM PROJECT University of Wisconsin-Madison
Research DirectionsRelational Clustering
![Page 42: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/42.jpg)
42THE EDAM PROJECT University of Wisconsin-Madison
Problem Statement
Goal: Discover clusters of attribute-values Data: A table T with attributes drawn from domains
D1,…,Dn
Thus, a tuple of T consists of a value from each domain, e.g., (a1,b2,c1)
T could be an arbitrary view over several tables!
a2
a1
a3
a4
b1
b2
b3
c1
c2
c3
c4
A B C
Note: We expect sizes of D1,…,Dn to be small
![Page 43: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/43.jpg)
43THE EDAM PROJECT University of Wisconsin-Madison
STIRR (Gibson, Kleinberg, Raghavan, VLDB 98)
Intuition: Want to detect that “Honda and Toyota are related because unusually high numbers of both were sold in August.” If we also find that many Hondas and Nissans are
sold in Sept, and many dealers sell both Hondas and Acuras, this leads to a cluster best described as “late-summer sales of Japanese cars”
Approach: Techniques for spectral graph partitioning, generalized to hypergraphs. Attribute values as weighted vertices in a graph;
edges based on co-occurrence. Weights propagate along links, leading to a non-linear dynamical system.
![Page 44: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/44.jpg)
44THE EDAM PROJECT University of Wisconsin-Madison
CACTUS (Ganti, Gehrke, Ramakrishnan, KDD 99)
Same motivation, different problem formulation and approach
Precise definition of cluster, deterministic algorithm that computes all clusters
Very efficient, scalable, SQL-based algorithm
![Page 45: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/45.jpg)
45THE EDAM PROJECT University of Wisconsin-Madison
Similarity Between Attributes
“similarity’’ between a1 and b1 support(a1,b1) = number of tuples containing (a1,b1)
a1 and b1 are strongly connected if support(a1,b1) is higher than expected
{a1,a2,a3,a4} and {b1,b2} are strongly connected if all pairs are
a2
a1
a3
a4
b1
b2
b3
b4
c1
c2
c3
c4
A B C
Not strongly connected
![Page 46: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/46.jpg)
46THE EDAM PROJECT University of Wisconsin-Madison
Similarity Within an Attribute
simA(b1,b2): Number of values of A which are strongly connected with both b1 and b2
a2
a1
a3
a4
b1
b2
b3
b4
c1
c2
c3
c4
sim*(B) thru A thru C
(b1,b2) 4 2
(b1,b3) 0 2
(b1,b4) 0 0
(b2,b3) 0 2
(b2,b4) 0 0
A B C
![Page 47: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/47.jpg)
47THE EDAM PROJECT University of Wisconsin-Madison
Cluster Definition
Region: A cross-product of sets of attribute values: C1 x … x Cn
C=C1 x … x Cn is a cluster iff1. Ci and Cj are strongly connected, for all i,j
2. Ci is maximal, for all i
3. Support(C) > expected
Ci: cluster projection of C on Ai
![Page 48: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/48.jpg)
48THE EDAM PROJECT University of Wisconsin-Madison
The CACTUS Algorithm
Summarize Inter-attribute summaries: Scan dataset Intra-attribute summaries: Query IA
summaries Clustering phase
Compute cluster projectionsLevel-wise synthesis of cluster projections to
form candidate clusters Validation
Requires a scan of the dataset
![Page 49: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/49.jpg)
49THE EDAM PROJECT University of Wisconsin-Madison
Inter-Attribute Summaries
Supports of all strongly connected attribute value pairs from different attributes Similar in nature to “frequent’’ 2-itemsets So is the computation
a2
a1
a3
a4
b1
b2
b3
b4
c1
c2
c3
c4
IJ(A,B) IJ(A,C) IJ(B,C)
(a1,b1) (a1,c1) (b1,c1)
(a1,b2) (a1,c2) (b1,c2)
(a2,b1) (a2,c1) (b2,c1)
(a2,b2) (a2,c2) (b2,c2)
(a3,b1) (b3,c1)
… …
A B C
![Page 50: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/50.jpg)
50THE EDAM PROJECT University of Wisconsin-Madison
Intra-Attribute Summaries
simA(B): Similarities through A of attribute value pairs of B
a2
a1
a3
a4
b1
b2
b3
b4
c1
c2
c3
c4
sim*(B) thru A thru C
(b1,b2) 4 2
(b1,b3) 0 2
(b1,b4) 0 0
(b2,b3) 0 2
(b2,b4) 0 0
A B C
![Page 51: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/51.jpg)
51THE EDAM PROJECT University of Wisconsin-Madison
Experimental Evaluation
Compare CACTUS with STIRR [GKR98] Synthetic datasets
Quasi-random data [GKR98:STIRR]Fix domain of each attributeRandomly generate tuples from these
domains Identify clusters and plant additional (5%)
data within the clusters
![Page 52: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/52.jpg)
52THE EDAM PROJECT University of Wisconsin-Madison
Synthetic Datasets
{0,…9} x {0,…9}{10,…,19} x {10,…,19}
0
9
19
10
20…
99
Both CACTUS and STIRR identified the two clusters exactly
![Page 53: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/53.jpg)
53THE EDAM PROJECT University of Wisconsin-Madison
Synthetic Dataset (contd.)
0
9
19
10
20…
99
{0,…,9} x {0,…,9} x {0,…,9}{10,…,19} x {10,…,19} x {10,…,19}{0,…,9} x {10,…,19} x {10,…,19} Cactus identifies the 3 clusters
STIRR returns:{0,…,9} x {0,…,19} x {0,…,9}{10,…,19} x {0,…,19} x {10,…,19}
![Page 54: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/54.jpg)
54THE EDAM PROJECT University of Wisconsin-Madison
Scalability with #Tuples
Time vs. #Tuples
0
500
1000
1500
2000
2500
1 2 3 4 5
#Tuples (in millions)
Tim
e (i
n s
eco
nd
s)
CACTUS STIRR#Attributes: 10
Domain Size: 100
CACTUS is 10 times faster
![Page 55: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/55.jpg)
55THE EDAM PROJECT University of Wisconsin-Madison
Scalability with #Attributes
Time vs. #Attributes
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
4 6 8 10 20 30 40 50#Attributes
Tim
e (
in s
eco
nd
s)
CACTUS STIRR 1 million tuplesDomain size: 100
![Page 56: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/56.jpg)
56THE EDAM PROJECT University of Wisconsin-Madison
Scalability with Domain Size
Time vs. Domain Size
0
50
100
150
200
250
50 100 200 400 600 800 1000
#Attribute Values
Tim
e (
in s
ec
on
ds
)
CACTUS STIRR 1 million tuples#attributes: 4
![Page 57: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/57.jpg)
57THE EDAM PROJECT University of Wisconsin-Madison
Bibliographic Data
Database and theory bibliographic entries [Wie]—38500 entries
Attributes: first author, second author, conference/journal, and year
Example cluster projections on the conference attribute:
(1). ACM Sigmod, VLDB, ACM TODS, ICDE, ACM Sigmod Record(2). ACMTG, CompGeom, FOCS, Geometry, ICALP, IPL, JCSS, …(3). PODS, Algorithmica, FOCS, ICALP, INFCTRL, IPL, JCSS, …
![Page 58: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/58.jpg)
58THE EDAM PROJECT University of Wisconsin-Madison
ROCK (Guha, Rastogi, Shim, ICDE 99)
Each tuple is a node, and two nodes are linked if within a threshold distance.
Similarity between two nodes is the number of common neighbors.
ROCK does agglomerative hierarchical clustering based on similarity.
![Page 59: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/59.jpg)
59THE EDAM PROJECT University of Wisconsin-Madison
Research DirectionsThe EDAM Project
![Page 60: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/60.jpg)
60THE EDAM PROJECT University of Wisconsin-Madison
Example Tasks
Label a spectrum to identify elements Find common elements across (subsets of) spectra
Collected at multiple locations, and multiple conditions, and … At different times, and over time periods
Find subsets of spectra (e.g., based on time periods and locations) with Unusually common elements Interesting characteristics Correlations to other spectral streams
Want to be able to reconstruct analysis done a year ago and run it on different data
Want to share ongoing analysis with colleagues and track changes and their impact
![Page 61: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/61.jpg)
61THE EDAM PROJECT University of Wisconsin-Madison
[Slides omitted from this version]
![Page 62: Mining: A Database Perspective](https://reader035.vdocument.in/reader035/viewer/2022062409/568149cc550346895db6fb0a/html5/thumbnails/62.jpg)
62THE EDAM PROJECT University of Wisconsin-Madison
Conclusions
Database systems hold a lot of the data people care about and want to mine, making them an important part of the mining environment Especially for ongoing analysis and collaboration
Beyond this, there are a number of ideas and techniques in the DB literature that can be applied more broadly Formulations of mining tasks Algorithms
Scalability is an important idea from databases But there are many more—compositionality, query-
driven approach, set-oriented analyses