mining knowledge about changes, differences, and trends guozhu dong wright state university dayton,...
TRANSCRIPT
![Page 1: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/1.jpg)
Mining Knowledge about Changes, Differences, and
Trends
Guozhu Dong Wright State University
Dayton, Ohio
![Page 2: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/2.jpg)
2
Outline
• Introduction
– Knowledge discovery from databases (KDD)
– Knowledge about changes, differences, & trends
• Contributions– Changes between datasets KDD 99 & more
– Changes in data cubes VLDB 01 & SIGMOD 01
– Trends in data cubes VLDB 02
• Concluding remarks
![Page 3: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/3.jpg)
3
Introduction -- KDD (1)
• Mountains of data, everywhere!– Use them better service, better cure, …
• Aims of KDD– Mine valid, novel, potentially useful patterns– Classifiers, clustering, associations, insights, ..
• History– Traditional scientific discovery = manual mining– Ancestry of KDD: statistics, machine learning, pattern
recognition, database, …– Field started in 1990s
• Data forms– Market basket data (transactions)– Relational data– Data cubes (relational + concept hierarchies)
![Page 4: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/4.jpg)
4
Introduction – KDD (2)
• Main tasks for KDD– Identifying “useful pattern types”– Giving algorithms for mining them– Finding ways for using them
• Our contributions are along these lines
![Page 5: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/5.jpg)
5
Example knowledge patterns about changes, differences, & trends (CDT)
• Compare dataset A against dataset B, looking for patterns capturing CDT– Cancer tissues vs normal tissues– Loyal customers vs disloyal customers– Data_1999 vs Data_2000
• Compare cells in a data cube, looking for similar cells with big measure differences– “Gradients”
• Analyze trends in MDML (multidimensional multi-level) manner on a set of time series in data cube
Gene groups
Drug design
Emerging trends
![Page 6: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/6.jpg)
6
Traditional approaches to “mining” CDT
• Compare histograms or pie charts of datasets
• Study time series, one or two at a time• Summaries• Limitations:
– Only offer high level view, on very few “factors/variables”– But miss knowledge on many factor groups, many insights
0
10
20
30
4050
60
70
80
90
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
East
West
NorthGain a little
Miss a lot
![Page 7: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/7.jpg)
7
Outline
• Introduction
– Knowledge discovery from databases
– Changes, differences, and trends
• Contributions– Changes between datasets KDD 99 etc
– Changes in data cubes VLDB 01 & SIGMOD 01
– Trends in data cubes VLDB 02
• Concluding remarks
![Page 8: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/8.jpg)
8
Emerging Patterns between Two Datasets
Normal Tissues Cancer Tissues
EP: Patterns w/ high frequency ratio b/w datasetsE.G. {g1=L,g2=H,g3=L}; freq ratio = infinite
g1 g2 g3 g4
L H L H
L H L L
H L L H
L H H L
g1 g2 g3 g4
H H L H
L H H H
L L L H
H H H L
![Page 9: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/9.jpg)
9
Colon tumor gene expression
• 40 tumor, 22 normal colon tissue samples• 6500 genes/sample (Affymetrix Hum6000
micro-array gene chip)
g1 g2 g3 g4
20 90 25 80
24 95 23 28
80 20 25 85
25 89 85 25
Original GE data
Last page: binned data
100s of samples
1000s of dimensions
![Page 10: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/10.jpg)
10
Top minimal EPs w/ infinite freq ratio
NormalEP FreqInNormal CancerEP FreqInCancer{25 33 37 41 43 57 59 69} 77.3% {2 10} 70%{25 33 37 41 43 47 57 69} 77.3% {3 10} 67.5%{29 33 35 37 41 43 57 69} 77.3% {10 20} 67.5%{29 33 37 41 43 47 57 69} 77.3% {10 21} 67.5% … …{6 43 57} 77.3% {21 58} 65%{6 47 57} 77.3% {15 40 56} 62.5%{6 57 69} 77.3% {21 40 56} 62.5%
Papers using EP techniques
in Cancer Cell (cover, 3/02) & in Bioinformatics
Minimal EP with infinite ratio (jumping EPs): all their subsets occur in both classes of tissues
![Page 11: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/11.jpg)
11
EP Types of Particular Interest (1)
• Minimal jumping EPs for normal tissues
Properly expressed gene groups important for normal cell
functioning, but destroyed in all colon cancer tissues
Restore these ?cure colon cancer?
• Minimal jumping EPs for cancer tissues
Bad gene groups that occur in some cancer tissues but never
occur in normal tissues
Disrupt these ?cure colon cancer?
• ? Possible targets for drug design ?
• Good for classification (later)!
![Page 12: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/12.jpg)
12
EP Types of Particular Interest (2)
• Emerging trends in timestamped DBs– E.G. Enrollment of US students in major Canadian
univ’s increased by 86% during 99-02, to 5000
– This was news in US papers (Oct 02)
– Perhaps an opportunity for Canadian universities
• Note: Dominating trends not opportunities
(either you have won or you are out)
![Page 13: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/13.jpg)
13
Related work
• Classification/discriminant rules– We’re not limited to classification/high level rules
• Association rules– We are more tightly coupled with objectives of
application (divide data into “good” and “bad”)
• Changes in models of datasets– Only compare fitted decision trees
• Other work usually assumes frequency threshold; we may not
![Page 14: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/14.jpg)
14
EP Mining Algorithms
• Border-based approach (KDD 99)
– Produces border descriptions of desired collections of EPs (structured & concise)
– Manipulates borders to get answer• Constraint-based approach (KDD 00)
– Look ahead, bound, prune• Tree-based approach (Bailey et al, 01)
– Organize data in a tree manner to encourage sharing/reducing work
• Still room for improvementHigh dimens
![Page 15: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/15.jpg)
15
Borders describe large collections
• <{12,13}, {12345,12456}> L (min) R (max)
123 1234 12 124 1235 12345 125 1245 12456 126 1246 13 134 1256 135 1345
{1,3,4,5}
![Page 16: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/16.jpg)
16
Border-Diff: Effect
• <{{}},{1234}> - <{{}},{34,24,23}> = <{1,234},{1234}>
{}{}1,, 22, , 3, 43, 412, 13, 14, , 23, 2423, 24, , 3434123, 124, 134, 2341234
• Similar to: [1,100] - [1,50] = (50,100]• Good for: Jumping EPs; EPs in rectangle
regions, …
Don’t expand
collections
![Page 17: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/17.jpg)
17
EP-based Classification
• Classification by aggregating power of EPsNormalEP FreqInNormal CancerEP FreqInCancer{25 33 37 41 43} 80% {2 10} 70%{25 33 37 41 63} 77.3% {3 10} 67.5%{29 33 35 37 41} 77.3% {10 20} 67.5%{6 43 67} 77.3% {21 58} 65%{6 47 77} 77.3% {15 40 56} 62.5%{6 57 69} 60% {21 40 56} 62.5%
• T= {2 6 10 25 33 37 41 43 47 57 69}
– Normal score (T) = 0.8 + 0.6 = 1.4– Cancer score (T) = 0.7– Class(T) = Normal– May also normalize scores …
We gave several proposals since 1999
![Page 18: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/18.jpg)
18
EP-based Classification
• Very high accuracy: Outperforms best of five other classifiers in 2/3 of 30 UCI datasets
• Outperforms SVM on gene expression data• Variants
– Using different subsets of selected EPs– Perhaps instance-driven for EP discovery
and score computation
![Page 19: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/19.jpg)
19
Why EP-based classifiers are good
• Use discriminating power of low support EPs, together with high support ones
• Use multi-feature conditions, not just single-feature conditions
• Select from larger pools of discriminative conditions– Compare: The search space of patterns for
decision trees is limited by early choices.
• Combine power of a diversified committee of “experts” (EPs)
• Decision is highly understandable
![Page 20: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/20.jpg)
20
Outline
• Introduction
– Knowledge discovery from databases
– Changes, differences, and trends
• Contributions– Changes between datasets KDD 99 & more
– Changes in data cubes VLDB 01 & SIGMOD 01
– Trends in data cubes VLDB 02
• Concluding remarks
![Page 21: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/21.jpg)
21
Decision support in data cubes
• Used for learning from consolidated historical data: – anomalies – unusual factor combinations
• Focus on modeling & analysis of data for decision
makers, not daily operations.
• Data organized around major subjects or factors,
such as customer, product, time, sales.
• Contain huge number of summaries at different
levels of details
• OLAP operators provided for data analysis
Wal-Mart success story
Initial idea: Codd et al 93
![Page 22: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/22.jpg)
22
Data Cubes -- Base Cells
• Sales volume (measure) as a function of product, time, and location (dimensions)
Pro
duct
Locati
on
Time
Hierarchical summarization paths
Industry Region Year
Category Country Quarter
Product City Month Week
Office Day
Base cells
![Page 23: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/23.jpg)
23
Data Cubes: Derived Cells
Time
Produ
ct
Loc
atio
n
sum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
Sum, count, avg, max, min, …
Derived cells, offering different levels of details
(TV,*,Mexico)
![Page 24: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/24.jpg)
24
Gradient problem
• Find pairs of similar cells (conditions) having big changes in measure values– Q: Find pairs of similar conditions having big
changes in total sale price
– A: Sales of trucks in West went down 20% from 99 to 00; Sales of (SUVs, East, June01) is 10% higher than (SUVs, West, June01) ……
• Similar cells: ances/desc pairs, sibling pairs • Considered by Imielinski et al as Cubegrade Problem • No constraint costly (see next slide)
![Page 25: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/25.jpg)
25
Huge Space of Cuboids and Cells
***
**C*B*A**
AB* A*C *BC
ABCEach node is a cuboid.
Each cuboid represents a set of cells.
Cuboid (and cells) form lattices
Coarse to fine
*: ALL
![Page 26: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/26.jpg)
26
Constrained Gradient Mining
• Csig: (cnt100)
• Cprb: (city=“Van”, cust_grp=“busi”, prod_grp=“*”)
• Cgrad(cg, cp): (avg_price(cg) / avg_price(cp)1.3)
Dimensions Measures
cid Yr City Cst_grp Prd_grp Cnt Avg_price
c1 00 Van Busi PC 300 2100
c2 * Van Busi PC 2800 1800
c3 * Tor Busi PC 7900 2350
c4 * * busi PC 58600 2250
Siblings
Ancestor of c1, c2. c3
(c4, c2) satisfies Cgrad!
![Page 27: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/27.jpg)
27
LiveSet-Driven Algorithm -- Main Idea --
• Compute iceberg of probe cells P using Csig & Cprb
• Use P and Cgrad to find gradients– Traverse gradient cells in coarse-to-fine
manner, using iceberg H-cubing SIGMOD 01
– Deal with all potential probe cells in one traversal (as live set of probe cells)
– Dynamically prune live set during traversal
![Page 28: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/28.jpg)
28
LiveSet
• LiveSet(c): set of probe cells cp that may form a
gradient-probe pair w/ some desc of current cell c– View current cell as a “set of potential gradient cells”
Dimensions Measures
cid Yr City Cstgrp prdgrp Cnt avgprice
P1 00 Van Edu PC 100 1500
P2 99 Tor * PC 4000 1800
P3 * Mon Busi PC 1500 8000
P4 * Edm * Ski 2000 10000
p5 * Whi * Ski 1000 10050
Cur cell c1=(*,*,Edu,*)• cnt=800
LiveSet(c1)={p2, p4}
Csig: cnt 100Cgrad(cg, cp): (cnt(cg)/cnt(cp) 2)
P1, … P5: Global probe cells
![Page 29: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/29.jpg)
29
2-Way Pruning of Gradient Cells and Probe Cells Using LiveSet
• Prune current grad cell c if LiveSet(c) = {}
• Prune probe cells cp if cp can be ignored in
searching c’s descendants– Use min-max boundary check:
If constraint cnt(cg)/cnt(cp)>=2
and Cnt values in liveset are: 10, 18, 32, …; min(cnt)=10
then 19/10<2 gradient cells w/ cnt<=19 can be pruned
• Handle non anti-monotone constraints, using weaker constraint for pruning (SIGMOD 01)
![Page 30: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/30.jpg)
30
Pruning Probe Cells by Dimension Matching Analysis
• Derive LiveSet of child c2 from LiveSet of parent c1
– Since LiveSet(c2) LiveSet(c1)
• Discard probe cells in LiveSet(c2) that are
unmatchable with c2
Dimensions Measures # of mismatches
(with c3)cid Yr City Cst_grp Prd_grp Cnt Avg_price
P1 00 Van Edu PC 100 1500 1
P2 99 Tor * PC 4000 1800 1
P3 * Mon Busi PC 1500 8000 1, 1*LiveSet(c1) = {p1,p2,p3} c1=(00, Tor, *, *)
LiveSet(c2) = {p1,p2} c2=(00, Tor, *,PC)
![Page 31: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/31.jpg)
31
An efficient H-cubing method using H-tree
root
Edu. Hhd. Bus.
Jan. Mar. Jan. Feb.
Tor. Van. Tor. Mon.
A.I.A.I. A.I.Aux-Info
Sum: 1765Cnt: 2
bins
Attr. Val.
sum, cnt Side-link
Edu Sum:2285 …Hhd …Bus …… …Jan …Feb …… …
TorTor ……Van …Mon …… …
HeaderTable
H-tree: efficient way to organize data, & to
promote sharing/reuse of computation
![Page 32: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/32.jpg)
32
H-cubing: Computing Cells Involving Dimension City
root
Edu. Hhd. Bus.
Jan. Mar. Jan. Feb.
Tor. Van. Tor. Mon.
A.I.A.I. A.I.Aux-Info
Sum: 1765Cnt: 2
bins
Attr. Val.
sum, cnt Side-link
Edu Sum:2285 …Hhd …Bus …… …Jan …Feb …… …
TorTor ……Van …Mon …… …
Attr. Val.sumcnt
Side-link
Edu …Hhd …Bus …… …
Jan …Feb …
… …
HeaderTableHTor
From (*, *, Tor) to (*, Jan, Tor)
![Page 33: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/33.jpg)
33
Scalability on Number of Probe Cells
![Page 34: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/34.jpg)
34
Scalability on Gradient Threshold
![Page 35: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/35.jpg)
35
Scalability on Significance Threshold
![Page 36: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/36.jpg)
36
Scalability on Number of Tuples
![Page 37: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/37.jpg)
37
Outline
• Introduction
– Knowledge discovery from databases
– Changes, differences, and trends
• Contributions– Changes between datasets KDD 99 & more
– Changes in data cubes VLDB 01 & SIGMOD 01
– Trends in data cubes VLDB 02
• Concluding remarks
![Page 38: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/38.jpg)
38
Multi-Dimensional Trends Analysis of Sets of Time-Series -- Overview
• Consider applications having many time series– Stocks, power grids, sensor nets, internet,
gene expressions for toxicology, …• Needs for MDML trends analysis
– Mining/monitoring unusual patterns/events, in MDML manner
• Regression cube for time series– Store regression base cube– Support MDML OLAP of regressions
• Results also useful for MDML data stream monitoring
![Page 39: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/39.jpg)
39
Why MDML trends analysis
• Many time series– E.G. Prices of 10000s of stocks; One time
series per stock
• Objectives– Understand behavior of stocks/stock groups – Find patterns of stock groups– Monitor unusual events– Find “groups of stocks” – variables -- with
interesting patterns (MDML search)
![Page 40: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/40.jpg)
40
Regression based trends analysis
A time series: (ti , zi), i =1..n
Linear regression model is a linear fitting curvez = a0 + a1 t
With least square error
Can generalize regression toz = a0+a1f1(t)+a2f2(t)+…+akfk(t)
Each f is a fixed function of t
Common tool for trends analysisBut limited to situations where “variables” (groups of time series) are known
![Page 41: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/41.jpg)
41
Regression cube for time series
• There is one initial time series per base cell• Too costly to fully store all time series• Regression base cube
– Only store regression parameters of base cells (4 values vs 10000s)
– Can we support MDML OLAP of regressions, using only the regression base cube, in lossless manner?
• Answer is yes, for both “roll up” on standard dimensions and on time dimension
![Page 42: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/42.jpg)
42
Aggregation in Standard Dimensions
Two component cells
Aggregated cell
We can derive regression of aggregated cell from regression parameters of component cells
![Page 43: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/43.jpg)
43
Aggregation in Time Dimension
Cells of 2 adjacent time intervals:
Aggregated cell
We can derive regression of aggregated cell from regression parameters of component cells
![Page 44: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/44.jpg)
44
Remarks on Regression Cube
Efficient storage; scalable (independent of
number of tuples in data cells)
Lossless aggregation without accessing raw data
Fast and efficient aggregation
Regression models of data cells at all levels
Results cover a large and popular class of
regression (linear, polynomial, and other models)
![Page 45: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/45.jpg)
45
Concluding remarks
• Mining knowledge about change, differences, & trends (CDT) is useful & exciting
• Traditional approaches focus on high level view
• We considered CDT mining in transactions, relations, & data cubes
• We used discovered CDT patterns for classification, niche mining, & bioinformatics & medical studies
• Future work: mining useful CDT knowledge for bioinformatics, bio-medicine, business, …
![Page 46: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/46.jpg)
46
References: Changes, Differences, & Trends
• S. D. Bay and M. J. Pazzani. Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery, 2001.
• Y. Cai, N. Cercone, and J. Han. Attribute-oriented induction in relational databases. In Knowledge Discovery in Databases, AAAI/MITPress, 1991.
• G. Dong and K. Deshpande. Efficient mining of niches and set routines. In Pacific-Asia Conf. On Knowledge Discovery & Data Mining, 2001.
• G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and differences. In Proc. of the 5th ACM SIGKDD Int'l Conf. On Knowledge Discovery and Data Mining, 1999.
• G. Dong, X. Zhang, L. Wong, and J. Li. CAEP: Classification by aggregating emerging patterns. In Proc. 2nd Int'l Conf. on Discovery Science, Tokyo, 1999.
• V. Ganti, J. Gehrke, R. Ramakrishnan, and W. Y. Loh. A framework for measuring changes in data characteristics. In PODS, 1999.
• J. Li, G. Dong, and K. Ramamohanarao. Instance-based classification by emerging patterns. In European Conf. of Principles and Practice of Knowledge Discovery in Databases, Lyon, France, 2000.
![Page 47: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/47.jpg)
47
References: Changes, Difference and Trends (Cont’d)
• J. Li, G. Dong, K. Ramamohanarao. Making use of the most expressive jumping emerging patterns for classification. In Proc Pacific Asia Conf. on Knowledge Discovery & Data Mining, 2000.
• J. Li, K. Ramamohanarao, G. Dong. Combining the strength of pattern frequency and distance for classification. In Pacific-Asia KDD, 2001.
• J. Li, L. Wong. Identifying good diagnostic genes or genes groups from gene expression data by using the concept of emerging patterns. Bioinformatics. 18:725--734, 2002.
• Bing Liu, Wynne Hsu, Heng-Siew Han, and Yiyuan Xia. Mining changes for real-life applications. In DaWaK, 2000.
• Bing Liu, Wynne Hsu, and Yiming Ma. Discovering the set of fundamental rule changes. In KDD, 2001.
• Eng-Juh Yeoh, …, Jinyan Li, …,Limsoon Wong, James R. Downing. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell, 1:133—143, March 2002.
• X. Zhang, G. Dong, K. Ramamohanarao. Exploring constraints to efficiently mine emerging patterns from large high-dimensional datasets. In KDD, 2000.
![Page 48: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/48.jpg)
48
References: Changes and Trends (Data Cubes)
• S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. VLDB'96.
• K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. SIGMOD'99.
• S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM SIGMOD Record, 26:65-74, 1997.
• Y. Chen, G. Dong, J. Han, B. W. Wah, J. Wang. Multi-Dimensional Regression Analysis of Time-Series Data Streams. VLDB 2002.
• E. F. Codd, S. B. Codd, and C. T. Salley. Providing OLAP (on-line analytical processing) to user-analysts: an IT mandate. Tech Report, Codd Associates, 1993.
• G. Dong, J. Han, J. Lam, J. Pei, K. Wang. Mining Multi-Dimensional Constrained Gradients in Data Cubes. VLDB 2001.
• M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. VLDB'98.
![Page 49: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/49.jpg)
49
References: Changes and Trends (Data Cubes) (Cont’d)
• J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
• J. Han, J. Pei, G. Dong, and K. Wang. Efficient computation of iceberg cubes with complex measures. SIGMOD'01.
• V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. SIGMOD'96.
• T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades: Generalizing association rules. Tech Report, Computer Science, Rutgers Univ, Aug. 2000.
• L. V.S. Lakshmanan, J Pei, J. Han. Quotient Cube: How to Summarize the Semantics of a Data Cube. VLDB 2002.
• K. Ross and D. Srivastava. Fast computation of sparse datacubes. VLDB'97.• S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP
data cubes. EDBT'98.• Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for
simultaneous multidi-mensional aggregates. SIGMOD'97.
![Page 50: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/50.jpg)
50
Extra Slides
• Just in case …
![Page 51: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/51.jpg)
51
Base Cells: Tuples of a Relation
Product Location Time Sale
Printer Manhattan Jan 1999 100K
Laptop Queens Jan 1999 800K
… … … …
![Page 52: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/52.jpg)
52
Data Cubes: OLAP OPs
Time
Produ
ct
Loc
atio
n
sum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
Rollup, drilldown,
slice/dice
pivot
![Page 53: Mining Knowledge about Changes, Differences, and Trends Guozhu Dong Wright State University Dayton, Ohio](https://reader035.vdocument.in/reader035/viewer/2022081603/56649e565503460f94b4ec83/html5/thumbnails/53.jpg)
53
Experimental Results
• Constraints:
– Csig is on cnt
– Cprb selects set of cells
– Cgrad(cg, cp): (avg_price(cg)/avg_price(cp)s)
• Data set
– 10 dimensions
– 10k-20k tuples
– Cardinality 10 for each dimension
– Measure range: 100-1000
• All-Pairs: One independent search per probe cell