mining knowledge about changes, differences, and trends guozhu dong wright state university dayton,...

Mining Knowledge about Changes, Differences, and

Trends

Guozhu Dong Wright State University

Dayton, Ohio

2

Outline

• Introduction

– Knowledge discovery from databases (KDD)

– Knowledge about changes, differences, & trends

• Contributions– Changes between datasets KDD 99 & more

– Changes in data cubes VLDB 01 & SIGMOD 01

– Trends in data cubes VLDB 02

• Concluding remarks

3

Introduction -- KDD (1)

• Mountains of data, everywhere!– Use them better service, better cure, …

• Aims of KDD– Mine valid, novel, potentially useful patterns– Classifiers, clustering, associations, insights, ..

• History– Traditional scientific discovery = manual mining– Ancestry of KDD: statistics, machine learning, pattern

recognition, database, …– Field started in 1990s

• Data forms– Market basket data (transactions)– Relational data– Data cubes (relational + concept hierarchies)

4

Introduction – KDD (2)

• Main tasks for KDD– Identifying “useful pattern types”– Giving algorithms for mining them– Finding ways for using them

• Our contributions are along these lines

5

Example knowledge patterns about changes, differences, & trends (CDT)

• Compare dataset A against dataset B, looking for patterns capturing CDT– Cancer tissues vs normal tissues– Loyal customers vs disloyal customers– Data_1999 vs Data_2000

• Compare cells in a data cube, looking for similar cells with big measure differences– “Gradients”

• Analyze trends in MDML (multidimensional multi-level) manner on a set of time series in data cube

Gene groups

Drug design

Emerging trends

6

Traditional approaches to “mining” CDT

• Compare histograms or pie charts of datasets

• Study time series, one or two at a time• Summaries• Limitations:

– Only offer high level view, on very few “factors/variables”– But miss knowledge on many factor groups, many insights

0

10

20

30

4050

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

East

West

NorthGain a little

Miss a lot

7

Outline

• Introduction

– Knowledge discovery from databases

– Changes, differences, and trends

• Contributions– Changes between datasets KDD 99 etc




8

Emerging Patterns between Two Datasets

Normal Tissues Cancer Tissues

EP: Patterns w/ high frequency ratio b/w datasetsE.G. {g1=L,g2=H,g3=L}; freq ratio = infinite

g1 g2 g3 g4

L H L H

L H L L

H L L H

L H H L

g1 g2 g3 g4

H H L H

L H H H

L L L H

H H H L

9

Colon tumor gene expression

• 40 tumor, 22 normal colon tissue samples• 6500 genes/sample (Affymetrix Hum6000

micro-array gene chip)

g1 g2 g3 g4

20 90 25 80

24 95 23 28

80 20 25 85

25 89 85 25

Original GE data

Last page: binned data

100s of samples

1000s of dimensions

10

Top minimal EPs w/ infinite freq ratio

NormalEP FreqInNormal CancerEP FreqInCancer{25 33 37 41 43 57 59 69} 77.3% {2 10} 70%{25 33 37 41 43 47 57 69} 77.3% {3 10} 67.5%{29 33 35 37 41 43 57 69} 77.3% {10 20} 67.5%{29 33 37 41 43 47 57 69} 77.3% {10 21} 67.5% … …{6 43 57} 77.3% {21 58} 65%{6 47 57} 77.3% {15 40 56} 62.5%{6 57 69} 77.3% {21 40 56} 62.5%

Papers using EP techniques

in Cancer Cell (cover, 3/02) & in Bioinformatics

Minimal EP with infinite ratio (jumping EPs): all their subsets occur in both classes of tissues

11

EP Types of Particular Interest (1)

• Minimal jumping EPs for normal tissues

Properly expressed gene groups important for normal cell

functioning, but destroyed in all colon cancer tissues

Restore these ?cure colon cancer?

• Minimal jumping EPs for cancer tissues

Bad gene groups that occur in some cancer tissues but never

occur in normal tissues

Disrupt these ?cure colon cancer?

• ? Possible targets for drug design ?

• Good for classification (later)!

12

EP Types of Particular Interest (2)

• Emerging trends in timestamped DBs– E.G. Enrollment of US students in major Canadian

univ’s increased by 86% during 99-02, to 5000

– This was news in US papers (Oct 02)

– Perhaps an opportunity for Canadian universities

• Note: Dominating trends not opportunities

(either you have won or you are out)

13

Related work

• Classification/discriminant rules– We’re not limited to classification/high level rules

• Association rules– We are more tightly coupled with objectives of

application (divide data into “good” and “bad”)

• Changes in models of datasets– Only compare fitted decision trees

• Other work usually assumes frequency threshold; we may not

14

EP Mining Algorithms

• Border-based approach (KDD 99)

– Produces border descriptions of desired collections of EPs (structured & concise)

– Manipulates borders to get answer• Constraint-based approach (KDD 00)

– Look ahead, bound, prune• Tree-based approach (Bailey et al, 01)

– Organize data in a tree manner to encourage sharing/reducing work

• Still room for improvementHigh dimens

15

Borders describe large collections

• <{12,13}, {12345,12456}> L (min) R (max)

123 1234 12 124 1235 12345 125 1245 12456 126 1246 13 134 1256 135 1345

{1,3,4,5}

16

Border-Diff: Effect

• <{{}},{1234}> - <{{}},{34,24,23}> = <{1,234},{1234}>

{}{}1,, 22, , 3, 43, 412, 13, 14, , 23, 2423, 24, , 3434123, 124, 134, 2341234

• Similar to: [1,100] - [1,50] = (50,100]• Good for: Jumping EPs; EPs in rectangle

regions, …

Don’t expand

collections

17

EP-based Classification

• Classification by aggregating power of EPsNormalEP FreqInNormal CancerEP FreqInCancer{25 33 37 41 43} 80% {2 10} 70%{25 33 37 41 63} 77.3% {3 10} 67.5%{29 33 35 37 41} 77.3% {10 20} 67.5%{6 43 67} 77.3% {21 58} 65%{6 47 77} 77.3% {15 40 56} 62.5%{6 57 69} 60% {21 40 56} 62.5%

• T= {2 6 10 25 33 37 41 43 47 57 69}

– Normal score (T) = 0.8 + 0.6 = 1.4– Cancer score (T) = 0.7– Class(T) = Normal– May also normalize scores …

We gave several proposals since 1999

18

EP-based Classification

• Very high accuracy: Outperforms best of five other classifiers in 2/3 of 30 UCI datasets

• Outperforms SVM on gene expression data• Variants

– Using different subsets of selected EPs– Perhaps instance-driven for EP discovery

and score computation

19

Why EP-based classifiers are good

• Use discriminating power of low support EPs, together with high support ones

• Use multi-feature conditions, not just single-feature conditions

• Select from larger pools of discriminative conditions– Compare: The search space of patterns for

decision trees is limited by early choices.

• Combine power of a diversified committee of “experts” (EPs)

• Decision is highly understandable

20

Outline

• Introduction







21

Decision support in data cubes

• Used for learning from consolidated historical data: – anomalies – unusual factor combinations

• Focus on modeling & analysis of data for decision

makers, not daily operations.

• Data organized around major subjects or factors,

such as customer, product, time, sales.

• Contain huge number of summaries at different

levels of details

• OLAP operators provided for data analysis

Wal-Mart success story

Initial idea: Codd et al 93

22

Data Cubes -- Base Cells

• Sales volume (measure) as a function of product, time, and location (dimensions)

Pro

duct

Locati

on

Time

Hierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

Base cells

23

Data Cubes: Derived Cells

Time

Produ

ct

Loc

atio

n

sum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

Sum, count, avg, max, min, …

Derived cells, offering different levels of details

(TV,*,Mexico)

24

Gradient problem

• Find pairs of similar cells (conditions) having big changes in measure values– Q: Find pairs of similar conditions having big

changes in total sale price

– A: Sales of trucks in West went down 20% from 99 to 00; Sales of (SUVs, East, June01) is 10% higher than (SUVs, West, June01) ……

• Similar cells: ances/desc pairs, sibling pairs • Considered by Imielinski et al as Cubegrade Problem • No constraint costly (see next slide)

25

Huge Space of Cuboids and Cells

***

**C*B*A**

AB* A*C *BC

ABCEach node is a cuboid.

Each cuboid represents a set of cells.

Cuboid (and cells) form lattices

Coarse to fine

*: ALL

26

Constrained Gradient Mining

• Csig: (cnt100)

• Cprb: (city=“Van”, cust_grp=“busi”, prod_grp=“*”)

• Cgrad(cg, cp): (avg_price(cg) / avg_price(cp)1.3)

Dimensions Measures

cid Yr City Cst_grp Prd_grp Cnt Avg_price

c1 00 Van Busi PC 300 2100

c2 * Van Busi PC 2800 1800

c3 * Tor Busi PC 7900 2350

c4 * * busi PC 58600 2250

Siblings

Ancestor of c1, c2. c3

(c4, c2) satisfies Cgrad!

27

LiveSet-Driven Algorithm -- Main Idea --

• Compute iceberg of probe cells P using Csig & Cprb

• Use P and Cgrad to find gradients– Traverse gradient cells in coarse-to-fine

manner, using iceberg H-cubing SIGMOD 01

– Deal with all potential probe cells in one traversal (as live set of probe cells)

– Dynamically prune live set during traversal

28

LiveSet

• LiveSet(c): set of probe cells cp that may form a

gradient-probe pair w/ some desc of current cell c– View current cell as a “set of potential gradient cells”

Dimensions Measures

cid Yr City Cstgrp prdgrp Cnt avgprice

P1 00 Van Edu PC 100 1500

P2 99 Tor * PC 4000 1800

P3 * Mon Busi PC 1500 8000

P4 * Edm * Ski 2000 10000

p5 * Whi * Ski 1000 10050

Cur cell c1=(*,*,Edu,*)• cnt=800

LiveSet(c1)={p2, p4}

Csig: cnt 100Cgrad(cg, cp): (cnt(cg)/cnt(cp) 2)

P1, … P5: Global probe cells

29

2-Way Pruning of Gradient Cells and Probe Cells Using LiveSet

• Prune current grad cell c if LiveSet(c) = {}

• Prune probe cells cp if cp can be ignored in

searching c’s descendants– Use min-max boundary check:

If constraint cnt(cg)/cnt(cp)>=2

and Cnt values in liveset are: 10, 18, 32, …; min(cnt)=10

then 19/10<2 gradient cells w/ cnt<=19 can be pruned

• Handle non anti-monotone constraints, using weaker constraint for pruning (SIGMOD 01)

30

Pruning Probe Cells by Dimension Matching Analysis

• Derive LiveSet of child c2 from LiveSet of parent c1

– Since LiveSet(c2) LiveSet(c1)

• Discard probe cells in LiveSet(c2) that are

unmatchable with c2

Dimensions Measures # of mismatches

(with c3)cid Yr City Cst_grp Prd_grp Cnt Avg_price

P1 00 Van Edu PC 100 1500 1

P2 99 Tor * PC 4000 1800 1

P3 * Mon Busi PC 1500 8000 1, 1*LiveSet(c1) = {p1,p2,p3} c1=(00, Tor, *, *)

LiveSet(c2) = {p1,p2} c2=(00, Tor, *,PC)

31

An efficient H-cubing method using H-tree

root

Edu. Hhd. Bus.

Jan. Mar. Jan. Feb.

Tor. Van. Tor. Mon.

A.I.A.I. A.I.Aux-Info

Sum: 1765Cnt: 2

bins

Attr. Val.

sum, cnt Side-link

Edu Sum:2285 …Hhd …Bus …… …Jan …Feb …… …

TorTor ……Van …Mon …… …

HeaderTable

H-tree: efficient way to organize data, & to

promote sharing/reuse of computation

32

H-cubing: Computing Cells Involving Dimension City

root

Edu. Hhd. Bus.

Jan. Mar. Jan. Feb.

Tor. Van. Tor. Mon.

A.I.A.I. A.I.Aux-Info

Sum: 1765Cnt: 2

bins

Attr. Val.

sum, cnt Side-link

Edu Sum:2285 …Hhd …Bus …… …Jan …Feb …… …

TorTor ……Van …Mon …… …

Attr. Val.sumcnt

Side-link

Edu …Hhd …Bus …… …

Jan …Feb …

… …

HeaderTableHTor

From (*, *, Tor) to (*, Jan, Tor)

33

Scalability on Number of Probe Cells

34

Scalability on Gradient Threshold

35

Scalability on Significance Threshold

36

Scalability on Number of Tuples

37

Outline

• Introduction







38

Multi-Dimensional Trends Analysis of Sets of Time-Series -- Overview

• Consider applications having many time series– Stocks, power grids, sensor nets, internet,

gene expressions for toxicology, …• Needs for MDML trends analysis

– Mining/monitoring unusual patterns/events, in MDML manner

• Regression cube for time series– Store regression base cube– Support MDML OLAP of regressions

• Results also useful for MDML data stream monitoring

39

Why MDML trends analysis

• Many time series– E.G. Prices of 10000s of stocks; One time

series per stock

• Objectives– Understand behavior of stocks/stock groups – Find patterns of stock groups– Monitor unusual events– Find “groups of stocks” – variables -- with

interesting patterns (MDML search)

40

Regression based trends analysis

A time series: (ti , zi), i =1..n

Linear regression model is a linear fitting curvez = a0 + a1 t

With least square error

Can generalize regression toz = a0+a1f1(t)+a2f2(t)+…+akfk(t)

Each f is a fixed function of t

Common tool for trends analysisBut limited to situations where “variables” (groups of time series) are known

41

Regression cube for time series

• There is one initial time series per base cell• Too costly to fully store all time series• Regression base cube

– Only store regression parameters of base cells (4 values vs 10000s)

– Can we support MDML OLAP of regressions, using only the regression base cube, in lossless manner?

• Answer is yes, for both “roll up” on standard dimensions and on time dimension

42

Aggregation in Standard Dimensions

Two component cells

Aggregated cell

We can derive regression of aggregated cell from regression parameters of component cells

43

Aggregation in Time Dimension

Cells of 2 adjacent time intervals:

Aggregated cell

We can derive regression of aggregated cell from regression parameters of component cells

44

Remarks on Regression Cube

Efficient storage; scalable (independent of

number of tuples in data cells)

Lossless aggregation without accessing raw data

Fast and efficient aggregation

Regression models of data cells at all levels

Results cover a large and popular class of

regression (linear, polynomial, and other models)

45

Concluding remarks

• Mining knowledge about change, differences, & trends (CDT) is useful & exciting

• Traditional approaches focus on high level view

• We considered CDT mining in transactions, relations, & data cubes

• We used discovered CDT patterns for classification, niche mining, & bioinformatics & medical studies

• Future work: mining useful CDT knowledge for bioinformatics, bio-medicine, business, …

46

References: Changes, Differences, & Trends

• S. D. Bay and M. J. Pazzani. Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery, 2001.

• Y. Cai, N. Cercone, and J. Han. Attribute-oriented induction in relational databases. In Knowledge Discovery in Databases, AAAI/MITPress, 1991.

• G. Dong and K. Deshpande. Efficient mining of niches and set routines. In Pacific-Asia Conf. On Knowledge Discovery & Data Mining, 2001.

• G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and differences. In Proc. of the 5th ACM SIGKDD Int'l Conf. On Knowledge Discovery and Data Mining, 1999.

• G. Dong, X. Zhang, L. Wong, and J. Li. CAEP: Classification by aggregating emerging patterns. In Proc. 2nd Int'l Conf. on Discovery Science, Tokyo, 1999.

• V. Ganti, J. Gehrke, R. Ramakrishnan, and W. Y. Loh. A framework for measuring changes in data characteristics. In PODS, 1999.

• J. Li, G. Dong, and K. Ramamohanarao. Instance-based classification by emerging patterns. In European Conf. of Principles and Practice of Knowledge Discovery in Databases, Lyon, France, 2000.

47

References: Changes, Difference and Trends (Cont’d)

• J. Li, G. Dong, K. Ramamohanarao. Making use of the most expressive jumping emerging patterns for classification. In Proc Pacific Asia Conf. on Knowledge Discovery & Data Mining, 2000.

• J. Li, K. Ramamohanarao, G. Dong. Combining the strength of pattern frequency and distance for classification. In Pacific-Asia KDD, 2001.

• J. Li, L. Wong. Identifying good diagnostic genes or genes groups from gene expression data by using the concept of emerging patterns. Bioinformatics. 18:725--734, 2002.

• Bing Liu, Wynne Hsu, Heng-Siew Han, and Yiyuan Xia. Mining changes for real-life applications. In DaWaK, 2000.

• Bing Liu, Wynne Hsu, and Yiming Ma. Discovering the set of fundamental rule changes. In KDD, 2001.

• Eng-Juh Yeoh, …, Jinyan Li, …,Limsoon Wong, James R. Downing. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell, 1:133—143, March 2002.

• X. Zhang, G. Dong, K. Ramamohanarao. Exploring constraints to efficiently mine emerging patterns from large high-dimensional datasets. In KDD, 2000.

48

References: Changes and Trends (Data Cubes)

• S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. VLDB'96.

• K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. SIGMOD'99.

• S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM SIGMOD Record, 26:65-74, 1997.

• Y. Chen, G. Dong, J. Han, B. W. Wah, J. Wang. Multi-Dimensional Regression Analysis of Time-Series Data Streams. VLDB 2002.

• E. F. Codd, S. B. Codd, and C. T. Salley. Providing OLAP (on-line analytical processing) to user-analysts: an IT mandate. Tech Report, Codd Associates, 1993.

• G. Dong, J. Han, J. Lam, J. Pei, K. Wang. Mining Multi-Dimensional Constrained Gradients in Data Cubes. VLDB 2001.

• M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. VLDB'98.

49

References: Changes and Trends (Data Cubes) (Cont’d)

• J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.

• J. Han, J. Pei, G. Dong, and K. Wang. Efficient computation of iceberg cubes with complex measures. SIGMOD'01.

• V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. SIGMOD'96.

• T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades: Generalizing association rules. Tech Report, Computer Science, Rutgers Univ, Aug. 2000.

• L. V.S. Lakshmanan, J Pei, J. Han. Quotient Cube: How to Summarize the Semantics of a Data Cube. VLDB 2002.

• K. Ross and D. Srivastava. Fast computation of sparse datacubes. VLDB'97.• S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP

data cubes. EDBT'98.• Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for

simultaneous multidi-mensional aggregates. SIGMOD'97.

50

Extra Slides

• Just in case …

51

Base Cells: Tuples of a Relation

Product Location Time Sale

Printer Manhattan Jan 1999 100K

Laptop Queens Jan 1999 800K

… … … …

52

Data Cubes: OLAP OPs

Time

Produ

ct

Loc

atio

n

sum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

Rollup, drilldown,

slice/dice

pivot

53

Experimental Results

• Constraints:

– Csig is on cnt

– Cprb selects set of cells

– Cgrad(cg, cp): (avg_price(cg)/avg_price(cp)s)

• Data set

– 10 dimensions

– 10k-20k tuples

– Cardinality 10 for each dimension

– Measure range: 100-1000

• All-Pairs: One independent search per probe cell

mining knowledge about changes, differences, and trends guozhu dong wright state university dayton,...

Documents