selected research results & applications of wsu' data mining research lab guozhu dong phd,...

Selected Research Results & Applications of WSU' Data Mining Research Lab

Guozhu Dong

PhD, Professor

Data Mining Research Lab

Wright State University

Data Mining Results and Applications Guozhu Dong 2

Outline Contrast data mining Contrast pattern based classifiers Contrast pattern mining on sequence data Real-time mining/analysis of sensor network data Multi-dimensional multi-level data mining in data cubes Mining large collections of time series Microarray concordance analysis Summarizing clusterings of abstracts/articles Alternative clustering Conversion of undesirable objects Data mining for knowledge transfer Comparative summary of search results

Focus on the “bold” topics

Contrast data mining - What & Why ? Contrast - ``To compare or appraise in respect to

differences’’ (Merriam Webster Dictionary) Contrast data mining - The mining of patterns and

models contrasting two or more classes, conditions, or datasets.

Why: ``Sometimes it’s good to contrast what you like with

something else. It makes you appreciate it even more’’ Darby Conley, Get Fuzzy, 2001

Useful for understanding, prediction/classification, outlier detection, …

Data Mining Results and Applications Guozhu Dong

What can be contrasted ?

Objects at different time periods ``Compare ICDM papers published in 2006-2007

versus those in 2004-2005 to find emerging research directions’’

Objects for different spatial locations ``Find the distinguishing patterns of cars sold in the

south, versus those sold in the north’’

Objects across different classes ``Find the key differences between normal

colon tissues and cancerous colon tissues’’Data Mining Results and Applications

Guozhu Dong

How do we contrast two datasets, without advanced mining tools?

Let D1 and D2 be the two datasets.

We usually find a prototypical case p1 for D1, and a prototypical case p2 for D2. Then we compare p1 against p2.

We may also compare the distribution of D1 against that of D2.

Such simplifications often miss the interesting contrast patterns.

Alternative names for contrast data mining/patterns

Contrast data mining is related to change mining, difference mining, discriminator mining, classification rule mining, …

Contrast patterns are related to these patterns: Change patterns, class based association rules, contrast sets,

concept drift, difference patterns, discriminative patterns, (dis)similarity patterns, emerging patterns, gradient patterns, high confidence patterns, (in)frequent patterns, ……

How is contrast data mining used ? Domain understanding

``Young children with diabetes have a greater risk of hospital admission, compared to the rest of the population

Used for building classifiers Many different techniques - to be covered later Also used for weighting and ranking instances

Used for monitoring ``Tell me when something unusual (unlike others in this class)

arrives”

Understanding can help us do prevention, prediction can help us do treatment. An ounce of prevention is worth a pound of cure!

Emerging Patterns Emerging Patterns (EPs) are contrast patterns between two

classes of data whose support changes significantly between the two classes. “Significant change” can be defined by:

If supp2(X)/supp1(X) = infinity, then X is a jumping EP. jumping EP occurs in some members of one class but never

occurs in the other class. Here, X is the AND of a set of simple conditions.

Extension to OR was also studied

similar to RiskRatio; +: allowing patterns with small overall support

big support ratio:supp2(X)/supp1(X) >= minRatio

big support difference:|supp2(X) – supp1(X)| >= minDiff (as defined by Bay+Pazzani 99)

Support = frequency

Example EP in microarray data for cancer

Normal Tissues Cancer Tissues

EP example: X={g1=L,g2=H,g3=L}; suppN(X)=50%, suppC(X)=0

Use minimality to reduce number of mined EPs

g1 g2 g3 g4

L H L H

L H L L

H L L H

L H H L

g1 g2 g3 g4

H H L H

L H H H

L L L H

H H H L

binned data

tissues

Top support minimal jumping EPs for colon cancer

Colon Cancer EPs{1+ 4- 112+ 113+} 100%{1+ 4- 113+ 116+} 100%{1+ 4- 113+ 221+} 100%{1+ 4- 113+ 696+} 100%{1+ 108- 112+ 113+} 100%{1+ 108- 113+ 116+} 100%{4- 108- 112+ 113+} 100%{4- 109+ 113+ 700+} 100%{4- 110+ 112+ 113+} 100%{4- 112+ 113+ 700+} 100%{4- 113+ 117+ 700+} 100%{1+ 6+ 8- 700+} 97.5%

Colon Normal EPs{12- 21- 35+ 40+ 137+ 254+} 100%{12- 35+ 40+ 71- 137+ 254+} 100%{20- 21- 35+ 137+ 254+} 100%{20- 35+ 71- 137+ 254+} 100%{5- 35+ 137+ 177+} 95.5%{5- 35+ 137+ 254+} 95.5%{5- 35+ 137+ 419-} 95.5%{5- 137+ 177+ 309+} 95.5%{5- 137+ 254+ 309+} 95.5%{7- 21- 33+ 35+ 69+} 95.5%{7- 21- 33+ 69+ 309+} 95.5%{7- 21- 33+ 69+ 1261+} 95.5%

EPs from Mao+Dong 05 (gene club + border-diff).

Colon cancer dataset (Alon et al, 1999 (PNAS)): 40 cancer tissues, 22 normal tissues. 2000 genes

These EPs have 95%--100% support in one class but 0% support in the other class.

Minimal: Each proper subset occurs in both classes.

Very few 100% support EPs.

There are ~1000 items with supp >= 80%.

Besides uses discussed earlier, another potential use of minimal jumping EPs:

Minimal jumping EPs for normal tissues

Properly expressed gene groups important for normal cell functioning, but

destroyed in all colon cancer tissues

Restore these ?cure colon cancer?

Minimal jumping EPs for cancer tissues

Bad gene expression groups that occur in some cancer tissues but never occur in

normal tissues

Disrupt these ?cure colon cancer?

? Possible targets for drug design ?

Li+Wong 02 proposed “gene therapy using EP” idea

Paper using EP published in Cancer Cell (cover, 3/02).EPs have been applied in medical applications for diagnosing acute Lymphoblastic Leukemia etc.

EP Mining Algorithms and Studies Complexity result (Wang et al 05) Border-differential algorithm (Dong+Li 99) Gene club + border differential (Mao+Dong 05) Constraint-based approach (Zhang et al 00) Tree-based approach (Bailey et al 02,

Fan+Kotagiri 02) Projection based algorithm (Bailey el al 03) ZBDD based method (Loekito+Bailey 06) Equivalence class based (Li et al 07).

Can handle 200+ dimensions

Contrast pattern based classification -- history

Contrast pattern based classification: Methods to build or improve classifiers, using contrast patterns

CBA (Liu et al 98) CAEP (Dong et al 99) Instance based method: DeEPs (Li et al 00, 04) Jumping EP based (Li et al 00), Information based (Zhang et al 00), Bayesian

based (Fan+Kotagiri 03), improving scoring for >=3 classes (Bailey et al 03) CMAR (Li et al 01) Top-ranked EP based PCL (Li+Wong 02) CPAR (Yin+Han 03) Weighted decision tree (Alhammady+Kotagiri 06) Rare class classification (Alhammady+Kotagiri 04) Constructing supplementary training instances (Alhammady+Kotagiri 05) Noise tolerant classification (Fan+Kotagiri 04) One-class classification/detection of outlier cases (Chen+Dong 06) …

Most follow the aggregating approach of CAEP.

EP-based classifiers: rationale Consider a typical EP in the Mushroom dataset, {odor = none,

stalk-surface-below-ring = smooth, ring-number = one}; its support increases from 0.2% from “poisonous” to 57.6% in “edible” (support ratio = 288).

Strong differentiating power: if a test case T contains this EP, we can predict T as edible with high confidence 99.6% = 57.6/(57.6+0.2)

A single EP is usually sharp in telling the class of a small fraction (e.g. 3%) of all instances. Need to aggregate the power of many EPs to make the classification.

EP based classification methods often out perform state of the art classifiers, including C4.5 and SVM. They are also noise tolerant.

CAEP (Classification by Aggregating Emerging Patterns)

The contribution of one EP X (support weighted confidence):

Given a test T and a set E(Ci) of EPs for class Ci, the aggregate score of T for Ci is

Given a test case T, obtain T’s scores for each class, by aggregating the discriminating power of EPs contained in T; assign the class with the maximal score as T’s class. The discriminating power of EPs are expressed in terms of supports and growth rates. Prefer large supRatio, large support

For each class, may use median (or 85%) aggregated value to normalize to avoid bias towards class with more EPs

CMAR aggregates “Chi2 weighted Chi2”

strength(X) = sup(X) * supRatio(X) / (supRatio(X)+1)

score(T, Ci) = strength(X) (over X of Ci matching T)

How CAEP works? An example

Given a test case T={a,d,e}, how to classify T?a c d e

b c d e

a b c d

a b d e

Class 2 (D2)

Class 1 (D1)

T contains EPs of class 1 : {a,e} (50%:25%) and {d,e} (50%:25%), so Score(T, class1) =

T contains EPs of class 2: {a,d} (25%:50%), so Score(T, class 2) = 0.33;

T will be classified as class 1 since Score1>Score2

0.5*[0.5/(0.5+0.25)] + 0.5*[0.5/(0.5+0.25)] = 0.67

DeEPs (Decision-making by Emerging Patterns)

An instance based (lazy) learning method, like k-NN; but does not use the normal distance measure.

For a test instance T, DeEPs First project all training instances to contain only items in T Discover EPs from the projected data Use these EPs to get the training data that match some discovered EPs Finally, use the proportional size of matching data in a class C as T’s

score for C Advantage: disallow similar EPs to give duplicate votes!

Why EP-based classifiers are good

Use the discriminating power of low support EPs (with high supRatio), in addition to the high support ones

Use multi-feature conditions, not just single-feature conditions Select from larger pools of discriminative conditions

Compare: Search space of patterns for decision trees is limited by early greedy choices.

Aggregate/combine the discriminating power of a diversified committee of “experts” (EPs)

Decision of such classifiers is highly explainable

Also Studied Contrast Pattern Mining for

Sequence family A vs sequence family B Graph collection A vs graph collection B Build contrast pattern based clustering quality index Constructing synthetic training data for classes with few training

instances …

More than 6 PhD dissertations About 50 research papers A tutorial given at IEEE ICDM 2007

Multi-dimensional multi-level data mining in data cubes

Data cube is used for discovering patterns captured in consolidated historical data for a company/organization: rules, anomalies, unusual factor combinations

Data cube is focused on modeling & analysis of data for decision

makers, not daily operations.

Data organized around major subjects or factors, such as

customer, product, time, sales.

Cube “contains” huge number of MDML sumaries for “segments” or

“sectors” at different levels of details

Basic OLAP operations: Drill down, roll up, slice and dice, pivot

Data Cubes: Base Table & Hierarchies Base table stores sales volume (measure), a function of

product, time, & location (dimensions)

Product

Location Time Hierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

a base cell

*: all (as top of each dimension)

Data Cubes: Derived CellsTime

sum TV

1Qtr 2Qtr 3Qtr 4Qtr

Canada

Mexico

Measures: sum, count, avg, max, min, std, …

Derived cells, different levels of details

(TV,*,Mexico)

Gradient mining in data cubes Find syntactically similar cells with significantly different

measure values EG:

(house,California,May,2008), total-sale=100M vs (house,Iowa,May,2008), total-sale = 200M

*** This is made up to show the point ***

Other people studied: iceberg cubes, cells significantly different from neighbors, …

Multi-Dimensional Trends Analysis of Sets of Time-Series in Data Cubes

Consider applications having many time series ECG curves, stocks, power grids, sensor networks,

internet, gene expressions for toxicology study, … Need MDML trends analysis

Mining/monitoring unusual patterns/events, in MDML manner

E.G. Find good sets of stocks with desired total risk/reward ratios

Regression cube for time series Store regression base cube Support MDML OLAP of regressions

Results also useful for MDML data stream monitoring

Example: Aggregating Set of Time Series

Two component cells

Aggregated cell

Deriving regression of aggregated cell from regression of component cells

In-Network Detection of Shapes of Region-Based Events in Sensor Networks

Sensor Node

Event Sensing

Each sensor can sense events, and talk with neighbors

Event Sensing

Research Problems Studied

Detection of Region-Based Events: given a sensor network, when a region-based event occurs, report the spatial geometric information, which may include the boundaries and the shape of the region; positions of important points; important metrics: length, area, density…

Tracking of Region-Based Events: after initial detection of a region-based event, determine its spatial dynamic parameters (moving direction, speed, expansion rate of area, etc).

Computation is done in the sensor network, which is organized into an R-tree.

Multiple platforms/labs dataset concordance/consistency evaluation

Microarrays (supplied by different manufactures) are used

to measure gene expressions in tissues, by different labs.

Without knowing the concordance between platform/lab

conditions, it is hard to transfer knowledge

(patterns/classifiers) from one lab to another

We provide measures and techniques to address this

problem, based on “discriminating gene/classifier

transferability”

Summarizing clusterings of documents

We often need to process large collections of documents (abstracts, articles, google search, …)

We need methods to help us quickly get a sense of the main themes of the documents

We gave methods to find “summary word sets” (cluster description sets) to describe clusterings of documents

Words in a summary set for a cluster should be typical in the cluster, and be rare in other clusters

Alternative Clustering

Clustering is usually performed on poorly understood datasets

Multiple clusterings (ways to group the data) may exist

Need methods to discover alternative clusterings We gave algorithms to solve this problem, and

introduced a new similarity measure between clusterings

Undesirable object converter mining

We have a class of desirable objects and a class of undesirable objects.

The goal is to mine “small sets of attribute changes, which when applied to undesirable objects, may change those objects’ class from undesirable to desirable.”

We considered two types of converter sets – personalized, and universal

We gave algorithms to mine them

Data mining for knowledge transfer

We have two application domains: a well understood one and a less understood one.

The goal is to mine knowledge that can be transferred from the well understood domain to the less understood domain, to solve problems in the less understood domain

Comparative summary of search results

We often perform multiple searches on the web or on a document collection.

There is an information overload, when we process the search results.

We developed tools to compare and summarize the search results to reduce the information overload.

Compare two searches -- examples: Same key words searched at two time points Same key words searched over two locations etc

Outline of Some Recent Works, Review

Contrast data mining Contrast pattern based classifiers Contrast pattern mining on sequence data Real-time mining/analysis of sensor network data Multi-dimensional multi-level data mining in data cubes Mining large collections of time series Microarray concordance analysis using contrast patterns Summarizing clusterings of abstracts/articles Alternative clustering Conversion of undesirable objects Data mining for knowledge transfer Comparative summary of search results

Thank you

List of papers available at http://www.cs.wright.edu/~gdong/

Email: guozhu.dong@wright.edu

Collaboration opportunities to work on your problems are welcome

selected research results & applications of wsu' data mining research lab guozhu dong phd,...

Documents

web mining. why ir ？ research & fun

mining for solutions: final report on research...

mining and mapping the research landscape

research in data mining

data mining research

solution mining meeting paper research institute

676971 mining research - british columbia

problems and solutions of baghouse in power plants guozhu...

data quality mining: new research directions

pattern aided regression modeling & pattern aided problem...

database research: data mining

mining knowledge about changes, differences, and trends...

the mining...

postgraduate course outline mine8440 mining...

mining operations research - university of adelaide ·...

frac sand mining environmental research webinar … sand...

data mining: applications. applications and trends in data...

mining and oil industry research sample

advances in data mining for biomedical research€¦ ·...

solution mining research institute technical … ·...