mining biological data

40
Mining Biological Data Jiong Yang, Ph. D. Visiting Assistant Professor UIUC [email protected]

Upload: julius

Post on 22-Feb-2016

67 views

Category:

Documents


0 download

DESCRIPTION

Mining Biological Data. Jiong Yang, Ph. D. Visiting Assistant Professor UIUC [email protected]. Data is Everywhere. Data Mining is a Powerful Tool. Computational Biology E-Commerce Intrusion Detection Multimedia Processing Unstructured Data . . . Data Mining. Data. Knowledge. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Mining Biological Data

Mining Biological Data

Jiong Yang, Ph. D. Visiting Assistant ProfessorUIUC [email protected]

Page 2: Mining Biological Data

Data is Everywhere

Page 3: Mining Biological Data

Data Mining is a Powerful Tool

Computational Biology E-Commerce Intrusion Detection Multimedia Processing Unstructured Data . . .

Data

Data MiningKnowledge

Page 4: Mining Biological Data

Biological Data Bio-informatics have become one of the most

important applications in data mining.DNA sequencesProtein sequencesProtein foldingMicroarray data……

Page 5: Mining Biological Data

Outline

Approximate sequential pattern mining

Coherent cluster: clustering by pattern similarity in a large data set

Page 6: Mining Biological Data

Frequent Patterns Model

A set of sequences of symbols. a1,a2,a4 a2,a3,a5 a1,a4,a5,a6,a7

If a pattern occurs more than a certain number of times, then this pattern is considered important.

a1,a4 Widely studied

Frequent itemset mining: Agarwal and Srikant (IBM Almaden) FP growth: Han (UIUC) Stream data: Motwani (Stanford) …

Page 7: Mining Biological Data

Apriori Property Widely used in data mining field It holds for the support metrics All patterns form a lattice.

(a, b, d) is a super-pattern of (a, d) and it is a sub-pattern of (a, b, c, d).

Support metric defines a partial order on the lattice. Support(a, b, d) <= min{Support(b, d) , Support(a, d) ,

Support(a, b) }Level-wise search algorithm can be used

Page 8: Mining Biological Data

Shortcomings Require exact match and fail to recognize

possible substitution among symbolsProtein may mutate without change of its

functionality.A sensor may make some mistakesDifferent web pages may have similar contents.A word may have many synonyms.

How can the symbol substitution be modeled

Page 9: Mining Biological Data

Compatibility Matrix

d1 d2 d5d3 d4

d1

d2

d5

d3

d4

observedtrue

0.90.05

0.1 0

0.75

0.80.7

0

0.050.05

0.05

0.1 0.1

0.10.10.15

0.850.15

00

000 0 0

Compatibility matrix of 5 symbols

Page 10: Mining Biological Data

Compatibility Matrix The compatibility matrix serves as a bridge between

the observation and the underlying substance. Each observed symbol is interpreted as an occurrence of a

set of symbols with various probabilities. An observed symbol combination is treated as an

occurrence of a set of patterns with various degrees. Obtain the compatibility matrix through

empirical study domain expert

Page 11: Mining Biological Data

Match A new metric, match, is then proposed to

quantify the importance of a pattern.The match of a pattern P in a subsequence s (with

the same length) is defined as the conditional probability Prob(P| s).

The match of a pattern P in a sequence S is defined as the maximal match of P in every distinct subsequence in S.

A dynamic programming technique is used to compute the match of P in a sequence S

Page 12: Mining Biological Data

Match M(d1d2…di, S1S2…Sj) is the maximum of M(d1s2…di, S1S2…Sj-1)

and M(d1d2…di-1,S1S2…Sj-1) x C(di, Sj)

The match of a pattern P in a set of sequence is defined as the sum of the pattern P with each sequence.

A pattern is called a frequent pattern if its match exceeds a user-specified threshold min_match.

SpSp

max d1 d3

d4 d1d1d2

S

p 0.9 0.9 0.9 0.90.045 0.090.09

Page 13: Mining Biological Data

Challenges Previous work focuses on short patterns. Long patterns require a large number of

scans through the input sequence.Expensive I/O cost

Performance vs. Accuracy Probabilistic Approach

Page 14: Mining Biological Data

Chernoff Bound Let X be a random variable whose range is R. Suppose

that we have n independent observations of X and the observed mean is . The Chernoff bound states that, with probability (1- ), the true mean of X is at least - , where

With probability (1- ), the true value of X is at most + .

nR

2)/1ln(2

Page 15: Mining Biological Data

Approach Three-stage approach to mine patterns with length l:

Finding Match of Individual Symbols and Take a Sample set of sequences

Pattern Discovery on Samples Ambiguous Patterns Determination

Pattern Discovery on Samples Sample size: depending on memory size Based on the samples, three types of patterns are

determined.

Page 16: Mining Biological Data

Approach Frequent pattern if match is greater than (min_match

+) Ambiguous pattern if match is between (min_match - )

and (min_match + ). Infrequent pattern otherwise;

Page 17: Mining Biological Data

Ambiguous Patterns Ambiguous Patterns

Too manyBorder collapse

We have the negative and positive borders of significant patterns.

Our goal is to collapse the border as fast as possible.

Page 18: Mining Biological Data

Ambiguous Patterns

(d1)

(d1,d2) (d1,d3) (d1,d4) (d1,d5)

(d1,d2,d3) (d1,d2,d4) (d1,d2,d5) (d1,d3,d4) (d1,d3,d5)

(d1,d2,d3,d4) (d1,d2,d3,d5) (d1,d2,d4,d5) (d1,d3,d4,d5)

(d1,d2,d3,d4,d5)

(d1,d4,d5)

Page 19: Mining Biological Data

Ambiguous Patterns

(d1)

(d1,d2,) (d1,d3) (d1,d4) (d1,d5)

(d1,d2,d3) (d1,d2,d4) (d1,d2,d5) (d1,d3,d4) (d1,d3,d5) (d1,d4,d5)

(d1,d2,d3,d4) (d1,d2,d3,d5) (d1,d2,d4,d5) (d1,d3,d4,d5)

(d1,d2,d3,d4,d5)

frequent

infrequent

Page 20: Mining Biological Data

Effects of 1-

0

20

40

60

80

100

120

140

0.85 0.9 0.95 1

confidence

num

ber

of a

mbi

guou

s pa

tter

ns (K

)

1.00E-06

1.00E-05

1.00E-04

1.00E-03

1.00E-02

1.00E-01

1.00E+00

0.85 0.9 0.95 1

confidence

erro

r ra

te

With BorderCollapse

Without BorderCollapse

Page 21: Mining Biological Data

Approximate Pattern Mining Reference:

Mining long sequential patterns in a noisy environment, Proceeding of ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 406-417, 2002.

Other WorkPeriodic Patterns (KDD2000, ICDM2001)Statistically significant Patterns (KDD2001, ICDM

2002)

Page 22: Mining Biological Data

Outline Approximate sequential pattern mining

Coherent cluster: clustering by pattern similarity in a large data set

Page 23: Mining Biological Data

Coherent Cluster In many applications, data can be of very high

dimensionality. Gene expression data

Dozens to hundreds conditions/samples Customer evaluation

Thousands or more merchants

Objective: discover peer groups

dij

attributes

obje

cts

oi

aj

o1...

..

.

a1 . . . . . .

Page 24: Mining Biological Data

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16139 69 0 69 139 139 139 139 69 0 0 69 110 0 69 0 0

0 69 69 69 110 110 110 110 69 0 69 69 110 110 69 0 69139 110 0 69 69 110 139 139 139 0 69 69 139 69 69 0 0139 110 0 69 110 110 110 139 110 0 69 110 139 69 69 0 69208 179 110 69 110 110 110 161 161 0 69 69 110 0 69 0 69

0 0 0 69 69 139 161 179 139 0 69 0 110 0 69 0 690 0 0 0 110 110 110 69 110 0 0 0 69 0 69 0 69

179 161 69 69 69 110 69 110 110 0 0 69 0 0 69 0 6969 110 69 110 110 161 110 69 139 69 69 110 110 139 110 69 11069 0 69 69 110 139 110 0 0 0 69 69 110 69 69 0 0

139 161 110 110 139 179 139 110 139 69 69 69 110 110 110 69 69179 179 161 139 161 195 161 161 161 110 161 161 139 139 161 110 110179 240 161 195 195 256 220 208 240 139 195 195 195 161 195 161 110161 161 69 110 139 161 139 110 161 69 110 139 69 69 110 69 69208 283 240 248 264 304 283 283 283 195 220 240 240 240 248 195 208161 195 110 139 195 248 179 161 220 110 179 195 161 179 208 110 110139 161 139 161 139 179 161 139 69 69 139 69 69 179 179 110 69304 326 304 322 326 350 340 376 318 248 314 283 314 318 326 264 26469 69 0 69 110 110 69 0 69 0 69 69 139 69 69 0 0

283 208 220 277 289 326 289 289 248 220 271 240 271 294 277 230 208337 383 383 413 414 403 381 393 343 350 369 358 347 358 356 314 289161 161 220 195 161 195 161 110 110 110 195 179 179 69 139 110 110208 195 220 161 139 161 161 110 139 110 195 195 195 69 161 139 139248 230 330 300 277 240 240 179 195 220 277 289 240 240 220 161 161264 300 289 264 277 277 289 277 300 248 283 271 294 256 264 271 283230 240 289 264 240 256 220 208 220 248 271 256 256 240 220 179 208439 442 464 456 451 422 417 403 432 510 438 442 450 462 419 476 476256 230 208 240 230 248 240 283 248 220 230 230 220 240 248 220 240374 322 322 300 330 356 361 333 369 376 369 374 369 343 361 393 399139 195 161 139 161 139 161 139 179 110 110 139 139 139 110 161 161230 277 256 248 264 271 248 240 256 220 230 230 256 208 208 240 230494 470 498 488 477 460 466 484 449 532 485 473 464 487 477 492 484326 248 240 289 300 294 289 264 277 248 283 283 277 283 277 271 283179 139 110 69 69 110 69 69 69 69 69 69 69 69 110 69 69326 411 397 383 371 347 314 277 330 264 289 283 304 264 264 340 343161 220 220 220 208 208 161 161 208 179 195 179 179 161 139 161 139220 271 248 230 240 248 240 179 248 208 208 220 230 220 179 230 230220 271 230 208 161 195 161 161 195 161 208 195 220 161 179 195 220179 195 110 161 139 179 161 179 161 69 110 139 139 139 161 139 161

17 conditions40 genes

Page 25: Mining Biological Data

Coherent Cluster

0

100

200

300

400

500

600

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

condition

expr

essi

on le

vel

Page 26: Mining Biological Data

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16139 69 0 69 139 139 139 139 69 0 0 69 110 0 69 0 0

0 69 69 69 110 110 110 110 69 0 69 69 110 110 69 0 69139 110 0 69 69 110 139 139 139 0 69 69 139 69 69 0 0139 110 0 69 110 110 110 139 110 0 69 110 139 69 69 0 69208 179 110 69 110 110 110 161 161 0 69 69 110 0 69 0 69

0 0 0 69 69 139 161 179 139 0 69 0 110 0 69 0 690 0 0 0 110 110 110 69 110 0 0 0 69 0 69 0 69

179 161 69 69 69 110 69 110 110 0 0 69 0 0 69 0 6969 110 69 110 110 161 110 69 139 69 69 110 110 139 110 69 11069 0 69 69 110 139 110 0 0 0 69 69 110 69 69 0 0

139 161 110 110 139 179 139 110 139 69 69 69 110 110 110 69 69179 179 161 139 161 195 161 161 161 110 161 161 139 139 161 110 110179 240 161 195 195 256 220 208 240 139 195 195 195 161 195 161 110161 161 69 110 139 161 139 110 161 69 110 139 69 69 110 69 69208 283 240 248 264 304 283 283 283 195 220 240 240 240 248 195 208161 195 110 139 195 248 179 161 220 110 179 195 161 179 208 110 110139 161 139 161 139 179 161 139 69 69 139 69 69 179 179 110 69304 326 304 322 326 350 340 376 318 248 314 283 314 318 326 264 26469 69 0 69 110 110 69 0 69 0 69 69 139 69 69 0 0

283 208 220 277 289 326 289 289 248 220 271 240 271 294 277 230 208337 383 383 413 414 403 381 393 343 350 369 358 347 358 356 314 289161 161 220 195 161 195 161 110 110 110 195 179 179 69 139 110 110208 195 220 161 139 161 161 110 139 110 195 195 195 69 161 139 139248 230 330 300 277 240 240 179 195 220 277 289 240 240 220 161 161264 300 289 264 277 277 289 277 300 248 283 271 294 256 264 271 283230 240 289 264 240 256 220 208 220 248 271 256 256 240 220 179 208439 442 464 456 451 422 417 403 432 510 438 442 450 462 419 476 476256 230 208 240 230 248 240 283 248 220 230 230 220 240 248 220 240374 322 322 300 330 356 361 333 369 376 369 374 369 343 361 393 399139 195 161 139 161 139 161 139 179 110 110 139 139 139 110 161 161230 277 256 248 264 271 248 240 256 220 230 230 256 208 208 240 230494 470 498 488 477 460 466 484 449 532 485 473 464 487 477 492 484326 248 240 289 300 294 289 264 277 248 283 283 277 283 277 271 283179 139 110 69 69 110 69 69 69 69 69 69 69 69 110 69 69326 411 397 383 371 347 314 277 330 264 289 283 304 264 264 340 343161 220 220 220 208 208 161 161 208 179 195 179 179 161 139 161 139220 271 248 230 240 248 240 179 248 208 208 220 230 220 179 230 230220 271 230 208 161 195 161 161 195 161 208 195 220 161 179 195 220179 195 110 161 139 179 161 179 161 69 110 139 139 139 161 139 161

40 genes

Page 27: Mining Biological Data

Coherent Cluster

0

100

200

300

400

500

600

3 5 9 14 15

condition

expr

essi

on le

vel

YBL069WYBL097WYBR064WYBR065CYBR114WYCL013WYDR149CYDR461WYDR526CYHR061CYIL092WYIR043CYJL010CYJL023CYJL033WYJL076WYJR162CYKL068WYKL134CYLR219W

Co-regulated genes

Page 28: Mining Biological Data

Coherent Cluster• Observations:

1. If mapped to points in high dimensional space, they may not be close to each other.

• Bias exists universally.2. Only a subset of objects and a subset of attributes

may participate.3. Need to accommodate some degree of noise.

• Solution: subspace cluster, bicluster, coherent cluster

Page 29: Mining Biological Data

Subspace cluster CLICK: Argawal et al IBM Almaden

Find a subset of dimensions and a subset of objects such that the distance between the objects on the subset of dimensions is close.

The clusters may overlap Proclus: Aggawal et al IBM T. J. Watson

Do not allow overlap

Page 30: Mining Biological Data

Bicluster Developed in 2000 by Cheung and Church Using mean squared error residual After discovering one cluster, replace the

cluster with random data and find another Not efficient and not accurate

Page 31: Mining Biological Data

Coherent Cluster Coherent cluster

Subspace clustering Measure distance on mutual bias

pair-wise disparity For a 22 (sub)matrix consisting of objects {x, y} and

attributes {a, b}

)()( ybxbyaxa

ybya

xbxa

dddd

dddd

D

x

ya b

dxa

dya

dxb

dyb

x

ya battribute

mutual biasof attribute a

mutual biasof attribute b

Page 32: Mining Biological Data

Coherent ClusterA 22 (sub)matrix is a -coherent cluster if its D

value is less than or equal to .An mn matrix X is a -coherent cluster if every

22 submatrix of X is -coherent cluster. A -coherent cluster is a maximum -coherent cluster

if it is not a submatrix of any other -coherent cluster.Objective: given a data matrix and a threshold ,

find all maximum -coherent clusters.

Page 33: Mining Biological Data

Coherent Cluster Challenges:

Finding subspace clustering based on distance itself is already a difficult task due to the curse of dimensionality.

The (sub)set of objects and the (sub)set of attributes that form a cluster are unknown in advance and may not be adjacent to each other in the data matrix.

The actual values of the objects in a coherent cluster may be far apart from each other.

Each object or attribute in a coherent cluster may bear some relative bias (that are unknown in advance) and such bias may be local to the coherent cluster.

Page 34: Mining Biological Data

Coherent ClusterCompute the maximum coherent

attribute sets for each pair of objects

Construct the lexicographical tree

Post-order traverse the tree to find maximum coherent clusters

Compute the maximum coherent object sets for each pair of attributes

Two way pruning

Page 35: Mining Biological Data

Coherent Cluster Observation: Given a pair of objects {o1, o2} and a

(sub)set of attributes {a1, a2, …, ak}, the 2k submatrix is a -coherent cluster iff, for every attribute ai, the mutual bias (do1ai – do2ai) does not differ from each other by more than .

a1 a2 a3 a4 a51

3

5

7

3 2 3.5 2 2.5

o1

o2

[2, 3.5]

If = 1.5,then {a1,a2,a3,a4,a5} is acoherent attribute set (CAS)of (o1,o2).

Page 36: Mining Biological Data

a1 a2 a3 a4 a51

3

5

7

3 2 3.5 2 2.5

r1

r2

Coherent Cluster Strategy: find the maximum coherent attribute

sets for each pair of objects with respect to the given threshold .

= 1

3

5

7

r1

r2

a2

2a3

3.5a4

2a5

2.5a1

3

1

The maximum coherent attribute sets define the search space for maximum coherent clusters.

Page 37: Mining Biological Data

Two Way Pruninga0 a1 a2

o0 1 4 2o1 2 5 5o2 3 6 5o3 4 200 7o4 300 7 6

(o0,o2) →(a0,a1,a2)(o1,o2) →(a0,a1,a2)

(a0,a1) →(o0,o1,o2)(a0,a2) →(o1,o2,o3)(a1,a2) →(o1,o2,o4)(a1,a2) →(o0,o2,o4)

(o0,o2) →(a0,a1,a2)(o1,o2) →(a0,a1,a2)

(a0,a1) →(o0,o1,o2)(a0,a2) →(o1,o2,o3)(a1,a2) →(o1,o2,o4)(a1,a2) →(o0,o2,o4)

delta=1 nc =3 nr = 3

MCAS MCOS

Page 38: Mining Biological Data

Coherent Cluster High expressive power

The coherent cluster can capture many interesting and meaningful patterns overlooked by previous clustering methods.

Efficient and highly scalable Wide applications

Gene expression analysis Collaborative filtering

0

2000

4000

6000

8000

10000

12000

10 20 50 100 200 500

number of conditionsav

erag

e re

spon

se time (sec

)

traditionalclustering

coherent clustering

Page 39: Mining Biological Data

Coherent Cluster References:

Delta-cluster: capturing subspace correlation in a large data set, Proceedings of the 18th IEEE International Conference on Data Engineering (ICDE), pp. 517-528, 2002.

Clustering by pattern similarity in large data sets, Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 394-405, 2002.

Enhanced biclustering on expression data, Proceedings of the IEEE bio-informatics and bioengineering (BIBE), 2003.

Other Work STING (VLDB1997) STING+ (ICDE1999, TKDE 2000) CLUSEQ (CSB2002, ICDE2003) Cluster Streams (ICDE2003)

Page 40: Mining Biological Data

Remarks Similarity measure

Powerful in capturing high order statistics and dependencies

Efficient in computation Robust to noise

Clustering algorithm High accuracy High adaptability High scalability High reliability