biclustering algorithms isa and samba - tel aviv universityrshamir/abdbm/pres/17/isa-samba.pdf ·...

Biclustering algorithms ISA and SAMBA

Slides with Amos Tanay

1 ABDBM © Ron Shamir

2

Biclustering

conditions

gene

s

•Clusters: global. partition of genes according to common

exp pattern across all conditions

•Genes have multiple functions •Conditions may be diverse

•Bicluster: subsets of genes

and conditions •Finer, local analysis

ABDBM © Ron Shamir

3

In this lecture • Two current biclustering methodologies • Iterative Signature Algorithm (ISA)

– Simple – Randomized

• SAMBA – Combinatorial basis – Fast

• And maybe a little more

ABDBM © Ron Shamir

4

What makes a biclustering algorithm?

• Define what is a bicluster; score • Alg for finding one bicluster • Alg for finding all (many) biclusters

• Important themes:

– Normalization – Redundancies

ABDBM © Ron Shamir

6

• Developed at Naama Barkai’s Lab at WIS (I. Ihmels, S. Bergman)

• Motivation: – A bicluster is set of genes and conditions

that mutually define each other – It is possible to refine an approximate

bicluster by “stabalizing” it

The Iterative Signature Algorithm (ISA)

ABDBM © Ron Shamir

7

Normalization • Can we normalize simultaeously for both gene and

condition dependent trends? • In the ISA we are not trying to..

• Given a genes x conditions matrix E with condition

set U, gene set V define: – EC : normalize each cond to 0 mean, 1 std – EG : normalize each gene to 0 mean, 1 std

ABDBM © Ron Shamir

8

What is a bicluster • Assume all columns are independent, what is

the distribution of

Σ(j in U’) eGij

for a random cond set U’ and gene i? • Mean = 0, Std=sqrt(|U’|) • Same for Σ(i in V’) eC

ij and gene set V’. • In a bicluster, we expect independence not

to hold.

ABDBM © Ron Shamir

9

What is a bicluster (2) • Given a set of conds U’ define:

– ISA(U’) = {v in V s.t. Σ(j in U’) eGvj > TGσU’}

• Given a set of genes V’ define: – ISA(V’) = {u in U s.t. Σ(j in V’) eC

iu > TCσV’} • TG ,TC – threshold parameters, • σU’ ,σV’ standard deviations

– Estimated from the data

• A (perfect) bicluster is a pair (U’,V’) s.t.

ISA(V’) = U’ ISA(U’) = V’

ABDBM © Ron Shamir

10

Searching for biclusters • Define a directed graph: nodes = condition & gene

subsets; arcs X’Y’ iff ISA(X’)=Y’ • A bicluster is a cycle of two nodes U’V’ • An approximated bicluster is a larger cycle (but not

too large).

• Alg: start from a random or known gene set, compute ISA until converging to an approximated bicluster: – Vi = ISA(Ui-1), Ui = ISA(Vi)

– Converge at i when for all j > i-m, |Ui\Uj|/|Ui ∪ Uj| < ε

ABDBM © Ron Shamir

ISA


12

Adding weights • Instead of sets use vectors of gene and condition

weights • The operator ISA is generalized to become a matrix

multiplication + threshold function

Gene Set

Compute Avgs on conds

Compute Z-score of

conditions

Keep conds that

survived the

threshold

Gene weights

Multiply by gene

expression matrix

Compute Z-scores of conditions

Nullify weights

below the threshold

ABDBM © Ron Shamir

13

Handling Redundancy • Starting from different seeds yields different

fixed points (bics) • Using different thresholds changes the graph

structure and gives more bics • Need to filter similar solutions & report a

short, non-redundant list of significant bics

ABDBM © Ron Shamir

14

ISA - applications • Start from sets of genes with a known

functional annotation • Start from genes with binding sites of a

transcription factor • Start from a set of sequence orthologs

• See: Ihmels et al. Nat Gen 2002, Bergman

et al. Phy Rev Letter 2003, Bergman et al. PLoS 2004.

ABDBM © Ron Shamir

The basic signature algorithm (Nat. Genetics 02)


Using recurrence to evaluate solutions

16

• A bad initial gene set will also lead to some module • How can we identify the good modules? • Idea: for input gene set A, random set of other genes

R, apply ISA(A), ISA(R∪A) and compare them. • If A represents part of a real transcription module,

expect large overlap in resulting solutions.

ABDBM © Ron Shamir

Using recurrence (2)

17

a, A reference set of Ncore co-regulated genes was composed of genes encoding either ribosomal proteins (dashed lines) or proteins involved in amino acid biosynthesis (dashed/dotted line). The recurrent signature method was applied to this set as follows. First, a collection of input sets was derived by randomly adding genes to the reference set. Second, the signature algorithm was applied to the reference set and to the derived sets; this generates a reference signature and a collection of perturbed signatures, respectively. Last, the overlaps between the reference signature and the perturbed signatures were calculated. Shown is the average overlap as a function of the number of genes added to the reference set. The different lines correspond to different choices of Ncore, shown in parentheses. b, The recurrent signature method was applied to three sequence-related references sets. These sets include all of the genes that contain the binding sequences CGGN11CCG (for Gal4), TGACTC (for Gcn4) or TTN9GGAAA (for Mcm1) in a region of 600 bp upstream. Shown is the fraction of perturbed signatures whose overlap with the reference signature is greater than some threshold, as a function of this threshold. Note the large number of highly overlapping outputs for all three references sets. By contrast, the profile corresponding to a random sequence is distinctly different, with no large overlaps. Thus, the ‘recurrence profile’ gives a clear indication of whether a given sequence functions as a regulatory control element.

ABDBM © Ron Shamir

A global analysis in yeast • 1000 expression profiles • Applied SA with input gene sets:

– All target sets of 6-mers, 7-mers, 8-mers (~86K sets) – All functional groups in MIPS – All clusters in a hierarchical clustering of all genes

• Accepted only recurring modules. • Results: 86 modules covering 2241 genes.


Genes in most modules participate in module-specific cellular process


20

ISA – Pros/Cons • Pros

– Simple, quite fast – Elegant solution to the normalization problem – Good empirical results in several cases

• Cons – Thresholds setting – Finding good seeds – Redundancies – Non normal behaviors

ABDBM © Ron Shamir

21

SAMBA: Statistical and Algorithmic Method for Bicluster Analysis

• Developed here (Tanay, Sharan, Shamir Bioinformatics 02)

• Outline: – Develop efficient combinatorial techniques for

biclustering large datasets. – Employ a statistical model for biclusters – Allow integration of heterogeneous data

ABDBM © Ron Shamir

http://www.cs.tau.ac.il/~rshamir/Group/Photos/roded.jpg�

22

The SAMBA model

conditions edge

no edge

G=(U,V,E) Goal : Find dense subgraphs ABDBM © Ron Shamir

Goal : Find high similarity submatrices

23

The SAMBA approach • Normalization: translate GE matrix to a

weighted bipartite graph using a statistical model for the data

• Bicluster model: Heavy subgraphs

• How to find biclusters: Combined hashing and local optimization

• Redundancies: Find many biclusters at once, filter them in post process

ABDBM © Ron Shamir

24

From a statistical model to edge weights – a simple example

• Background model: Independent edges, each present with prob. p<½.

• H – subgraph of n genes, m conds, k edges • P-value = tail of binomial distribution:

• Weight the graph – edges: (1+log p) non-edges: (1+log(1-p)).

then subgraph weight ≥ log p-value.

knmknmknmk

kkpppp

knm

Hp −−

≥

−≤−

= ∑ )1(2)1(

')( ''

'

2log ( ( )) log( ) ( ) log(1 )p H nm k p nm k p≤ + + − −

ABDBM © Ron Shamir

25

Limitations of the uniform probability model

• Not all dense subgraphs are statistically significant.

• Different genes/conds have dissimilar noise characteristics.

• Noisy genes/conds have high probability of forming dense subgraphs.

• An extended likelihood ratio model:

Background Random Graph Model

Bicluster Random Subgraph Model

Will show: Likelihood model translates to sum of weights over edges and non edges

= ABDBM © Ron Shamir

26

A Degree Based Random Graph Model

• Each edge (u,v) occurs independently w prob p(u,v). • p(u,v) depends on the degrees of both u and v • Γ = { G’=(U,V,E’) | deg(w, E’)=deg(w, E) for all w in U,V} set

of degree preserving graphs on same node sets. • p(u,v) = Pr((u,v) in E’ | G in Γ) • Approximated using Monte Carlo process

low-prob edges

medium-prob edges

high-prob edges

ABDBM © Ron Shamir

27

Likelihood Ratio Model

∑∑

∏∏∏∏∏∏

∉∈

∉∈∉∈

∉∈

−−

+=

−−

=−

−=

'),('),(

'),('),('),('),(

'),('),(

),(11log

),(log)(log

),(11

),()),(1(),(

)1()(

Evu

c

Evu

c

Evu

c

Evu

c

EvuEvu

Evuc

Evuc

vupp

vuppBL

vupp

vupp

vupvup

ppBL

Subgraph weight = log likelihood ratio

• Bicluster model assumption: edges occur independently with prob pc

• Likelihood ratio score:

ABDBM © Ron Shamir

28

Heaviest bipartite subgraph • NPC (Dawande et al. 97, Hochbaum 98) • (But: node biclique is polynomial!)

• Assumption: degree on V side bounded by d

• Start by finding heavy bicliques.

• Alg: use hashing to discover heavy subsets of

conds.

ABDBM © Ron Shamir

GE © Ron Shamir 29

Finding Heaviest Biclique 4 3 2 2 2 3 2 2 2

4 6 4 4 4 3 2 2 4

•Takes O(n2d) time and space.

http://www.tau.ac.il/�

30

Using bicliques to find the heaviest biclusters

'(( ', ')) ( , ')

u Uw U V w u V

∈

= ∑Lemma: If B=(U’,V’) is a maximum weight subgraph and

X⊆U’ then ∃v s.t. |N(v)∩X|≥|X|/2. Pf:

Assume: edge weight = 1, non-edge weight = -1

Note :

'

'

0 (( , ')) | ( ) | | ( ) |

2 | ( ) | | |v V

v V

w X V N v X N v X

N v X X∈

∈

< = − =

−

∑

∑

Corollary: If B=(U’,V’) is a maximum weight subgraph then |U’|≤ 2d

ABDBM © Ron Shamir

31


• Lemma: in a max wt subgraph (U*,V*), ∀X⊆U* ∃Y⊆X, |Y|≥|X|/2 s.t. Y⊆N(v) for some v∈V*.

• Corollary: in a max wt subgraph (U*,V*), U* can be covered by at most log (2d) sets, each containing the neighborhood of some vertex in V*

Assume: edge weight = 1, non-edge weight = -1

ABDBM © Ron Shamir

32


A set of conditions in a maximal bicluster is the union of up to log(2d) subsets of gene neighborhoods.

• Exhaustive O((n2d)log(2d)) time alg: •Hash bicliques •enumerate all log(2d) size N(v) combinations.

• Can be generalized to arbitrary edge/nonedge weights.

u’’ u’’’

… U’

ABDBM © Ron Shamir

33

SAMBA’s implementation

• Phase I: find heavy bicliques - hash for each gene of deg<d all subsets of neighbors of size 4-6.

• Phase II: greedy expansion of heaviest bicliques containing each gene/cond

• Phase III: filter overlapping biclusters.

ABDBM © Ron Shamir

34

Evaluating Specificity Suppose conditions partition into k classes of sizes c1,…ck; Σci = m

A bic has b conditions, bi from class i

If bt=max bi assign the bic to class t

How good is the match of the bic to the classification? Hypergeometric score:

Pr( )t

t tb

k b

c m ck b k

Bmb

=

−−

= ∑ABDBM © Ron Shamir

35

Specificity Test GE data: Alizadeh et al. (00) 4026 genes, 96 human tissues; 9 classes of lymphoma, normal

SAMBA Cheng-Church ‘00 Random

Better fit to true classification

Frac

tion

of

bicl

uste

rs

Log (p-value)

ABDBM © Ron Shamir

36

Specificity (2)

+ Lymphoma data (Alizadeh et.al) x Shuffled Data

log p-value

log

likel

ihoo

d

Generate random bipartite graph with same degree sequence as the Alizadeh data; compute biclusters; plot p-value and likelihood (weight)

ABDBM © Ron Shamir

37

Heterogeneous data

Transcription Level Protein Level Phenotype Level

1 + 1 = 0

ChIP Chip

mRNA profiling 2-Hybrid

Protein Complexes Identification using

Mass Spec

Synthetic lethality

Barcoded deletion libraries

and so many more…

Tanay Sharan Kupiec Shamir PNAS 04

ABDBM © Ron Shamir

39

A Heterogeneous Collection of Yeast Genomic Information • Gene expression: ~1000 conditions, 27

publications • TF binding profiles: 110 profiles from growth

on YPD (Lee et al.) • Phenotype profiles: 6 (30) profiles (Giaever et al.) • Two hybrid interactions: ~1000 (Uetz et al.) • Protein Complex interaction: ~4000 (Ho et al.) • MIPS interactions: ~1000

ABDBM © Ron Shamir

GE © Ron Shamir 40

From experiments to properties

Strong Induction

Medium Induction

Medium Repression

Strong Repression

p1 p2 p3 p4

Strong Binding to TF T

Medium Binding to TF T

High Sensitivity

Medium Sensitivity

High Confidence Interaction

Medium Confidence Interaction

p1

Strong complex binding to protein P p2

Medium complex binding to Protein P

p1 p2 p1 p2 p1 p2

gene g

http://www.tau.ac.il/�

modular organization in yeast

Ovals = modules

Edges = module overlaps

Map generated automatically by SAMBA

•Clustered organization

•Cluster=process

•Hierarchical

•“bridges” 42 ABDBM © Ron Shamir

44

SAMBA – Pros/Cons • Pros

– Fast – Allow simultaneous normalization of

genes and conditions – Allow integration of heterogeneous data – Well suited for query based usage

• Cons – Discretization – redundancies

ABDBM © Ron Shamir

45

Biclustering – interim summary • A general data mining problem • The key point: defining what is a bicluster • Algorithms vary, depending on the nature

of bicluster model • Open issues:

– What is the best objective/ bic criterion? – Search for bics in really huge matrices – Handling redundancies

ABDBM © Ron Shamir

biclustering algorithms isa and samba - tel aviv universityrshamir/abdbm/pres/17/isa-samba.pdf ·...

Documents