biclustering algorithms isa and samba - tel aviv universityrshamir/abdbm/pres/17/isa-samba.pdf ·...
TRANSCRIPT
Biclustering algorithms ISA and SAMBA
Slides with Amos Tanay
1 ABDBM © Ron Shamir
2
Biclustering
conditions
gene
s
•Clusters: global. partition of genes according to common
exp pattern across all conditions
•Genes have multiple functions •Conditions may be diverse
•Bicluster: subsets of genes
and conditions •Finer, local analysis
ABDBM © Ron Shamir
3
In this lecture • Two current biclustering methodologies • Iterative Signature Algorithm (ISA)
– Simple – Randomized
• SAMBA – Combinatorial basis – Fast
• And maybe a little more
ABDBM © Ron Shamir
4
What makes a biclustering algorithm?
• Define what is a bicluster; score • Alg for finding one bicluster • Alg for finding all (many) biclusters
• Important themes:
– Normalization – Redundancies
ABDBM © Ron Shamir
6
• Developed at Naama Barkai’s Lab at WIS (I. Ihmels, S. Bergman)
• Motivation: – A bicluster is set of genes and conditions
that mutually define each other – It is possible to refine an approximate
bicluster by “stabalizing” it
The Iterative Signature Algorithm (ISA)
ABDBM © Ron Shamir
7
Normalization • Can we normalize simultaeously for both gene and
condition dependent trends? • In the ISA we are not trying to..
• Given a genes x conditions matrix E with condition
set U, gene set V define: – EC : normalize each cond to 0 mean, 1 std – EG : normalize each gene to 0 mean, 1 std
ABDBM © Ron Shamir
8
What is a bicluster • Assume all columns are independent, what is
the distribution of
Σ(j in U’) eGij
for a random cond set U’ and gene i? • Mean = 0, Std=sqrt(|U’|) • Same for Σ(i in V’) eC
ij and gene set V’. • In a bicluster, we expect independence not
to hold.
ABDBM © Ron Shamir
9
What is a bicluster (2) • Given a set of conds U’ define:
– ISA(U’) = {v in V s.t. Σ(j in U’) eGvj > TGσU’}
• Given a set of genes V’ define: – ISA(V’) = {u in U s.t. Σ(j in V’) eC
iu > TCσV’} • TG ,TC – threshold parameters, • σU’ ,σV’ standard deviations
– Estimated from the data
• A (perfect) bicluster is a pair (U’,V’) s.t.
ISA(V’) = U’ ISA(U’) = V’
ABDBM © Ron Shamir
10
Searching for biclusters • Define a directed graph: nodes = condition & gene
subsets; arcs X’Y’ iff ISA(X’)=Y’ • A bicluster is a cycle of two nodes U’V’ • An approximated bicluster is a larger cycle (but not
too large).
• Alg: start from a random or known gene set, compute ISA until converging to an approximated bicluster: – Vi = ISA(Ui-1), Ui = ISA(Vi)
– Converge at i when for all j > i-m, |Ui\Uj|/|Ui ∪ Uj| < ε
ABDBM © Ron Shamir
ISA
11 ABDBM © Ron Shamir
12
Adding weights • Instead of sets use vectors of gene and condition
weights • The operator ISA is generalized to become a matrix
multiplication + threshold function
Gene Set
Compute Avgs on conds
Compute Z-score of
conditions
Keep conds that
survived the
threshold
Gene weights
Multiply by gene
expression matrix
Compute Z-scores of conditions
Nullify weights
below the threshold
ABDBM © Ron Shamir
13
Handling Redundancy • Starting from different seeds yields different
fixed points (bics) • Using different thresholds changes the graph
structure and gives more bics • Need to filter similar solutions & report a
short, non-redundant list of significant bics
ABDBM © Ron Shamir
14
ISA - applications • Start from sets of genes with a known
functional annotation • Start from genes with binding sites of a
transcription factor • Start from a set of sequence orthologs
• See: Ihmels et al. Nat Gen 2002, Bergman
et al. Phy Rev Letter 2003, Bergman et al. PLoS 2004.
ABDBM © Ron Shamir
The basic signature algorithm (Nat. Genetics 02)
15 ABDBM © Ron Shamir
Using recurrence to evaluate solutions
16
• A bad initial gene set will also lead to some module • How can we identify the good modules? • Idea: for input gene set A, random set of other genes
R, apply ISA(A), ISA(R∪A) and compare them. • If A represents part of a real transcription module,
expect large overlap in resulting solutions.
ABDBM © Ron Shamir
Using recurrence (2)
17
a, A reference set of Ncore co-regulated genes was composed of genes encoding either ribosomal proteins (dashed lines) or proteins involved in amino acid biosynthesis (dashed/dotted line). The recurrent signature method was applied to this set as follows. First, a collection of input sets was derived by randomly adding genes to the reference set. Second, the signature algorithm was applied to the reference set and to the derived sets; this generates a reference signature and a collection of perturbed signatures, respectively. Last, the overlaps between the reference signature and the perturbed signatures were calculated. Shown is the average overlap as a function of the number of genes added to the reference set. The different lines correspond to different choices of Ncore, shown in parentheses. b, The recurrent signature method was applied to three sequence-related references sets. These sets include all of the genes that contain the binding sequences CGGN11CCG (for Gal4), TGACTC (for Gcn4) or TTN9GGAAA (for Mcm1) in a region of 600 bp upstream. Shown is the fraction of perturbed signatures whose overlap with the reference signature is greater than some threshold, as a function of this threshold. Note the large number of highly overlapping outputs for all three references sets. By contrast, the profile corresponding to a random sequence is distinctly different, with no large overlaps. Thus, the ‘recurrence profile’ gives a clear indication of whether a given sequence functions as a regulatory control element.
ABDBM © Ron Shamir
A global analysis in yeast • 1000 expression profiles • Applied SA with input gene sets:
– All target sets of 6-mers, 7-mers, 8-mers (~86K sets) – All functional groups in MIPS – All clusters in a hierarchical clustering of all genes
• Accepted only recurring modules. • Results: 86 modules covering 2241 genes.
18 ABDBM © Ron Shamir
Genes in most modules participate in module-specific cellular process
19 ABDBM © Ron Shamir
20
ISA – Pros/Cons • Pros
– Simple, quite fast – Elegant solution to the normalization problem – Good empirical results in several cases
• Cons – Thresholds setting – Finding good seeds – Redundancies – Non normal behaviors
ABDBM © Ron Shamir
21
SAMBA: Statistical and Algorithmic Method for Bicluster Analysis
• Developed here (Tanay, Sharan, Shamir Bioinformatics 02)
• Outline: – Develop efficient combinatorial techniques for
biclustering large datasets. – Employ a statistical model for biclusters – Allow integration of heterogeneous data
ABDBM © Ron Shamir
22
The SAMBA model
conditions edge
no edge
G=(U,V,E) Goal : Find dense subgraphs ABDBM © Ron Shamir
Goal : Find high similarity submatrices
23
The SAMBA approach • Normalization: translate GE matrix to a
weighted bipartite graph using a statistical model for the data
• Bicluster model: Heavy subgraphs
• How to find biclusters: Combined hashing and local optimization
• Redundancies: Find many biclusters at once, filter them in post process
ABDBM © Ron Shamir
24
From a statistical model to edge weights – a simple example
• Background model: Independent edges, each present with prob. p<½.
• H – subgraph of n genes, m conds, k edges • P-value = tail of binomial distribution:
• Weight the graph – edges: (1+log p) non-edges: (1+log(1-p)).
then subgraph weight ≥ log p-value.
knmknmknmk
kkpppp
knm
Hp −−
≥
−≤−
= ∑ )1(2)1(
')( ''
'
2log ( ( )) log( ) ( ) log(1 )p H nm k p nm k p≤ + + − −
ABDBM © Ron Shamir
25
Limitations of the uniform probability model
• Not all dense subgraphs are statistically significant.
• Different genes/conds have dissimilar noise characteristics.
• Noisy genes/conds have high probability of forming dense subgraphs.
• An extended likelihood ratio model:
Background Random Graph Model
Bicluster Random Subgraph Model
Will show: Likelihood model translates to sum of weights over edges and non edges
= ABDBM © Ron Shamir
26
A Degree Based Random Graph Model
• Each edge (u,v) occurs independently w prob p(u,v). • p(u,v) depends on the degrees of both u and v • Γ = { G’=(U,V,E’) | deg(w, E’)=deg(w, E) for all w in U,V} set
of degree preserving graphs on same node sets. • p(u,v) = Pr((u,v) in E’ | G in Γ) • Approximated using Monte Carlo process
low-prob edges
medium-prob edges
high-prob edges
ABDBM © Ron Shamir
27
Likelihood Ratio Model
∑∑
∏∏∏∏∏∏
∉∈
∉∈∉∈
∉∈
−−
+=
−−
=−
−=
'),('),(
'),('),('),('),(
'),('),(
),(11log
),(log)(log
),(11
),()),(1(),(
)1()(
Evu
c
Evu
c
Evu
c
Evu
c
EvuEvu
Evuc
Evuc
vupp
vuppBL
vupp
vupp
vupvup
ppBL
Subgraph weight = log likelihood ratio
• Bicluster model assumption: edges occur independently with prob pc
• Likelihood ratio score:
ABDBM © Ron Shamir
28
Heaviest bipartite subgraph • NPC (Dawande et al. 97, Hochbaum 98) • (But: node biclique is polynomial!)
• Assumption: degree on V side bounded by d
• Start by finding heavy bicliques.
• Alg: use hashing to discover heavy subsets of
conds.
ABDBM © Ron Shamir
GE © Ron Shamir 29
Finding Heaviest Biclique 4 3 2 2 2 3 2 2 2
4 6 4 4 4 3 2 2 4
•Takes O(n2d) time and space.
30
Using bicliques to find the heaviest biclusters
'(( ', ')) ( , ')
u Uw U V w u V
∈
= ∑Lemma: If B=(U’,V’) is a maximum weight subgraph and
X⊆U’ then ∃v s.t. |N(v)∩X|≥|X|/2. Pf:
Assume: edge weight = 1, non-edge weight = -1
Note :
'
'
0 (( , ')) | ( ) | | ( ) |
2 | ( ) | | |v V
v V
w X V N v X N v X
N v X X∈
∈
< = − =
−
∑
∑
Corollary: If B=(U’,V’) is a maximum weight subgraph then |U’|≤ 2d
ABDBM © Ron Shamir
31
Using bicliques to find the heaviest biclusters
• Lemma: in a max wt subgraph (U*,V*), ∀X⊆U* ∃Y⊆X, |Y|≥|X|/2 s.t. Y⊆N(v) for some v∈V*.
• Corollary: in a max wt subgraph (U*,V*), U* can be covered by at most log (2d) sets, each containing the neighborhood of some vertex in V*
Assume: edge weight = 1, non-edge weight = -1
ABDBM © Ron Shamir
32
Using bicliques to find the heaviest biclusters
A set of conditions in a maximal bicluster is the union of up to log(2d) subsets of gene neighborhoods.
• Exhaustive O((n2d)log(2d)) time alg: •Hash bicliques •enumerate all log(2d) size N(v) combinations.
• Can be generalized to arbitrary edge/nonedge weights.
u’’ u’’’
… U’
ABDBM © Ron Shamir
33
SAMBA’s implementation
• Phase I: find heavy bicliques - hash for each gene of deg<d all subsets of neighbors of size 4-6.
• Phase II: greedy expansion of heaviest bicliques containing each gene/cond
• Phase III: filter overlapping biclusters.
ABDBM © Ron Shamir
34
Evaluating Specificity Suppose conditions partition into k classes of sizes c1,…ck; Σci = m
A bic has b conditions, bi from class i
If bt=max bi assign the bic to class t
How good is the match of the bic to the classification? Hypergeometric score:
Pr( )t
t tb
k b
c m ck b k
Bmb
=
−−
= ∑ABDBM © Ron Shamir
35
Specificity Test GE data: Alizadeh et al. (00) 4026 genes, 96 human tissues; 9 classes of lymphoma, normal
SAMBA Cheng-Church ‘00 Random
Better fit to true classification
Frac
tion
of
bicl
uste
rs
Log (p-value)
ABDBM © Ron Shamir
36
Specificity (2)
+ Lymphoma data (Alizadeh et.al) x Shuffled Data
log p-value
log
likel
ihoo
d
Generate random bipartite graph with same degree sequence as the Alizadeh data; compute biclusters; plot p-value and likelihood (weight)
ABDBM © Ron Shamir
37
Heterogeneous data
Transcription Level Protein Level Phenotype Level
1 + 1 = 0
ChIP Chip
mRNA profiling 2-Hybrid
Protein Complexes Identification using
Mass Spec
Synthetic lethality
Barcoded deletion libraries
and so many more…
Tanay Sharan Kupiec Shamir PNAS 04
ABDBM © Ron Shamir
38
Unified Modeling of Biological Information
Genes/Proteins
Properties
Modules
ABDBM © Ron Shamir
39
A Heterogeneous Collection of Yeast Genomic Information • Gene expression: ~1000 conditions, 27
publications • TF binding profiles: 110 profiles from growth
on YPD (Lee et al.) • Phenotype profiles: 6 (30) profiles (Giaever et al.) • Two hybrid interactions: ~1000 (Uetz et al.) • Protein Complex interaction: ~4000 (Ho et al.) • MIPS interactions: ~1000
ABDBM © Ron Shamir
GE © Ron Shamir 40
From experiments to properties
Strong Induction
Medium Induction
Medium Repression
Strong Repression
p1 p2 p3 p4
Strong Binding to TF T
Medium Binding to TF T
High Sensitivity
Medium Sensitivity
High Confidence Interaction
Medium Confidence Interaction
p1
Strong complex binding to protein P p2
Medium complex binding to Protein P
p1 p2 p1 p2 p1 p2
gene g
41
A SAMBA module Ge
nes
Properties
GO annotations CPA1 CPA2
ABDBM © Ron Shamir
modular organization in yeast
Ovals = modules
Edges = module overlaps
Map generated automatically by SAMBA
•Clustered organization
•Cluster=process
•Hierarchical
•“bridges” 42 ABDBM © Ron Shamir
43
ABDBM © Ron Shamir
TF-function
map
44
SAMBA – Pros/Cons • Pros
– Fast – Allow simultaneous normalization of
genes and conditions – Allow integration of heterogeneous data – Well suited for query based usage
• Cons – Discretization – redundancies
ABDBM © Ron Shamir
45
Biclustering – interim summary • A general data mining problem • The key point: defining what is a bicluster • Algorithms vary, depending on the nature
of bicluster model • Open issues:
– What is the best objective/ bic criterion? – Search for bics in really huge matrices – Handling redundancies
ABDBM © Ron Shamir