motif finding yueyi irene liu cs374 lecture oct. 17, 2002

52
Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Upload: jesus-pollard

Post on 26-Mar-2015

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Motif Finding

Yueyi Irene Liu

CS374 Lecture

Oct. 17, 2002

Page 2: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Outline

• Background biology

• Motif-finding methods– Word enumeration– Gibbs sampling– Random projection– Phylogenetic footprinting– Reducer

Page 3: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002
Page 4: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Regulation of Gene Expression

• Chromatin structure• Transcription initiation• Transcript processing and modification• RNA transport• Transcript stability• Translation initiation• Post-Translational Modification• Protein Transport• Control of Protein Stability

Page 5: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Typical Structure of an Eukaryotic mRNA Gene

Page 6: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Control of Transcription Initiation

Page 7: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Motif

• A conserved pattern that is found in two or more sequences

• Can be found in – DNA (e.g., transcription factor binding sites)– Protein – RNA

Page 8: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Models for Representing Motifs

• Regular expression– Consensus

• TGACGCA

– Degenerate• WGACRCA

• Position Specific Matrix

TGACGCATGACGCAAGACGCATGACACAAGACGCA

1 2 3 4 5 6 7

A 0.4 0 1 0 0.2 0 1

T 0.6 0 0 0 0 0 0

G 0 1 0 0 0.8 0 0

C 0 0 0 1 0 1 0

Page 9: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Where to look for motifs?

• Gene families: a set of genes controlled by a common transcription factor or common environmental stimulus

• How do you construct gene families?– Microarray experiments

Page 10: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Known DNA sequences

Glass slide

Isolate mRNA

Cells of Interest

Reference samplegene

s

Resulting data

3.25 3.01 1.30 0.70

6.73 2.89 0.92 0.67

1.14 1.15 0.60 0.23

2.12 6.12 0.07 0.02

experiments

10

Microarrays

Page 11: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Motif-finding Methods

• Goal: Look for motifs (5-15bp) in the data set

• Methods:– Word enumeration method– Gibbs sampling– Random projection– Phylogenetic footprinting– Reducer

Page 12: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Word Enumeration

• For every word w, calculate: – Expected frequency based on entire upstream region of the

yeast genome• E.g., P(ATTGA) = (0.4)4(0.1)1, given P(A) = P(T) = 0.4,

P(G)=P(C) = 0.1• Expected number of occurrences of ATTGA: n*P(ATTGA)

– Observed frequency in the data set– Statistical significance of enrichment

Z = (O - E) / sqrt[np (1 - p)] ~ N(0, 1)– Disadvantage: only consider exact word

• E.g, YCTGCA: TCTGCA and CCTGCA

Page 13: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Gibbs Sampling

• Matrix to capture a motif

• Goal: find the best ak to maximize the difference between motif and background base distribution.

a2

a3

a4

ak

a1

Liu, X

Page 14: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Gibbs Sampling (Lawrence, et al, 1993)

• Step 1: Pick random start position, compute current motif matrix

• Step 2: Iterative update– Take one sequence out, update motif matrix

– Calcuate fitness score of each position of out sequence

– Pick start position in out sequence based on weight Ax

– Take out another sequence, …, until converge

• Step 3: Reset starting position

Liu, X

Page 15: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Gibbs Sampling InitializationPick random start position, compute motif matrix

a1

a2

a3

a4

ak

a1'

a3'

a4'

ak'

a2'

Liu, X

Page 16: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Gibbs Sampling Iteration Steps1) Take out one sequence, calculate the fitness score of

every subsequence relative to the current motif

a3'

a4'

ak'

a2'

?????????????????a1'

Liu, X

Page 17: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Fitness Score

• Ax = Qx / Px– Qx: probability of

generating subsequence x from current motif

– Px: probability of generating subsequence x from background

1 2 3

A 0.1 0.3 0.7

T 0.1 0.2 0.1

G 0.7 0.4 0.1

C 0.1 0.1 0.1

Current Motif

Background:

P(A) = P(T) = 0.4

P(G) = P(C) = 0.1

X = GGA:

Q? P?

Page 18: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Gibbs Sampling Iteration Steps2) Pick new start position sampling from fitness score

Sample from Fitness Score

0

1

2

3

4

5

0 1 2 3 4 5 6 7 8 9 10 11 12 …

Starting position of motif in sequence

Fitn

ess

a1''

a3'

a4'

ak'

a2'

Liu, X

Page 19: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Recent Development

• Random Projection

• Phylogenetic Footprinting

• Reducer

Page 20: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Random Projection (Buhler, 2002)

• (l, d)-motif problem: – M is an (unknown) motif of length l – Each occurrence of M is corrupted by exactly d

point substitutions in random positions

• No known biological motifs are

of (l, d)-motifCCcaAG

CCcgAG

CCgcAG

CCtaAG

CCtgAG

CtATgG

CCctAc

tCtTAG

CaAcAG

CCAgAa

Page 21: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Random Projection Algorithm

• Guiding principle: Some instances of a motif agree on a subset of positions.

• Use information from multiple motif instances to construct model.

ATGCGTC

...ccATCCGACca...

...ttATGAGGCtc...

...ctATAAGTCgc...

...tcATGTGACac... (7,2) motif

x(1)x(2)

x(5)x(8)

=M

Buhler, J

Page 22: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

k-Projections

• Choose k positions in string of length l.

• Concatenate nucleotides at chosen k positions to form k-tuple.

• In l-dimensional Hamming space, projection onto k dimensional subspace.

ATGGCATTCAGATTC TGCTGAT

l = 15 k = 7P

P = (2, 4, 5, 7, 11, 12, 13)Buhler, J

Page 23: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Random Projection Algorithm

• Choose a projection by selecting k positions uniformly at random.

• For each l-tuple in input sequences, hash into bucket based on letters at k selected positions.

• Recover motif from bucket containing multiple l-tuples.

Bucket TGCT

TGCACCT

Input sequence x(i):…TCAATGCACCTAT...

Buhler, J

Page 24: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Example

• l = 7 (motif size) , k = 4 (projection size)

• Choose projection (1,2,5,7)

GCTC

...TAGACATCCGACTTGCCTTACTAC...

Buckets

Input Sequence

ATGC

ATCCGAC

GCCTTAC

Buhler, J

Page 25: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Hashing and Buckets

• Hash function h(x) obtained from k positions of projection.

• Buckets are labeled by values of h(x).

• Enriched buckets: contain more than s l-tuples, for some parameter s.

ATTCCATCGCTCATGC Buhler, J

Page 26: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Motif Refinement• How do we recover the motif from the

sequences in the enriched buckets?

• k nucleotides are known from hash value of bucket.

• Use information in other l-k positions as starting point for local refinement scheme, e.g. EM or Gibbs sampler

Local refinement algorithmATGCGTCCandidate motif

ATGC

ATCCGAC

ATGAGGCATAAGTC

ATGTGACBuhler, J

Page 27: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Parameter Selection

• Projection size k

• Choose k small so several motif instances hash to same bucket. (k < l - d)

• Choose k large to avoid contamination by spurious l-mers. ( 4k > t (n - l + 1)

• Bucket threshold s: (s = 3, s = 4)

Buhler, J

Page 28: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Recent Development

• Random Projection

• Phylogenetic Footprinting

• Reducer

Page 29: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Conservation of Regulatory Elements in Upstream of

ApoAI Gene

TATA boxTATA box

Hepatic site C CCAAT boxMouseRabbitHumanChicken

MouseRabbitHumanChicken

MouseRabbitHumanChicken

TATA box

Page 30: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

AAGCA

AAGCA ACGCA

AAGCA

AAGCA

Page 31: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Substring Parsimony Problem

Given: • orthologous upstream sequences S1,…Sn

• phylogenetic tree T of the n species

• size k of the motif, threshold d

Problem: Find all sets of substrings s1,…sn of S1,…Sn , each of size k,

such that the parsimony score of s1,…sn on T is at most d

Blanchette, M

Page 32: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Parsimony Score

s1

s2

s3s4

s5s6

s`34

Minimum (all possible labelings of internal nodes) TEvu

vluld),(

))(),((

•l(v) – label of node v

•d(l1, l2) – Hamming distance

Tree T:

Blanchette, M

Page 33: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

String Parsimony Problem

S1: AAAGCATTC

S2: TACGCACCC

S3: GAAGCAGGG

S1 S2 S3

AAGCA

AAGCA ACGCA

AAGCA

AAGCA

k = 5

d = 1

Page 34: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Algorithm: version I

• Root the tree at arbitrary internal node r

• Compute table Wu of size 4k for each node u, where Wu[s] – best parsimony score for subtree rooted at u when u is labeled with s

• Direct implementation of this recursion gives O(n∙k∙(42k + l), where l – average sequence length

)(leaf anot is if ),(][(min

of substring a is and leaf is if ,0

of substring anot is and leaf is if ,

][

uChildvvkt

u

u

u

utsdtW

Ssu

Ssu

sW

Blanchette, M

Page 35: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Algorithm: version II

• Define X(u, v)[s] – best parsimony score for subtree consisting of edge (u,v) and the subtree rooted at v

)),(][(min][),( tsdtWsX vtvu k

u labeled s

v

w

)(

),( ][][uChildv

vuu sXsW

Blanchette, M

Page 36: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Algorithm: version II (continued)

• Update X(u, v) in phases: in phase p maintain set Bp of sequences t, such that X(u, v)[t] = p

• Define: • Ra = {s: Wv[s] = a}

• N(s) = {t in ∑k: d(s, t) = 1}

• Start in phase m and let Bm = Rm

• Update

• Computation of X(u, v) takes O(k∙4k)

pBs pj

jpp BsNRB

)(11

Blanchette, M

Page 37: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Improvements

• Reduce the size of Bp when sequences contribute to X(u, v) greater than threshold dIn phase p, only care for sequence X(u, v) [s] if

Leads to significant reductions in stages d/2 … d

• Reduce the number of substrings inserted in W at the leavesFor substring s of Si, if its best match against any Sj, has

Hamming distance at least d, s can be discarded

otherwise 1

computedbeen has ][ if ][max

),(),(

)( p

sXsXpd

vuvu

vwuChildw

Blanchette, M

Page 38: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Results

• Practical limit on k = 10

• There appeared to be a threshold d0 with very few solutions below and many above

• Algorithm found ~80% known binding sites

• Performed better than ClustalW, MEME, Consensus

Blanchette, M

Page 39: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Recent Development

• Random Projection

• Phylogenetic Footprinting

• Reducer

Page 40: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Reducer (Bussemaker, et al 2001)

• Links motif finding to expression level• Ag = C + Σ Fu Nug

– Ag: gene expression level (logarithm of expression ratio)

– M: number of significant motifs– Ng: number of occurrences of motif u in gene g– C: baseline expression level (same for all genes)– F: increase/decrease of expression level caused by

presence of motif

Page 41: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Reducer (Cont’d)

Expression vector

Log ratio of expression levels

Gene1 Gene2 Gene3 Gene4 … GeneN

1.3 -3.7 10.3 4.5 -2.3

Motif vector

Number of times that motif occurs in the upstream region of the gene

Gene1 Gene2 Gene3 Gene4 … GeneN

AAAAA 2 0 5 3 0

AAAAT 5 3 2 1 5

…Liu, X

Page 42: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Reducer (Cont’d)

• Normalize expression (A) and motif (n) vectors

• Linear regression between A vector and every n vector to find the best fit n to A

• Step-wise regression to combine effects of motifs– Subtract the effect of one motif– Find the next best motif

Liu, X

Page 43: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Acknowlegement

• People from whom I borrowed slides:– Xiaole Liu (Reducer)– Olga Troyanskaya (Microarray)– Jeremy Buhler (Random projections)– Mathieu Blanchette (Phylogenetic footprinting)– Various web sources

Page 44: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002
Page 45: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

cDNA clones(probes)

PCR product amplificationpurification

printing

microarray Hybridise target to microarray

mRNA target)

excitation

laser 1laser 2

emission

scanning

analysis

overlay images and normalise

0.1nl/spot

Page 46: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Information Content of Motifs

• Uncertainty

• Information = Hbefore - Hafter

Page 47: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Improvement on Original Gibbs sampler

• 0 ~ n copies of sites in each sequence

• Iterative masking to find multiple motifs

• Use higher order Markov models to improve motif specificity

Page 48: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Clinical Importance of Defects in Regulatory Elements

Burkitt’s Lymphoma

Page 49: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Statistical Methods

• Expectation Maximization (EM)– MEME

• Gibbs sampling– BioProspector– AlignACE

Page 50: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Motifs are not limited to DNAs

• RNA motifs– RNA – RNA interaction motifs, e.g., intron-exon

splice sites– RNA – protein interaction motifs, e.g., binding of

proteins to RNA polyA tail

• Protein motifs– E.g., Helix-turn-helix motif

Page 51: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Sequence Logo

Page 52: Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Why is this Problem Hard?

• Motif information content low

• Hamming distance between each motif instance high