cs262 lecture 17, win07, batzoglou gene regulation and microarrays
Post on 20-Dec-2015
220 views
TRANSCRIPT
CS262 Lecture 17, Win07, Batzoglou
Overview
• A. Gene Expression and Regulation
• B. Measuring Gene Expression: Microarrays
• C. Finding Regulatory Motifs
CS262 Lecture 17, Win07, Batzoglou
Cells respond to environment
Cell responds toenvironment—various external messages
CS262 Lecture 17, Win07, Batzoglou
Genome is fixed – Cells are dynamic
• A genome is static
Every cell in our body has a copy of same genome
• A cell is dynamic
Responds to external conditions Most cells follow a cell cycle of division
• Cells differentiate during development
• Gene expression varies according to:
Cell type Cell cycle External conditions Location
slide credits: M. Kellis
CS262 Lecture 17, Win07, Batzoglou
Where gene regulation takes place
• Opening of chromatin
• Transcription
• Translation
• Protein stability
• Protein modifications
CS262 Lecture 17, Win07, Batzoglou
Transcriptional Regulation
• Efficient place to regulate:
No energy wasted making intermediate products
• However, slowest response time
After a receptor notices a change:
1. Cascade message to nucleus
2. Open chromatin & bind transcription factors
3. Recruit RNA polymerase and transcribe
4. Splice mRNA and send to cytoplasm
5. Translate into protein
CS262 Lecture 17, Win07, Batzoglou
Transcription Factors Binding to DNA
Transcription regulation:
• Transcription factors bind DNA
• Binding recognizes DNA substrings:
• Regulatory motifs
CS262 Lecture 17, Win07, Batzoglou
Promoter and Enhancers
• Promoter necessary to start transcription
• Enhancers can affect transcription from afar
CS262 Lecture 17, Win07, Batzoglou
Transcription Factor(Protein)
DNA
Gene Regulation with TFs
Regulatory Element Gene
RNA polymerase
CS262 Lecture 17, Win07, Batzoglou
Gene
RNA polymerase
Transcription Factor(Protein)
Regulatory Element
DNA
Gene Regulation with TFs
CS262 Lecture 17, Win07, Batzoglou
DNA
New protein
Gene Regulation with TFs
Transcription Factor(Protein)
Regulatory Element Gene
RNA polymerase
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAAT
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATTT
Promoter motifs
3’ UTR motifs
Exons
Introns
CS262 Lecture 17, Win07, Batzoglou
Example: A Human heat shock protein
• TATA box: positioning transcription start
• TATA, CCAAT: constitutive transcription
• GRE: glucocorticoid response
• MRE: metal response
• HSE: heat shock element
TATASP1CCAAT AP2HSEAP2CCAATSP1
promoter of heat shock hsp70
0--158
GENE
CS262 Lecture 17, Win07, Batzoglou
The Cell as a Regulatory Network
• Genes = wires• Motifs = gates
A B Make DC
If C then D
If B then NOT D
If A and B then D D
Make BD
If D then B
C
gene D
gene B
CS262 Lecture 17, Win07, Batzoglou
DNA Microarrays
Measuring gene transcription in a high-throughput fashion
CS262 Lecture 17, Win07, Batzoglou
What is a microarray
• A 2D array of DNA sequences from thousands of genes
• Each spot has many copies of same gene
• Measure number of hybridizations per spot
Result:• Thousands of “experiments” – one per gene –
in one go
• Perform many microarrays for different conditions: Time during cell cycle Temperature Nutrient level
CS262 Lecture 17, Win07, Batzoglou
Goal of Microarray Experiments
• Measure level of gene expression across many different conditions:
Expression Matrix M: {genes}{conditions}:
Mij = |genei| in conditionj
• Group genes into coregulated sets
Observe cells under different conditions
Find genes with similar expression profiles
• Potentially regulated by same TF
slide credits: M. Kellis
CS262 Lecture 17, Win07, Batzoglou
Clustering vs. Classification
• Clustering Idea: Groups of genes that share similar function have similar expression
patterns• Hierarchical clustering• k-means • Bayesian approaches• Projection techniques
• Principal Component Analysis• Independent Component Analysis
• Classification Idea: A cell can be in one of several states
• (Diseased vs. Healthy, Cancer X vs. Cancer Y vs. Normal) Can we train an algorithm to use the gene expression patterns to
determine which state a cell is in?• Support Vector Machines• Decision Trees• Neural Networks• K-Nearest Neighbors
CS262 Lecture 17, Win07, Batzoglou
Clustering Algorithms
b
ed
f
a
c
h
ga b d e f g hc
• K-meansb
ed
f
a
c
h
gc1
c2
c3a b g hcd e f
• Hierarchical
slide credits: M. Kellis
CS262 Lecture 17, Win07, Batzoglou
Hierarchical clustering
• Bottom-up algorithm: Initialization: each point in a separate cluster
• At each step: Choose the pair of closest clusters Merge
• The exact behavior of the algorithm depends on how we define the distance CD(X,Y) between clusters X and Y
• Avoids the problem of specifying the number of clusters
b
ed
f
a
c
h
g
slide credits: M. Kellis
CS262 Lecture 17, Win07, Batzoglou
Distance between clusters
• CD(X,Y)=minx X, y Y D(x,y)
Single-link method
• CD(X,Y)=maxx X, y Y D(x,y)
Complete-link method
• CD(X,Y)=avgx X, y Y D(x,y)
Average-link method
• CD(X,Y)=D( avg(X) , avg(Y) )
Centroid method
ed
f
h
g
ed
f
h
g
ed
f
h
g
ed
f
h
g
slide credits: M. Kellis
CS262 Lecture 17, Win07, Batzoglou
Results of Clustering Gene Expression
• CLUSTER is simple and easy to use
• De facto standard for microarray analysis
Time: O(N2M)
N: #genesM: #conditions
CS262 Lecture 17, Win07, Batzoglou
K-Means Clustering Algorithm
• Each cluster Xi has a center ci
• Define the clustering cost criterion
• COST(X1,…Xk) = ∑Xi ∑x Xi |x – ci|2
• Algorithm tries to find clusters X1…Xk and centers c1…ck that minimize COST
• K-means algorithm: Initialize centers Repeat:
• Compute best clusters for given centers
• → Attach each point to the closest center
• Compute best centers for given clusters
• → Choose the centroid of points in cluster
Until the changes in COST are “small”
b
ed
f
a
c
h
g
c1
c2
c3
slide credits: M. Kellis
CS262 Lecture 17, Win07, Batzoglou
K-Means Algorithm
• Repeat … until convergence
Time: O(KNM) per iteration
N: #genesM: #conditions
CS262 Lecture 17, Win07, Batzoglou
Mixture of Gaussians – Probabilistic K-means
• Data is modeled as mixture of K Gaussians N(1, 2I), …, N(K, 2I)
Prior probabilities 1, …, K
• Different i for every Gaussian i, or even different covariance matrices are possible, but learning becomes harder
P(x) = ∑i P(x | N(1, 2I)) i
Use EM to learn parameters
CS262 Lecture 17, Win07, Batzoglou
Analysis of Clustering Data
• Statistical Significance of Clusters
Gene Ontology http://www.geneontology.org/
KEGG http://www.genome.jp/kegg/
• Regulatory motifs responsible for common expression
• Regulatory Networks
• Experimental Verification
CS262 Lecture 17, Win07, Batzoglou
Evaluating clusters – Hypergeometric Distribution
rm
k
N
mk
pN
m
p
rposP )(
• N genes, p labeled ++, (N-p) ––• Cluster: k genes, m labeled ++• P-value of single cluster containing k
genes of which at least r are ++
Prob a random set of k genes
has m ++ and k-m –– genes
P-value that at least r genes
are ++ in the cluster
slide credits: M. Kellis
CS262 Lecture 17, Win07, Batzoglou
Regulatory Motif Discovery
DNA
Group of co-regulated genesCommon subsequence
Find motifs within groups of corregulated genes
slide credits: M. Kellis
CS262 Lecture 17, Win07, Batzoglou
Characteristics of Regulatory Motifs
• Tiny
• Highly Variable
• ~Constant Size Because a constant-size
transcription factor binds
• Often repeated
• Low-complexity-ish
CS262 Lecture 17, Win07, Batzoglou
Sequence Logos
Height of each letter proportional to its frequency
Height of all letters proportional to information content at that position
CS262 Lecture 17, Win07, Batzoglou
Problem Definition
Probabilistic
Motif: Mij; 1 i W
1 j 4
Mij = Prob[ letter j, pos i ]
Find best M, and positions p1,…, pN in sequences
Combinatorial
Motif M: m1…mW
Some of the mi’s blank
Find M that occurs in all si with k differences
Given a collection of promoter sequences s1,…, sN of genes with common expression
CS262 Lecture 17, Win07, Batzoglou
Discrete Formulations
Given sequences S = {x1, …, xn}
• A motif W is a consensus string w1…wK
• Find motif W* with “best” match to x1, …, xn
Definition of “best”:
d(W, xi) = min hamming dist. between W and any word in xi
d(W, S) = i d(W, xi)
CS262 Lecture 17, Win07, Batzoglou
Exhaustive Searches
1. Pattern-driven algorithm:
For W = AA…A to TT…T (4K possibilities)
Find d( W, S )
Report W* = argmin( d(W, S) )
Running time: O( K N 4K )
(where N = i |xi|)
Advantage: Finds provably “best” motif W
Disadvantage: Time
CS262 Lecture 17, Win07, Batzoglou
Exhaustive Searches
2. Sample-driven algorithm:
For W = any K-long word occurring in some xi
Find d( W, S )
Report W* = argmin( d( W, S ) )or, Report a local improvement of W*
Running time: O( K N2 )
Advantage: Time
Disadvantage: If the true motif is weak and does not occur in data
then a random motif may score better than any instance of true motif
CS262 Lecture 17, Win07, Batzoglou
MULTIPROFILER
• Extended sample-driven approach
Given a K-long word W, define:
Nα(W) = words W’ in S s.t. d(W,W’) α
Idea:
Assume W is occurrence of true motif W*
Will use Nα(W) to correct “errors” in W
CS262 Lecture 17, Win07, Batzoglou
MULTIPROFILER
Assume W differs from true motif W* in at most L positions
Define:
A wordlet G of W is a L-long pattern with blanks, differing from W L is smaller than the word length K
Example:
K = 7; L = 3
W = ACGTTGA
G = --A--CG
CS262 Lecture 17, Win07, Batzoglou
MULTIPROFILER
Algorithm:
For each W in S:For L = 1 to Lmax
1. Find the α-neighbors of W in S Nα(W)2. Find all “strong” L-long wordlets G in Na(W)3. For each wordlet G,
1. Modify W by the wordlet G W’2. Compute d(W’, S)
Report W* = argmin d(W’, S)
Step 2 above: Smaller motif-finding problem; Use exhaustive search
CS262 Lecture 17, Win07, Batzoglou
Expectation Maximization in Motif FindingExpectation Maximization in Motif Finding
CS262 Lecture 17, Win07, Batzoglou
All K-long wordsmotif background
Expectation Maximization
Algorithm (sketch):
1. Given genomic sequences find all k-long words
2. Assume each word is motif or background
3. Find likeliest Motif Model
Background Model
classification of words into either Motif or Background
CS262 Lecture 17, Win07, Batzoglou
Expectation Maximization
Given sequences x1, …, xN,
• Find all k-long words X1,…, Xn
• Define motif model:
M = (M1,…, MK)
Mi = (Mi1,…, Mi4)(assume {A, C, G, T})
where Mij = Prob[ letter j occurs in motif position i ]
• Define background model:
B = B1, …, B4
Bi = Prob[ letter j in background sequence ]
motif background
ACGT
M1 MKM1 B
CS262 Lecture 17, Win07, Batzoglou
Expectation Maximization
• Define
Zi1 = { 1, if Xi is motif;
0, otherwise }
Zi2 = { 0, if Xi is motif;
1, otherwise }
• Given a word Xi = x[s]…x[s+k],
P[ Xi, Zi1=1 ] = M1x[s]…Mkx[s+k]
P[ Xi, Zi2=1 ] = (1 – ) Bx[s]…Bx[s+k]
Let 1 = ; 2 = (1 – )
motif background
ACGT
M1 MKM1 B
1 –
CS262 Lecture 17, Win07, Batzoglou
Expectation Maximization
Define:
Parameter space = (M, B)
1: Motif; 2: Background
Objective:
Maximize log likelihood of model:
2
1
2
111
1
2
11
log)|(log
))|(log(),|,...(log
j jjij
n
ijiij
n
i
n
i jjijijn
ZZ
Z
XP
XPZXXP
ACGT
M1 MKM1 B
1 –
CS262 Lecture 17, Win07, Batzoglou
Expectation Maximization
• Maximize expected likelihood, in iteration of two steps:
Expectation:
Find expected value of log likelihood:
Maximization:
Maximize expected value over ,
)],|,...([log 1 ZXXPE n
CS262 Lecture 17, Win07, Batzoglou
Expectation:
Find expected value of log likelihood:
2
1
2
111
1
log][)|(log][
)],|,...([log
j jjij
n
ijiij
n
i
n
ZZ EXPE
ZXXPE
where expected values of Z can be computed as follows:
ijii
jijijij Z
XPXP
XPZobZE *
)|()1()|(
)|(]1[Pr][
21
Expectation Maximization: E-step
CS262 Lecture 17, Win07, Batzoglou
Expectation Maximization: M-step
Maximization:
Maximize expected value over and independently
For , this has the following solution:(we won’t prove it)
Effectively, NEW is the expected # of motifs per position, given our current parameters
n
i
n
i
iii
NEW
n
Zxam ZZ
1 1
121
*))1log(log(arg **
CS262 Lecture 17, Win07, Batzoglou
• For = (M, B), define
cjk = E[ # times letter k appears in motif position j]
c0k = E[ # times letter k appears in background]• cij values are calculated easily from Z* values
It then follows:
4
1k jk
jkNEWjk
c
cM
4
1 0
0
k k
kNEWk
c
cB
to not allow any 0’s, add pseudocounts
Expectation Maximization: M-step
CS262 Lecture 17, Win07, Batzoglou
Initial Parameters Matter!
Consider the following artificial example:
6-mers X1, …, Xn: (n = 2000)
990 words “AAAAAA” 990 words “CCCCCC” 20 words “ACACAC”
Some local maxima:
= 49.5%; B = 100/101 C, 1/101 A M = 100% AAAAAA
= 1%; B = 50% C, 50% A M = 100% ACACAC
CS262 Lecture 17, Win07, Batzoglou
Overview of EM Algorithm
1. Initialize parameters = (M, B), : Try different values of from N-1/2 up to 1/(2K)
2. Repeat:
a. Expectation
b. Maximization
3. Until change in = (M, B), falls below
4. Report results for several “good”
CS262 Lecture 17, Win07, Batzoglou
Gibbs Sampling
• Given: x1, …, xN, motif length K, background B,
• Find: Model M Locations a1,…, aN in x1, …, xN
Maximizing log-odds likelihood ratio:
N
i
K
kika
ika
i
i
xB
xkM
1 1 )(
),(log
CS262 Lecture 17, Win07, Batzoglou
Gibbs Sampling
• AlignACE: first statistical motif finder• BioProspector: improved version of AlignACE
Algorithm (sketch):1. Initialization:
a. Select random locations in sequences x1, …, xN
b. Compute an initial model M from these locations
2. Sampling Iterations:a. Remove one sequence xi
b. Recalculate modelc. Pick a new location of motif in xi according to probability the location is a
motif occurrence
CS262 Lecture 17, Win07, Batzoglou
Gibbs Sampling
Initialization:
• Select random locations 1,…, N in x1, …, xN
• For these locations, compute M:
))((1
1
N
ikajkj jx
BNM
i
where j are pseudocounts to avoid 0s,
and B = j j
• That is, Mkj is the number of occurrences of letter j in motif position k, over the total
CS262 Lecture 17, Win07, Batzoglou
Gibbs Sampling
Predictive Update:
• Select a sequence x = xi
• Remove xi, recompute model:
))(()1(
1
,1
N
isskajkj jx
BNM
s
where j are pseudocounts to avoid 0s,
and B = j j
M
CS262 Lecture 17, Win07, Batzoglou
Gibbs Sampling
Sampling:
For every K-long word xj,…,xj+k-1 in x:
Qj = Prob[ word | motif ] = M(1,xj)…M(k,xj+k-1)
Pi = Prob[ word | background ] B(x j)…B(xj+k-1)
Let
Sample a random new position ai
according to the probabilities A1,…, A|x|-k+1.
1||
1
/
/kx
jjj
jjj
PQ
PQA
0 |x|
Prob
CS262 Lecture 17, Win07, Batzoglou
Gibbs Sampling
Running Gibbs Sampling:
1. Initialize
2. Run until convergence
3. Repeat 1,2 several times, report common motifs
CS262 Lecture 17, Win07, Batzoglou
Advantages / Disadvantages
• Very similar to EM
Advantages:• Easier to implement• Less dependent on initial parameters• More versatile, easier to enhance with heuristics
Disadvantages:• More dependent on all sequences to exhibit the motif• Less systematic search of initial parameter space
CS262 Lecture 17, Win07, Batzoglou
Repeats, and a Better Background Model
• Repeat DNA can be confused as motif Especially low-complexity CACACA… AAAAA, etc.
Solution:
more elaborate background model
0th order: B = { pA, pC, pG, pT }
1st order: B = { P(A|A), P(A|C), …, P(T|T) }
…
Kth order: B = { P(X | b1…bK); X, bi{A,C,G,T} }
Has been applied to EM and Gibbs (up to 3rd order)
CS262 Lecture 17, Win07, Batzoglou
Limits of Motif Finders
• Given upstream regions of coregulated genes: Increasing length makes motif finding harder – random motifs clutter the
true ones Decreasing length makes motif finding harder – true motif missing in
some sequences
Motif Challenge problem:
Find a (15,4) motif in N sequences of length
0
gene???
CS262 Lecture 17, Win07, Batzoglou
Example Application: Motifs in Yeast
Group:
Tavazoie et al. 1999, G. Church’s lab, Harvard
Data:
• Microarrays on 6,220 mRNAs from yeast Affymetrix chips (Cho et al.)• 15 time points across two cell cycles
1. Clustering genes according to common expression
• K-means clustering -> 30 clusters, 50-190 genes/cluster• Clusters correlate well with known function
2. AlignACE motif finding • 600-long upstream regions
CS262 Lecture 17, Win07, Batzoglou
Motifs are preferentially conserved across evolution
Scer TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACA Spar CTATGTTGATCTTTTCAGAATTTTT-CACTATATTAAGATGGGTGCAAAGAAGTGTGATTATTATATTACATCGCTTTCCTATCATACACA Smik GTATATTGAATTTTTCAGTTTTTTTTCACTATCTTCAAGGTTATGTAAAAAA-TGTCAAGATAATATTACATTTCGTTACTATCATACACA Sbay TTTTTTTGATTTCTTTAGTTTTCTTTCTTTAACTTCAAAATTATAAAAGAAAGTGTAGTCACATCATGCTATCT-GTCACTATCACATATA * * **** * * * ** ** * * ** ** ** * * * ** ** * * * ** * * *
Scer TATCCATATCTAATCTTACTTATATGTTGT-GGAAAT-GTAAAGAGCCCCATTATCTTAGCCTAAAAAAACC--TTCTCTTTGGAACTTTCAGTAATACGSpar TATCCATATCTAGTCTTACTTATATGTTGT-GAGAGT-GTTGATAACCCCAGTATCTTAACCCAAGAAAGCC--TT-TCTATGAAACTTGAACTG-TACGSmik TACCGATGTCTAGTCTTACTTATATGTTAC-GGGAATTGTTGGTAATCCCAGTCTCCCAGATCAAAAAAGGT--CTTTCTATGGAGCTTTG-CTA-TATGSbay TAGATATTTCTGATCTTTCTTATATATTATAGAGAGATGCCAATAAACGTGCTACCTCGAACAAAAGAAGGGGATTTTCTGTAGGGCTTTCCCTATTTTG ** ** *** **** ******* ** * * * * * * * ** ** * *** * *** * * *
Scer CTTAACTGCTCATTGC-----TATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTSpar CTAAACTGCTCATTGC-----AATATTGAAGTACGGATCAGAAGCCGCCGAGCGGACGACAGCCCTCCGACGGAATATTCCCCTCCGTGCGTCGCCGTCTSmik TTTAGCTGTTCAAG--------ATATTGAAATACGGATGAGAAGCCGCCGAACGGACGACAATTCCCCGACGGAACATTCTCCTCCGCGCGGCGTCCTCTSbay TCTTATTGTCCATTACTTCGCAATGTTGAAATACGGATCAGAAGCTGCCGACCGGATGACAGTACTCCGGCGGAAAACTGTCCTCCGTGCGAAGTCGTCT ** ** ** ***** ******* ****** ***** *** **** * *** ***** * * ****** *** * ***
Scer TCACCGG-TCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAA-----TACTAGCTTTT--ATGGTTATGAASpar TCGTCGGGTTGTGTCCCTTAA-CATCGATGTACCTCGCGCCGCCCTGCTCCGAACAATAAGGATTCTACAAGAAA-TACTTGTTTTTTTATGGTTATGACSmik ACGTTGG-TCGCGTCCCTGAA-CATAGGTACGGCTCGCACCACCGTGGTCCGAACTATAATACTGGCATAAAGAGGTACTAATTTCT--ACGGTGATGCCSbay GTG-CGGATCACGTCCCTGAT-TACTGAAGCGTCTCGCCCCGCCATACCCCGAACAATGCAAATGCAAGAACAAA-TGCCTGTAGTG--GCAGTTATGGT ** * ** *** * * ***** ** * * ****** ** * * ** * * ** ***
Scer GAGGA-AAAATTGGCAGTAA----CCTGGCCCCACAAACCTT-CAAATTAACGAATCAAATTAACAACCATA-GGATGATAATGCGA------TTAG--TSpar AGGAACAAAATAAGCAGCCC----ACTGACCCCATATACCTTTCAAACTATTGAATCAAATTGGCCAGCATA-TGGTAATAGTACAG------TTAG--GSmik CAACGCAAAATAAACAGTCC----CCCGGCCCCACATACCTT-CAAATCGATGCGTAAAACTGGCTAGCATA-GAATTTTGGTAGCAA-AATATTAG--GSbay GAACGTGAAATGACAATTCCTTGCCCCT-CCCCAATATACTTTGTTCCGTGTACAGCACACTGGATAGAACAATGATGGGGTTGCGGTCAAGCCTACTCG **** * * ***** *** * * * * * * * * **
Scer TTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCG--ATGATTTTT-GATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCAC-----TTSpar GTTTT--TCTTATTCCTGAGACAATTCATCCGCAAAAAATAATGGTTTTT-GGTCTATTAGCAAACATATAAATGCAAAAGTTGCATAGCCAC-----TTSmik TTCTCA--CCTTTCTCTGTGATAATTCATCACCGAAATG--ATGGTTTA--GGACTATTAGCAAACATATAAATGCAAAAGTCGCAGAGATCA-----ATSbay TTTTCCGTTTTACTTCTGTAGTGGCTCAT--GCAGAAAGTAATGGTTTTCTGTTCCTTTTGCAAACATATAAATATGAAAGTAAGATCGCCTCAATTGTA * * * *** * ** * * *** *** * * ** ** * ******** **** *
Scer TAACTAATACTTTCAACATTTTCAGT--TTGTATTACTT-CTTATTCAAAT----GTCATAAAAGTATCAACA-AAAAATTGTTAATATACCTCTATACTSpar TAAATAC-ATTTGCTCCTCCAAGATT--TTTAATTTCGT-TTTGTTTTATT----GTCATGGAAATATTAACA-ACAAGTAGTTAATATACATCTATACTSmik TCATTCC-ATTCGAACCTTTGAGACTAATTATATTTAGTACTAGTTTTCTTTGGAGTTATAGAAATACCAAAA-AAAAATAGTCAGTATCTATACATACASbay TAGTTTTTCTTTATTCCGTTTGTACTTCTTAGATTTGTTATTTCCGGTTTTACTTTGTCTCCAATTATCAAAACATCAATAACAAGTATTCAACATTTGT * * * * * * ** *** * * * * ** ** ** * * * * * *** *
Scer TTAA-CGTCAAGGA---GAAAAAACTATASpar TTAT-CGTCAAGGAAA-GAACAAACTATASmik TCGTTCATCAAGAA----AAAAAACTA..Sbay TTATCCCAAAAAAACAACAACAACATATA * * ** * ** ** **
Gal10 Gal1Gal4
GAL10
GAL1
TBP
GAL4 GAL4 GAL4
GAL4
MIG1
TBPMIG1
Factor footprint
Conservation islandslide credits: M. Kellis
Is this enough to discover motifs?No.
CS262 Lecture 17, Win07, Batzoglou
Comparison-based Regulatory Motif Discovery
Study known motifs
Derive conservation rules
Discover novel motifs
slide credits: M. Kellis
CS262 Lecture 17, Win07, Batzoglou
Known motifs are frequently conserved
• Across the human promoter regions, the Err motif: appears 434 times is conserved 162 times
Human
Dog
Mouse
Rat
Err Err Err
Conservation rate: 37%
• Compare to random control motifs– Conservation rate of control motifs: 6.8% – Err enrichment: 5.4-fold– Err p-value < 10-50 (25 standard deviations under binomial)
Motif Conservation Score (MCS)
slide credits: M. Kellis
CS262 Lecture 17, Win07, Batzoglou
Finding conserved motifs in whole genomesM. Kellis PhD Thesis on yeasts, X. Xie & M. Kellis on mammals
1. Define seed “mini-motifs”
2. Filter and isolate mini-motifs that are more conserved than average
3. Extend mini-motifs to full motifs
4. Validate against known databases of motifs & annotations
5. Report novel motifs
CT A C GAN
slide credits: M. Kellis
CS262 Lecture 17, Win07, Batzoglou
Test 1: Intergenic conservation
Total count
Con
serv
ed c
ount
CGG-11-CCG
slide credits: M. Kellis
CS262 Lecture 17, Win07, Batzoglou
Test 2: Intergenic vs. Coding
Coding Conservation
Inte
rgen
ic C
onse
rvat
ion
CGG-11-CCG
Higher Conservation in Genes
slide credits: M. Kellis
CS262 Lecture 17, Win07, Batzoglou
Test 3: Upstream vs. Downstream
CGG-11-CCG
Downstream motifs?
MostPatterns
Downstream Conservation
Ups
trea
m C
onse
rvat
ion
slide credits: M. Kellis
CS262 Lecture 17, Win07, Batzoglou
Extend
Collapse
Full Motifs
Constructing full motifs
2,000 Mini-motifs
72 Full motifs
6CT A C GAR R
CT GR C C GA AA CCTG C GA A
CT GR C C GA ACT RA Y C GA A
Y 5Extend Extend Extend
Collapse Collapse Collapse
Merge
Test 1 Test 2 Test 3
slide credits: M. Kellis
CS262 Lecture 17, Win07, Batzoglou
Summary for promoter motifsRank Discovered Motif
Known TF motif
Tissue Enrichment
Distance bias
1 RCGCAnGCGY NRF-1 Yes Yes
2 CACGTG MYC Yes Yes
3 SCGGAAGY ELK-1 Yes Yes
4 ACTAYRnnnCCCR Yes Yes
5 GATTGGY NF-Y Yes Yes
6 GGGCGGR SP1 Yes Yes
7 TGAnTCA AP-1 Yes
8 TMTCGCGAnR Yes Yes
9 TGAYRTCA ATF3 Yes Yes
10 GCCATnTTG YY1 Yes
11 MGGAAGTG GABP Yes Yes
12 CAGGTG E12 Yes
13 CTTTGT LEF1 Yes
14 TGACGTCA ATF3 Yes Yes
15 CAGCTG AP-4 Yes
16 RYTTCCTG C-ETS-2 Yes Yes
17 AACTTT IRF1(*) Yes
18 TCAnnTGAY SREBP-1 Yes Yes
19 GKCGCn(7)TGAYG Yes Yes
20 GTGACGY E4F1 Yes Yes
21 GGAAnCGGAAnY Yes Yes
22 TGCGCAnK Yes Yes
23 TAATTA CHX10 Yes
24 GGGAGGRR MAZ Yes
25 TGACCTY ERRA Yes
• 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias
75% have evidence
• Control sequences< 2% match known TF motifs
< 5% expression enrichment
< 3% show positional bias
< 7% false positives
Most discovered motifs are likely to be functional
NewNew
New
New
New
slide credits: M. Kellis