r egulatory m otif f inding mohammed alquraishi. t alk o utline biology background algorithmic...
TRANSCRIPT
REGULATORY MOTIF FINDINGMohammed AlQuraishi
TALK OUTLINE
Biology Background
Algorithmic Problem
PapersNew Motif Finding Algorithm
(MotifCut)Analysis of Motif Finders’ Performance
TALK OUTLINE
Biology Background
Algorithmic Problem
PapersNew Motif Finding Algorithm
(MotifCut)Analysis of Motif Finders’ Performance
CELL = FACTORY, PROTEINS = MACHINES
Biovisions, Harvard
Instructions for making the machines
DNA
“Coding” Regions
Instructions for when and where to make them
“Regulatory” Regions (Regulons)
TRANSCRIPTIONAL REGULATION
Regulatory regions are comprised of “binding sites”
“Binding sites” attract a special class of proteins, known as “transcription factors”
Bound transcription factors can inhibit DNA transcription
DNA REGULATION
Source: Richardson, University College London
CELL REGULATION
Transcriptional regulation is one of many regulatory mechanisms in the cell
Focus of Talk
Source: Mallery, University of Miami
STRUCTURAL BASIS OF INTERACTION
STRUCTURAL BASIS OF INTERACTION
Key Feature: Transcription factors are not 100% specific when
binding DNA
Not one sequence, but family of sequences, with varying affinities
0.540.48
0.32
0.25
0.110.08
TALK OUTLINE
Biology Background
Algorithmic Problem
PapersNew Motif Finding Algorithm
(MotifCut)Analysis of Motif Finders’ Performance
TALK OUTLINE
Biology Background
Algorithmic Problem
PapersNew Motif Finding Algorithm
(MotifCut)Analysis of Motif Finders’ Performance
MOTIF FINDING
Basic Objective: Find regions in the genome that bind
transcription factors
Many classes of algorithms, differ in: Types of input data Motif representation
INPUT DATA
Single sequence
Evolutionarily related set of sequences
Sequence + other data Microarray expression profile ChIP-chip Others…
MOTIF REPRESENTATION
Probabilistic
Word-Based
Focus of Talk
MOTIF REPRESENTATION
Structural discussion immediately raises difficulties
STRUCTURAL BASIS OF INTERACTION
Key Feature: Transcription factors are not 100% specific when
binding DNA
Not one sequence, but family of sequences, with varying affinities
0.540.48
0.32
0.25
0.110.08
MOTIF REPRESENTATION
Structural discussion immediately raises difficulties
Least Expressive: Single sequence
Most Expressive: 4k-dimensional probability distribution Independently assign probability for each possible
kmer
MOTIF REPRESENTATION
Standard Solution: Position-Specific Scoring Matrix (PSSM) Assuming independence of positions, assign a
probability for each position
Fraught with problems… (Will revisit this)
TALK OUTLINE
Biology Background
Algorithmic Problem
PapersNew Motif Finding Algorithm
(MotifCut)Analysis of Motif Finders’ Performance
TALK OUTLINE
Biology Background
Algorithmic Problem
PapersNew Motif Finding Algorithm
(MotifCut)Analysis of Motif Finders’ Performance
REFERENCEAuthors:
Eugene Fratkin, Brian T. Naughton, Douglas L. Brutlag, and Serafim Batzoglou
Title: MotifCut: regulatory motifs finding
with maximum density subgraphs
Publication:Bioinformatics Vol. 22 no. 14 2006,
pages e150–e157
OVERVIEW
Motif Finding Algorithm (“MotifCut”)
Motivation Oversimplicity of PSSMs Intractability of more complex models
OVERSIMPLICITY OF PSSMS
Assumes independence between positions
~25% of TRANSFAC motifs have been shown to violate this assumption Two Examples: ADR1 and YAP6
OVERSIMPLICITY OF PSSMS
Assumes independence between positions
Generates potentially unseen motifs
BASIC FEATURES OF MOTIFCUT
Does not assume an underlying PSSM
Represents a motif with a graph structure In principle maximally expressive In practice not quite
Motif finding cast as maximum density subgraph Subquadratic complexity
MOTIF GRAPH REPRESENTATION
Nodes are kmers Edge weights are distances between kmers
Generative model: Frequency of kmer node equal to frequency of generating kmer
Distance definition is complicated (Will come back to)
Same kmer node can appear multiple times
AGTGGGAC
AGTGGGAC
AGTGCGAC
AGTGCTAC
0
1
2
11
2
MOTIF FINDING
Find highest density subgraph
Density is defined as sum of edge weights per node
Somewhat limits representational power
MOTIF FINDING
Read new sequence
Generate graph as previously described Kmers are generated by shifting one base pair Each kmer in the sequence gets a node,
including identical kmers Graph contains as many nodes as there are base
pairs Connect edges with weights based on distances
between nodes
Find densest subgraph
EDGE WEIGHTS
Heart of the algorithm, will focus on this
Semantics: Edge weight is the likelihood of two kmers to be
in the same motif
Use Hamming distance as a way to quantify distance between kmers
0TT AACC
123
EDGE WEIGHTS
Heart of the algorithm, will focus on this
Semantics: Edge weight is the likelihood of two kmers to be
in the same motif
Use Hamming distance as a way to quantify distance between kmers
“Interpret” hamming distance as a measure of the likelihood of two kmers to be in same motif: F(hamming distance) = likelihood of two kmers
to be in same motif
EDGE WEIGHTS
Let’s make this a bit more precise:
But how to compute ?
Simulate it! Way too many variables to account for
analytically: Background model, kmer length, hamming distance, etc…
“GENOME SIMULATION” Background + Motifs
No genes, promoters, signaling sequences, etc.
Background Model 3rd order Markov model
Probability of next base depends on previous 3 bases Modeled on the yeast genome Incorporates GC bias
Motif Model PSSM Based on empirically observed information
content of yeast motifs
“GENOME SIMULATION”
Use Markov model to generate 10k – 20k length sequences of background
Seed with 20 motifs generated by the PSSM
Result is a simulated genome of yeast We know which parts are the real motifs, and
which are not
EDGE WEIGHTS
Back to :
is number of true motifs of k-length that are l-distance away
is number of non-motifs of k-length that are l-distance away
EDGE WEIGHTSTrue Motifs
G G G G G G
G G G C G G
G G G G G GG G G G G G
G T G G G G
False Motifs (Part of Background)
EDGE WEIGHTS
G G G G G G
G G G C G G
G G G G G GG G G G G G
G T G G G G
All ≤1 distance away (Hamming distance) α(k = 6, l = 1) = 1 β(k = 6, l = 1) = 1
Let’s perform calculation from the perspective of this motif
G G G G G G
G G G C G GG T G G G G
EDGE WEIGHTS
Computation provides an empirical estimate for
Parameterized by two quantities: k, the kmer length l, the Hamming distance between two kmers
Fit to a sigmoidal function
EDGE WEIGHTS
Normalization step Won’t go into details
This covers problem formulation How is motif finding actually done?
MAXIMUM DENSITY SUBGRAPH
Standard graph theory method Max-flow / min-cut O(nm log(n2m))
Need faster method
Developed heuristic approach that utilizes max-flow / min-cut method with modifications
MAXIMUM DENSITY SUBGRAPH
Remove all edges below a certain threshold
MAXIMUM DENSITY SUBGRAPH
Pick one vertex (do this for every vertex)
MAXIMUM DENSITY SUBGRAPH
Put back all neighboring edges for that vertex
MAXIMUM DENSITY SUBGRAPH Use standard algorithm to calculate densest subgraph
RESULTS
Synthetic Tests Plenty of test cases Measure performance as data set size grows Avoid over biasing on empirical data Know real answer, can unambiguously test
performance
Yeast Test Gold standard data (Harbinson et al., 2004)
SYNTHETIC TESTS
Varied: Motif length Information content
Simulated genome (as before)
Correlated predicted PSSMs to real ones, counted as true positive if correlation > 0.7
SYNTHETIC TESTS RESULTS
YEAST TEST RESULTS
PERFORMANCE
TALK OUTLINE
Biology Background
Algorithmic Problem
PapersNew Motif Finding Algorithm
(MotifCut)Analysis of Motif Finders’ Performance
TALK OUTLINE
Biology Background
Algorithmic Problem
PapersNew Motif Finding Algorithm
(MotifCut)Analysis of Motif Finders’ Performance
Shorter but more drier (no pretty pictures)
REFERENCE Authors:
Patrick Ng, Niranjan Nagarajan, Neil Jones, and Uri Keich
Title: Apples to apples: improving the
performance of motif finders and their significance analysis in the Twilight Zone
Publication:Bioinformatics Vol. 22 no. 14 2006, pages
e393–e401
OVERVIEW
Twilight Zone Non-negligible probability that a maximally
scoring random motif would have a higher score than motifs that overlap the ‘‘real’’ motif
Motivation Behavior of Motif Finders in Twilight Zone is
poorly understood Understanding would aid in development of Motif
Finders Sheds light on whether it is theoretically possible
OBJECTIVES
Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone
Examine and suggest new metrics
Employ new metric for motif finding
OBJECTIVES
Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone
Examine and suggest new metrics
Employ new metric for motif finding
E-VALUE
E-value is defined in terms of information content
Information Content
E-value Expected number of random alignments
exhibiting an information content at least as high as that of the given alignment
AlignmentLength
Number of sequences
Background frequency of jth
letter
AlphabetSize
Frequency of jth letter at ith position
ANALYSIS
Generate 400 random datasets Dataset = 40 sequences totaling 1485 bases
Implant a single motif of length 13 per dataset
High likelihood that motif finders would miss it
RESULTS
Reported E-value: 8 x 1015
Very high, very statistically insignificant In principle, theoretically impossible to find
Search results Alignment covering ≥30% of motif found in
288/400 cases!
Data generated exactly in accordance with E-value model
WHAT’S GOING ON?
They don’t know, hand-waive it
Many “satellite” alignments boost up effective score Difficult to characterize analytically
OBJECTIVES
Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone
Examine and suggest new metrics
Employ new metric for motif finding
OBJECTIVES
Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone
Examine and suggest new metrics
Employ new metric for motif finding
NEW METRIC: OPV
Also defined in terms of Information Content
OPV(s) (Overall p-value) Probability that a random sample of the same
size as the input set will contain an alignment with at least as much information content as s
Contrast E-value: Expected number of alignments (in
general) OPV: Probability of finding an alignment in a
dataset
ESTIMATION
Caveat Random sample (no biasing)
Difficult to calculate analytically
Estimate empirically General OPV Finder-specific OPV
GENERAL OPV ESTIMATION
Generate 1600 random datasets No implants
Run a collection of motif finders on each dataset
Pick highest scoring motif in each dataset Out of all finders
Sort scores, then pick score with 95% quantile
GENERAL OPV ESTIMATION
Score such that 95% of scores are below it, 5%
above it
GENERAL OPV ESTIMATION
Meaning 95% of the time, highest scoring random motif
scored less than s0
Obtaining a score ≥ s0 means ≤ 5% chance for the motif to be random
GENERAL OPV RESULTS
Run on previous 400 datasets
90% of correct runs (288/400) were classified as noise
Not good…
FINDER-SPECIFIC OPV ESTIMATION
Same as before, but use only one finder
Better biased toward the parameter space of the specific finder
FINDER-SPECIFIC OPV RESULTS
Tested it on Gibbs
Same 400 datasets 228 TPs 13 FPs
Much better…
USING OPV
Impractical
A priori generation is prohibitive given parameter space of motif finders
Per problem estimation is prohibitive Requires ~100x more runs
Not theirs…
ANOTHER METRIC: ILR (INCOMPLETE LIKELIHOOD RATIO)
Not defined in terms of Information Content
Number of Sequences
Length of nth sequence
Length of motif
Probability of subsequence starting at
m to be the motifMotif PSSM
Background PSSMProbability of subsequence
starting at m to be background
ANOTHER METRIC: ILR (INCOMPLETE LIKELIHOOD RATIO)
Not defined in terms of Information Content
Ratio of null hypothesis to OOPS hypothesis OOPS: Once occurrence per sequence
Intuition behind it
ANOTHER METRIC: ILR (INCOMPLETE LIKELIHOOD RATIO)
OBJECTIVES
Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone
Examine and suggest new metrics
Employ new metric for motif finding
OBJECTIVES
Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone
Examine and suggest new metrics
Employ new metric for motif finding
MOTIF FINDING USING ILR
Used existing algorithms, ranked final output by ILR
Developed simple new algorithm that uses ILR as objective function
ILR MOTIF FINDING RESULTS
ILR MOTIF FINDING RESULTS
ILR MOTIF FINDING RESULTS
Promising…
OBJECTIVES
Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone
Examine and suggest new metrics
Employ new metric for motif finding
OBJECTIVES
Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone
Examine and suggest new metrics
Employ new metric for motif finding
One More Thing!
THANK YOU