r egulatory m otif f inding mohammed alquraishi. t alk o utline biology background algorithmic...

REGULATORY MOTIF FINDINGMohammed AlQuraishi

TALK OUTLINE

Biology Background

Algorithmic Problem

PapersNew Motif Finding Algorithm

(MotifCut)Analysis of Motif Finders’ Performance

CELL = FACTORY, PROTEINS = MACHINES

Biovisions, Harvard

Instructions for making the machines

DNA

“Coding” Regions

Instructions for when and where to make them

“Regulatory” Regions (Regulons)

TRANSCRIPTIONAL REGULATION

Regulatory regions are comprised of “binding sites”

“Binding sites” attract a special class of proteins, known as “transcription factors”

Bound transcription factors can inhibit DNA transcription

DNA REGULATION

Source: Richardson, University College London

CELL REGULATION

Transcriptional regulation is one of many regulatory mechanisms in the cell

Focus of Talk

Source: Mallery, University of Miami

STRUCTURAL BASIS OF INTERACTION


Key Feature: Transcription factors are not 100% specific when

binding DNA

Not one sequence, but family of sequences, with varying affinities

0.540.48

0.32

0.25

0.110.08

TALK OUTLINE

Biology Background

Algorithmic Problem



MOTIF FINDING

Basic Objective: Find regions in the genome that bind

transcription factors

Many classes of algorithms, differ in: Types of input data Motif representation

INPUT DATA

Single sequence

Evolutionarily related set of sequences

Sequence + other data Microarray expression profile ChIP-chip Others…

MOTIF REPRESENTATION

Probabilistic

Word-Based

Focus of Talk


Structural discussion immediately raises difficulties


Key Feature: Transcription factors are not 100% specific when

binding DNA

Not one sequence, but family of sequences, with varying affinities

0.540.48

0.32

0.25

0.110.08


Structural discussion immediately raises difficulties

Least Expressive: Single sequence

Most Expressive: 4k-dimensional probability distribution Independently assign probability for each possible

kmer


Standard Solution: Position-Specific Scoring Matrix (PSSM) Assuming independence of positions, assign a

probability for each position

Fraught with problems… (Will revisit this)

TALK OUTLINE

Biology Background

Algorithmic Problem



REFERENCEAuthors:

Eugene Fratkin, Brian T. Naughton, Douglas L. Brutlag, and Serafim Batzoglou

Title: MotifCut: regulatory motifs finding

with maximum density subgraphs

Publication:Bioinformatics Vol. 22 no. 14 2006,

pages e150–e157

OVERVIEW

Motif Finding Algorithm (“MotifCut”)

Motivation Oversimplicity of PSSMs Intractability of more complex models

OVERSIMPLICITY OF PSSMS

Assumes independence between positions

~25% of TRANSFAC motifs have been shown to violate this assumption Two Examples: ADR1 and YAP6

OVERSIMPLICITY OF PSSMS

Assumes independence between positions

Generates potentially unseen motifs

BASIC FEATURES OF MOTIFCUT

Does not assume an underlying PSSM

Represents a motif with a graph structure In principle maximally expressive In practice not quite

Motif finding cast as maximum density subgraph Subquadratic complexity

MOTIF GRAPH REPRESENTATION

Nodes are kmers Edge weights are distances between kmers

Generative model: Frequency of kmer node equal to frequency of generating kmer

Distance definition is complicated (Will come back to)

Same kmer node can appear multiple times

AGTGGGAC

AGTGGGAC

AGTGCGAC

AGTGCTAC

0

1

2

11

2

MOTIF FINDING

Find highest density subgraph

Density is defined as sum of edge weights per node

Somewhat limits representational power

MOTIF FINDING

Read new sequence

Generate graph as previously described Kmers are generated by shifting one base pair Each kmer in the sequence gets a node,

including identical kmers Graph contains as many nodes as there are base

pairs Connect edges with weights based on distances

between nodes

Find densest subgraph

EDGE WEIGHTS

Heart of the algorithm, will focus on this

Semantics: Edge weight is the likelihood of two kmers to be

in the same motif

Use Hamming distance as a way to quantify distance between kmers

0TT AACC

123

EDGE WEIGHTS

Heart of the algorithm, will focus on this

Semantics: Edge weight is the likelihood of two kmers to be

in the same motif

Use Hamming distance as a way to quantify distance between kmers

“Interpret” hamming distance as a measure of the likelihood of two kmers to be in same motif: F(hamming distance) = likelihood of two kmers

to be in same motif

EDGE WEIGHTS

Let’s make this a bit more precise:

But how to compute ?

Simulate it! Way too many variables to account for

analytically: Background model, kmer length, hamming distance, etc…

“GENOME SIMULATION” Background + Motifs

No genes, promoters, signaling sequences, etc.

Background Model 3rd order Markov model

Probability of next base depends on previous 3 bases Modeled on the yeast genome Incorporates GC bias

Motif Model PSSM Based on empirically observed information

content of yeast motifs

“GENOME SIMULATION”

Use Markov model to generate 10k – 20k length sequences of background

Seed with 20 motifs generated by the PSSM

Result is a simulated genome of yeast We know which parts are the real motifs, and

which are not

EDGE WEIGHTS

Back to :

is number of true motifs of k-length that are l-distance away

is number of non-motifs of k-length that are l-distance away

EDGE WEIGHTSTrue Motifs

G G G G G G

G G G C G G

G G G G G GG G G G G G

G T G G G G

False Motifs (Part of Background)

EDGE WEIGHTS

G G G G G G

G G G C G G

G G G G G GG G G G G G

G T G G G G

All ≤1 distance away (Hamming distance) α(k = 6, l = 1) = 1 β(k = 6, l = 1) = 1

Let’s perform calculation from the perspective of this motif

G G G G G G

G G G C G GG T G G G G

EDGE WEIGHTS

Computation provides an empirical estimate for

Parameterized by two quantities: k, the kmer length l, the Hamming distance between two kmers

Fit to a sigmoidal function

EDGE WEIGHTS

Normalization step Won’t go into details

This covers problem formulation How is motif finding actually done?

MAXIMUM DENSITY SUBGRAPH

Standard graph theory method Max-flow / min-cut O(nm log(n2m))

Need faster method

Developed heuristic approach that utilizes max-flow / min-cut method with modifications


Remove all edges below a certain threshold


Pick one vertex (do this for every vertex)


Put back all neighboring edges for that vertex

MAXIMUM DENSITY SUBGRAPH Use standard algorithm to calculate densest subgraph

RESULTS

Synthetic Tests Plenty of test cases Measure performance as data set size grows Avoid over biasing on empirical data Know real answer, can unambiguously test

performance

Yeast Test Gold standard data (Harbinson et al., 2004)

SYNTHETIC TESTS

Varied: Motif length Information content

Simulated genome (as before)

Correlated predicted PSSMs to real ones, counted as true positive if correlation > 0.7

SYNTHETIC TESTS RESULTS

YEAST TEST RESULTS

PERFORMANCE

TALK OUTLINE

Biology Background

Algorithmic Problem



TALK OUTLINE

Biology Background

Algorithmic Problem



Shorter but more drier (no pretty pictures)

REFERENCE Authors:

Patrick Ng, Niranjan Nagarajan, Neil Jones, and Uri Keich

Title: Apples to apples: improving the

performance of motif finders and their significance analysis in the Twilight Zone

Publication:Bioinformatics Vol. 22 no. 14 2006, pages

e393–e401

OVERVIEW

Twilight Zone Non-negligible probability that a maximally

scoring random motif would have a higher score than motifs that overlap the ‘‘real’’ motif

Motivation Behavior of Motif Finders in Twilight Zone is

poorly understood Understanding would aid in development of Motif

Finders Sheds light on whether it is theoretically possible

OBJECTIVES

Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone

Examine and suggest new metrics

Employ new metric for motif finding

E-VALUE

E-value is defined in terms of information content

Information Content

E-value Expected number of random alignments

exhibiting an information content at least as high as that of the given alignment

AlignmentLength

Number of sequences

Background frequency of jth

letter

AlphabetSize

Frequency of jth letter at ith position

ANALYSIS

Generate 400 random datasets Dataset = 40 sequences totaling 1485 bases

Implant a single motif of length 13 per dataset

High likelihood that motif finders would miss it

RESULTS

Reported E-value: 8 x 1015

Very high, very statistically insignificant In principle, theoretically impossible to find

Search results Alignment covering ≥30% of motif found in

288/400 cases!

Data generated exactly in accordance with E-value model

WHAT’S GOING ON?

They don’t know, hand-waive it

Many “satellite” alignments boost up effective score Difficult to characterize analytically

OBJECTIVES




NEW METRIC: OPV

Also defined in terms of Information Content

OPV(s) (Overall p-value) Probability that a random sample of the same

size as the input set will contain an alignment with at least as much information content as s

Contrast E-value: Expected number of alignments (in

general) OPV: Probability of finding an alignment in a

dataset

ESTIMATION

Caveat Random sample (no biasing)

Difficult to calculate analytically

Estimate empirically General OPV Finder-specific OPV

GENERAL OPV ESTIMATION

Generate 1600 random datasets No implants

Run a collection of motif finders on each dataset

Pick highest scoring motif in each dataset Out of all finders

Sort scores, then pick score with 95% quantile


Score such that 95% of scores are below it, 5%

above it


Meaning 95% of the time, highest scoring random motif

scored less than s0

Obtaining a score ≥ s0 means ≤ 5% chance for the motif to be random

GENERAL OPV RESULTS

Run on previous 400 datasets

90% of correct runs (288/400) were classified as noise

Not good…

FINDER-SPECIFIC OPV ESTIMATION

Same as before, but use only one finder

Better biased toward the parameter space of the specific finder

FINDER-SPECIFIC OPV RESULTS

Tested it on Gibbs

Same 400 datasets 228 TPs 13 FPs

Much better…

USING OPV

Impractical

A priori generation is prohibitive given parameter space of motif finders

Per problem estimation is prohibitive Requires ~100x more runs

Not theirs…

ANOTHER METRIC: ILR (INCOMPLETE LIKELIHOOD RATIO)

Not defined in terms of Information Content

Number of Sequences

Length of nth sequence

Length of motif

Probability of subsequence starting at

m to be the motifMotif PSSM

Background PSSMProbability of subsequence

starting at m to be background


Not defined in terms of Information Content

Ratio of null hypothesis to OOPS hypothesis OOPS: Once occurrence per sequence

Intuition behind it

OBJECTIVES




MOTIF FINDING USING ILR

Used existing algorithms, ranked final output by ILR

Developed simple new algorithm that uses ILR as objective function

ILR MOTIF FINDING RESULTS

ILR MOTIF FINDING RESULTS

Promising…

OBJECTIVES




OBJECTIVES




One More Thing!

THANK YOU

r egulatory m otif f inding mohammed alquraishi. t alk o utline biology background algorithmic...

Documents

regulatory motifs

varying affinities0

regulatory mechanisms

family of sequences

binding sitesbinding

difficultiesleast expressive

transfac motifs

single sequencemost