r egulatory m otif f inding mohammed alquraishi. t alk o utline biology background algorithmic...

83
REGULATORY MOTIF FINDING Mohammed AlQuraishi

Upload: duane-knight

Post on 16-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

REGULATORY MOTIF FINDINGMohammed AlQuraishi

Page 2: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

TALK OUTLINE

Biology Background

Algorithmic Problem

PapersNew Motif Finding Algorithm

(MotifCut)Analysis of Motif Finders’ Performance

Page 3: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

TALK OUTLINE

Biology Background

Algorithmic Problem

PapersNew Motif Finding Algorithm

(MotifCut)Analysis of Motif Finders’ Performance

Page 4: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

CELL = FACTORY, PROTEINS = MACHINES

Biovisions, Harvard

Page 5: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

Instructions for making the machines

DNA

“Coding” Regions

Instructions for when and where to make them

“Regulatory” Regions (Regulons)

Page 6: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

TRANSCRIPTIONAL REGULATION

Regulatory regions are comprised of “binding sites”

“Binding sites” attract a special class of proteins, known as “transcription factors”

Bound transcription factors can inhibit DNA transcription

Page 7: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

DNA REGULATION

Source: Richardson, University College London

Page 8: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

CELL REGULATION

Transcriptional regulation is one of many regulatory mechanisms in the cell

Focus of Talk

Source: Mallery, University of Miami

Page 9: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

STRUCTURAL BASIS OF INTERACTION

Page 10: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

STRUCTURAL BASIS OF INTERACTION

Key Feature: Transcription factors are not 100% specific when

binding DNA

Not one sequence, but family of sequences, with varying affinities

0.540.48

0.32

0.25

0.110.08

Page 11: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

TALK OUTLINE

Biology Background

Algorithmic Problem

PapersNew Motif Finding Algorithm

(MotifCut)Analysis of Motif Finders’ Performance

Page 12: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

TALK OUTLINE

Biology Background

Algorithmic Problem

PapersNew Motif Finding Algorithm

(MotifCut)Analysis of Motif Finders’ Performance

Page 13: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

MOTIF FINDING

Basic Objective: Find regions in the genome that bind

transcription factors

Many classes of algorithms, differ in: Types of input data Motif representation

Page 14: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

INPUT DATA

Single sequence

Evolutionarily related set of sequences

Sequence + other data Microarray expression profile ChIP-chip Others…

Page 15: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

MOTIF REPRESENTATION

Probabilistic

Word-Based

Focus of Talk

Page 16: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

MOTIF REPRESENTATION

Structural discussion immediately raises difficulties

Page 17: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

STRUCTURAL BASIS OF INTERACTION

Key Feature: Transcription factors are not 100% specific when

binding DNA

Not one sequence, but family of sequences, with varying affinities

0.540.48

0.32

0.25

0.110.08

Page 18: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

MOTIF REPRESENTATION

Structural discussion immediately raises difficulties

Least Expressive: Single sequence

Most Expressive: 4k-dimensional probability distribution Independently assign probability for each possible

kmer

Page 19: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

MOTIF REPRESENTATION

Standard Solution: Position-Specific Scoring Matrix (PSSM) Assuming independence of positions, assign a

probability for each position

Fraught with problems… (Will revisit this)

Page 20: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

TALK OUTLINE

Biology Background

Algorithmic Problem

PapersNew Motif Finding Algorithm

(MotifCut)Analysis of Motif Finders’ Performance

Page 21: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

TALK OUTLINE

Biology Background

Algorithmic Problem

PapersNew Motif Finding Algorithm

(MotifCut)Analysis of Motif Finders’ Performance

Page 22: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

REFERENCEAuthors:

Eugene Fratkin, Brian T. Naughton, Douglas L. Brutlag, and Serafim Batzoglou

Title: MotifCut: regulatory motifs finding

with maximum density subgraphs

Publication:Bioinformatics Vol. 22 no. 14 2006,

pages e150–e157

Page 23: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

OVERVIEW

Motif Finding Algorithm (“MotifCut”)

Motivation Oversimplicity of PSSMs Intractability of more complex models

Page 24: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

OVERSIMPLICITY OF PSSMS

Assumes independence between positions

~25% of TRANSFAC motifs have been shown to violate this assumption Two Examples: ADR1 and YAP6

Page 25: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

OVERSIMPLICITY OF PSSMS

Assumes independence between positions

Generates potentially unseen motifs

Page 26: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

BASIC FEATURES OF MOTIFCUT

Does not assume an underlying PSSM

Represents a motif with a graph structure In principle maximally expressive In practice not quite

Motif finding cast as maximum density subgraph Subquadratic complexity

Page 27: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

MOTIF GRAPH REPRESENTATION

Nodes are kmers Edge weights are distances between kmers

Generative model: Frequency of kmer node equal to frequency of generating kmer

Distance definition is complicated (Will come back to)

Same kmer node can appear multiple times

AGTGGGAC

AGTGGGAC

AGTGCGAC

AGTGCTAC

0

1

2

11

2

Page 28: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

MOTIF FINDING

Find highest density subgraph

Density is defined as sum of edge weights per node

Somewhat limits representational power

Page 29: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

MOTIF FINDING

Read new sequence

Generate graph as previously described Kmers are generated by shifting one base pair Each kmer in the sequence gets a node,

including identical kmers Graph contains as many nodes as there are base

pairs Connect edges with weights based on distances

between nodes

Find densest subgraph

Page 30: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

EDGE WEIGHTS

Heart of the algorithm, will focus on this

Semantics: Edge weight is the likelihood of two kmers to be

in the same motif

Use Hamming distance as a way to quantify distance between kmers

0TT AACC

123

Page 31: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

EDGE WEIGHTS

Heart of the algorithm, will focus on this

Semantics: Edge weight is the likelihood of two kmers to be

in the same motif

Use Hamming distance as a way to quantify distance between kmers

“Interpret” hamming distance as a measure of the likelihood of two kmers to be in same motif: F(hamming distance) = likelihood of two kmers

to be in same motif

Page 32: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

EDGE WEIGHTS

Let’s make this a bit more precise:

But how to compute ?

Simulate it! Way too many variables to account for

analytically: Background model, kmer length, hamming distance, etc…

Page 33: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

“GENOME SIMULATION” Background + Motifs

No genes, promoters, signaling sequences, etc.

Background Model 3rd order Markov model

Probability of next base depends on previous 3 bases Modeled on the yeast genome Incorporates GC bias

Motif Model PSSM Based on empirically observed information

content of yeast motifs

Page 34: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

“GENOME SIMULATION”

Use Markov model to generate 10k – 20k length sequences of background

Seed with 20 motifs generated by the PSSM

Result is a simulated genome of yeast We know which parts are the real motifs, and

which are not

Page 35: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

EDGE WEIGHTS

Back to :

is number of true motifs of k-length that are l-distance away

is number of non-motifs of k-length that are l-distance away

Page 36: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

EDGE WEIGHTSTrue Motifs

G G G G G G

G G G C G G

G G G G G GG G G G G G

G T G G G G

False Motifs (Part of Background)

Page 37: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

EDGE WEIGHTS

G G G G G G

G G G C G G

G G G G G GG G G G G G

G T G G G G

All ≤1 distance away (Hamming distance) α(k = 6, l = 1) = 1 β(k = 6, l = 1) = 1

Let’s perform calculation from the perspective of this motif

G G G G G G

G G G C G GG T G G G G

Page 38: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

EDGE WEIGHTS

Computation provides an empirical estimate for

Parameterized by two quantities: k, the kmer length l, the Hamming distance between two kmers

Fit to a sigmoidal function

Page 39: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

EDGE WEIGHTS

Normalization step Won’t go into details

This covers problem formulation How is motif finding actually done?

Page 40: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

MAXIMUM DENSITY SUBGRAPH

Standard graph theory method Max-flow / min-cut O(nm log(n2m))

Need faster method

Developed heuristic approach that utilizes max-flow / min-cut method with modifications

Page 41: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

MAXIMUM DENSITY SUBGRAPH

Remove all edges below a certain threshold

Page 42: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

MAXIMUM DENSITY SUBGRAPH

Pick one vertex (do this for every vertex)

Page 43: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

MAXIMUM DENSITY SUBGRAPH

Put back all neighboring edges for that vertex

Page 44: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

MAXIMUM DENSITY SUBGRAPH Use standard algorithm to calculate densest subgraph

Page 45: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

RESULTS

Synthetic Tests Plenty of test cases Measure performance as data set size grows Avoid over biasing on empirical data Know real answer, can unambiguously test

performance

Yeast Test Gold standard data (Harbinson et al., 2004)

Page 46: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

SYNTHETIC TESTS

Varied: Motif length Information content

Simulated genome (as before)

Correlated predicted PSSMs to real ones, counted as true positive if correlation > 0.7

Page 47: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

SYNTHETIC TESTS RESULTS

Page 48: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

YEAST TEST RESULTS

Page 49: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

PERFORMANCE

Page 50: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

TALK OUTLINE

Biology Background

Algorithmic Problem

PapersNew Motif Finding Algorithm

(MotifCut)Analysis of Motif Finders’ Performance

Page 51: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

TALK OUTLINE

Biology Background

Algorithmic Problem

PapersNew Motif Finding Algorithm

(MotifCut)Analysis of Motif Finders’ Performance

Shorter but more drier (no pretty pictures)

Page 52: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

REFERENCE Authors:

Patrick Ng, Niranjan Nagarajan, Neil Jones, and Uri Keich

Title: Apples to apples: improving the

performance of motif finders and their significance analysis in the Twilight Zone

Publication:Bioinformatics Vol. 22 no. 14 2006, pages

e393–e401

Page 53: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

OVERVIEW

Twilight Zone Non-negligible probability that a maximally

scoring random motif would have a higher score than motifs that overlap the ‘‘real’’ motif

Motivation Behavior of Motif Finders in Twilight Zone is

poorly understood Understanding would aid in development of Motif

Finders Sheds light on whether it is theoretically possible

Page 54: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

OBJECTIVES

Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone

Examine and suggest new metrics

Employ new metric for motif finding

Page 55: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

OBJECTIVES

Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone

Examine and suggest new metrics

Employ new metric for motif finding

Page 56: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

E-VALUE

E-value is defined in terms of information content

Information Content

E-value Expected number of random alignments

exhibiting an information content at least as high as that of the given alignment

AlignmentLength

Number of sequences

Background frequency of jth

letter

AlphabetSize

Frequency of jth letter at ith position

Page 57: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

ANALYSIS

Generate 400 random datasets Dataset = 40 sequences totaling 1485 bases

Implant a single motif of length 13 per dataset

High likelihood that motif finders would miss it

Page 58: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

RESULTS

Reported E-value: 8 x 1015

Very high, very statistically insignificant In principle, theoretically impossible to find

Search results Alignment covering ≥30% of motif found in

288/400 cases!

Data generated exactly in accordance with E-value model

Page 59: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

WHAT’S GOING ON?

They don’t know, hand-waive it

Many “satellite” alignments boost up effective score Difficult to characterize analytically

Page 60: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

OBJECTIVES

Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone

Examine and suggest new metrics

Employ new metric for motif finding

Page 61: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

OBJECTIVES

Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone

Examine and suggest new metrics

Employ new metric for motif finding

Page 62: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

NEW METRIC: OPV

Also defined in terms of Information Content

OPV(s) (Overall p-value) Probability that a random sample of the same

size as the input set will contain an alignment with at least as much information content as s

Contrast E-value: Expected number of alignments (in

general) OPV: Probability of finding an alignment in a

dataset

Page 63: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

ESTIMATION

Caveat Random sample (no biasing)

Difficult to calculate analytically

Estimate empirically General OPV Finder-specific OPV

Page 64: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

GENERAL OPV ESTIMATION

Generate 1600 random datasets No implants

Run a collection of motif finders on each dataset

Pick highest scoring motif in each dataset Out of all finders

Sort scores, then pick score with 95% quantile

Page 65: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

GENERAL OPV ESTIMATION

Score such that 95% of scores are below it, 5%

above it

Page 66: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

GENERAL OPV ESTIMATION

Meaning 95% of the time, highest scoring random motif

scored less than s0

Obtaining a score ≥ s0 means ≤ 5% chance for the motif to be random

Page 67: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

GENERAL OPV RESULTS

Run on previous 400 datasets

90% of correct runs (288/400) were classified as noise

Not good…

Page 68: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

FINDER-SPECIFIC OPV ESTIMATION

Same as before, but use only one finder

Better biased toward the parameter space of the specific finder

Page 69: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

FINDER-SPECIFIC OPV RESULTS

Tested it on Gibbs

Same 400 datasets 228 TPs 13 FPs

Much better…

Page 70: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

USING OPV

Impractical

A priori generation is prohibitive given parameter space of motif finders

Per problem estimation is prohibitive Requires ~100x more runs

Not theirs…

Page 71: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

ANOTHER METRIC: ILR (INCOMPLETE LIKELIHOOD RATIO)

Not defined in terms of Information Content

Number of Sequences

Length of nth sequence

Length of motif

Probability of subsequence starting at

m to be the motifMotif PSSM

Background PSSMProbability of subsequence

starting at m to be background

Page 72: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

ANOTHER METRIC: ILR (INCOMPLETE LIKELIHOOD RATIO)

Not defined in terms of Information Content

Ratio of null hypothesis to OOPS hypothesis OOPS: Once occurrence per sequence

Intuition behind it

Page 73: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

ANOTHER METRIC: ILR (INCOMPLETE LIKELIHOOD RATIO)

Page 74: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

OBJECTIVES

Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone

Examine and suggest new metrics

Employ new metric for motif finding

Page 75: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

OBJECTIVES

Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone

Examine and suggest new metrics

Employ new metric for motif finding

Page 76: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

MOTIF FINDING USING ILR

Used existing algorithms, ranked final output by ILR

Developed simple new algorithm that uses ILR as objective function

Page 77: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

ILR MOTIF FINDING RESULTS

Page 78: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

ILR MOTIF FINDING RESULTS

Page 79: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

ILR MOTIF FINDING RESULTS

Promising…

Page 80: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

OBJECTIVES

Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone

Examine and suggest new metrics

Employ new metric for motif finding

Page 81: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

OBJECTIVES

Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone

Examine and suggest new metrics

Employ new metric for motif finding

One More Thing!

Page 82: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis
Page 83: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis

THANK YOU