motif discovery: algorithm and application dan scanfeld hong xue sumeet gupta varun aggarwal

27
Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Post on 22-Dec-2015

228 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Motif Discovery:Algorithm and

Application

Dan Scanfeld

Hong Xue

Sumeet Gupta

Varun Aggarwal

Page 2: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Objective: Motif discovery and use for deriving biological information

Get bound and unbound sequences by TF nanog

in human ES cells

Find a motif using a motif finding algorithm

Genome wide functional analysis using motif to find biological pattern

Page 3: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Why nanog: Relevance to ES Cells

1 Genome1 Cell

>200 Phenotypes1013 Cells

• Activate certain genes essential for cell growth

• Repress a key set of genes needed for an embryo to develop.

• This key set of repressed genes activate entire networks for generating many different specialized cells and tissues.

Page 4: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Objective: Motif discovery and use for deriving biological information

Find a motif (nanog) using a motif finding algorithm

Get bound and unbound Sequences by TF nanog

in Human ES cells

Genome wide Functional Analysis using motif to find biological signals

Page 5: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Location Analysis (ChIP-CHIP) in Human ES Cells (Cell Boyer et al 122: 947-956)

Differentially label

Crosslink Fragment Enrich for Nanog

44k 10 SetAgilent

Page 6: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

ChIP-CHIP Data Analysis

Probe-set p-valuep=0.005

P<=0.001

P<=0.005

P<=0.01

Enr

ich

me

nt

ratio

Chromosomal position WCE signal

IP s

igna

l

0

Set - normalized negative control-

subtracted

Perform Median

Normalization

Sequences (500 bp)

May 2004 Genome Release

Obtain Intensities

using Genepix

Page 7: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Objective: Motif discovery and use for deriving biological information

Find a motif (nanog) using a motif finding algorithm

(State-of-the-art)

Get bound and unbound Sequences by TF nanog

in Human ES cells

Genome wide functional analysis using motif to find biological pattern

Page 8: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Motif Finding Algorithm(Mac Isaac, et. al., 2006)

Use Structural Prior(Database, MacIssac, et. al.)

Refinement:Expectation-Maximization (ZOOPS)

Score of found motifs:Classification on unseen data

Significance testing on score:Use of Empirical p-value

Page 9: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Refinement:Expectation-Maximization

Differences from EM in Lab 1

Use of structural prior (beta = Strength of prior)

ZOOPS (Zero or One per sequence) model

5th order Markov Model for background trained over unbound sequences

SVM for hypothesis testing

Page 10: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

ZOOPS Model (Bailey & Elkan 1994)B Background Model, M: Motif ModelΛ Percentage of Bound Sequences (Mixture Model parameter)Sequences are drawn from the distribution

P(S) = P(S| M) Λ + P(S|B)(1- Λ)

Hidden Variable for EM: Zij : 1 or 0, position j in sequence i is bound by the TF (1) or not (0)

E-step:Prob(Zij) = [Λ *P(Si bound at j |M)] ----------------------------------------- [(1- Λ)P(Si |B) + Λ *∑ j P(Si bound at j |M)]

M-step:(SAME AS BEFORE)Updating M (Motif Model): For position p on the motif model and each base b (A C T or G)Baseip : Base at position p of ith sequencePWM(p,b) = ∑ i (∑ j (prob(Zi(j-p+1))* (Baseij = = b))) + pseudocounts AND NORMALIZE

Updating Background Model [[WE DON’T UPDATE BACKGROUND)

Updating ΛΛ = (∑ i ∑ j prob(Zij))/( number of sequences )

P(Si)

P(M bound at j | Si)

Page 11: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Hypothesis testing

Get motifs from EM Use 2 sets of bound and

unbound seq. ( Train and test)

Train a linear SVM on train set.

Find classification error on test set Error = Misclassifications/Total Samples

Score = 1 – error

B

UB

B

UB

Train Set

Input = P(S|M)/P(S|B)Output = B OR UB

Train ClassifierTest SetTest Classifier

B + EM Motif (M)

Page 12: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Expectation-Maximization

When to stop? Will it overtrain?

Rules of thumb (When likelihood increases very slowly) Second derivative is negative for given number of times Euclidean distance is less than given value

Over-train to given sequences

Maximizes likelihood of motif in given sequences. Disregards their likelihood in unbound sequences

Find test classification error at each EM step using SVMs.

Page 13: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Expectation-Maximization

A different Methodology: 4 sets of data:

Bound (for EM), B & U.B. (Train SVM), B. & U.B. (Test SVM), B. & U.B. (Validation)

At each EM iteration, train SVM and find test Error.

Use two kind of motifs Best Test Error motif EM last iteration motif Choose 10 best hypothesis Use larger validation set

Initial Points

Final Motif

SVM & Error

Initial Points

Final Motif

SVM & Error

SVM & ErrorSVM & Error

SVM & Error

Page 14: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Expectation-MaximizationDetails of RUN Transfactor: Nanog Beta = [0 0.2 0.35 0.5 0.6 0.7 1]

(Strength of prior) 5 motifs per beta by masking motifs Motif Length : 8 25 bound seqs for EM 500 base pairs in each seq. 150 total train seq (SVM) [Low: Noisy] 150 total test seq (SVM) [Low: Noisy] 500 total Validation seq. c = [1e-3,0.05,100.0] (SVM: Budget for misclassifications) EM for minimum 60 iterations, Second derivative is negative for five

iterations

Page 15: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Expectation-Maximization

Representative Score graphs during EM iterations

Beta 0.0 Beta 0.35

Beta 0.6 Beta 0.7

X-Axis: EM Iteration Y-Axis: Score of Motif

Page 16: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Expectation-Maximization

Test and Validate Error of refined Motifs

Test Classification Score

*: End of iteration EM resulto: Best of Iteration

Validate Classification Score

*: End of iteration EM resulto: Best of Iteration

X-Axis: beta Value Y-Axis: Score of Motif

Page 17: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Expectation-Maximization

When is it the best-of-iteration?

itera

tion

RUNS

Total iterations Iterations for Best-Of-Iterations

Page 18: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Expectation Maximization

Results:: 6 out of 7 top ranking motifs were best-of-

iteration and 1 was end-of-iteration (6 out of 10 as well)

Best Motif: Validate Error over set of 500 Score: 61.2%, Error: 38.8%A 0.003392 0.764554 0.995187 0.072268 0.063644 0.459349 0.000033 0.088069 C 0.268216 0.050266 0.000149 0.000022 0.303880 0.003363 0.472214 0.201074 G 0.039865 0.000023 0.002015 0.205620 0.105970 0.537248 0.446827 0.228689 T 0.688527 0.185157 0.002648 0.722090 0.526506 0.000040 0.080927 0.482167 T A A T T A or G C or G T

Page 19: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Assumptions and Caveats

Random baseline: End-of-run motif in EM

Low number of sequences for test error

Bound sets may actually not be bound. Better to use highly probable sequences as bound.

All runs (inc. beta=0) used starting point as the structural prior.

Page 20: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Objective: Motif discovery and use for deriving biological information

Find a motif (nanog) using a motif finding algorithm

Get bound and unbound Sequences by TF nanog

in Human ES cells

Genome wide functional analysis using motif to find biological pattern

Page 21: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

GSEA (Subramanian et al 2005)

Gene Set Enrichment Analysis (GSEA) determines whether an a priori defined set of genes shows statistically significant differences between two biological states.

Page 22: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

GSEA Output

Enrichment Plot Gene List Gene Set Information

Page 23: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

GSEA Ranked List

Set of promoter sequences for every human gene.

2000 bp upstream and 200 bp downstream of Transcription initiation site.

Score each promoter for likelihood of the motif. Input this ranked list into GSEA. Search for gene sets enriched in the ranked list.

Page 24: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Results

Human embryonic stem cell genes OCT4, NANOG, STELLAR, and GDF3 are expressed in both seminoma and breast carcinoma. ( Ezeh et al 2006 )

Breast cancer geneset found at p-value: 0.008

Page 25: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Implementation Details

Young Lab Error model for chIP-chip data Analysis

Motif finding Algorithm in MATLAB Implemented Markov Model Implemented ZOOPS Model Integrated SVM Toolbox ( by S. R. Gunn.) with code

Used structural prior from MacIsaac, et.al. 2006

Used software for GSEA for Functional Analysis.

Page 26: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Future Directions

Algorithm Better use of classification error. Maximize Likelihood in Bound + Minimizes Likelihood in

Unbound (Multi-objective Optimization using GAs) Biological Information: Distance from transcription site,

Conservation Integrating expression data Cross-species Motif search and functional analysis,

maybe using GO Terms Scoring Sequence length

Page 27: Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Acknowledgments

Fraenkel Lab Young Lab Kenzie D. MacIsaac Dr. David Gifford (CSAIL) Dr. Richard Young (WIBR) Dr. Tommi Jaakkola (CSAIL)