detection of spaced motifs using submotif pattern mining s.m. yiu department of computer science the...

50
Detection of Spaced Motifs using Submotif Pattern Mining S.M. Yiu Department of Computer Science The University of Hong Kong Joint work with Ken Sung (Genome Institute of Singapore and NUS) Edward Wijaya, Rajaraman Kanagasabai (Institute for Infocomm Research) 1

Upload: emma-butler

Post on 03-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Detection of Spaced Motifs using Submotif Pattern Mining

S.M. YiuDepartment of Computer Science

The University of Hong Kong

Joint work with Ken Sung (Genome Institute of Singapore and NUS)

Edward Wijaya, Rajaraman Kanagasabai (Institute for Infocomm Research)

1

Outline

• Introduction to Bioinformatics Research & Motif Finding Problem

• Spaced Motifs and Our contributions

• Some Technical Background – String representation of Motifs

• The Model for Spaced Motif

– Gaps in Motifs

– Submotif notion

• SPACE

– Algorithm for Spaced Motifs

• Experimental Results & Conclusions

• Other Related Projects

2

Bioinformatics Research

Goal: To help bio-medical research community to discover biological meaningful knowledge

How CS helps: By providing computational efficient tools for analysis of huge amount of data and/or solving computational intensive problems.

May make use of techniques in Algorithms, Database, AI etc.

3

4

Typical Process of Bioinformatics Research

Biological Scenario

Math/computational model

Verify & Test the solution using real

data

Develop efficient & effective algorithms

(solution)

siRNA can be used to de-activate genes causing cancers[How to select good candidates is not easy!]

Formulate criteria for forming siRNA molecules

Develop a computational tool for selecting siRNA candidates

An example

Two issues:1)Difficult to derive a correct model [even the bio-medical researchers may not know the exact criteria];

2)Complexity of the problem and huge amount of data

Typical Process of Bioinformatics Research

Biological Scenario

Math/computational model

Verify & Test the solution using real

data

Develop efficient & effective algorithms

(solution)

5

The motif finding problem - a fundamental and critical problem in bio-medical research

- To activate a gene, there is a corresponding short sequence (called binding site) in our DNA needs to be first interacted (binded) by a particular transcription factor.

6

…..AGCTAAACCACGTGGCATGGGACGTATGCCCAGTA…..

Binding site

Transcription Factor A

Gene A

activated to produce a protein

DNA sequence is considered as a sequence of {A, C, G, T}

7

•The binding sites that can interact with the same transcription factor are similar in pattern (but usually do not have the same pattern). The general pattern of a set of related binding sites is referred as the motif.

AGCTAAACCACGTGGCATGG

AGCCACGCGCGTGGCATGG

AGCTAAACCACGTGGCATGTGC

ACCCGTGCCACGTGGCATGG

…GCACGCGGTATCGTTAGCTTGACAATGAAGACCCCCGCTCGACAGGAAT……GCATACTTTGACACTGACTTCGCTTCTTTAATGTTTAATGAAACATGCG……CCCTCTGGAAATTAGTGCGGTCTCACAACCCGAGGAATGACCAAAGTTG……GTATTGAAAGTAAGTAGATGGTGATCCCCATGACACCAAAGATGCTAAG…………………………………………………………………………………………………………………………………………

Given: A set of sequences that possibly contain the binding sites of the same transcription factor.What we know: the binding sites should occur abundantly (unlikely to occur by random) inside these sequences.The Motif Finding Problem: to identify a set of similar patterns that occur a lot in these sequences and predict the motif.

Importance of this motif finding problem

- Solving this problem provides critical information for the study of genetic diseases and drug design.- Motif finding becomes a daily task for bio-medical researchers.

Within the last few years, more than 30 new software tools were developed and more than 100 research papers were published!

8

Issues Representations of Motif

Different representations can model different motifs (binding sites) and have different descriptive power.

Scoring function To formalize the concept of “occurred abundantly” (not occur by random).

Complexity of the motif finding problem: some versions have been proved to be NP-hard.

Practical concern: Dataset with noise ………

Traditional motif models

“Assume that a motif is a contiguous substring in the

sequence”Additional Information: e.g. Negative Set – sequences known to be without the binding sites

Still a lot of known binding sites not being captured by these models

10

The Spaced Motif

Our contributions:- Traditional model assumes no gaps in motif or allows at most one gap. We derive a new model to allow multiple gaps with different lengths in motifs (spaced motifs)- Formulate the spaced motif finding problem as a frequent submotif mining problem (database technique)- We developed an effective algorithm (SPACE) to solve the problem.- Experimental results verified that the new model represents real motifs better than existing models.

11

Two common representations for motif:

1. String Representation

A motif is modeled as a length-L string- two variations: mismatch model and wildcard model

A (L, d)-mismatch motif M is a length-L string such that a Length-L string with at most d mismatches from M is considered as a binding site.

TTGACATCGACATTGACATTGAAAATGACATTGACAGTGACATTGACTTTGACCTTGACA

Motif: TTGACABinding sites: with at most 1 mismatch

This is a (6, 1)-mismatch motif!

Note: the mismatch can occur at any position!

1. String Representation

A motif is modeled as a length-L string- two variations: mismatch model and wildcard model

A (L, d)-wildcard motif M is a length-L string such that a binding site is one with at most d wildcards w.r.t. M.

TTGACATCGATATTGATATGGAAATTGAGATCGACATGGAGATTGACATAGATATCGAGA

In this case, the binding sites should be better modeled by a (6, 2)-wildcard motif: T*GA*A

2. Positional Weight Matrix (PWM) Representation

Gaps in motif

Examples of motifs that are not contiguous substrings in the sequence!!

e.g. ARCA-P (Liu and DeWulf, 2004)The motif is known to be “GTTAA n n n n n n GTTAA”

HAP1 (Scherling and Holmberg, 1996)The motif is known to be “CGG n n n TA n CGG n n n TA” Gaps (spacers)

Issues: We don’t know the number of gaps and the length of each gap! The search space may be huge if we try all combinations.

14

Previous work?

• Most previous work usually does not allow gaps in a motif (called monads).

• Some algorithms (like MITRA) allow gaps. But they only allow one gap.

In this work, we attempt to develop a better approach to locate motifs with arbitrary number of gaps of possibly different lengths.

Our Proposed Motif Model

We define a (spaced) motif M as a length-L string with characters {A, C, G, T, n} with at least c x L non-n characters (c is called the coverage).

E.g. L=15, c=l/2,M=ACGCnnGCGTnCTCA

Each maximal substring of consecutive n represents a gap/spacer.

Each maximal substring of {A, C, G, T} characters represents a segment.

16

Assumption: total gap length is relatively small w.r.t. L.

E.g. L=15, c=l/2,M=ACGCnnGCGTnCTCA

Note: we allow any number of segments (and gaps), and segments (as well as gaps) can have different lengths.

Notion of submotif [to capture similarity!]:

We fix a parameter ls, any length-ls substring within any segment of M is called a submotif.

e.g. if ls = 3, ACG; CGC; GCG; CGT; CTC; TCA are submotifs of M.

17

Instance (binding site) of a motifA length-L string I in a given sequence is called an instance (a binding site) of M if I contains all submotifs with at most d mismatch ((ls,d)-substrings) for each submotif.

e.g. Assume (ls, d) = (3,1)M=ACGCnnGCGTnCTCAI=ATGCcgGCATtCTTA I’=ATGCcgCCATtCTTA

I is an instance; I’ is not!

The problem: Given a set S of t sequences, find all (spaced) motifs M such that M has at least q instances.

18

Example of Motif

• If q = 4, M=ACGCnnGCGTnCTCA is a spaced motif as M has 4 instances in S

TTCAACGCACGCGTTCTCAGCTCAGCTGAACGTCGGTCACT

CGACATTCCACGGCTATGCCGGCATACGCACCGTACGCTCC

CGGCAATTACTCGTCAGTTCTACGTGCGCGACCCCATCCCA

ACTGCGCATGAACACTCCTGAGTGATCACTAAAGTTCGGTG

19

Idea of the SPACE Algorithm

• Input sequences S:TTGATACCGAAGATACCGATTAGAAATCACTCA

ACTACAGAAAAGCAGTAGTAAAACTGTACAGTC

GAAGACCGTCATGAGAAATCGCATACACGAGCA

TTCACCCGATAAAAATAAGGCTGTCTGGACTAA

TCGGAACAATTACGAAGAAAAGCAGTAGAAAAA

20

Then, for each other length-L substring in S, check if the sharing (ls, d)-substrings cover at least c x L characters. e.g. L = 20, ls = 5, d = 1, c = ½ (i.e. c x L = 10).S2: ACTACAGAAAAGCAGTAGTAAAACTGTACAGTC

Finding motif candidates

-GAAGATACCGATTAGAAATC

-GAAAA TAAAA

-GAAAAGCAGTAGTAAAACTG

Consider any length-L substring in S.e.g. L = 20S1: TTGATACCGAAGATACCGATTAGAAATCACTCA“GAAGATACCGATTAGAAATC” starting at pos 9

21

S1:TTGATACCGAAGATACCGATTAGAAATCACTCAS2:ACTACAGAAAAGCAGTAGTAAAACTGTACAGTCS3:GAAGACCGTCATGAGAAATCGCATACACGAGCAS4:TTCACCCGATAAAAATAAGGCTGTCTGGACTAAS5:TCGGAACAATTACGAAGAAAAGCAGTAGAAAAA

More examples:

Note that the set of shared (ls, d) substrings may not be the same for every case.

22

Extracting sharing (ls, d)-substrings– GAAGATACCGATTAGAAATC

– GAAAA TAAAA– GAAAAGCAGTAGTAAAACTG (1,13)

– GAAGA ATGAGAAATC– GAAGACCGTCATGAGAAATC (1,11,12,13,14,15,16)

– CGATAA ATAAG– CCCGATAAAAATAAGGCTGT (3,4,12)

– GAAGAC TAGAAA– GAAGACCAGCAGTAGAAAAA (1,2,13,14)

23

Mining the frequent pattern• Find the patterns which occur at least q

times.• Example (q=3):

– (1, 13)– (1, 11, 12, 13, 14, 15, 16)– (3, 4, 12)– (1, 2, 13, 14)

• The pattern is (1, 13)• If the patterns can give a coverage of >

c x L, then we found a spaced motif.• Hence, GAAGAnnnnnnnTAGAA• This step is done using a well-known

data mining technique.24

Generate the results and compute the significant score

• M = GAAGAnnnnnnnTAGAA

TTGATACCGAAGATACCGATTAGAAATCACTCAACTACAGAAAAGCAGTAGTAAAACTGTACAGTCGAAGACCGTCATGAGAAATCGCATACACGAGCATTCACCCGATAAAAATAAGGCTGTCTGGACTAATCGGAACAATTACGAAGAAAAGCAGTAGAAAAA

• The motif M is scored based on how unlikely this pattern can occur in random. [Details omitted!]

25

Summary of the algorithm

1. Finding motif candidates– For every length-L substring P of S

• Find all motif instances of P in S• Extract the sharing (ls,d)-substrings in all motif

instances• Mine patterns occuring more than q times• Generate the motif and its significant score

2. Sort the motifs based significant score3. Report the motifs

26

Some technical details

• A straight-forward implementation of the previous algorithm is slow.

• The bottleneck is on the step of finding motif candidates (it takes O(Ln2) time, where n is the total length of the sequences).

27

Idea: Window shifting• Suppose we know coverage between P and I

– P=GAAGATACCGATTAGAAATC– GAAGA ATGAGAAATC (1, 11, 12, 13, 14, 15, 16)– I=GAAGACCGTCATGAGAAATC– Coverage = 15

• We can find the coverage between P’ and I’ efficiently– P’=AAGATACCGATTAGAAATCG– ATGAGAAATCT (10, 11, 12, 13, 14, 15, 16)– I’=AAGACCGTCATGAGAAATCT– Coverage = 11

P

P’

First length-ls substring of P

Last length-ls substring of P’

Time complexity reduced to O(n2).

28

Idea: Pruning on CoverageLet Sa = S[a..a+L-1] be the motif candidate.

Tb = T[b..b+L-1] is the substring being considered.

Let C be the coverage of Tb on Sa.

We can get the upper bound for the coverage of Sa+p and Tb+p easily for any p > 0.

29

• Suppose we know coverage between P and I– Sa=CGAAGATACCGTTAGAAATC– GAAGA (2)– Tb=TGAAGACCGTCTGACCGATC– Coverage = 5

• If we move both substrings to the right by one character– Sa+1=GAAGATACCGTTAGAAATCG– GAAGA AATCG (2, 16)– Tb+1=GAAGACCGTCTGACCGATCG– Coverage = 10

The coverage of Sa+p and Tb+p is upper bounded by

C + (ls-1) + p

Experimental Results

30

Compare the effectiveness of our tool with 13 other existing software tools.

Testing Data:1)Motif Assessment Benchmark Datasets (Martin Tompa, 2005): 56 datasets constructed from 4 different species (fly, human, mouse, yeast).2)9 datasets extracted from the literature with some identified spaced motifs.

Evaluation Measures:Sensitivity (Sn): % of known binding sites identified by the tool. Specificity (PPV): % of predicted binding sites that match with the known binding sites.

Experimental Results – Benchmark Dataset

31

Benchmark Dataset – Comparison on Four Organisms with

the best performed softwareS

eSiM

CM

C

SeS

iMC

MC

An

nS

PE

C An

nS

PE

C

Imp

rob

izer

+ Y

MF

Imp

rob

izer

+ Y

MF

Wee

der

Wee

der

32

Experimental Results – Spaced motif real datasets

An exampleARCA-P Literature GTTAAnnnnnnGTTAA

SPACE GTTAnnnnnATGTTA MITRA GTTAACT

SPACE also performs better (in terms of both measures) in all cases.

33

Conclusions

- SPACE is found to be effective in locating spaced motifs (as well as motifs without gaps).

- After the paper was published, quite a number of biologists emailed us for the tool. Some major genome laboratories, such as The Computational Biology Research Center (CBRC) of Japan, are considering to list (& include a link to our web-based tool) our tool in their collection.

- We are extending our work to handle motifs for which the same gap may have different lengths across different instances.

34

35

Some Related Projects

Traditional motif models

“Assume that a motif is a contiguous substring in the

sequence”Additional Information: e.g. Negative Set – sequences known to be without the binding sites

SpacedMotifs

Motifs with Character

Dependence

36

Motifs with Nucleotide Dependence

- Joint work with Prof. Chin & Henry Leung- A new model to capture the dependency of nucleotides (characters) in a motif

e.g GGT

AA

CGG

TGACGGA

The SPSP (Scored Position specific Pattern) Model

The 5th, 6th, 7th characters are dependent, i.e., if 5th character is “T”, then the 6th and 7th characters must be “GA”.

37

Some on-going Bioinformatics Projects

Traditional motif models

“Assume that a motif is a contiguous substring in the

sequence”Additional Information: e.g. Negative Set – sequences known to be without the binding sites

SpacedMotifs

Motifs with Nucleotide Dependenc

e

Ensemble Approaches for Combining Results from different Motif Finders

38

Traditional motif models

“Assume that a motif is a contiguous substring

in the sequence”Additional Information: e.g. Negative Set – sequences known to be without the binding sites

SpacedMotifs

Motifs with Nucleotide Dependenc

e

Ensemble Approaches for Combining Results from different Motif FindersSingle Motif

Protein Motif Pairs Motif Modules

<Thank you!>

2. Positional Weight Matrix (PWM) Representation

A motif is modeled as a 4 x L matrix. -The binding site is of length L. -The 4 rows are labeled by “A”, “C”, “G”, “T”.-The jth column provides the probability of the occurrence of each nucleotide at position j of the binding site.

1 2 3 4 5 6A 0.1 0 0 1 0.1 0.8C 0 0.1 0 0 0.9 0.1G 0.1 0 1 0 0 0T 0.8 0.9 0 0 0 0.1

Positional Weight Matrix (PWM)TTGACATCGACATTGACATTGAAAATGACATTGACAGTGACATTGACTTTGACCTTGACA

Remark: Solution space is infinite, only suboptimal answers will be produced.

GGTAACGG

TGACGGA

P

log(1)-log(0.8)-log(0.2)-log(1)-

log(0.4)-log(0.6)-log(1)-

S

SPSP model

Remark: A length-11 string is considered as a binding site if (i) it matches with P and (ii) its score (sum of corresponding entries) is at most 3.1.

For example, the score of “CGGATGAATGG” is -log(1)+ -log(0.6) + -log(1) + -log(0.8) + -log(1) = 1.05 < 3.1. The score of “CGGACGGAAGG” is -log(1)+ -log(0.4) + -log(1) + -log(0.2) + -log(1) = 3.6 > 3.1. The string “TGGATGAATGG” does not match with P.

40

Significance Test & Scoring

Intuitively, a motif is significant if-The number of occurrences is a lot more than expected-The pattern is either very conserved or occurs in quite a number of input sequences.

(1)Let Occ_s(M, e) be total no. of observed occurrences of M with at most e mutations.Let E(M, e) be expected frequency of M with at most e mutations based on a set of background sequences

Occurrence score:X(M) = log [(Occ_s(M, e)/E(M, e) x total length of seq]

41

(2) For a sequence s with an occurrence of M, consider the most conserved instance (let e’ be the no. of mutations). Then, E(M, e’) x Len(s) is the expected freq. of this motif in s. Note that this value is small is the motif is very conserved.

Sequence-specific score:Y(M) = Sum [log (1/E(M, e’) x Len(s))] over all sequence s.Overall score = x(M) + y(M).

42

GCACGCGGTATCGTTAGCTTGACAATGAAGACCCCCGCTCGACAGGAATGCATACTTTGACACTGACTTCGCTTCTTTAATGTTTAATGAAACATGCGCCCTCTGGAAATTAGTGCGGTCTCACAACCCGAGGAATGACCAAAGTTGGTATTGAAAGTAAGTAGATGGTGATCCCCATGACACCAAAGATGCTAAGCAACGCTCAGGCAACGTTGACAGGTGACACGTTGACTGCGGCCTCCTGCGTCTCTTGACCGCTTAATCCTAAAGGGAGCTATTAGTATCCGCACATGTGAACAGGAGCGCGAGAAAGCAATTGAAGCGAAGTTGACACCTAATAACT

Given a set of (regulatory) sequences that possibly bind to the same transcription factor, the problem is to search for the common binding site pattern (motif) that is over-represented in the sequences.

Unlikely to occur by random!

e.g. TT occurs 17 times, but length of each seq. = 49, total length ~ 350, expected no. of occurrences ~ 20.

43

GCACGCGGTATCGTTAGCTTGACAATGAAGACCCCCGCTCGACAGGAATGCATACTTTGACACTGACTTCGCTTCTTTAATGTTTAATGAAACATGCGCCCTCTGGAAATTAGTGCGGTCTCACAACCCGAGGAATGACCAAAGTTGGTATTGAAAGTAAGTAGATGGTGATCCCCATGACACCAAAGATGCTAAGCAACGCTCAGGCAACGTTGACAGGTGACACGTTGACTGCGGCCTCCTGCGTCTCTTGACCGCTTAATCCTAAAGGGAGCTATTAGTATCCGCACATGTGAACAGGAGCGCGAGAAAGCAATTGAAGCGAAGTTGACACCTAATAACT

TTGACATCGACATTGACATTGAAAATGACATTGACAGTGACATTGACTTTGACCTTGACA

Motif: TTGACABinding sites: with at most 1 mismatch

It occurs 10 times, but expected number of occurrences is only:

(1/4)6 x 6 x 3 x 350 ~ 1.5

44

An example: hm17g

45

Synthetic Dataset1. Motif AGTTGTC with no spacer

2. Motif CCTGTnnnAGTTGTC containing 2 segments with spacers.

3. Motif ATCGTnnnTGACCnnnCTTTC containing 3 segments of length 5 with spacers.

4. Motif CGGCnnnnnnTCTAA containing 2 segments with a spacer.

0

0.2

0.4

0.6

0.8

1

1.2

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

nSn - MITRA

nSn - SPACE

nPPV - MITRA

nPPv - SPACE

nPC - MITRA

nPC - SPACE

nCC - MITRA

nCC - SPACE

Note: for each segment, the instances may have one mismatch.

46

Datasets• 9 Real datasets with known spaced motifs.

• Motif Assessment Benchmark Datasets (Tompa, 2005).– Consists of 56 datasets from 4 different species

(fly, human, mouse, yeast).– Each dataset contain 1 to 35 sequences.– Sequence length up to 3K bp.

• Six synthetic datasets containing implanted motif with variations in:– spacer length.– motif parts (segments) and – segments length.

47

Parameterization• Perform intersections with 12 runs of these

parameter combinations:– L = 8,15,20 (max motif length)– c = 0.5, 0.8 (coverage)– q = t, 0.5t (min support)– ls = 5 (substring length)– d = 1 (mismatches)

Experimental Results

48

Evaluation Measures

(Recall)

(Precision) =

(Performance Coef) =

. . (Correl Coef) =

( )( )( )( )

TPSn

TP FN

TPPPV

TP FP

TPPC

TP FN FP

TP TN FN FPCC

TP FN TN FP TP FP TN FN

49

Benchmark Dataset – Comparison on Four Organisms with

the best performed softwareS

eSiM

CM

C

SeS

iMC

MC

An

nS

PE

C An

nS

PE

C

Imp

rob

izer

+ Y

MF

Imp

rob

izer

+ Y

MF

Wee

der

Wee

der

50