sequence alignment by dynamic programming

38
Sequence Alignment by Dynamic Programming Bios 560R, Lecture 2 Overview The difficulty The scoring system Dynamic programming for pairwise alignment Genome sequencing (very brief) Overview of HGP Shotgun sequencing Next generation sequencing

Upload: willow-beck

Post on 02-Jan-2016

59 views

Category:

Documents


0 download

DESCRIPTION

Bios 560R, Lecture 2. Genome sequencing (very brief). O verview of HGP Shotgun sequencing Next generation sequencing. Sequence Alignment by Dynamic Programming. Overview The difficulty The scoring system Dynamic programming for pairwise alignment. - PowerPoint PPT Presentation

TRANSCRIPT

Sequence Alignment by Dynamic Programming

Bios 560R, Lecture 2

OverviewThe difficultyThe scoring systemDynamic programming for pairwise alignment

Genome sequencing (very brief)Overview of HGPShotgun sequencingNext generation sequencing

1977. Allan Maxam and Walter Gilbert at Harvard University and Frederick Sanger at the U.K. Medical Research Council (MRC) independently develop methods for sequencing DNA.1988. NIH establishes the Office of Human Genome Research.1995. Patrick Brown of Stanford and colleagues publish first paper using a printed glass microarray of complementary DNA (cDNA) probes

Researchers at Whitehead and Généthon (led by Lander and Thomas Hudson at Whitehead) publish a physical map of the human genome containing 15,000 markers.1996. NIH funds six groups to attempt large-scale sequencing of the human genome.

Affymetrix makes DNA chips commercially available. An international consortium publicly releases the complete genome

sequence of the yeast1998. a new company named Celera.

NIH announces a new project to find SNPs2001. The HGP consortium publishes its working draft in Nature (15

February), Celera publishes its draft in Science (16 February).2006. Sequence of all chromosomes finalized.

Human genome project (HGP) – some milestones

Science 9 February 2001: Vol. 291. no. 5507, p. 1195

Shotgun sequencing.

Shotgun: fires a number of small spherical pellets.

The genome DNA (~ 3 billion base pairs, in 22 autosomes and 2 sex chromosomes) is too long for sequencing directly.

The DNA is randomly broken is to small pieces. The pieces are sequenced.The short sequences are assembled into long sequences based on the

overlapping of fragments.

Genome sequencing strategies

http://www.shotgun-insight.com/expressFibreVsPlastic/proFibre5_25yds80percent.jpg

Genome

Reads

Contigs(islands)

The Lander-Waterman model. Genomics. 1988 Apr;2(3):231-9.

Important result about coverage (direct quotes from the paper):

Genome sequencing strategies

“a greater risk of long-range misassembly”

“additional preliminary work”

Proc Natl Acad Sci U S A. 2002 March 19; 99(6): 3712–3716. HGP Celera

Genome sequencing strategies

Whole genome using next generation sequencing ?

Next generation sequencing generates much shorter reads at higher throughput.

nature biotechnology 26(10):1135. Sanger method Next Generation

Whole genome using next generation sequencing ?

Nature Reviews Drug Discovery 12, 358–369 (2013)

Application of next-generation sequencing (NGS) in various ways:

Between protein and gene sequences- The gene sequence (DNA/RNA) are made of 4 characters

(nucleotides)- The protein sequence is made of 20 characters (amino acids)- The DNA codons:

- When discussingalignment, we considerboth DNA sequenceand protein sequence.

Sequence alignment• Alignment answers the question of whether two or more

sequences are (evolutionarily) related.

• Example: A new gene is found in human. We wish to study its properties. To get a hint, we try to find its corresponding part in mouse. Among the tens of thousands of genes in mouse, which is the one that’s most related to this human gene? (BLAST.)

• Sequence segments more important to biological functions are more conserved. Align proteins sharing a function to find peptide sequences more important to the function.Align DNA sequences upstream to (functionally or evolutionarily) related genes to find segments that bind transcription factors.

Kapitonov VV, Jurka J PLoS Biology Vol. 3, No. 6, e181 

Sequence alignment

http://pevsnerlab.kennedykrieger.org/wiley/bestfit2.gif

Sequence alignment

• Alignment serves as the basis for determining evolutionary relationships.

• Example:

http://www.computational-genomics.net/case_studies/sabertooth_demo_37.png

Sequence alignment

In the design and/or data interpretation of sequence-based high-throughput technologies, sequence comparison is necessary:

- To find microarray probes that do not have similar sequence with other genes.

- To match sequences in high-throughput sequencing data to genome.

- To find motifs based on ChIP-chip/ChIP-seq data.……

Sequence alignment

Difficulty in alignmentThe huge search space (the number of possible alignments) !

Sequence 1: GSAQVKSequence 2: GNPKVKGSAQVK----- GSAQVK---------GNPKVK ----GNPKVK

GSAQ--VK -----GSAQVK-G-NPKVK GNPKVK-----

………………… How to pick the best??

The scoring system• A quasi-statistical model log-likelihood ratio:

ii

ii

ii

ii

PyPx

yPx

PyPx

yPx

RyxP

MyxPloglog

)|,(

)|,(log

x: sequence segment 1, y: sequence segment 2i: aligned positionPxiyi: Prob(xi and yi are in aligned position)Pxi: Prob(xi occurs)Pyi: prob(yi occurs)

M: this aligned modelR: x, y are independent sequences

The scoring systemNote: this kind of “model” is for illustration purposes only!

There are many possible models: M1, M2, ……, Mn

According to the Bayes Rule,

i

jj

iii MyxP

MPMyxP

MPMyxPyxMP |,

)(|,

)(|,,|

This is MLE also since we are using an uninformative prior.

The scoring system• Given the above, the best alignment model is the one that maximizes

ii

ii

PyPx

yPxlog

Notice that there are a limited number of possible xi, yi pairs. For amino acids, there are 20 choose 2 pairs, plus any amino acid against Gap (all equal values : the “gap penalty”).

How do we get these values ????

The scoring system• These values are obtained from some high-confidence alignments:

GAPVKFC GAPKKFC

ii

ii

PyPx

yPxlog

The log likelihood ratio of the pair occurring as an aligned pair as opposed to an unaligned pair.

It can be obtained in multiple ways. Generally, such a matrix describes the rate that one character (amino acid or nucleotide) is replaced by another one during evolution.

The scoring systemPAM (Point Accepted Mutation) matrix is calculated from closely related sequences. Finding aligned positions is simple as there are only a few positions where the two sequences are different. The number following “PAM” denotes evolutionary distance. PAM1 matrix considers only direct mutations.

Taking this matrix to N-th power with proper normalization yields PAM-N matrix.

The scoring system

http://www.brc.dcs.gla.ac.uk/~drg/courses/bioinformatics_mscIT/slides/slides3/img033.gif

BLOSUM (BLOcks of Amino Acid SUbstitution Matrix).

It comes from blocks of aligned sequences with less similarity than those used by PAM.

The log-odds ratio are calculated directly from the observed alignment. There is no extrapolation as PAM.

The scoring system

The scoring system

http://en.wikipedia.org/wiki/Image:BLOSUM62.gif

Dynamic programming• Now, for every pair of amino acids, we have a score that can be considered the log likelihood ratio. So what? Do we still need to perform endless computations to compare all models?

• The answer is of course NO.• Different models share sub-models. We should not re-calculate the score

• For non-optimal models, we do not need to know their exact score.

GSAQVK G-SA-QVK GSA-QVKGNPKVK GN-PK-VK G-NPKVK

“Two sledgehammers of the algorithms craft, dynamic programming and linear programming”

Dynamic programming:

Breaking the overall optimization problem into overlapping smaller problems;

Solve each sub-problem once, and reuse the results, thus reducing the computing cost (dramatically);

Often working backward.

Dynamic programming

A simple example:

Find the shortest path from S to E in the directed acyclic graph below.

Take node D as an example. The way to get to D is through B or C. So,

Algorithms, Dasgupta et al.

Dynamic programming

Linearization:

Algorithm:

Algorithms, Dasgupta et al.

Dynamic programming

Compare:(1) Exhaustive approachSABE: 1+6+2SCABE: 2+4+6+2SCDE: 2+3+1SCABDE: 2+4+6+1+1

(2) DP approachDist(A)=min(1, 2+4)=1Dist(C)=2Dist(B)=min(dist(A)+6)=1+6=7Dist(D)=min(dist(B)+1, dist(C)+3)=min(7+1, 2+3)=5Dist(E)=min(dist(B)+2, dist(D)+1)=min(7+2, 5+1)=6

11 additions.Complexity grows exponentially with the size of graph

6 additions.Complexity grows linearly with the size of graph

Dynamic programming

Another example: A game of picking up matches.There are 30 matches on the table. You start by picking up 1~3 matches, then your opponent picks up 1~3 matches. This goes on until the last match is picked up. The person picking up the last is the loser.

Last step: you want to leave the last match to your opponent

One step back: this is what you want to leave to your opponent

The first step: this is what you want to leave to your opponent 29x

1, 5, 9, 13, 17, 21, 25, 29

Two steps back: this is what you want to leave to your opponent

Operations Research: mathematical programming, Winston WL

Dynamic programming

Dynamic programmingTwo critical components of dynamic programming sequence alignment:

(1)The assumption that the score of an aligned pair of amino acids is independent of other parts of the alignment.

(2)Due to the additive nature of the likelihood, any sub-alignment of the optimal overall alignment must also be optimal. Thus the search for the global optimum becomes a sequential search for local optima.

Dynamic programmingAny sub-model of the best model must be optimal.Example. if this is the best model,

GS-AQVKG-NPKVK

Then any part of it, for example the middle part

-AQNPK

Must be better than any other possible sub-model. Otherwise, say if it is not as good as

A-QNPK

Then, GSA-QVKG-NPKVK

Is better than the “best” model.

Dynamic programming is used when the overall solution can be achieved by recursively solving sub-problems.

It saves computing time by re-using results of submodels.

GSAQVK G-SA-QVK GSA-QVKGNPKVK GN-PK-VK G-NPKVK

In contrast, the “brute-force” method that simply lists all possible alignments and re-calculate the scores for every model solves the same sub-problem over and over again.

Dynamic programming

Needleman-Wunsch algorithm for global alignment

Mi,j = MAXIMUM[

Mi-1, j-1 + Si,j (match/mismatch in the diagonal), Mi,j-1 + w (gap in sequence #1), Mi-1,j + w (gap in sequence #2)

]

Create a matrix of dimension (m+1)x(n+1), where m and n are the lengths of the two sequences. Let M(0,0)=0Iteratively calculate other values in the matrix, while recording which way (align, gap-x, gap-y) led to the maximum for each cell.

A W G H E E

P

A

W

H

E

A

Gap penalty: -8

Needleman-Wunsch algorithm

http://www.matfys.kvl.dk/databehandling/f2004/eksempler/word/needleman-wunsch.png

Smith-Waterman algorithm

The previous algorithm looks for the best OVERALL alignment between two sequences. A more common task is to find best SUB-SEQUENCE alignments between two sequences.

Mi,j = MAXIMUM[

0,Mi-1, j-1 + Si,j (match/mismatch in the diagonal), Mi,j-1 + w (gap in sequence #1), Mi-1,j + w (gap in sequence #2)

]

Smith-Waterman algorithm

http://www.bi.a.u-tokyo.ac.jp/~shimizu/bioinfo/s-w.gif

CommentsIn biological sequences, a gap is often longer than 1. For example, the insertion of a new functioning domain. A long gap may be penalized too much by the previous methods.

- assign two different penalties:gap initiation penalty.gap extension penalty.

Now at every position, 3 values, instead of 1, are kept.

Significance of an alignment?

- Bayesian approach model comparison.- significance derived from extreme value

distribution.

Difficulty in higher-dimensionsWhen we have two sequences to align, we can use dynamic programming with o(n^2) calculations. Can we use the same idea when we have more sequences?It is hard. For example, when we have 3 sequences, instead of operating in a square, we will be operating in a cube. The number of calculations explodes with more sequences.

http://www.bioinf2.leeds.ac.uk/b/MA_sect.gif