algorithms for alignment of genomic sequences michael brudno department of computer science stanford...

59
Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Upload: charity-wakely

Post on 16-Dec-2015

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Algorithms for Alignment of Genomic Sequences

Michael Brudno

Department of Computer ScienceStanford University

PGA Workshop 07/16/2004

Page 2: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Conservation Implies Function

Exon

Gene

CNS:OtherConserved

Page 3: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Edit Distance Model (1)

Weighted sum of insertions, deletions & mutations to transform one string into another

AGGCACA--CA AGGCACACA| |||| || or | || ||A--CACATTCA ACACATTCA

Page 4: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Edit Distance Model (2)

Given: x, y

Define: F(i,j) = Score of best alignment ofx1…xi to y1…yj

Recurrence: F(i,j) = max ( F(i-1,j) – GAP_PENALTY,

F(i,j-1) – GAP_PENALTY,F(i-1,j-1) + SCORE(xi, yj))

Page 5: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Edit Distance Model (3)

F(i,j) = Score of best alignment ending at i,j

Time O( n2 ) for two seqs, O( nk ) for k seqs

F(i,j)

F(i,j-1)F(i-1,j-1)

F(i-1,j)

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

Page 6: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Overview

• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

Page 7: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Local Alignment

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

F(i,j) = max (F(i,j), 0)

Return all paths with a position i,j where

F(i,j) > C

Time O( n2 ) for two seqs, O( nk ) for k seqs

Page 8: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Heuristic Local Alignment

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

BLAST FASTA

Page 9: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

CHAOS: CHAins Of Seeds

1. Find short matching words (seeds)

2. Chain them

3. Rescore chain

Page 10: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

CHAOS: Chaining the Seeds

• Find seeds at current location in seq1

locationin seq1

seedseq1

seq2

Page 11: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

CHAOS: Chaining the Seeds

locationin seq1

distancecutoff

seedseq1

seq2

• Find seeds at current location in seq1

Page 12: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

CHAOS: Chaining the Seeds

locationin seq1

distancecutoff

gapcutoff

seedseq1

seq2

• Find seeds at current location in seq1

Page 13: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

CHAOS: Chaining the Seeds

• Find seeds at current location in seq1

• Find the previous seeds that fall into the search box

locationin seq1

distancecutoff

gapcutoff

seed

Search box

seq1

seq2

Page 14: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

CHAOS: Chaining the Seeds

• Find seeds at current location in seq1

• Find the previous seeds that fall into the search box

• Do a range query: seeds are indexed by their diagonal

locationin seq1

distancecutoff

gapcutoff

seed

Search box

seq1

seq2

Range of search

Page 15: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

CHAOS: Chaining the Seeds

• Find seeds at current location in seq1

• Find the previous seeds that fall into the search box

• Do a range query: seeds are indexed by their diagonal.

• Pick a previous seed that maximizes the score of chain location

in seq1

distancecutoff

gapcutoff

seed

Search box

seq1

seq2

Range of search

Page 16: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

CHAOS: Chaining the Seeds

• Find seeds at current location in seq1

• Find the previous seeds that fall into the search box

• Do a range query: seeds are indexed by their diagonal.

• Pick a previous seed that maximizes the score of chain location

in seq1

distancecutoff

gapcutoff

seed

Search box

seq1

seq2

Range of search

Time O(n log n), where n is number of seeds.

Page 17: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

CHAOS Scoring

• Initial score = # matching bp - gaps

• Rapid rescoring: extend all seeds to find optimal location for gaps

Page 18: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Overview

• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

Page 19: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Global Alignment

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

x

y

z

Page 20: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

LAGAN: 1. FIND Local Alignments

1. Find Local Alignments

2. Chain Local Alignments

3. Restricted DP

Page 21: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

LAGAN: 2. CHAIN Local Alignments

1. Find Local Alignments

2. Chain Local Alignments

3. Restricted DP

Page 22: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

LAGAN: 3. Restricted DP

1. Find Local Alignments

2. Chain Local Alignments

3. Restricted DP

Page 23: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

MLAGAN: 1. Progressive Alignment

Given N sequences, phylogenetic tree

Align pairwise, in order of the tree (LAGAN)

Human

Baboon

Mouse

Rat

Page 24: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

MLAGAN: 2. Multi-anchoring

XZ

YZ

X/Y

Z

To anchor the (X/Y), and (Z) alignments:

Page 25: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Cystic Fibrosis (CFTR), 12 species

• Human sequence length: 1.8 Mb• Total genomic sequence: 13 Mb

HumanBaboon Cat Dog

Cow Pig

MouseRat

ChimpChicken

Fugufish

Zebrafish

Page 26: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

CFTR (cont’d )

9055099.7%MammalsLAGAN

9086296%Chicken & Fishes

Chicken & Fishes

Mammals6704547

99.8%MLAGAN

98%

MAX MEMORY

(Mb)TIME (sec)

% Exons Aligned

Page 27: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Automatic computational system for Automatic computational system for comparative analysis of pairs of genomescomparative analysis of pairs of genomes http://pipeline.lbl.gov

Alignments (all pair combinations):

Human Genome (Golden Path Assembly)Mouse assemblies: Arachne, Phusion (2001) MGSC v3 (2002)Rat assemblies: January 2003, February 2003

----------------------------------------------------------D. Melanogaster vs D. Pseudoobscura February 2003

Page 28: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Tandem Local/Global Approach

•Finding a likely mapping for a contig (BLAT)

Page 29: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Progressive Alignment Scheme

yes

no yes no

Human, Mouse and Rat genomes

Pairwise M/R mapping

Aligned M&R fragments Unaligned M&R sequences

Map to Human GenomeMapping aligned fragments by union of M&R local BLAT hits on the human genome

H/M/R MLAGAN alignment

M/R pairwise alignment

M/H and R/H pairwise

alignment

Unassigned M&R DNA fragments

yes no

Page 30: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Computational Time

23 dual 2.2GHz Intel Xeon node PC cluster.

Pair-wise rat/mouse – 4 hours

Pair-wise rat/human and mouse/human – 2

hours

Multiple human/mouse/rat – 9 hours

Total wall time: ~ 15 hours

Page 31: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Distribution of Large Indels

0

20

40

60

80

100

120

140

160

180

200

100 150 200 250 300 350 400 450 500 550

Indel length

Count

Page 32: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Evolution Over a Chromosome

Page 33: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Overview

• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

Page 34: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Evolution at the DNA level

…ACGGTGCAGTTACCA…

…AC----CAGTCCACCA…

Mutation

SEQUENCE EDITS

REARRANGEMENTS

Deletion

InversionTranslocationDuplication

Page 35: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Local & Global Alignment

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

Local Global

Page 36: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Glocal Alignment Problem

Find least cost transformation of one sequence into another using new operations

•Sequence edits

•Inversions

•Translocations

•Duplications

•Combinations of above

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

Page 37: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Shuffle-LAGAN

A glocal aligner for long DNA sequences

Page 38: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

S-LAGAN: Find Local Alignments

1. Find Local Alignments

2. Build Rough Homology Map

3. Globally Align Consistent Parts

Page 39: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

S-LAGAN: Build Homology Map

1. Find Local Alignments

2. Build Rough Homology Map

3. Globally Align Consistent Parts

Page 40: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Building the Homology Map

d

a b

c

Chain (using Eppstein

Galil); each alignment

gets a score which is

MAX over 4 possible

chains.

Penalties are affine (event and distance components)

Penalties:

a) regular

b) translocation

c) inversion

d) inverted translocation

Page 41: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

S-LAGAN: Build Homology Map

1. Find Local Alignments

2. Build Rough Homology Map

3. Globally Align Consistent Parts

Page 42: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

S-LAGAN: Global Alignment

1. Find Local Alignments

2. Build Rough Homology Map

3. Globally Align Consistent Parts

Page 43: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

S-LAGAN Results (CFTR)

Local

Glocal

Page 44: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

S-LAGAN Results (CFTR)

Hum/Mus

Hum/Rat

Page 45: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

S-LAGAN Results (IGF cluster)

Page 46: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

S-LAGAN results (HOX)

• 12 paralogous genes• Conserved order in mammals

Page 47: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

S-LAGAN results (HOX)

• 12 paralogous genes• Conserved order in mammals

Page 48: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

S-LAGAN Results (Chr 20)

• Human Chr 20 v. homologous Mouse Chr 2.

• 270 Segments of conserved synteny

• 70 Inversions

Page 49: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

S-LAGAN Results (Whole Genome)

LAGAN S-LAGAN

Total 37% 38%

Exon 93% 96%

Ups200 78% 81%

CPU Time

350 Hrs 450 Hrs

• Used Berkeley Genome Pipeline

• % Human genome aligned with mouse sequence

• Evaluation criteria from Waterston, et al (Nature

2002)

Page 50: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Rearrangements in Human v. Mouse

Preliminary conclusions:

• Rearrangements come in all sizes

• Duplications worse conserved than other rearranged regions

• Simple inversions tend to be most common and most conserved

Page 51: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

What is next? (Shuffle)

• Better algorithm and scoring

• Whole genome synteny mapping

• Multiple Glocal Alignment(!?)

Page 52: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Overview

• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

Page 53: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Biological Story

• Math1 (Mouse Atonal Homologue 1, also ATOH) is a gene that is responsible for nervous system development

Page 54: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Align Human, Mouse, Rat & Fugu

Page 55: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Detailed Alignment

hum_a : CAATAGAGGGTCTGGCAGAGGCTC---------------------CTGGC @ 57336/400001mus_a : CAATAGAGGGGCTGGCAGAGGCTC---------------------CTGGC @ 78565/400001rat_a : CAATAGAGGGGCTGGCAGAGACTC---------------------CTGGC @ 112663/369938fug_a : TGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCGTGGGC @ 36013/68174

hum_a : CGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 57386/400001mus_a : CCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 78615/400001rat_a : CCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 112713/369938fug_a : CGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCGAGCGC @ 36063/68174

Page 56: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Can we align human & fly???

CGCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAGCCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG GAGGTGTTGGATGGCCTGAGTGA-AGCACGCGCTGTCAGCTGGCGAGCGCTCGCG-AGTCCCTGCCGTGTCCCCGMelan GCTACTCCAGCT-ACCACCTGCATGCAGCTGCACAGCPseudo GCCACTGAGACT-GCCACCTGCATGCAGCTGCACAGA

Page 57: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Putting it all together

CGCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAGCCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG GAGGTGTTGGATGGCCTGAGTGA-AGCACGCGCTGTCAGCTGGCGAGCGCTCGCG-AGTCCCTGCCGTGTCCCCGMelan GCTACTCCAGCT-ACCACCTGCATGCAGCTGCACAGCPseudo GCCACTGAGACT-GCCACCTGCATGCAGCTGCACAGA

Page 58: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Overview

• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

Page 59: Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004

Acknowledgments

Stanford:Serafim BatzoglouArend SidowMatt Scott

Gregory Cooper Chuong (Tom) DoSanket MaldeKerrin SmallMukund Sundararajan

Berkeley: Inna DubchakAlexander Poliakov

Göttingen:Burkhard Morgenstern

Rat Genome Sequencing Consortium

http://lagan.stanford.edu/