alignments and comparative genomics. welcome to cs374! today: serafim: alignments and comparative...

58
Alignments and Comparative Genomics

Post on 21-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Alignments and Comparative Genomics

Page 2: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Welcome to CS374!

Today:

• Serafim: Alignments and Comparative Genomics

• Omkar: Administrivia

Page 3: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Biology in One Slide – Twentieth Century

…ACGTGACTGAGGACCGTGCGACTGAGACTGACTGGGTCTAGCTAGACTACGTTTTATATATATATACGTCGTCGTACTGATGACTAGATTACAGACTGATTTAGATACCTGACTGATTTTAAAAAAATATT…

…and today

Page 4: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Complete DNA Sequences

nearly 200 complete genomes have been

sequenced

Page 5: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Evolution

Page 6: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Evolution at the DNA level

…ACGGTGCAGTTACCA…

…AC----CAGTCCACCA…

Mutation

SEQUENCE EDITS

REARRANGEMENTS

Deletion

InversionTranslocationDuplication

Page 7: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Evolutionary Rates

OK

OK

OK

X

X

Still OK?

next generation

Page 8: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Sequence conservation implies function

Alignment is the key to• Finding important regions• Determining function• Uncovering the evolutionary forces

Page 9: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Sequence Alignment

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

DefinitionGiven two strings x = x1x2...xM, y = y1y2…yN,

an alignment is an assignment of gaps to positions0,…, N in x, and 0,…, N in y, so as to line up each

letter in one sequence with either a letter, or a gapin the other sequence

AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC

Page 10: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

What is a good alignment?

Alignment: The “best” way to match the letters of one sequence with those of the other

How do we define “best”?

Alignment:A hypothesis that the two sequences come from a common ancestor through sequence edits

Parsimonious explanation:Find the minimum number of edits that transform one sequence into the other

Page 11: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Scoring Function

• Sequence edits: AGGCCTC

Mutations AGGACTC

InsertionsAGGGCCTC

DeletionsAGG.CTC

Scoring Function:Match: +mMismatch: -sGap: -d

Score F = (# matches) m - (# mismatches) s – (#gaps) d

Page 12: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

How do we compute the best alignment?

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

Too many possible alignments:

O( 2M+N)

Page 13: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Dynamic Programming

• Given two sequences x = x1……xM and y = y1……yN

• Let F(i, j) = Score of best alignment of x1……xi to y1……yj

• Then, F(M, N) == Score of best alignment

Idea: Compute F(i, j) for all i and j Do this by using F(i–1 , j), F(i, j–1), F(i–1, j–1)

Page 14: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Dynamic Programming (cont’d)

Notice three possible cases:

1. xi aligns to yj

x1……xi-1 xi

y1……yj-1 yj

2. xi aligns to a gap

x1……xi-1 xi

y1……yj -

3. yj aligns to a gap

x1……xi -

y1……yj-1 yj

m, if xi = yj

F(i,j) = F(i-1, j-1) + -s, if not

F(i,j) = F(i-1, j) - d

F(i,j) = F(i, j-1) - d

Page 15: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Dynamic Programming (cont’d)

• How do we know which case is correct?

Inductive assumption:F(i, j-1), F(i-1, j), F(i-1, j-1) are optimal

Then,F(i-1, j-1) + s(xi, yj)

F(i, j) = max F(i-1, j) – dF( i, j-1) – d

Where s(xi, yj) = m, if xi = yj; -s, if not

i-1, j-1 i-1, j

i, j-1 i, j

Page 16: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Example

x = AGTA m = 1y = ATA s = -1

d = -1

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3 -1 -1 0 2

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3

Optimal Alignment:

F(4,3) = 2

AGTAA - TA

Page 17: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

The Needleman-Wunsch Algorithm

1. Initialization.a. F(0, 0) = 0b. F(0, j) = - j dc. F(i, 0) = - i d

2. Main Iteration. Filling-in partial alignmentsa. For each i = 1……M

For each j = 1……N F(i-1,j) – d [case 1]

F(i, j) = max F(i, j-1) – d [case 2]

F(i-1, j-1) + s(xi, yj) [case 3]

UP, if [case 1]Ptr(i,j) = LEFT if [case 2]

DIAG if [case 3]

3. Termination. F(M, N) is the optimal score, andfrom Ptr(M, N) can trace back optimal alignment

Page 18: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Alignment on a Large Scale

• Given a newly sequenced organism,• Which subregions align with other organisms?

Potential genes Other biological characteristics

• Assume we use Dynamic Programming:

The entire genomic database

Our newly sequenced mammal

3109

1010 - 1011

Page 19: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Index-based Local Alignment

Main idea:

1. Construct a dictionary of all the words in the query

2. Initiate a local alignment for each word match between query and DB

Running Time:Theoretical worst case: O(MN)Fast in practice

query

DB

Page 20: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Index-based Local Alignment — BLAST

Dictionary:All words of length k (~11)Alignment initiated between exact-matching words

(more generally, between words of alignment score T)

Alignment:Ungapped extensions until score

below statistical threshold

Output:All local alignments with score

> statistical threshold

……

……

query

DB

query

scan

Page 21: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Index-based Local Alignment — BLAST

A C G A A G T A A G G T C C A G T

C

C

C

T

T

C C

T

G

G

A T

T

G

C

G

A

Example:

k = 4,T = 4

The matching word GGTC initiates an alignment

Extension to the left and right with no gaps until alignment falls < 50%

Output:GTAAGGTCC

GTTAGGTCC

Page 22: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Gapped BLAST

A C G A A G T A A G G T C C A G T

C

T

G

A

T

C C

T

G

G

A

T

T

G C

G

A

Added features:

• Pairs of words can initiate alignment

• Nearby alignments are merged

• Extensions with gaps until score < T below best score so far

Output:

GTAAGGTCCAGTGTTAGGTC-AGT

Page 23: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Example

Query: gattacaccccgattacaccccgattaca (29 letters) [2 mins]

Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,726,556 sequences; 8,074,398,388 total letters

>gi|28570323|gb|AC108906.9| Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence, complete sequence Length = 144487 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus

Query: 4 tacaccccgattacaccccga 24 ||||||| |||||||||||||

Sbjct: 125138 tacacccagattacaccccga 125158

Score = 34.2 bits (17),

Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus

Query: 4 tacaccccgattacaccccga 24 ||||||| |||||||||||||

Sbjct: 125104 tacacccagattacaccccga 125124

>gi|28173089|gb|AC104321.7| Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence, complete sequence Length = 139823 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus

Query: 4 tacaccccgattacaccccga 24 ||||||| |||||||||||||

Sbjct: 3891 tacacccagattacaccccga 3911

http://www.ncbi.nlm.nih.gov

Page 24: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Efficient global alignment

S1

S2

Page 25: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Global alignment with the chaining approach

1. Find local alignments2. Chain them into a rough global map

3. Align regions in-between

Page 26: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

LAGAN: 1. FIND Local Alignments

1. Find Local Alignments

2. Chain Local Alignments

3. Restricted DP

Mike Brudno, Chuong B Do, et al.

Page 27: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

LAGAN: 2. CHAIN Local Alignments

1. Find Local Alignments

2. Chain Local Alignments

3. Restricted DP

Mike Brudno, Chuong B Do, et al.

Page 28: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

LAGAN: 3. Restricted DP

1. Find Local Alignments

2. Chain Local Alignments

3. Restricted DP

Mike Brudno, Chuong B Do, et al.

Page 29: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Restricted DP (cont’d)

• What if a box is too large? Recursive application of LAGAN,

more sensitive word search

Page 30: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Multiple Alignment

Page 31: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia
Page 32: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Scoring Function: Sum Of Pairs

Definition: Induced pairwise alignmentA pairwise alignment induced by the multiple alignment

Example:

x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Page 33: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Sum Of Pairs (cont’d)

• The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments

S(m) = k<l s(mk, ml)

s(mk, ml): score of induced alignment (k,l)

Page 34: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Dynamic Programming for Multiple Alignment

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

x

y

z

Page 35: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Progressive Alignment

• Multiple Alignment is NP-complete• Most used heuristic: Progressive Alignment

Algorithm:Until all sequences are aligned:– Align two (multi-)sequences to each other, and

treat the result as a new sequence

Example: aligning AACTGTA with AATGTC, gives

AACTGTA

AA-TGTC, with “letters” (AA), (AA), (C-), (TT), (GG), (TT), (AC)

Running Time: O(NL2), where N: #seqs, L: length of a seq

Page 36: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

MLAGAN: Progressive Alignment

Given N sequences, phylogenetic tree

Align pairwise, in order of the tree (LAGAN) With needed generalizations for multi-

anchoring & scoring edit distance

Human

Baboon

Mouse

Rat

Page 37: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Evolution at the DNA level

…ACGGTGCAGTTACCA…

…AC----CAGTCCACCA…

Mutation

SEQUENCE EDITS

REARRANGEMENTS

Deletion

InversionTranslocationDuplication

Page 38: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Local & Global Alignment

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

Local Global

Page 39: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Glocal Alignment Problem

Find least cost transformation of one sequence into another using shuffle operations

• Sequence edits

• Inversions

• Translocations

• Duplications

• Combinations of above

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

Page 40: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

SLAGAN: 1. Find Local Alignments

1. Find Local Alignments

2. Build Rough Homology Map

3. Globally Align Consistent Parts

Page 41: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

SLAGAN: 2. Build Homology Map

1. Find Local Alignments

2. Build Rough Homology Map

3. Globally Align Consistent Parts

Page 42: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

SLAGAN: 2. Build Homology Map

d

a b

c

Chain using Sparse Dynamic Programming

Penalties:

a) regular

b) translocation

c) inversion

d) inverted translocation

Page 43: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

SLAGAN: 2. Build Homology Map

1. Find Local Alignments

2. Build Rough Homology Map

3. Globally Align Consistent Parts

Page 44: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

SLAGAN: 3. Global Alignment

1. Find Local Alignments

2. Build Rough Homology Map

3. Globally Align Consistent Parts

Page 45: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

SLAGAN Example: Chromosome 20

Human Chromosome 20 versus Mouse Chromosome 2

• 270 Segments of conserved synteny

• 70 Inversions

Page 46: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

SLAGAN example: HOX cluster

• 10 paralogous genes• Conserved order in Human/Mouse/Rat

Page 47: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

SLAGAN example: HOX cluster

• 10 paralogous genes• Conserved order in Human/Mouse/Rat

Page 48: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Whole-genome alignment with SLAGAN

Two-step Shuffle

1. Shuffle for large-scale synteny map

2. Shuffle each syntenic region for microrearrangements

Page 49: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

The ENCODE Project

Page 50: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

ENCODE regions shuffled

Hum/Mus Hum/Rat

Page 51: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

ENCODE regions shuffled

Hum/Mus Hum/Rat

Page 52: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

ENCODE regions shuffled

Hum/Mus Hum/Rat

Page 53: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

ENCODE regions shuffled

Hum/Mus Hum/Rat

Page 54: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

ENCODE regions shuffled

Hum/Mus Hum/Rat

Page 55: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Constrained Elements in Alignments

Page 56: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Human-Mouse-Rat

Berkeley Genome Pipelinehttp://pipeline.lbl.gov

Page 57: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

Human-Mouse-Rat

Page 58: Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia

More DNA is coming…