identification of large-scale genomic rearrangements between closely related organisms bob mau 1,2,...

33
Identification of large- scale genomic rearrangements between closely related organisms Bob Mau 1,2 , Aaron Darling 1,3 , Fred Blattner 4,5 , Nicole Perna 1,5 Departments of Animal Health and Biomedical Sciences 1 , Oncology 2 , Computer Science 3 , Laboratory of Genetics 4 , Genome Center University of Wisconsin – Madison

Upload: homer-barker

Post on 02-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna

Identification of large-scale genomic rearrangements

between closely related organisms

Bob Mau1,2, Aaron Darling1,3, Fred Blattner4,5, Nicole Perna1,5

Departments of Animal Health and Biomedical Sciences1, Oncology2, Computer Science3, Laboratory of Genetics4 , Genome Center

University of Wisconsin – Madison

 

Page 2: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna

The Amazing Variety of Diseases caused by E.coli strains

in Bacterial Pathogenesis: A Molecular Approach

“… is due to the fact different strains have acquired different sets of virulence genes. Most strains of E.coli are avirulent because they lack these virulence genes. E.coli is an excellent example of the maxim that it is the set of virulence genes carried by an organsims that make it a pathogen, not its species or genus designation.”

Page 3: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna
Page 4: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna

Categories of Bacterial Genome Evolution

• Local Single Base Mutations

Indels (Small insertions and deletions

• Global (Large-scale) Rearrangements Inversions, translocations, inverted translocations

• Gene Gain and Loss Horizontal or Lateral Transfer

Transformation, Transduction, and Conjugation Phage Integration

Mobile Elements Transposons and Insertion Sequences

Gene Duplication ( Mediated by mobile elements )

Page 5: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna

From the two E. coli genomes sequenced at the Blattner lab, we’ve identified:

• ~3900 genes common to both K-12 and O157:H7

• 528 genes unique to K-12

• 1387 genes unique to O157:H7

• 40 % of these genes are of unknown function.

The primary reasons for these wholesale differences are: lateral transfer, phage integration , and one whopperof a duplication.

Page 6: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna
Page 7: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna

Strategy of Global Alignment of Two Highly Related Genomes:

K

O

Partially Sorted

Suffix Arrays

STEP 1

Quickly find all 16-mer matches between genomes

(K1,O1)

:

(Ki,Oi)

:

(Kn,On)

STEP 2

Collapse consecutive pairs to form a collection of maximally exact matches. (MEMs)

Use LIS algorithm to construct a collinear set of maximally ordered matches.

STEP 3

Extend across intervening regions via anchored alignments from individual MEM endpoints

Unique Insert

Substitution

Page 8: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna

K-12 vs O157:H7 MEM Stats

• 43,235 total MEMs (24 bps) • 31,640 form maximal collinear subset• The largest exact match is 2,632 bases• 62 MEMs exceed 1000 bps• Over 11,000 exceed 100 bps• 18,212 single base differences (SNPs)• Resulted in a segmentation of O157:H7 into 357

intervals of backbone or unique insert.

Page 9: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna
Page 10: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna

A Three-way Genomic Comparison: Parkhill et.al. Nature

E. coli K-12 MG1655

S. Typhi CT18

S. Typhi-murium LT2

Page 11: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna

The “Traditional” WAY to view MEMs

{(a0,b0),(a1,b1),…, (aK,bK)} for K+1 genomes

For the reference genome G0, a0 < b0 by convention.

For the NON reference genomes, ak<bk means the match is oriented with G0, ak>bk means the match occurs on the opposite strand (reverse complement)

Page 12: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna

A novel approach, wherein:

• Extensibility: works just as well for N as it does for 2 genomes, provided there is sufficient sequence similarity.

• Automatically identifies inversions, translocations, and inverted translocations

• Determines a maximal collinear subset within each locally collinear region, without recourse to an LIS step

• Very space efficient and very fast

Page 13: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna

2 5 J a n u a ry 2 0 0 1

N a tu re 4 0 9, 5 2 9 - 5 3 3 (2 0 0 1 ) © M acm illan P u b lish ers L td .

G en o m e seq u en ce o fen tero h a em o rrh a g icE sch erich ia co li O 1 5 7 :H 7

N IC O L E T . P E R N A , G U Y P L U N K E T T III,

V A L E R IE B U R L A N D , B O B M A U ,

J E R E M Y D . G L A S N E R , D E B R A J . R O S E ,

G E O R G E F . M A Y H E W , P E T E R S . E V A N S ,

J A S O N G R E G O R ,

H E A T H E R A . K IR K P A T R IC K ,

G Y Ö R G Y P Ó S F A I, J E R E M IA H H A C K E T T ,

S A R A K L IN K , A D A M B O U T IN , Y IN G S H A O ,

L E S L IE M IL L E R , E R IK J . G R O T B E C K ,

N . W A Y N E D A V IS , A L E X L IM ,

E IL E E N T . D IM A L A N T A ,

K O N S T A N T IN O S D . P O T A M O U S IS ,

J E N N IF E R A P O D A C A ,

T H O M A S S . A N A N T H A R A M A N , J IE Y I L IN ,

G A L E X Y E N , D A V ID C . S C H W A R T Z ,

R O D N E Y A . W E L C H &

F R E D E R IC K R . B L A T T N E R

T h e b acteriu m E sch erich ia co li O 1 5 7 :H 7is a w o rld w id e th reat to p u b lic h ealth an dh as b een im p licated in m an y o u tb reak s o fh aem o rrh ag ic co litis, so m e o f w h ichin clu d ed fatalities cau sed b y h aem o ly ticu raem ic sy n d ro m e. C lo se to 7 5 ,0 0 0 caseso f O 1 5 7 :H 7 in fectio n are n o w estim ated too ccu r an n u ally in th e U n ited S tates. T h esev erity o f d isease, th e lack o f effectiv etreatm en t an d th e p o ten tial fo r larg e-scaleo u tb reak s fro m co n tam in ated fo o d su p p liesh av e p ro p elled in ten siv e research o n th ep ath o g en esis an d d etectio n o f E . co liO 1 5 7 :H 7 (ref. 4 ). H ere w e h av e seq u en cedth e g en o m e o f E . co li O 1 5 7 :H 7 to id en tifycan d id ate g en es resp o n sib le fo rp ath o g en esis, to d ev elo p b etter m eth o d s o fstrain d etectio n an d to ad v an ce o u ru n d erstan d in g o f th e ev o lu tio n o f E . co li,th ro u g h co m p ariso n w ith th e g en o m e o f th en o n -p ath o g en ic lab o rato ry strain E . co liK -1 2 (ref. 5 ). W e fin d th at lateral g en etran sfer is far m o re ex ten siv e th anp rev io u sly an ticip ated . In fact, 1 ,3 8 7 n ewg en es en co d ed in strain -sp ecific clu sters o fd iv erse sizes w ere fo u n d in O 1 5 7 :H 7 .T h ese in clu d e can d id ate v iru len ce facto rs,altern ativ e m etab o lic cap acities, sev eralp ro p h ag es an d o th er n ew fu n ctio n s— all o fw h ich co u ld b e targ ets fo r su rv eillan ce.

N a tu re © M a c m illa n P u b lis h e rs L td

2 0 0 1 R e g is te re d N o . 7 8 5 9 9 8 E n g la n d .

Page 14: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna

Multiple Oriented Offset

For each non-reference genome, determine the polarity with respect to G0

1ip

As well as the offset: kkk apad *0 The Multiple Oriented Offset is the N vector:

)},(),...,,{( 11 NN pdpdMoo

Page 15: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna

Canonical MEM Equivalence Classes

By appending the interval in reference genome coordinates: (a0, b0) to the Moo, the MEM is completely specified.

We aggregate MEMs by their generalized offset,

inducing a partition on the set of MEMs. This defines a CMemEC:

{Moo,{(a01, b0

1), (a02, b0

2),…, (a0M, b0

M)}}

||||..1

01

K

kkk

K

kk apadOG

Page 16: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna
Page 17: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna
Page 18: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna

In this example, it’s abundantly clear from the plot that there are two large rearrangements, one around the origin and the other about the terminus of replication.

We could probably get by with modest extensions of existing methods (MUMmer or our earlier algorithm) to account for the large amounts of laterally transferred lineage-specific sequence.

Page 19: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna

In this example, it’s abundantly clear from the plot that there are two large rearrangements, one around the origin and the other about the terminus of replication.

We could probably get by with modest extensions of existing methods (MUMmer or our earlier algorithm) to account for the large amounts of laterally transferred lineage-specific sequence.

But, hey, biology ain’t easy ...

Page 20: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna
Page 21: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna

Figure 1: Simplest Block and Strip Diagram

G1: Strip 1

G2: Strip 2

G3: Strip 3

1 2 3 4 5 6 7

G4: Strip 4

1 -7 5 6 4 3 2

-3 -2 -1 -7 5 6 4

-7 4 5 6 -3 -2 -1

1 -3 -2 4 5 6 7

G0: Reference Strip

Page 22: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna

1 2 3 4 5 6 7

Cut pt. Terminus Origin

G0: Reference

G1: Genome 1

1 2 -3 4 -6 -5 7

1 2 3 4 6 -5 7

G2: Genome 2

G3: Genome 3

1 -3 -2 4 5 6 7

G5: Genome 5

G4: Genome 4 1 2 -3 4 5 -6 7

1 2 3 4 5 6 7

Figure 2: Example with Variable Block Lengths

Page 23: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna

Figure 1: Large-scale Genomic Rearrangements

Genome 2

Genome 1

Zero Pt. Terminus Origin

Genome 3

Genome 4

Genome 5

Species Tree

MRCA

Page 24: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna

Figure 3: Segmentation Graph S(G0)

Page 25: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna

LOOk at the Picture and

Page 26: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna

Sorted Merge Lists of Six Enterobacterial strains

MG1655 W3110 EDL933 Sakai CT18 LT2

Six SMLs of bimers, one for each genome. A bimer is the lexicographically lesser of an n-mer (we use n=23) and its reverse complement, together with an orientation flag.

K-12 O157:H7 Typhi TyphimuriumEscherichia coli Salmonella Enterica

Page 27: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna
Page 28: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna
Page 29: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna
Page 30: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna
Page 31: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna

0 12 3 10 7 1 5 4 2 11 6 9 8 0

C20 C21 C22 C22.5 C23 C24 C25 C1 C2 C3 C4 C5 C6 C7

A Transformation of CO92 to KIM by Inversions Near the Origin

0 1 2 3 4 5 6 7 8 9 10 11 12 0

K5 K4 K3 K2 K1 K25 K24 K23 K22 K21 K20.5 K20 K19 K18

Page 32: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna

0 12 3 10 7 1 5 4 2 11 6 9 8 0C20 C21 C22 C22.5 C23 C24 C25 C1 C2 C3 C4 C5 C6 C7

A Transformation of CO92 to KIM by Inversions Near the Origin

0 1 2 3 4 5 6 7 8 9 10 11 12 0K5 K4 K3 K2 K1 K25 K24 K23 K22 K21 K20.5 K20 K19 K18

0 8 9 6 11 2 4 5 1 7 10 3 12 0

0 1 5 4 2 11 6 9 8 7 10 3 12 0

0 1 11 2 4 5 6 9 8 7 10 3 12 0

0 1 3 10 7 8 9 6 5 4 2 11 12 0

0 1 3 2 4 5 6 9 8 7 10 11 12 0

0 1 3 2 4 5 6 9 8 7 10 11 12 0

Page 33: Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna