breakpoint graphs and ancestral genome reconstructionshome.gwu.edu/~maxal/ap_gr09.pdf · methods...

47
Breakpoint Graphs and Ancestral Genome Reconstructions Max A. Alekseyev and Pavel A. Pevzner Department of Computer Science and Engineering University of California at San Diego, U.S.A. {maxal,ppevzner}@cs.ucsd.edu Classification: Genome Rearrangements, Ancestral Genome Reconstruction, Molecular Evolution Corresponding author: Pavel Pevzner email: [email protected] Mail address: 9500 Gilman Dr., La Jolla, CA 92093-0404, U.S.A. Phone: 1-310-4976941 Fax: 1-858-5347029

Upload: others

Post on 12-May-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Breakpoint Graphs and Ancestral Genome Reconstructions

Max A. Alekseyev and Pavel A. PevznerDepartment of Computer Science and Engineering

University of California at San Diego, U.S.A.{maxal,ppevzner}@cs.ucsd.edu

Classification: Genome Rearrangements, Ancestral Genome Reconstruction, Molecular Evolution

Corresponding author: Pavel Pevzneremail: [email protected] address: 9500 Gilman Dr., La Jolla, CA 92093-0404, U.S.A.Phone: 1-310-4976941Fax: 1-858-5347029

Abstract

Recently completed whole genome sequencing projects marked the transition from gene-basedphylogenetic studies to phylogenomics analysis of entire genomes. We developed an algorithmMGRA for reconstructing ancestral genomes and used it to study the rearrangement history ofseven mammalian genomes: human, chimpanzee, macaque, mouse, rat, dog, and opossum. MGRArelies on the notion of the multiple breakpoint graphs to overcome some limitations of the existingapproaches to ancestral genome reconstructions. MGRA also generates the rearrangement-basedcharacters guiding the phylogenetic tree reconstruction when the phylogeny is unknown.

2

INTRODUCTION

The first attempts to reconstruct the genomic architecture of ancestral mammals predated the eraof genomic sequencing and were based on the cytogenetic approaches (Wienberg and Stanyon, 1997).The rearrangement-based phylogenomic studies were pioneered by Sankoff and co-authors (Sankoffet al., 1992; Sankoff and Blanchette, 1998; Blanchette et al., 1997) and were based on analyzing thebreakpoint distances. Moret et al. (2001) further optimized this approach and developed a popularGRAPPA software for rearrangement analysis. MGR, another genome rearrangement tool (Bourqueand Pevzner, 2002), uses the genomic distances instead of breakpoint distances for ancestral reconstruc-tions. Since genomic distances lead to more accurate ancestral reconstructions (Moret et al., 2002;Tang and Moret, 2003), GRAPPA has been modified for genomic distances as well. While MGR hasbeen used in a number of phylogenomic studies (Bourque et al., 2005; Murphy et al., 2005; Pontiuset al., 2007; Bulazel et al., 2007; Xia et al., 2007; Deuve et al., 2008; Cardone et al., 2008), both MGR andGRAPPA have limited ability to distinguish reliable from unreliable rearrangements and to addressthe “weak associations” problem in ancestral reconstructions (Bourque et al., 2004, 2005; Froenickeet al., 2006; Bourque et al., 2006).

Recently, Ma et al. (2006) made an important step towards reliable reconstruction of the ancestralgenomes. In contrast to MGR and GRAPPA (which analyze both reliable and unreliable rearrange-ments), they have chosen to focus on the reliable breakpoint reconstruction in the ancestral genomesand to avoid assignments in the case of weak associations (complex breakpoints). This proved to bea valuable approach since, as it turned out, most breakpoints in the ancestral mammalian genomescan be reliably reconstructed. However, there are some limitations (discussed in Rocchi et al. (2006))that this approach has to overcome to scale for large sets of genomes. First, while the Ma et al. (2006)inferCARs algorithm assumes that the phylogeny is known, it remains a subject of enduring debateseven in the case of the primate–rodent–carnivore split (which is assumed to be resolved in Ma et al.(2006)). With the increase in the number of species, the reliability of the phylogeny will become evena bigger concern, thus raising the question of devising an approach that does not assume a fixedphylogeny but instead uses rearrangements as new characters for constructing phylogenetic trees(see Chaisson et al. (2006)). While MGR does not assume a fixed phylogeny, its heuristically derivedweak associations are less reliable. The challenge then is to integrate the reliability of inferCARs withthe flexibility of MGR. Another avenue to improve inferCARs algorithm is to find out how to dealwith complex breakpoints that create gaps in reconstructions.

Note that the Ma et al. (2006) approach focuses on the reliable ancestor reconstruction ratherthan on the specific rearrangements that happened in the course of the evolution. These are relatedbut different problems that both can benefit from incorporating them into a single computationalframework. Indeed, Ma et al. (2006) consider individual breakpoints and do not distinguish betweenparticular types of rearrangements that generated a breakpoint of interest. In reality, the reversalsand translocations operate on pairs of dependent breakpoints rather than individual breakpoints. Somerearrangements (and synteny associations) cannot be inferred from the analysis of single breakpointsbut become tractable via analyzing the breakpoint graph.1 As a result, while MGR constructs provablyoptimal scenarios in the absence of breakpoint re-use, it is not clear whether the same result holds forinferCARs.

Recently, Zhao and Bourque (2007) developed the EMRAE algorithm, which reconstructs bothreliable rearrangements and ancestors, thus addressing the shortcomings of both MGR (difficultyin distinguishing between reliable and putative rearrangement events) and inferCARs (ancestor re-construction only). However, EMRAE (in contrast to MGR) does not attempt to reconstruct the

1The breakpoint graphs represent a popular technique for the rearrangement analysis since they reveal pairs of breakpointsrepresenting footprints of the rearrangement events. See Chapter 10 of Pevzner (2000) for background information ongenome rearrangements and breakpoint graphs.

3

ch

ct

ah

at

bh

bt

ch

ct

ah

at

bh

bt

ch

ct

ah

at

bh

bt

breakpoint graph G(P,Q)of the genomes P and Q

ch

ct

ah

at

bh

bt

a

b

c

PPc)

b)

a)genome P = (+a +b −c)

a

c

b

genome Q = (+a −b +c)

QQ

G(P,Q)

Figure 1: a) Unichromosomal genome P = (+a + b − c) represented as a black-obverse cycle. b) Unichromosomal genomeQ = (+a − b + c) represented as a green-obverse cycle. c) The breakpoint graph G(P,Q) with and without obverse edges.

phylogenetic tree and is limited to unichromosomal genomes. Below we address some limitations ofMGR, EMRAE and inferCARs by developing the Multiple Genome Rearrangements and Ancestors(MGRA) algorithm (available from http://www.cs.ucsd.edu/users/ppevzner/software.html). Inparticular,• MGRA constructs provably optimal scenarios even when there is some breakpoint re-use and

when other tools do not guarantee optimality.• MGRA is suitable for ancestral reconstructions of multichromosomal genomes (in contrast to

EMRAE).• MGRA is conceptually simpler and orders of magnitude faster than MGR.• MGRA is not limited to reconstructing ancestral genomes in the case of known phylogeny

(like inferCARs and EMRAE). Instead, it can guide the rearrangement-based reconstruction ofphylogenetic trees.

• MGRA does not require prior information about the approximate lengths of the branches of thephylogenetic trees (in contrast to inferCARs).

To evaluate the performance of MGRA, we compared ancestral reconstructions generated byMGRA and inferCARs. Despite the fact that MGRA and inferCARs are very different algorithms, theirreconstructions turned out to be remarkably similar (98.5% of synteny associations are identical). Wefurther analyzed some differences between MGRA, inferCARs, and the cytogenetics approach.

METHODS1 From Pairwise to Multiple Breakpoint GraphsWe start with analysis of rearrangements in circular genomes (i.e., genomes consisting of circularchromosomes) and later extend it to genomes with linear chromosomes. We assume that each genomeis formed by the same set of synteny blocks, which are arranged differently in different genomes. Wewill find it convenient to represent a chromosome formed by synteny blocks b1, . . . , bn as a cycle withn directed labeled edges (corresponding to blocks) alternating with n undirected unlabeled edges(connecting adjacent blocks). The directions of the edges correspond to signs (strand) of the blocks.We label the tail and head of a directed edge bi as bt

i and bhi respectively (Fig. 1) and represent a genome

4

x 1

x 2

1y 1y

y 2

1y

y 2

x 1

x 2

1y

y 2

x 1

x 2

1y

y 2

x 1

x 2

1y

x 2

x 1

y 2

1y

y 2

x 1

x 2x 2y 2

x 1

1y x 1

y 2 x 2translocation

fusion /

x 2

x 11y

y 2

a) reversal

fission

b)

reversal

c) d)

fusion

fission

Figure 2: a) A 2-break on edges (x1, x2) and (y1, y2) from the same chromosome corresponds to either a reversal, or a fission.b) A 2-break on edges (x1, x2) and (y1, y2) from different chromosomes corresponds to a translocation/fusion. c) A 2-breakon edges (y1, y2) and (x1,∞) of a linear chromosome corresponds to a reversal affecting a chromosome end x1 and creating anew chromosome end y1. d) A 2-break on edges (x1,∞) and (y1,∞) from different chromosomes models a fusion. Fissionscan be modeled as 2-breaks operating on an irregular loop edge (∞,∞) and an arbitrary regular edge in the genome.

as a set of disjoint cycles (one for each chromosomes). The edges in each cycle alternate between twocolors: one color (e.g., “black”) used for undirected edges while the other color (traditionally called“obverse”) used for directed edges.

Let P be a genome represented as a collection of alternating black-obverse cycles (a cycle is alter-nating if the colors of its edges alternate). For any two black edges (x1, x2) and (y1, y2) in the genome(graph) P we define a 2-break rearrangement (first introduced as DCJ rearrangement in Yancopouloset al. (2005) and recently studied in Bergeron et al. (2006); Lin and Moret (2008)) as replacement ofthese edges with either a pair of edges (x1, y1), (x2, y2), or a pair of edges (x1, y2), (x2, y1) (Fig. 2a,b).In the case of circular genomes, 2-breaks correspond to the standard rearrangement operations ofreversals, fissions, or fusions/translocations (Fig. 2).2

Let P and Q be genomes on the same set of blocks B. The (pairwise) breakpoint graph G(P,Q) issimply the superposition of genomes (graphs) P and Q (Fig. 1c). Formally, the breakpoint graph G(P,Q)is defined on the set of vertices V = {bt, bh

| b ∈ B} with edges of three colors: obverse (connectingvertices bt and bh), black (connecting adjacent blocks in P), and green (connecting adjacent blocksin Q). The black and green edges form the black-green alternating cycles that play an important rolein analyzing rearrangements (Bafna and Pevzner, 1996). ¿From now on we will ignore the obverseedges in the breakpoint graph so that it becomes simply a collection of (black-green) cycles (Fig. 1).

The 2-break distance d2(P,Q) between genomes P and Q is defined as the minimum number of2-breaks required to transform one genome into the other. In contrast to the Genomic DistanceProblem (Hannenhalli and Pevzner, 1995; Tesler, 2002a; Ozery-Flato and Shamir, 2003) (for linearmultichromosomal genomes), the 2-Break Distance Problem for circular multichromosomal genomeshas a trivial solution (Yancopoulos et al., 2005; Alekseyev and Pevzner, 2007): d2(P,Q) = b(P,Q) −c(P,Q), where b(P,Q) = |B| is the number of synteny blocks in P and Q, and c(P,Q) is the number ofblack-green cycles in G(P,Q).

2In this paper we use the term reversal (common in bioinformatics literature) instead of the term inversion (commonin biology literature). For circular chromosomes, fusions and translocations are not distinguishable, i.e., every fusion ofcircular chromosomes can be viewed as a translocation, and vice versa.

5

2 3 4G(P ,P ,P ,P )

1

P =(+a−c−b)(+d+e+f)1 3P =(+a−d)(−c−b+e−f)

4P =(+d−a−c−b+e−f)

ah

hb

tc

he

td

hd

te

2P =(+d+e+b+c)(+a+f)

tb

at

hc

4P =(+d−a−c−b+e−f)

P =(+a−c−b)(+d+e+f)1 3P =(+a−d)(−c−b+e−f)

2P =(+d+e+b+c)(+a+f)

2Q =(+a−d−c−b+e−f)

1Q =(+a−d−c−b+e+f)

Q =(+a+b+c)(+d+e+f)3

r2

r1 r3

r4

r5

r6

r7

hf

tft

bh

bt

ch

ch

dt

eh

et

d at

ah t

fh

f ahh

dt

d at h

ct

ch

bt

bt

eh

et

fh

f

at

ah t

ch

ch

bt

bh

dt

dt

eh

et

fh

f at

ah h

dt

dh

ct

ch

bt

bt

eh

eh

ft

f

c)

a) b)

T

X=(+a+b+c+d+e+f)

T

Figure 3: a) A phylogenetic tree T with four linear genomes P1,P2,P3,P4 (represented as green, blue, red, and yellowgraphs respectively) at the leaves. The obverse edges are not shown. b) The multiple breakpoint graph G(P1,P2,P3,P4) is asuperposition of graphs representing genomes P1,P2,P3,P4. The multidegrees of regular vertices vary from 1 (e.g., vertexbh) to 3 (e.g., vertex eh). c) The same phylogenetic tree T with all intermediate genome specified and a genome X selectedas a root. A T-consistent transformation of X into P1,P2,P3,P4 can viewed as a transformation of the quadruple (X,X,X,X)into the quadruple (P1,P2,P3,P4) where a rearrangement at each step is applied to some copies of the same genome in thequadruple. A particular such transformation takes the following steps: (X,X,X,X)

r1−→ (X,X,Q1,Q1)

r2−→ (Q3,Q3,Q1,Q1)

r3−→ (Q3,Q3,Q2,Q2)

r4−→ (Q3,Q3,P3,Q2)

r5−→ (Q3,Q3,P3,P4)

r6−→ (P1,Q3,P3,P4)

r7−→ (P1,P2,P3,P4), where r1 is a reversal in two

copies of X; r2 is a fission in two copies of X; r3 is a reversal in both copies of Q1; r4 is a fission in one copy of Q2, r5 is areversal in the other copy of Q2; r6 is a reversal in one copy of Q3, r7 is a translocation in the other copy of Q3.

A linear genome is a collection of linear chromosomes represented as sequences of signed syntenyblocks. Each linear chromosome on n blocks is represented as a path of n directed obverse edges(encoding blocks and their direction) alternating with n − 1 undirected black edges (connectingadjacent blocks). In addition, we introduce an extra vertex ∞ and connect it by an undirected(irregular) black edge with every vertex representing a chromosomal end (hence, the degree of vertex∞ is twice the number of linear chromosomes). A linear chromosome is an alternating path of blackand obverse edges, starting and ending at the vertex ∞, and a linear genome is a collection of suchpaths. The 2-breaks involving irregular edges model the rearrangements affecting the chromosomeends (Fig. 2c,d).

Analyzing reversals, translocations, fusions, and fissions in linear genomes poses additional algo-rithmic challenges as compared to analyzing 2-breaks in circular genomes. However, rearrangementscenarios in linear genomes are well approximated by 2-break scenarios in circular genomes (Alek-seyev, 2008). Hence, we use 2-breaks as a single substitute for reversals, translocations, fusions, andfissions, admitting that 2-breaks may violate linearity of the genomes by creating circular chromo-somes.

While previous rearrangement studies (e.g., MGR) were limited to analyzing the pairwise break-point graphs, MGRA uses multiple breakpoint graphs (Caprara, 1999b), which simplify the rearrange-

6

ment analysis. Let P1, . . . ,Pk be genomes on the same set of synteny blocks B. Similarly to the pairwisebreakpoint graph, the (multiple) breakpoint graph G(P1, . . . ,Pk) is simply the superposition of genomes(graphs) P1, . . . ,Pk on the same vertex set V = {bt, bh

| b ∈ B}∪ {∞} (Fig. S20 and Fig. 3a,b). Fig. 4 showsthe breakpoint graph on 1357 synteny blocks3 of six mammalian genomes: M (mouse), R (rat), D (dog),Q (macaque), H (human), and C (chimpanzee).

A vertex in the breakpoint graph is regular if it is different from∞. Similarly, an edge is regular ifboth its endpoints are regular, and irregular otherwise. The edges of G(P1, . . . ,Pk) are represented byundirected edges from the genomes P1, . . . ,Pk of k different colors (hence, the degree of each regularvertex is k). To simplify the notation, we will use P1, . . . ,Pk also to refer to the colors of edges inthe multiple breakpoint graph, and denote the set of all colors C = {P1, . . . ,Pk}. Furthermore, anynon-empty subset of C is called a multicolor. All edges connecting vertices x and y in the (multiple)breakpoint graph form the multi-edge (x, y) of the multicolor represented by the colors of these edges(e.g., the multi-edge (eh, f h) in Fig. 3b has multicolor {P3,P4} shown as red and yellow edges). Thenumber of multi-edges incident to a vertex (also equal to the number of adjacent vertices) is calledthe multidegree (note that the multidegree of a vertex may be smaller than its degree, e.g., the vertexeh in Fig. 3b has degree 4 and multidegree 3). Multi-edges correspond to adjacent synteny blocks thatare conserved across multiple species and thus, represent valuable phylogenetic characters (Sankoffand Blanchette, 1998).

A breakpoint in the multiple breakpoint graph G(P1,P2, . . . ,Pk) is a vertex of the multidegree greaterthan 1. A multiple breakpoint graph without breakpoints is an identity breakpoint graph G(X, . . . ,X) ofsome genome X. Alternatively, the identity breakpoint graph can be characterized as a breakpointgraph consisting of complete multi-edges (i.e., multi-edges of the multicolor C) that correspond to thesynteny blocks adjacencies in X.

2 Multiple Genome Rearrangement ProblemThe key observation in studies of pairwise genome rearrangements is that every 2-break transforma-tion of a “black” genome P into a “green” genome Q corresponds to a transformation of the breakpointgraph G(P,Q) into the identity breakpoint graph G(Q,Q) (Fig. S21) with 2-breaks on pairs of blackedges (black 2-breaks). MGR (Bourque and Pevzner, 2002) implicitly applies a similar observation andattempts to come up with rearrangements that bring the multiple breakpoint graph G(P1,P2, . . . ,Pk)closer to the identity multiple breakpoint graph G(Pi,Pi, . . . ,Pi) for i varying from 1 to k. However, thisapproach does not allow one to utilize the internal edges of the phylogenetic tree for finding reliablerearrangements. Below we formalize the Multiple Genome Rearrangement Problem in terms of multiplebreakpoint graphs. The key element of MGRA is finding a shortest transformation of the multiplebreakpoint graph G(P1,P2, . . . ,Pk) into an arbitrary identity multiple breakpoint graph G(X,X, . . . ,X)for some a priori unknown genome X. We first illustrate this concept with pairwise breakpoint graphs.

Let G(P1,P2) → G(X,X) be an m-step transformation of G(P1,P2) into G(X,X) by either black orgreen 2-breaks (in contrast to the standard breakpoint graph analysis based on black 2-breaks only).4

It is easy to see that every such transformation corresponds to a transformation P1 → X→ P2 that usesm black 2-breaks. Therefore, instead of searching for a shortest transformation G(P1,P2)→ G(P2,P2),one can search for a shortest transformation of G(P1,P2) into any identity breakpoint graph G(X,X)without knowing X in advance.

3The detailed information about synteny blocks and assembly builds is provided in the Supplementary File. Out of 1360synteny blocks (kindly provided by Jian Ma) three synteny blocks represent intermixed segments of the chromosome Xand other chromosomes (the mouse chromosome 7 and the rat chromosomes 15 and 20). Since these blocks are short (16,47, and 17 KB respectively), we have discarded them to simplify the chromosome X analysis below.

For better illustration of the breakpoint graphs, the vertex ∞ is shown in multiple copies as black dots, each connectedby a single multi-edge to regular vertices.

4Switching from black rearrangements to a mixture of black and green rearrangements is a simple but powerful paradigmthat proved to be useful in previous studies (Bafna and Pevzner, 1998; Tannier and Sagot, 2004).

7

1000h

1001t

999t

1000t

1002t

999h

1001h

1002h

410t

1003t

1034h

1003h

1004t

1035h

1035t

1004h

1005t

122h

1005h1006t

1016t

868t

1006h

1007h

1007t

504h

1008t

1008h

1009h

1009t

1010t

100h

101t

123h

100t

99h

1010h

1011t

1012t

1011h

1013t

1012h

1013h

1014h

1014t

1015t

1017h

1015h

261h

1018t

1016h 1017t

296t

702h

295h

1018h

1019t

515h

1019h1020h

1020t

77t

101h

102h

102t

124t

1021t

1027t

1028t

1021h

1022h

1022t

1030t

1023t

1023h

1024t

1024h

1025t

1025h

1026t

1026h1029h

1027h

1029t

1028h

103t

1030h

1031h

1031t

1032t

1032h

1033h

1033t

1034t

840t

970h

1036h

1037t

469h

1036t471h

53h

1037h

1038t

419t

470t

1038h

1039t

1040t

1039h

1041t

1040h

103h

104t

1132t

1041h

1042t

1043t

1042h

1045h

1047t

1043h

1044t

1046h

1044h

1045t

1049h

831t

1048h

1046t

1047h

1048t

667h

1049t

1214t

1050t

830h

104h

105h105t

185t

1050h

1051h

1051t

1052t

1052h

1053t

877t

1053h

1054t

1055t

877h

1054h

1056t

1055h

1056h

1057t

1059h

1057h

1058t

1060t

1059t1058h

1060h

1061t

106t106h

1061h

1062t

555t

820t

1062h 1063t

14t

1182t

1191h

1063h

1064h

1064t

13h

1065t

1065h

1066t

1248t

1066h

1067t

409h

1256t

1067h

1068t

992h

1068h

1069h

1069t

992t

1070h

1070t

107t

1071t

1071h

1072t

1072h

1073h

1073t

343h

1074t

1074h

1075h

1075t

1076t

1076h

1077h

1077t

1078t

1078h

1079t

1080t

1079h

1081t

1080h

107h

108h

108t

1081h

1h

72t

1082h

1083t

1100h

1082t

1084t

1083h1103h

1104t

1084h

1085h

1085t

1086t

1086h

1087t

751h

1087h

1088h

1088t

866t

1089t

1089h

1090t

1116h

109t

1090h

1091h

1091t

1117t

1092t

1092h

1093h

1093t

1094t

1095h

1094h

1095t

1096h

1096t

1097t

1097h

1098t

27h

76h

1098h

1099h

1099t

1105t

1100t

109h

110t

585h

10h

11t

9h

10t

1101t

1101h

1102h

1102t

1105h

1103t

1104h

1106t

1106h

1107h

1107t

1116t1108t

1120t

1108h

1109t

1110t

1109h

1111t

1110h

110h

111t

112h

113h

252t

1111h

1112t

3h

1112h

1113t

1114h

1121t

1113h

1114t

1115t

1115h

1128h

1117h

1118t

1119t

1118h

1120h

1119h

111h

112t

113t

114t

1121h

1122h

1122t

1123t

1123h

1124t

1125h

1124h 1125t

1127t

1126t

1126h

1127h

1128t

3t

1129t

1129h

55t

1130h

1131t

768t

1130t

1143t

1131h

98h

1254t

1132h

1133h

1133t

1134t

1134h

1135t

1135h

1136h

1136t

99t

1137t

1137h

1138t

435h

1138h

1139t

1140h

463t

464t

1139h

1140t

1154t

1141t

1158h

1141h

1142h

1142t

1151h

1153h

1153t

540t

1143h

1144h

1144t

1145t

1145h

1146h

1146t

1147t

1147h

1148h

1148t

1149t

1149h

1150t

140h

114h

115h

115t

1150h

1151t

61t

1172h

1152t

1152h

1154h

1155h

1155t

1156t

1157t

1156h

1158t

1157h

1159t

1159h

1160h

1160t

136t

430t

116h

116t

1161t

1161h

1162t

135t

137h

1162h

1163h

1163t

1164t

1164h

1165h

1165t

1166t

1166h

1167h

1167t

1168t

1168h

1169h

1169t

1170t

117t

1170h

1171h

1171t

1172t

539t

1173h

1174t

1177h

1173t

1183t

1174h

1175t

1176t

889t

912t

1175h

1177t

1176h

1178t

1180h

1178h

1179h

1179t

1181t

931h

1180t

964t

117h

118t

119t

1184h

1255h

488h

1181h

573h

863t

1182h

337t

1183h 1184t

1238t

1185t

1237h

1185h

1186t

796t

1187t

1186h

1187h

1188t

1188h

1189t

15t

1189h1190t

207t

655h

656h

15h

118h

120t

119h

1190h

1191t

1192t

1193t

21h

1192h

1214h

1193h

1194t

219h

1194h

1195t

1196h

219t

1195h

1196t

1197h

1197t

1198h

1202h1198t

1200h

1201h

1203t 1199t 1202t

1199h 1200t

1201t

1204t

11h

12h

12t

1203h

1204h

1205t

1205h

1206h

1206t

1207t

1207h

1208t

1209t

1208h

1210t

1209h

120h121t

131h

27t

1210h

1211t

740t

1211h

1212h

1212t

740h

1213t

1213h

1219h

1219t

1215t

1215h

1216t

1216h

1217h

1217t

1218t

1218h

730t

121h

122t

128t

1220h

1221t

1224h

1220t

1225t

1221h

1222h

1222t

130h

1223t

1223h

1224t

1239t

205h

1225h

1226h

1226t

1227t

1227h

1228t

1229t

1228h

1230t

1229h

123t

1230h

1231h

1232h

1232t

1231t

216t

1233t

1233h

1234t

1235t

1234h

1236t

1235h

1236h

1237t

231t

489h

1238h

1246h

1239h

1240t

871h

872h

872t

141t

1240h

1241h

1241t

1242t

280t

1242h

1243h1243t

1244t1244h

1245t

1245h

1246t

781h

887h

887t

1247h

1247t

1254h

779t

1249t

917t

1248h 1256h

1249h

1250t1251t

124h

125t

963t

1250h

1251h

1252t

1252h

1253t

920t

1253h

577t

1255t

291t

292h

911h

916h

1257t

1257h

1258t

1259h

1258h

1259t

1260t

125h

126h

126t

431t

1260h

1261h

1261t

1262t

1262h

1263h

1263t

1264t

1264h

888t

900h

1265h

1266t

1313t

1265t

1274h

1266h

1267t

1275h

1267h

1269t

1287t

1288t

1269h

1270t

1271t

127t 184h

1270h

1272t

1271h

1272h

1273h

1273t

1274t

1275t

1288h

1276t

1286t

1276h

1277h

1277t

1283t

1278t

1278h

1279h

1279t

1280t

127h

535h

792h

913t

531t

1280h

1281t

1282h

1281h

1282t

1315t

1314t

1283h

1284t

1285h1284h

1285t

1313h

1286h

1289t

1287h

1312h

1304t

1289h

1290h

1290t

128h

129h

129t

1291t

1291h

1292h

1292t

1293t

1293h

1294h

1294t

1295t

1295h

1296h

1296t

1297t

1297h

1298h

1298t

1299t

1299h

1300h

1300t

130t

13t

1301t

1301h

1302h

1302t

1303t

1303h

1328t

1304h

1305h

1305t

1306t

1306h

1307t

1308h

1307h

1308t

1309t

1309h

1310t1311t

131t

1310h

1311h

1312t

1359h

1314h

1315h

1316t

1316h

1317h

1317t

1318t

1318h

1319h

1319t

1320t

132t

145h

147t

1320h

1321t

1322t

1321h

1323t

1322h

1323h

1324h

1324t

1325t

1325h

1326h

1326t

1327t

1327h

1328h

1329h

1329t

1330t

132h

133h

133t

143t

1330h

1331h

1331t

1332t

1332h

1333h

1333t

1334t

1334h

1335t

1336t

1335h

1337t

1336h

1337h

1338t

1339h

1339t

1338h

1340h

1340t

134t

1341t

1341h

1342h

1342t

1343t

1343h

1344h

1344t

1345h

1345t

1346t

1346h 1347t

1349t

1347h

1348t

1349h 1348h

1350t

134h

139h

1350h

1351t

1352h

1351h

1352t

1354t

1354h

1355t

1356t

1355h

1356h

1357h

1357t

1358t

1358h

1359t

135h

137t

139t

136h

138t

138h

140t

436t

141h

142h

142t

333t

505t

143h

144h

144t

145t

146t

146h

71h

147h

148t

758t

148h

149h

149t

741h

150t

14h

16t

150h

151t

151h

152h

152t

153t

153h

154t

155t

154h

156t

155h

156h

157h

157t

158t

158h

159t

855h

159h

160t

963h

160h

161t

170h

161h

162h

162t

171t

163t

163h

164h

164t

165t

165h

166t

167t

166h

168t

167h

168h

169h

169t

170t

16h

17t

171h

172h

172t

173t

173h

174t

645t

174h

175t

176t

656t

175h

177t

176h

177h

424t

178h

179h

179t

178t

215h

180t

17h

18h

18t

971t

180h

181t

253t

181h

182t250h

250t

182h183t

211t

253h

908h

183h

184t

774t

209t

254t

491t

492t

185h

186t

471t

989h

186h

187h

187t

188t

188h

189h

189t

190t 19t

190h

191h

191t

192t

210h

192h

193h

193t

199t

194t

194h

195t

204h

195h

196t

202h

204t

196h

197t

203t

197h 198t

281t

198h

208t

280h

199h

200h

200t

19h

20h

20t

2t

1t

22t

290t

201t

201h

202t

203h

205t

206t

206h

207h

208h

254h

209h

210t

795t

21t

990t

211h

212h

212t

213t

213h

214h

214t

215t

216h

217h

217t

218t

218h

220t

220h

221t

222t

221h

223t

222h

223h

224h

224t

225t

225h

226h

226t

227t

227h

228t

228h

229h

229t

230t

22h

23t

26h

230h

231h

232h

232t

233t

233h

234h

234t

235t

235h

236t

237t

236h

238t

237h

238h

239t

241h 239h

240h

240t

242t

23h

24t25t

715h

241t

242h

243h

243t

244t

244h

245t

245h

246h

246t

247t

247h

248h

248t

249t

249h

289t

24h

25h

26t

251t

251h

252h

909t

771h255t

255h

256h

256t

257t

257h

258h

258t

259t

259h

260h

260t

261t

262t

262h

263t

476t

479t

263h

264t

673h

277h

264h

265t

266t

932t

265h

268t

267h

266h 267t

285t

284h

268h

269h

269t

270t

270h

271t

955t

271h

272t

346h

955h

272h

273t

274h

274t

275t

273h

278t

275h

276h

276t

277t

278h 279t

279h

334t

374t

28t

281h

282t

283t

282h

284t

283h

285h

286t

286h

287h

287t

288t

288h

289h

28h

29t

332t

290h920h

44h

291h 292t

298t

293t

297h

293h

294t

296h

294h

295t

311h

297t

312t

298h

299h

299t

300t

29h

30h

30t

2h

4t

300h

301h

301t

302t

302h

303h

303t

304t

304h

305h

305t

306t

306h

307h307t

308t

309t

308h

599t

598h

309h

310h

310t

31t

311t

312h

313t

314t

313h

51h

52t

314h

315t

320h

315h

316t

321t

316h

317h

317t

318t

318h

319t

319h

320t

357h

31h

32h

32t

321h

322t

41t

322h

323t

586h

324t

323h

586t

571h

324h

325t

326t

325h

327t

326h

327h

328t

333h

328h

329t

330t

40h

329h

331t

330h

33t

331h

45t

332h

334h

335h

335t

336t

336h

935h

337h

338t

338h

339h339t

912h

340t340h

33h

34t

35h

341t

341h

342h

342t

343t

344t

344h

345h

345t

62h

346t

347t

347h

348h

348t

654t

349t

349h

350t

351t

34h

35t

36t

350h

352h

353h

351h

352t

353t

354h

354t

355t

355h

356h

356t

357t

358t

358h

359h

359t

925t

360t

360h

361t

658h

361h

362h

362t

646t

363t

363h

364h

364t

365t

365h

366h

366t

367t

367h

368h

368t

369t

369h

370h

370t

36h

37h

37t

371t

371h

372t

372h

373t

373h

653h

374h

375h

375t

376t

418h

536h542t

376h

377h

377t

692h

717h

378t

684h

693t

378h

379t

380h

379h 380t

382t

381t

38t

381h

382h

383t

383h

384t

417h

384h

385t

386t

418t

385h

387t

386h

387h

388t

391t

388h

389h

389t

391h

390t

38h

39h

39t

390h

392t

392h

393t

393h

394t

395t

394h

396t

395h

396h

397t

398t

397h

399t

398h

399h

400t

40t

400h

401h401t

462h

402t402h

403t

403h

404t

406h

404h

405t

406t

405h407t

407h

408h

408t

409t

410h

411t

412t

411h

413t

412h

413h

414h

414t

415t

415h

416h

416t

417t

419h

420h

420t

41h

42h42t

421t

421h

422h

422t

423t

423h

554h

424h

425h

425t

426h

426t

427t

427h

428h

428t

429t

429h

430h

43t43h

431h

432h

432t

433t

433h

434h

434t

435t

436h

437t

461t

437h

438t

473h

460h

773h

438h

439h

439t

474t

440t

44t

440h

441h

441t

442t

442h

443t

444h

444t

443h

450t

445h

445t

447t

446t

446h

447h

448t

449t

448h

450h

449h

451t

451h

452t

453t

452h

454t

453h

454h

455t

456h

455h

456t

457h

457t

458t

465h

458h

459t

466t

459h

460t

725t

45h

46h

46t

461h

462t

704t

752t

463h

465t

464h

466h

467t

78t

467h

468t

469t

475h

468h

807t

806h

47t

470h

489t

472t472h 473t

729h

474h

475t

476h

477t

478t

477h

730h

478h

479h

480h

480t

47h

48h

48t

481t

481h

482h

482t

483t

483h

484t

574t

484h

485t

486t

499h

485h

487h

488t

486h487t

538t

490t

49t

490h

793t

989t

491h

492h 493t

496t

493h

494h

494t

495h

495t

496h

498t

497t

497h

498h

499t

500t

49h

50h

50t

4h

5t

6t

500h

501t

510h

678h

501h

502h502t

511t

503t503h

504t

505h

506t507t

506h

507h

508t

508h

509h

509t

510t

51t

511h

512h

512t

513t

513h

514h

514t

515t

516t

516h

517t

670t

671h

716t

517h

518t

519t

533t

518h

520t

519h

520h

521t

522t 521h

524t

523h

522h

523t

545h

546t

524h

525h

525t

526t

526h

527h

527t

528t

528h

529h

529t

530t

52h

53t

57h

530h

531h

554t532h

532t

553h536t

789t

533h

534h

534t

535t

537t

540h

537h

541t

538h

539h

54t

58t

541h

542h

543h

543t

544t

544h

545t

546h

547t

551h

547h

548t

551t

552t

548h

549h

549t

553t

550t

54h

60h

550h

552h

555h

556h

556t

611h

611t

557t

557h

558t

559t

558h

559h

560h

560t

55h

56h

56t

561t

609t

561h

562h

562t

608h

563t

563h 564t

624t

564h

565t

616t

623h

565h

566t

567t

566h

568t

567h

568h 569t

644t

569h

570t

98t

643h

57t

570h 571t

643t

615h

572t

642h

572h

573t

870h

574h 575t

618t

575h

576t

577h

617h

576h

73t

578t

578h579t

638h

639t

579h

580h

580t

641h

581t

581h

582t

582h

583t

584h

584t

583h

585t

587t

587h

588h

588t

919h

589t

589h

590t594h

58h

59h

59t

590h

591t

592t

596h

591h

596t

595h

592h

593h

593t

594t

595t

597t

597h

598t

614h

615t

599h

600t

824t

60t

5h

7t

6h

600h

601t

602t

632t

601h

604h

603t

602h

603h

604t

605t

605h

606h

606t

607t

607h

608t

609h

610h

610t

927h

927t

928t

612h

612t

613t

613h

614t

616h

617t

79t

973t

618h

619h

619t

620t

61h

62t

75h

620h

621h621t

622t622h

623t

624h

625h

625t

626t

626h

627h

627t

628t

628h

629h

629t

630t

63t

630h

631t

631h

633t

632h

633h

634t

635h

634h

635t

636h

636t

637t

637h

638t

642t

639h

640h

640t

63h64t

75t

641t

644h

998h

645h

969h

646h

647h

647t

648t

648h

649h

649t

650t

64h

65h

65t

650h

651h

651t

652t

652h653t

940t

659t

654h655t

666h

657t

657h

658t

663h

937h

939h

663t

659h

660t

665h

66t

660h

661t

665t

666t

661h

662t

662h

664t

664h

941t

667t

668t

668h

669t

674t

669h

691h

66h

67t

68t

670h

671t

672t

672h

673t

685t

679t

674h

675h

675t

676t

676h

677h

677t

678t

679h

680t

681t

67h

69t

68h

680h

682t

681h

682h

683h

683t

684t

685h

686h

686t

687t

687h

688h

688t

689t

689h

690h

690t

691t

692t

892t

693h

694t

694h

695h

695t

696t

696h

697h

697t

698t

698h

699h

699t

700t

69h

70h

70t

700h

701h

701t

702t

703h

703t718t

704h

705t

778h

705h

706h

706t

724t

707t

707h

708h

708t

709t

709h

710h

710t

71t

711t

711h

712h

712t

713t

713h

714h

714t

715t

716h 717t

732t

731h

718h

719h

719t

856t

720t

737h

78h

720h

721h

721t

722t

722h

723h

723t

724h

725h

726t

731t

726h

727t

742h

727h

728h

728t

741t

729t

72h

766h

782t

732h

733h

733t

734t

734h

735t

736t

735h

737t

736h

738t

738h

739t

742t

739h

743t

73h

74t

763h

743h

744h

744t

745t

745h746t

747h

746h

747t

749t

750t

748t 748h750h

749h

751t

74h

770h

752h

753h

753t

754t

754h

755t

755h

756t

756h

757t

760t

757h

758h

764h

759t

759h

765t

76t

760h761t

769h

761h

762h

762t

767t

763t

764t

767h

765h

766t

768h

769t770t

771t

994t

801h

772t

774h

802t

772h

773t

798h

775t

775h

776t

799t

776h

777t

778t

777h

799h

800t

779h

780h

780t

77h

862h

781t

782h

783h

783t

784t

784h

785t

786t

785h

787t

786h

787h

788h

788t

789h

790t

791t

790h

792t

791h

793h

794t

794h

795h

796h

797h

797t

798t

79h

80t

81t

7h

8h

8t

800h

801t

802h

803h

803t

804t

804h

805h

805t

806t

807h

808h808t

809t809h

810t

80h

84t

83h

810h

811h

811t

812t

812h

813h

813t

814t

814h

815h

815t

816t

816h

817t

818t

817h

819t

818h

819h

81h

82h

82t

820h

821t

822h

821h

822t

823t

823h

824h

825t

854h

825h

826h

826t

855t

827t

827h

828t

859h

828h

829h

829t

860t

830t

83t

831h

832t

833t

832h

834t

833h

834h

835h

835t

836t

836h

837t

838t

837h

839t

838h

839h

840h 841t

857t

841h

842h

842t

856h

843t

843h

844t

845h

844h

845t

847t

846h

846t

847h

848h

848t

849t

849h

850h

850t

84h

85h

85t

851t

851h

852h852t

853t853h

854t

857h

858t

858h

859t

86t

860h

861h

861t

862t

863h

864h

864t

865t

865h

866h

867t

867h

891h

868h

869h

869t

870t

86h

87h87t

871t

873t

879h

873h

874t

875t

880t

874h

876t

875h

876h878h

878t

879t

88t88h

880h

881h

881t

882t

882h

883t

897h

883h

884t

885t

898t

884h

886t

885h

886h

888h

889h

890h

890t

89t

891t

892h

893t894t

893h

894h

895t

895h

896h

896t

897t

898h

899h

899t

900t

89h

90h

90t

9t

901t

901h

902h

902t

971h

903t

903h

904t905t

904h

905h

906t

906h

907h

907t

908t

909h

910h

910t

91t

911t

913h

914t

915t

914h

916t

915h

917h

918h

918t

919t

91h

92h

92t

921h922t

921t

925h

922h

923t

938h

923h

924h

924t

939t

926t

972h

926h

938t

928h

929h

929t

930t

93t

930h

931t

932h

933t

934t

933h

935t

934h

936h

936t

937t

942t

93h

94h

94t

940h

941h

942h

943h

943t

944t

944h

945h

945t

946t

946h

947h

947t

948t

948h

949h

949t

950t

95t

950h

951h

951t

952t

952h

953h

953t

954t

954h

956t

956h

957t

958h

957h

958t

959t

959h

960t

961t

95h

96h

96t

960h

962t

961h

962h

964h

965t

965h

966h

966t

967t

967h

968h

968t

969t

97t

990h

970t

972t

995h

973h

974h

974t

975t

975h

976t

977t

976h

978t

977h

978h

979h

979t

980t

97h

980h

981h

981t

982t

982h

983t984t

983h

984h

985t

985h

986t

987t

986h

988t

987h

988h

991t

991h

993t

993h

994h996t

995t

996h

997h

997t

998t

Chromosome colors:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X

Figure 4: The breakpoint graph G(M,R,D,Q,H,C) (obverse edges are not shown) of six mammalian genomes: Mouse (rededges), Rat (blue edges), Dog (green edges), macaQue (violet edges), Human (orange edges), and Chimpanzee (yellow edges).The graph has 1357 · 2 = 2714 vertices labeled as nt or nh (where n is a synteny block number) and colored in 23 colorsrepresenting chromosomes in the human genome.

In the case of k ≥ 2 genomes P1,P2, . . . ,Pk, 2-breaks can be applied to multi-edges in the multiplebreakpoint graph G(P1,P2, . . . ,Pk) of as many as 2k

− 2 different multicolors formed by proper subsetsof C. However, not every series of such 2-breaks makes sense in terms of ancestral genome recon-structions. A basic property of ancestral genome reconstructions is that 2-breaks on multi-edges of

8

QHC

X

QHC+MRD

MRD

MRD+QHC

HC

HC+MRDQ

Q

Q+MRDHC

H

H+MRDQC

C

C+MRDQH

MR

MR+DQHC

D

D+MRQHC

M

M+RDQHC

R

R+MDQHC

Figure 5: The phylogenetic tree T of six mammalian genomes: Mouse (red), Rat (blue), Dog (green), macaQue (violet), Human(orange), and Chimpanzee (yellow) with a root X on the MRD + QHC branch. The branches are directed towards X andlabeled with the corresponding pairs of complementary T-consistent multicolors. The ~T-consistent multicolor from eachpair also labels the starting node of the corresponding directed branch. Note that the tree orientation may not necessarycorrelate with the time scale and the root genome X may not necessary be a common ancestor of the leaf genomes.

multicolor Q ∈ C can be applied only when all genomes corresponding to colors in Q are mergedinto a single genome. We give an alternative definition of this property as follows: a transformation(series of 2-breaks) S of the multiple breakpoint graph G(P1,P2, . . . ,Pk) is strict if for any 2-breaksρ1, ρ2 ∈ S operating on multi-edges of multicolors Q1 ( Q2, ρ1 precedes ρ2 in S. The MultipleGenome Rearrangement Problem is reformulated as follows:

Multiple Genome Rearrangement Problem (MGRP). Given genomes P1, . . . ,Pk, find a shortest strictseries of 2-breaks that transforms the breakpoint graph G(P1, . . . ,Pk) into an identity breakpoint graph.

Let T be an (unrooted) phylogenetic tree of the genomes P1, . . . ,Pk (Fig. 3a). The tree T consists of kleaf nodes (or simply leaves), k − 2 internal nodes, and 2k − 3 branches connecting pairs of nodes, so thatthe degree of each leaf is 1 while the degree of each internal node is 3.

Removing a branch from T breaks it into two subtrees, each of which is induced by the set of itsown leaves. A multicolor consisting of all colors (leaves) of either of these induced subtrees is calledT-consistent. LetG be the set of all T-consistent multicolors. Note that if a multicolor Q is T-consistentthen its complement Q = C \ Q is also T-consistent. Therefore, there is a one-to-one correspondencebetween the pairs of complementary T-consistent multicolors and the branches of T (Fig. 5).

When a phylogenetic tree is given, MGRA addresses a restricted version of MGRP where 2-breaksare applied only to multicolors consistent with the phylogenetic tree.

Tree-Consistent Multiple Genome Rearrangement Problem (TCMGRP). Given genomes P1, . . . ,Pkat the leaves of a phylogenetic tree T, find a shortest strict series of T-consistent 2-breaks, transforming thebreakpoint graph G(P1, . . . ,Pk) into an identity breakpoint graph.

Note that MGRP and TCMGRP problems in the case of three unichromosomal genomes cor-respond to the median problem that is NP-complete (Caprara, 1999a; Tannier et al., 2008). Whileexistence of exact polynomial algorithms for solving MGRP and TCMGRP is unlikely, we describe aheuristic approach to “eliminating” breakpoints in G(P1, . . . ,Pk) that uses reliable rearrangements. Inparticular, MGRA optimally solves these problems in case of semi-independent rearrangement scenarioswith some breakpoint re-uses (see below).

9

We will find it convenient to fix a branch X of the tree T and assume that this branch contains aroot X (viewed as yet another node) the precise location of which is to be determined later. The choiceof X defines directions “towards” X on all branches of the tree T (Fig. 5). We label every leaf nodePi of the directed tree T with the corresponding singleton multicolor {Pi}, and then recursively labeleach internal node with the union of the multicolors of the starting nodes of all incoming branches(e.g., in Fig. 5 a common endpoint of branches coming from the leaf nodes M and R is labeled as MR).The multicolors forming node labels of the tree T are called ~T-consistent. Alternatively, ~T-consistentmulticolors can be defined as T-consistent multicolors whose induced subtrees do not contain X.Note that exactly one of the multicolors in each pair of complementary T-consistent multicolors is~T-consistent and it labels the starting node of the corresponding directed branch in T (except for themulticolors corresponding to the branch X that both are ~T-consistent).

MGRA transforms the genomes P1, . . . ,Pk into X along the directed branches of T, using 2-breakson ~T-consistent multicolors (~T-consistent 2-breaks). In terms of breakpoint graphs, MGRA eliminatesbreakpoints in G(P1,P2, . . . ,Pk) with ~T-consistent 2-breaks and transforms it into the identity break-point graph G(X, . . . ,X).5 This transformation defines a reverse transformation of the genome X intothe genomes P1, . . . ,Pk by ~T-consistent 2-breaks (such as in Fig. 3c). MGRA keeps the track of rear-rangements applied to the breakpoint graph G(P1, . . . ,Pk) during its transformation into an identitybreakpoint graph G(X, . . . ,X). The recorded rearrangements (in the reverse order) define a reversetransformation that passes through every internal node of the tree T and, thus, can be used to recon-struct the ancestral genomes at the internal nodes of T.

While initial steps in transformation of the breakpoint graph G(P1, . . . ,Pk) into an identity break-point graph usually correspond to reliable rearrangements, sooner or later one needs to employless reliable heuristic arguments in order to complete the transformation. However, sometimes it ispreferable to stop after reaching certain level of reliability even if the transformation is not complete(and the TCMGRP problem is not solved). In this case we stop short of reconstructing the ancestralgenomes since the transformation has not resulted in an identity breakpoint graph. In Supplement Cwe describe an alternative method (not requiring solution of the TCMGRP problem) for reliable re-construction of (parts of) ancestral genomes (similar to CARs from Ma et al. (2006)) at internal nodesof the phylogenetic tree.

RESULTS3 MGRA AlgorithmSupplement A introduces the notion of independent (no breakpoint re-uses), semi-independent(breakpoint re-uses may occur only within single branches of the phylogenetic tree), and weakly-independent rearrangements (breakpoint re-uses are limited to adjacent branches of the phylogenetictree). MGRA optimally solves the MGRP problem in case of semi-independent 2-breaks and usesheuristics to move beyond the semi-independent assumption. Below we show that most 2-breaks inmammalian evolution are either independent, semi-independent, or weakly independent resultingin reliable ancestral reconstructions.

3.1 Cycles and paths in the breakpoint graph

Visual inspection of a rather complex breakpoint graph in Fig. 4 (the giant component contains 630vertices) reveals a large number of cycles and simple paths that are characteristic to independent and

5The use of ~T-consistent 2-breaks here is motivated by an important property that every ~T-consistent transformation canbe turned into a strict ~T-consistent transformation by changing the order of 2-breaks. Therefore, we do not directly addressthe strictness requirement in MGRA that first produces a ~T-consistent transformation of the genomes P1,P2, . . . ,Pk into thegenome X and then reorders it into a strict transformation.

10

semi-independent rearrangement scenarios. MGRA uses the cycles/paths in the breakpoint graphs asa guidance for finding reliable ancestor reconstructions.

We note that the immediate result of a 2-break performed along a branch Q+Q in the phylogenetictree T is a cycle of four multi-edges whose multicolors alternate between Q and Q. All vertices inthis cycle have multidegree 2 and represent breakpoints that were not reused. Even if one of thesemulti-edges is used in later rearrangements, the remaining three multi-edges still form an alternatingpath that serves as a footprint of the 2-break. This observation motivates a search for alternatingpaths and cycles in the breakpoint graphs. We introduce the following definitions to analyze suchcycles/paths.

We define a simple vertex as a regular vertex of the multidegree 2 and a simple multi-edge as a multi-edge connecting two simple vertices. Simple multi-edges form simple cycles/paths in the breakpointgraphs, i.e., cycles/paths in which multicolors of consecutive multi-edges alternate between Q and Q.Simple multi-edges/paths/cycles are called good if their multicolors are T-consistent.

Multicolors Multi-edges

Simplevertices

Simplemulti-edges

Simplepaths

R +MDQHC 1173 1080 1036 235MR + DQHC 487 376 305 86D +MRQHC 473 368 310 105M + RDQHC 223 145 118 45MRD + QHC 208 135 111 37Q +MRDHC 162 130 120 33HC +MRDQ 140 104 87 30C +MRDQH 45 32 26 11H +MRDQC 15 8 6 3QC + MRDH 9 6 6 1MRQ + DHC 8 1 0 0MD + RQHC 8 1 0 0QH + MRDC 7 2 1 1RQ + MDHC 7 4 4 1DC + MRQH 6 4 4 1DQ + MRHC 5 0 0 0∅ + MRDQHC 2 0 0 0MRC + DQH 1 0 0 0

∅ + MDQO 1693 0 0 0O +MDQ 45 4 0 0M + DQO 42 4 0 0Q +MDO 35 0 0 0D +MQO 26 0 0 0MO + DQ 26 5 2 1MD + QO 19 7 4 1MQ + DO 12 3 0 0

Table 1: Top table: The statistics of the breakpoint graph of the Mouse, Rat, Dog, macaQue, Human, and Chimpanzeegenomes. For every pair of complementary multicolors, we show the number of multi-edges of these multicolors, thenumber of simple vertices that are incident to such multi-edges, the number of simple multi-edges, and the number ofsimple paths and cycles. The T-consistent multicolors are shown in bold. Only 18 out of 32 possible multicolors are shown(the remaining 14 multicolors have zero corresponding multi-edges). Bottom table: The statistics of the breakpoint graphof the Mouse, Dog, macaQue, and Opossum genomes after MGRA Stages 1 and 2 on confident branches (Fig. S18, bottom).

Table 1(top) describes the statistics of the breakpoint graph and illustrates how rearrangementanalysis contributes to construction of phylogenetic trees. Indeed, all three internal branches (correcttree partitions) are supported by large numbers of good paths/cycles and good multi-edges (86 and 305for MR+DQHC, 37 and 111 for MRD+QHC, 30 and 87 for HC+MRDQ). Each of 32 incorrect partitions(only 8 of them are shown in the Table 1(top)) have at most one simple path/cycle and at most six

11

simple multi-edges, an order of magnitude smaller number than non-trivial correct partitions. Thisobservation illustrates that reconstruction of the correct tree topology is a simple exercise in this case(see Chaisson et al. (2006)). This and other statistics produced by MGRA (see below) may be used todetermine the phylogenetic tree rather than to assume that it is given. In contrast to Cannarozzi et al.(2007), MGRA provides a large number of certificates supporting the tree topology in Fig 5. Belowwe show how MGRA reconstructs the ancestral genomes.

3.2 MGRA Stage 1: Processing good cycles and paths

Alternating cycles represent well-studied objects in the case of the pairwise breakpoint graphs. Everysuch cycle of length 2m is formed by (m − 1) 2-breaks (Alekseyev and Pevzner, 2008) in each mostparsimonious scenario.6 Therefore, there is little difference between alternating cycles in the pairwisebreakpoint graphs and good cycles in the multiple breakpoint graphs: indeed the good cycles withalternating multicolors Q and Q in the breakpoint graph model the rearrangements separating the setsof the genomes Q and Q exactly in the same way as in the pairwise genome comparison. We thereforeargue that such alternating cycles (and the corresponding rearrangements) can be reliably assigned tothe branch Q+Q in the phylogenetic tree T. This operation generalizes the notion of good rearrangementsin MGR by extending them from cycles alternating multicolors Pi and Pi = {P1, . . .Pi−1,Pi+1, . . . ,Pk}

to cycles alternating any complementary T-consistent multicolors. While MGR attempts to findrearrangements bringing Pi closer to all genomes from Pi (i.e., rearrangements on the leaf branchesof the phylogenetic tree), MGRA processes reliable rearrangements on all (both leaf and internal)branches of the phylogenetic tree (compare to Zhao and Bourque (2007)).

Similarly, good paths can be also assigned to branches of the phylogenetic tree by transformingthem into good cycles first. Consider a good path x1, x2, . . . , xm consisting of m − 1 multi-edgeswith T-consistent multicolors alternating between a multicolor Q of the multi-edge (x1, x2) and itscomplement Q. We extend this path by vertices x0 and xm+1 incident to its first and last verticesrespectively, resulting in the path p = (x0, x1, x2, . . . , xm+1). If the first and the last multi-edges in thispath have the same ~T-consistent multicolor, we perform a 2-break over the multi-edges (x0, x1) and(xm, xm+1) to transform p into a good cycle c = (x1, x2, . . . , xm) and a multi-edge (x0, xm+1) (Fig. 6a,c).7 Ifthe first or/and last multi-edges of p are of non-~T-consistent multicolor, we remove it/them to obtaina path flanked by a ~T-consistent multicolor that is processed (if it is longer than one edge) as above.Note that processing good cycles/paths in the breakpoint graph can create new good cycles/paths.We therefore process the good cycles/paths in an iterative fashion until no more good cycles/pathsremain.8

Fig. 7 (top panel) shows the breakpoint graph after processing (i.e., removing) good cycles/pathsand illustrates that it is significantly simplified as compared to Fig. 4. The size of the giant componentis reduced from 630 to 193 vertices and the overall number of vertices (not counting vertices incidentto complete multi-edges) is reduced tenfold from 2712 to 253. Fig. 7 (top panel) illustrates how MGRAimproves upon MGR: MGR is able to reduce the same graph only three-fold to 924 vertices (414vertices in the giant component) before it runs out of “good rearrangements”. While MGRA Stage 1greatly reduces the rearrangement distance between the analyzed genomes, Table S5 (center panel)illustrates that it still does not reveal the ancestral genomes MR, MRD, HC, and QHC. Moreover, it isnot clear how to derive these ancestors based on a rather complex topology of the breakpoint graph

6While this representation is not unique, all these representations are equivalent (i.e., they produce the same final result).Fig. 6b illustrates transformation of a simple cycle on six vertices into three complete multi-edges with two 2-breaks).

7In the special case x0 = xm+1 = ∞ and the flanking edges are of the same ~T-consistent multicolor, we perform a fusion2-break as shown in Fig. 6d. In the case of m = 1 (i.e., when p contains a single simple multi-edge) c represents a completemulti-edge rather than a cycle (Fig. 6c) and does not require further processing.

8One can prove that the topology of the resulting graph does not depend on the order in which good cycles/paths areprocessed.

12

x1 x6

x0 x7

x2 x5

x3 x4

x1 x6

x0 x7

x2 x5

x3 x4

x1 x6

x2 x5

x3 x4

x1 x6

x2 x5

x3 x4

x2x1 x2x1x2x1

x3 x3

x2x1

a) b)

d)c)

x1 y1

x2 2y

x1 y1

x2 2y

2yx2

x1y1

2yx2x1

y1

x y

x y

x y

x y

Figure 6: Top panel: Processing good paths using a ~T-consistent red-blue multicolor. a) A good path on vertices x1, x2, . . . , x6

is transformed into a cycle on the same vertices by extending it into x0, x1, x2, . . . , x6, x7 and performing a 2-break on themulti-edges (x0, x1) and (x6, x7). b) Transformation of a good cycle on 6 vertices into complete multi-edges with a 2-breakon the multi-edges (x1, x2), (x3, x4) followed by a 2-break on the multi-edges (x1, x4), (x5, x6). c) A 2-break on an irregularedge corresponds to a reversal involving chromosome ends. d) A 2-break on two irregular edges corresponds to a fusion.Bottom panel: Two ways of transforming a fair edge (x, y) into a good edge: by a 2-break on yellow edges (top) or by a2-break on green edges (bottom). In either case, the follow-up processing of the generated simple path results in the samegraph with the complete multi-edge (x, y).

in Fig. 7 (top panel). MGRA Stage 2 introduces the notion of fair cycles/paths that allows one to revealthe rearrangements that violate the semi-independence assumption and to further simplify the graphin Fig. 7 (top panel).

The results of MGRA Stage 1 already reveal valuable insights about the ancestral genomes (evenwithout MGRA Stage 2). To simplify the analysis of the Boreoeutherian ancestral reconstruction9 by

9We use the MRD node of the phylogenetic tree in Fig.5 to approximate the Boreoeutherian ancestor. While this paperfocuses on the Boreoeutherian ancestor, MGRA reconstructs ancestral genomes for every node of the phylogenetic tree.

13

1035h

970h

971t

103h

104t

185t

1047h

1048t

667h

1219h

668t

730t

989h

1061h

1062t

555t

1191h

483h

611t

1192t

1214h

484t

1141h

1142h

1142t

1153h

1154t

1158h

1159t

1161h

125t

1162t

135t

1173h

1174t

1177h

1255h

889t

1178t

1180h

1256t

1181t

931h1184h

77h

932t

1179h

1180t

964t

1185t

990t

78t

1186t

1187t

1185h

796t

1188t

1186h

1187h

1189h

1190t

655h

656h

656t

666h

657t

1215t

1196h

1197t

1198h

1202t

1199h

1201h

1199t

1200t

1204t

1203h

120h

127h

131h

77t

535h

792h

132t

78h

1216h

1217h

1217t

1218t

419t

1224h

1239t

250t

71h

872h

72t

1231t

216t

1239h

1241h

1246h

1247h

1247t

1245h

1246t

887h

887t

1254h

134h

888t

1248t

1256h917t

1257t

343h

124h

344t

1264h

76h

1265t 1267t

1274h

1269t

1275t

1288h

1266h

1314t

1267h

1289t

126h

127t

184h

531h

1286t

536t

793t

532t

1285h

1313h

1289h

1290h

1290t

1291t

1299h

1300h

1300t

1301t

143t

141h

142h

142t

147h

148t

758t

963h

759t

209h

210t

795t

75t

374t

376t

418h

375h

442h443t

444h

444t

445t

447t

443h

447h

448t

446h

678h

679t

516h

517t

671h

691h

672t

692t

586h

587t

769h

770t

771t

609h

610h

610t

927t

667t

941t

926h

670h

671t

672h

673t

685t

684h

717h

729h

752t

746h

747t

748h

749t

74h

770h

75h76t

771h772t

774h

774t

798h

775t

773h

776t

794h

795h

79h80t

81t

83h

80h84t

886h

888h

935h

941h

940h

970t

1196h

1198h

1202t

1199t

1239t

78h

1289h

1290h

1290t

1291t

1299h

1300h

1300t

1301t

374t

376t

375h

609h

610h

610t

611t

927t

926h794h

795h

795t

796t

79h80t

81t

83h

80h84t

970t

971t

Figure 7: The breakpoint graph G(M,R,D,Q,H,C) (the complete multi-edges are not shown) after MGRA Stage 1 (top panel)and after MGRA Stages 1-2 (bottom panel). The edge colors represent Mouse (red), Rat (blue), Dog (green), macaQue (violet),Human (orange), and Chimpanzee (yellow) genomes. Vertices are labeled and colored similarly to Fig. 4.

We emphasize that while reconstruction starts with selection of the root branch (as in Fig. 5), the choice of this branchand the exact location of the root X on this branch are rather arbitrary and not correlated with a specific ancestral genomeof interest (in contrast to the alternative “root-driven” approach described in Supplement C). As described in Section 3.4,the ancestral genomes are defined by the reverse transformation from the (whatever) root genome X to the leaf genomes.

14

MGRA Stage 1, we restrict the set of genomes to single representatives of rodents (mouse), carnivores(dog), and primates (macaque). The resulting breakpoint graph (with obverse edges shown) revealsmany long unicolored paths formed by alternating obverse edges and complete multi-edges (Fig. S10).Such paths represent parts of different human chromosomes in the reconstructed ancestor genome.We compress every such path into a single rectangular vertex as shown in Fig. S11 (top panel),resulting in a rather small graph. We further show the chromosomal associations present in thisgraph in Fig. S12. We emphasize that MGRA Stage 1 reveals some subtle but reliable adjacencies thatother ancestral reconstrution algorithms may miss. In particular, it reveals two adjacencies that areabsent in any of the extant genomes and many adjacencies that are present in only one of the extantgenomes.

The compressed breakpoint graph reveals only 5 complete multi-edges connecting vertices ofdifferent colors: 12 + 22, 12 + 22, 3 + 21, 4 + 8, and 14 + 15. These are exactly the same 5 adjacencies12a + 22a, 12b + 22b, 3 + 21, 4a + 8p, 14 + 15 revealed in Ma et al. (2006). It also reveals the CARscorresponding to the human chromosomes 2, 2, 5, 6, 7, 8, 9, 10, 10, 11, 17, 18, and X (representedas isolated boxes in Fig. S12), exactly the same as the ancestral chromosomes revealed by previouscytogenetics analysis (Froenicke et al., 2006) (2q, 2pq, 5, 6, 7a, 8q, 9, 10q, 11, 17, 18, X) with a singleexception: the second segment from chromosome 10 is identified as an isolated chromosome by usand is tentatively assigned as 10p + 12a + 22a by Froenicke et al. (2006). However, Froenicke et al.(2006) acknowledged that the association of 10p and 12a is only weakly supported (indicated by aquestion mark in Froenicke et al. (2006)).10 Our analysis also rules out the associations 1 + 22, 5 + 19,2 + 18, 1 + 10, and 20 + 2 suggested in Murphy et al. (2005) as weak associations and later criticizedby Froenicke et al. (2006) as unreliable. Supplement D further focuses on the connected componentof the breakpoint graph representing the human chromosomes 7, 16, and 19 where the cytogeneticsapproach disagrees with Ma et al. (2006).

3.3 MGRA Stage 2: Processing fair cycles and paths

M+ R+ MR+ D+ MRD+ Q+ HC+ H+ C+ DQ+ QH+ RD+

M+ ? 19 11 3R+ 19 ? 21 8 2 3 4 2

MR+ 11 21 ? 19 7 2 1D+ 3 8 19 ? 11 2 1 2

MRD+ 2 7 11 ? 4 1 2 2Q+ 3 ? 4 1

HC+ 4 2 2 4 4 ? 1H+ 1 ?C+ 1 2 1 1 ?

DQ+ 1 ?QH+ 2 ?RD+ 2 2 ?

Table 2: The statistics of composite multi-edges (non-zero counts only) in the breakpoint graph G(M,R,D,Q,H,C) afterMGRA Stage 1. Each pair of complementary multicolors is denoted by one of its representative multicolors (e.g., M+stands for the complementary multicolors M + RDQHC). The bold row/column labels correspond to T-consistent (pairs of)multicolors. The grayed cell entries correspond to pairs of adjacent branches in the phylogenetic tree T and account for87% of all composite multi-edges.

Figure 7 (top panel) reveals many pairs of vertices of multidegree three connected by a multi-edge. Each such multi-edge (x, y) corresponds to six vertices x, x1, x2, y, y1, y2 and five multi-edges

Ideally, different choices of the root branch and locations of the root X itself will result in the same set of ancestral genomes.10We are not claiming that this association does not exist since it may be present in some of 100+ genomes with available

cytogenetic data. However, there is no support for this association in the six mammalian genomes. We remark that Maet al. (2006) also did not find support for this association.

15

(x, y), (x, x1), (x, x2), (y, y1), (y, y2) (including cases with xi = ∞, yi = ∞, or xi = y j for some 1 ≤ i, j ≤ 2).A multi-edge (x, y) is called composite if edges (x, x1) and (y, y1) have the same multicolor Q1 andedges (x, x2) and (y, y2) have the same multicolor Q2. A composite multi-edge is called fair if Q1and Q2 represent T-consistent multicolors (Fig. 6, bottom panel). Table 2 shows the statistics ofcomposite multi-edges (depending on pairs of complementary multicolors Q1 + Q1 and Q2 + Q2) andreveals that (i) most composite multi-edges are fair and (ii) while some types of composite multi-edges are common (e.g., (M+,R+), (M+,MR+), (R+,MR+), (MR+,D+), (D+,QHC+), (MR+,QHC+)),others (e.g., (Q+,R+)) are either rare or absent. Table 2 illustrates the extremely biased statisticsof composite multi-edges: the branches Q1 + Q1 and Q2 + Q2 corresponding to the multicolorsQ1 and Q2 of a composite multi-edge are likely adjacent in the phylogenetic tree (compare to theweakly-independent rearrangements). Table 2 provides yet another illustration of utility of MGRAfor deriving phylogenetic trees. Indeed, it reveals valuable information about the topology of thephylogenetic tree (incident edges) that can be combined with information (valid partitions) in Table 1to infer the trees.

Every fair multi-edge (x, y) can be transformed into a good multi-edge by a 2-break (fair 2-break)either on multi-edges (x, x1) and (y, y1) (of multicolor Q1) or on multi-edges (x, x2) and (y, y2) (ofmulticolor Q2) (Fig. 6, bottom panel). In the former case, (x, y) is transformed into a good multi-edgeof color Q2, while in the latter case it is transformed into a good multi-edge of color Q1. The resultinggood paths (formed by fair 2-breaks) can be further processed as described in MGRA Stage 1. Animportant observation is that the final result of processing a fair multi-edge does not depend onwhether we start with a 2-break on Q1 or Q2 multicolor (see Fig. 6, bottom panel). A cycle/path in thebreakpoint graph is called fair if (i) all its edges are either good or fair and (ii) it can be transformedinto a good cycle/path by some fair 2-breaks.

MGRA Stage 2 detects fair paths/cycles, transforms them into good paths/cycles by fair 2-breaks,and further processes the resulting good paths/cycles as in MGRA Stage 1. In some cases fair pathsin Stage 2 should be chosen with caution since the choice of fair paths may influence ancestralreconstructions in some nodes (see Supplement E). Figure 7 (bottom panel) shows the breakpointgraph after processing fair cycles/paths and illustrates that it becomes so small that it now can beanalyzed in a step-by-case fashion by brute-force analysis of every connected component.

3.4 Reconstructing Ancestral Genomes

After removing vertex ∞, the breakpoint graph (after MGRA Stage 2) consists of only 9 connectedcomponents (Fig. 7, bottom panel). Five out of 9 components contain vertices corresponding to bothstart and end of the same synteny blocks 80, 610, 795, 1290, and 1300. This is surprising since generallythe start and end of a synteny block are not expected to be present in the same (small) connectedcomponent unless this block was subject to a micro-inversion (Chaisson et al., 2006). Indeed, blocks80, 610, 795, 1290, and 1300 turned out to be short (all under 500Kb) with blocks 610, 1290, and 1300even shorter than 100Kb (block 1300 is 91Kb in human, 41Kb in mouse, only 10Kb in dog genome)that is near the threshold of 50Kb used in Ma et al. (2006) for generating reliable synteny blocks.

The simplest way to deal with such short blocks is to simply remove them from the set of inputsynteny blocks (Supplement J). Such removal will not significantly affect the architecture of theancestral genomes (indeed, these blocks are well below the resolution of the cytogenetic approaches)while at the same time resolving 5 out of 9 remaining components in the graph. Supplement Fdescribes a different approach that attempts to find the positions and orientations of such shortsynteny block in the ancestors by processing complex breakpoints (MGRA Stage 3). We remark thatprocessing at MGRA Stage 3 is viewed as less reliable and the resulting associations are not consideredin the proposed ancestral reconstructions (see below).

Recall that a strict T-consistent rearrangement scenario uniquely defines ancestral genomes atall internal nodes of the phylogenetic tree T. However, because of use of 2-breaks instead of rever-

16

sals/translocations/fissions/fusions, the ancestral genomes initially obtained by MGRA may contain (asmall number of) circular chromosomes. Whenever possible, MGRA linearizes them by rearranging2-breaks in the transformation. While circular chromosomes may occasionally appear in the initialrearrangement scenario obtained by MGRA, their appearance is a result of either 2-breaks applied in“wrong” order (that can be avoided by reordering the 2-breaks (see Pevzner (2000))), or a “shortcut”in processing hurdles that can be remedied by introducing additional 2-breaks ((Hannenhalli andPevzner, 1999)). MGRA eliminates possible circular chromosomes in the reconstructed genomes atthe post-processing stage. We emphasize that the outcome of MGRA is the set of ancestral (linear)genomes while the 2-break rearrangement scenario produced by MGRA is considered only as a start-ing point for constructing the reversals/translocations/fusions/fissions scenario. An optimal linearrearrangement scenario can be found by applying GRIMM to the ancestral genomes reconstructed byMGRA.

Fig. S14 illustrates the results of ancestral genome reconstruction for the chromosome X for sixmammalian genomes. Supplement H shows the pairwise rearrangement distances between theancestral and leaf genomes, following the strict T-consistent transformation constructed by MGRA,and compares them to the genomic distances computed by GRIMM (Tesler, 2002b). The differencesbetween these distances are rather small, suggesting that the ~T-consistent transformation found byMGRA is close to the most parsimonious.

4 Benchmarking MGRABenchmarking of the ancestral genome reconstruction algorithms may be challenging since the archi-tecture of ancestral genomes is not known. While MGR, GRAPPA, inferCARs, and MGRA showedexcellent performance on simulated datasets, these benchmarks were mainly designed for rearrange-ments generated according to the Random Breakage Model (RBM). Since MGRA improves on MGRand is guaranteed to produce optimal solutions for semi-independent scenarios, it is bound to provideeven better results than MGR on such benchmarks. Supplement L compares MGRA and inferCARson simulated data and illustrates that MGRA generates more accurate ancestral reconstructions forall choices of parameters. However, analyzing all these tools on simulated data may generate over-optimistic results since RBM does not reflect the realities of mammalian evolution (Murphy et al.,2005; van der Wind et al., 2004; Bailey et al., 2004; Zhao et al., 2004; Webber and Ponting, 2005; Hinschand Hannenhalli, 2006; Ruiz-Herrera et al., 2006; Yue and Haaf, 2006; Kikuta et al., 2007; Mehan et al.,2007; Caceres et al., 2007; Gordon et al., 2007). We therefore decided to analyze the differences betweenMGRA and inferCARs reconstructions and to further track evidence for each such difference in thecase-by-case fashion.

MGRA and inferCARs produce highly consistent ancestral reconstructions. For illustration pur-poses, we have chosen to focus on the reconstruction of the MRD ancestral genome (Fig. 5), remarkingthat the results for the other ancestor genomes are similar. As an input to inferCARs we provided sixmammalian genomes and the same phylogenetic tree as used in Ma et al. (2006). The MRD genomesreconstructed by MGRA and inferCARs consist of 25 and 30 chromosomes (CARs) respectively.11

However, MGRA does not consider associations obtained at the Stage 3 (Fig. 7, bottom panel) as re-liable. Most of these associations correspond to micro-inversions and thus do not significantly affectthe ancestral reconstructions.

11inferCARs reconstructions slightly differ from those reported in Ma et al. (2006) since we use the synteny blocks fromthe latest builds of mammalian genomes (provided by Jian Ma). Similarly to Ma et al. (2006); Kemkemer et al. (2006), weignore very short CARs blocks in both inferCARs and MGRA reconstructions to simplify the analysis (see Table S14).

17

4.1 Comparison of two inferCARs reconstructions and using MGRA to improve inferCARs an-cestral reconstructions

We start by comparing inferCARs with itself on two inputs: the original 6 mammalian genomesM,R,D,Q,H,C and the genomes M′,R′,D′,Q′,H′,C′ produced by MGRA Stage 1 (Fig. 7, top panel).We denote the reconstructed MRD genomes as MRDCARs and MRD′CARs respectively.

Since MGRA Stage 1 processes only good cycles/paths that are unambiguously present in ev-ery optimal rearrangement scenario, one can safely assume that any optimal ancestral reconstruc-tion should include the rearrangements performed at Stage 1. Therefore, running inferCARs onM,R,D,Q,H,C genomes should ideally produce the same results as running inferCARs on the “equiv-alent” M′,R′,D′,Q′,H′,C′ genomes. However, since inferCARs makes some greedy decisions anddoes not claim optimality, it does not guarantee to produce the same results on M,R,D,Q,H,C ascompared to M′,R′,D′,Q′,H′,C′. Any such inconsistency would point to either somewhat less re-liable CARs reconstructed by inferCARs or to reliable adjacencies missed by inferCARs. Therefore,inferCARs reconstructions can be potentially improved if MGRA Stage 1 runs before inferCARs as apre-processing step.

Comparison of the reconstructed genomes MRDCARs and MRD′CARs indicates that while theyshare the overwhelming majority (99.0%) of reconstructed adjacencies, there are 13 adjacencies presentMRDCARs but absent in MRD′CARs and 13 adjacencies absent in MRDCARs but present in MRD′CARs(out of the 1325 reconstructed adjacencies). Fig. 8(top) displays the breakpoint graph between thecorresponding MRDCARs and MRD′CARs reconstructions and reveals that the MRD′CARs reconstructionis arguably more reliable than the MRDCARs reconstruction. Indeed, Fig. 8(top) reveals that whilemost of adjacencies (12 out of 13) present in MRDCARs but not in MRD′CARs correspond to “ambiguousjoins” (in terms of Ma et al. (2006)), MRD′CARs contains 4 reliable adjacencies (i.e., resolved by MGRAStage 1) that are nevertheless absent in MRDCARs.

To resolve the conflicts between inferCARs results on equivalent inputs we analyze each of theseadjacencies ((658h, 652h), (871t, 873t), (770t, 771t), and (1014t, 1017h)) in a case by case fashion. Forexample, in the case of the (658h, 652h) adjacency, inferCARs failed to connect them, since the vertices658h and 652h represent breakpoints of multidegree 3 (Fig. 8, bottom panels) and it is not immediatelyclear how to process them using “local” rules employed by inferCARs. inferCARs turns 658h into aCAR end in MRD although it is not a chromosome end in any of the six genomes. The breakpointgraph provides a clear view of connection between 658h and 652h by revealing good paths connectingthem (see Fig. 8, bottom panels).

4.2 Comparison of inferCARs and MGRA reconstructions

Fig. S19 displays the breakpoint graph of the three MRD reconstructions: MRDMGRA, MRDCARs,and MRD′CARs, and illustrates that the number of differences between MRDCARs and MRD′CARs (weconsider the latter reconstruction to be more reliable) is comparable to the number of differencesbetween MRDMGRA and MRD′CARs. Indeed, MRD′CARs differs from MRDCARs by 30 adjacenciesand differs from MRDMGRA by 39 adjacencies. Since the large-scale architecture of MRDCARs wasshown to be largely consistent with previous cytogenetic reconstructions (Ma et al., 2006) and sinceMRD′CARs (that is arguably even more reliable than MRDCARs) and MRDMGRA share at least 98.5% ofall adjacencies, all these reconstructions can be viewed as largely consistent with the cytogenetics-based reconstructions. Remarkably, most differences between MGRA and inferCARs reconstructionsare represented by “ambiguous joins” that MGRA labels as less reliable anyway (shown as dashededges). In particular, inferCARs reports eight less reliable adjacencies as unambiguous (completemulti-edges with dashed purple edges in Fig. S19). However, most of them correspond to micro-inversions and have minor effects on the large-scale ancestral architectures (see Supplement I fordetailed comparison of MGRA and inferCARs reconstructions). Table 3 shows the genomic distancesfrom MRDMGRA and MRDCARs to each of the six leaf genomes and illustrates that MGRA results in a

18

1061h (16)

1062t (16)

555t (7)

1191h (19)

1189h (19)

1190t (19)

1192t (19)

1199h (19)

1200t (19)

586h (7)

587t (7)

78t (1)

666h (8)

667t (8)

74h (1)

75t (1)

770h (10)

75h (1)

76t (1)

769h (10)

770t (10)

771t (10)

77h (1)

872h (12)

873t (12)

940h (13)

941t (13)

1003h (15)

1035h (15)

1014t (15)

1017h (15)

1197t (19)

1245h (22)

72t (1)

1246t (22)

1254h (22)

652h (8)

658h (8)

871t (12)

935h (13)

360h

361t

658h

646t

659t939h

645h

652h

653t940t

360h

361t

658h

646t

939h

645h

652h

940t 653t

659t

360h

361t

658h

646t

645h

652h

653t

659t939h

940t

360h

361t

658h

646t

645h

652h

653t

659t939h

940t

Figure 8: Top panel: The breakpoint graph of the genomes MRDCARs (cyan) and MRD′CARs (orange) reconstructed byinferCARs (common adjacencies are not shown). Bold edges represent reliable adjacencies (resolved by MGRA Stage 1),while dashed edges represent “ambiguous joins” (see Ma et al. (2006)) made by inferCARs. Vertex colors are coded as inthe Fig. 4. Bottom panel: A most parsimonious transformation of one connected component (containing vertices 658h and652h) of the breakpoint graph G(M,R,D,Q,H,C) from Fig. 4. Initial component (first panel) is transformed with a 2-breakin primates (second panel), a 2-break in rodents (third panel), and two 2-breaks in dog resulting from processing of a goodD + MRQCH path (forth panel).

19

QHC

X

QHC+MROD

MROD

MROD+QHC

HC

HC+MRODQ

Q

Q+MRODHC

H

H+MRODQC

C

C+MRODQH

MR

MRO

MR+ODQHC

D

OD

D+MROQHC

M

M+RODQHC

R

R+MODQHC

O

O+MRDQHC

Figure 9: Left panel: The primate–rodent–carnivore controversy: an alternative between the primate–rodent (green tree)and the primate–carnivore clades (red tree). Right panel: The phylogenetic tree T of seven mammalian genomes: Mouse(red), Rat (blue), Dog (green), macaQue (violet), Human (orange), Chimpanzee (yellow), and Opossum (brown). Since theOpossum branch is subject to a controversy, the dashed branches represent possible variations while the solid branches areconfident and do not depend on the Opossum branch.

slightly more parsimonious scenario as compared to inferCARs (the total distance is 1503 for MGRAand 1518 for inferCARs).

M R D Q H C TotalMRDCARs 303 656 180 124 122 133 1518MRDMGRA 285 637 173 130 133 145 1503

Table 3: The genomic distances between the MRD reconstructions MRDCARs and MRDMGRA and the genomes M, R, D, Q,H, C.

4.3 The Primate–Rodent–Carnivore Split in Mammalian Evolution

Knowledge of the correct phylogeny is an important prerequisite for many comparative genomicsapproaches (Blanchette and Tompa, 2002; Kellis et al., 2003). However, even the basic features of themammalian phylogeny (e.g., the primate–rodent–carnivore split) remain controversial (Fig. 9). Whilethe morphology studies support the primate–rodent clade (Shoshani and McKenna, 1998), the earlymolecular studies supported the primate–carnivore clade (Graur, 1993; Kumar and Hedges, 1998;Reyes et al., 2000; Janke et al., 1994). Although starting from Murphy et al. (2001) the phylogenybased on the primate–rodent clade (Madsen et al., 2001; Poux et al., 2002; Amrine-Madsen et al., 2003;Thomas et al., 2003; Reyes et al., 2004) has become widely accepted, the question is far from beingsettled: recent studies provided arguments against the primate–rodent clade (Jorgensen et al., 2005;Arnason et al., 2002; Misawa and Janke, 2003; Cannarozzi et al., 2007; Huttley et al., 2007; Niimuraand Nei, 2007; Huerta-Cepas et al., 2007). Below we analyze some rearrangement-based characterssupporting both the primate-carnivore and (to a smaller extent) the primate-rodent clade. Similarlyto other approaches, the rearrangement analysis reveals some pros and cons for each alternative butdoes not definitely resolve the long-stranding controversy.

Chaisson et al. (2006) made an attempt to analyze mammalian phylogeny using micro-rearrangementsin the CFTR region representing 0.06% of mammalian genomes. However, the small size of this regionand ambiguities in revealing micro-rearrangements between distant mammals, made it difficult tofind micro-rearrangements that can “certify” the deep branches of the mammalian phylogenetic tree.Cannarozzi et al. (2007) made an attempt to analyze large-scale rearrangements (as opposed to micro-

20

rearrangements) for reconstructing the mammalian evolutionary history. Their approach, whilepromising, left a number of questions unanswered. In particular, Cannarozzi et al. (2007) discussedonly reversals and ignored translocations, fusions, and fissions. Also, they computed the reversaldistances using an (unpublished) greedy algorithm. Since breakpoint re-use is prominent in mam-malian evolution, greedy approaches are unlikely to provide an adequate rearrangement scenario.Finally, Cannarozzi et al. (2007) used the “distance-based” rather than “character-based” methodsfor computing the phylogenetic tree. It is well-known that the performance of the “distance-based”methods deteriorates in the case of large breakpoint re-use typical for mammalian genomes.

Lunter (2007) criticized Cannarozzi et al. (2007) and wrote in April 2007: “It appears unjustified tocontinue to consider the phylogeny of primates, rodents, and canines as contentious”; Huttley et al. (2007)wrote in May 2007: “We have demonstrated with very high confidence that the rodents diverged beforecarnivores and primates” (see Niimura and Nei (2007); Huerta-Cepas et al. (2007) for other recentstudies supporting the primate–carnivore clade). We therefore argue that the rearrangement-basedstudy of the primate–rodent–carnivore controversy is timely.

To analyze the primate–rodent–carnivore controversy, we added the Opossum genome (Mikkelsenet al., 2007) to our rearrangement analysis.12 However, while the phylogenetic tree of the previouslyconsidered six mammalian genomes is well established, the position of the opossum genome in thistree is being debated (Fig. 9). Supplementary Table S13 presents the statistics of the breakpointgraph and reveals simple edges supporting the debated tree topologies. Among the non-confidentbranches, the MRO + DQHC branch (corresponding the primate–carnivore clade) is supported by 50multi-edges, while the DO+MRQHC branch (corresponding to the primate–rodent clade) is supportedby 32 multi-edges. We emphasize that only 4 out of 50 multi-edges supporting the MRO + DQHCbranch represent MRO multi-edges, resulting in a very small number of simple MRO+DQHC vertices(one simple vertex for MRO + DQHC branch as compared to zero simple vertices for DO + MRQHCbranch).

To further address the uncertainty with the opossum branch we applied MGRA only to the non-controversial parts of the tree with the goal to find characters supporting each of two currently debatedtree topologies. The debated tree topologies share (non-controversial) HC + MRDOQ, QHC + MRDO,and MR + DOQHC branches (as well as seven leaf branches corresponding to single genomes). Werefer to these branches as confident and consider only good and fair paths that correspond to confidentbranches in MGRA analysis. To further compare the support for the primate–carnivore and theprimate–rodent clades we run MGRA to simplify this breakpoint graph. MGRA Stages 1-2 result inthe breakpoint graph (Fig. S18(top)) that encodes rearrangements during mammalian radiation.

While running MGRA on all seven genomes was important for simplifying the initial breakpointgraph of seven genomes, it hardly makes sense to analyze all these genomes in the complex graphin Fig. S18(top). Indeed, we are not interested in subtle inconsistencies between mouse–rat andhuman–chimpanzee–macaque genomic architectures revealed by this graph. We therefore selectsingle representatives of the primate (macaque-human-chimpanzee ancestor), rodent (mouse-rat an-cestor), and carnivore (dog) as well as the outgroup (opossum genome) to simplify the analysis (seeFig. S18(top) for a similar analysis with the representatives corresponding to extant macaque, mouse,dog, and opossum genomes).

Table 1(bottom) allows one to analyze features supporting the primate–carnivore clade (26 multi-edges of MO + DQ multicolors) and the primate–rodent (12 multi-edges of MQ + DO multicolors).While the rearrangement-based support for the primate–carnivore clade is more significant than forthe primate–rodent clade (26 versus 12 multi-edges), one cannot exclude a possibility that somecomplex breakpoint re-use events skewed the statistics in Table 1(bottom) in favor of the primate–

12Adding the seventh genome increases the number of the synteny blocks to 1746 (by ≈ 30%) but reduces the coverageof the genomes by the synteny blocks from 89% to 79%.

21

carnivore clade (see Table 4). Since the elephant genome provides a better (less diverged) outgroupthan the opossum genome, there is a hope that the completion of the elephant sequencing projectmay eventually lead to the resolution of the primate-rodent-carnivore controvery.

We re-run MGRA on the set of seven genomes, assuming the primate–carnivore topology.13 Theresulting rearrangement distances as well as 2-breaks assigned to MRO + DQHC branch (supportingthe carnivore-primate split) are given in Supplement H.

Multicolors Multi-edges

Simplevertices

Simplemulti-edges

Simplepaths+cycles

∅ + RPCO 0 + 627 = 627 0 0 + 0 = 0 0 + 0 = 0O + RPC 617 + 740 = 1357 1230 613 + 523 = 1136 93 + 109 = 202R + PCO 280 + 173 = 453 346 121 + 173 = 294 52 + 28 = 80C + RPO 142 + 246 = 388 284 142 + 96 = 238 46 + 33 = 79P + RCO 49 + 124 = 173 98 49 + 31 = 80 18 + 13 = 31RO + PC 5 + 46 = 51 1 0 + 0 = 0 0 + 0 = 0RP + CO 32 + 11 = 43 1 0 + 0 = 0 0 + 0 = 0RC + PO 16 + 8 = 24 2 0 + 0 = 0 0 + 0 = 0

∅ + RPCO 0 + 1712 = 1712 0 0 + 0 = 0 0 + 0 = 0O + RPC 10 + 26 = 36 10 0 + 0 = 0 0 + 0 = 0R + PCO 22 + 5 = 27 5 0 + 0 = 0 0 + 0 = 0C + RPO 0 + 18 = 18 0 0 + 0 = 0 0 + 0 = 0P + RCO 0 + 16 = 16 0 0 + 0 = 0 0 + 0 = 0RO + PC 4 + 16 = 20 6 2 + 1 = 3 2 + 0 = 2RP + CO 5 + 4 = 9 4 0 + 0 = 0 0 + 0 = 0RC + PO 5 + 4 = 9 4 1 + 0 = 1 1 + 0 = 1

∅ + RPCO 0 + 1743 = 1743 0 0 + 0 = 0 0 + 0 = 0O + RPC 0 + 2 = 2 0 0 + 0 = 0 0 + 0 = 0C + RPO 0 + 2 = 2 0 0 + 0 = 0 0 + 0 = 0P + RCO 0 + 1 = 1 0 0 + 0 = 0 0 + 0 = 0RO + PC 9 + 10 = 19 12 3 + 3 = 6 5 + 0 = 5RC + PO 8 + 8 = 16 10 2 + 3 = 5 3 + 1 = 4RP + CO 7 + 8 = 15 10 4 + 2 = 6 4 + 0 = 4

Table 4: The statistics of the breakpoint graph of the ancestral Rodent (mouse-rat ancestor), ancestral Primate (macaque-human-chimpanzee ancestor), Carnivore (dog), and Opossum genomes. The ancestral Rodent and Primate ancestors werereconstructed using MGRA and were used as the genomes in the leaves of the phylogenetic tree for four species: Rodent,Primate, Dog, and Opossum. The statistics is shown for before running MGRA (top table), after MGRA Stage 1 (middletable), and after MGRA Stage 2 (bottom table) run on the leaf branches (i.e., without assuming any particular topology ofthe phylogenetic tree). Compare to Tables 1(bottom) and S13.

DISCUSSION

Recently, Froenicke et al. (2006) expressed a concern about some differences between the rearrangement-based and cytogenetics-based approaches to ancestral genome reconstruction. The problem is thatsome important insights developed by the cytogenetics community still did not find their way into thegenome rearrangement tools like MGR, GRAPPA, inferCARs, and EMRAE. While MGRA started as anattempt to close this gap, we quickly realized that the problem of merging the cytogenetics-based andrearrangement-based approaches is far from being simple. First, there is still no cytogenetics-basedsoftware that can be automatically applied to genome-scale datasets to enable an unbiased compari-son of two approaches on the same dataset. Second, it is not clear how well the cytogenetics approachscales with increase in the resolution, e.g., with 1000+ synteny blocks from Ma et al. (2006).

Despite the low resolution of the cytogenetic data, the cytogenetics-based ancestral reconstructions

13In contrast, Ma et al. (2006) assumed the primate–rodent topology.

22

are very accurate as there are relatively few discrepancies between the cytogenetics-based and therecent genomics-based high-resolution reconstructions (Bourque et al., 2005; Murphy et al., 2005; Maet al., 2006). Moreover, the discrepancies are usually attributed to some arbitrary assignments of thegenomics-based MGR algorithm (Froenicke et al., 2006) rather than errors in the cytogenetics analysis.Indeed, MGR was developed for finding the most parsimonious scenario rather than finding whichrearrangements in this scenario are less reliable than others. The discrepancies between MGR and thecytogenetics-based reconstructions are likely to be a reflection of the “strength in numbers” principlerather than shortcomings of the genomics-based approaches: while the cytogenetic reconstructionsare based on over 100 known cytogenetic maps, there are still only seven completed mammaliangenomic sequences suitable for the rearrangement analysis. However, even with a small increase inthe number of the genomes from 3-4 (as in Bourque et al. (2004, 2005)) to 6-7 (as in Murphy et al.(2005); Ma et al. (2006)), there are very few discrepancies between the cytogenetics-based and thegenomics-based approaches (Rocchi et al., 2006). Despite a recent debate (Froenicke et al., 2006;Bourque et al., 2006), the cytogenetics and genomics-based approaches are converging and benefitingfrom the higher resolution of the genomics-based approaches. However, the key condition for suchconvergence is the availability of algorithms that improve upon the existing heuristics for separatingbetween strong and weak associations. We addressed this challenge by devising MGRA algorithmthat remedies some limitations of the previous approaches to ancestral reconstructions. Similarlyto the algoritms recently proposed by Ma et al. (2008); Chauve and Tannier (2008) (published afterthis paper was submitted), MGRA focuses on accurate rather than the most parsimonious ancestralreconstructions.

ACKNOWLEDGEMENTS

We are indebted to Jian Ma for providing us with the synteny blocks for mammalian genomesfrom the latest builds and for numerous thoughtful discussions. We are thankful to Bill Murphy fora discussion about the primate–rodent–carnivore controversy as well as to Guillaume Bourque andGlenn Tesler for discussions on the algorithmic aspects of this project. We are also grateful to LutzFroenicke and Claus Kemkemer for useful comments on the cytogenetics approach and CytoAncestor.This work was supported by the Howard Hughes Professor Award.

23

ReferencesAlekseyev, M. A., 2008. Multi-Break Rearrangements and Breakpoint Re-uses: from Circular to Linear Genomes. Journal of

Computational Biology, 15(8):1117–1131.

Alekseyev, M. A. and Pevzner, P. A., 2007. Whole Genome Duplications, Multi-Break Rearrangements, and Genome HalvingTheorem. Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), :665–679.

Alekseyev, M. A. and Pevzner, P. A., 2008. Multi-Break Rearrangements and Chromosomal Evolution. Theoretical ComputerScience, 395(2-3):193–202.

Amrine-Madsen, H., Koepfli, K.-P., Wayne, R. K., and Springer, M. S., 2003. A new phylogenetic marker, apolipoprotein B,provides compelling evidence for eutherian relationships. Molecular Phylogenetics and Evolution, 28(2):225–240.

Arnason, U., Adegoke, J. A., Bodin, K., Born, E. W., Esa, Y. B., Gullberg, A., Nilsson, M., Short, R. V., Xu, X., and Janke, A.,et al., 2002. Mammalian mitogenomic relationships and the root of the eutherian tree. Proceedings of the National Academyof Sciences, 99(12):8151–8156.

Bafna, V. and Pevzner, P. A., 1996. Genome rearrangement and sorting by reversals. SIAM Journal on Computing, 25:272–289.

Bafna, V. and Pevzner, P. A., 1998. Sorting permutations by transpositions. SIAM J. Discrete Math., 11:224–240.

Bailey, J., Baertsch, R., Kent, W., Haussler, D., and Eichler, E., 2004. Hotspots of mammalian chromosomal evolution. GenomeBiology, 5(4):R23.

Bergeron, A., Mixtacki, J., and Stoye, J., 2006. A Unifying View of Genome Rearrangements. Lecture Notes in ComputerScience, 4175:163–173.

Blanchette, M., Bourque, G., and Sankoff, D., 1997. Breakpoint Phylogenies. Genome Informatics, 8:25–34.

Blanchette, M. and Tompa, M., 2002. Discovery of Regulatory Elements by a Computational Method for PhylogeneticFootprinting. Genome Research, 12(5):739–748.

Bourque, G. and Pevzner, P. A., 2002. Genome-Scale Evolution: Reconstructing Gene Orders in the Ancestral Species.Genome Research, 12(1):26–36.

Bourque, G., Pevzner, P. A., and Tesler, G., 2004. Reconstructing the genomic architecture of ancestral mammals: lessonsfrom human, mouse, and rat genomes. Genome Research, 14:507–516.

Bourque, G., Tesler, G., and Pevzner, P. A., 2006. The convergence of cytogenetics and rearrangement-based models forancestral genome reconstruction. Genome Res., 16(3):311–313.

Bourque, G., Zdobnov, E. M., Bork, P., Pevzner, P. A., and Tesler, G., 2005. Comparative architectures of mammalian andchicken genomes reveal highly variable rates of genomic rearrangements across different lineages. Genome Research,15:98–110.

Bulazel, K., Ferreri, G., Eldridge, M., and O’Neill, R., 2007. Species-specific shifts in centromere sequence composition arecoincident with breakpoint reuse in karyotypically divergent lineages. Genome Biology, 8(8):R170.

Caceres, M., Sullivan, R. T., and Thomas, J. W., 2007. A recurrent inversion on the eutherian X chromosome. Proceedings ofthe National Academy of Sciences, 104(47):18571–18576.

Cannarozzi, G., Schneider, A., and Gonnet, G., 2007. A Phylogenomic Study of Human, Dog, and Mouse. PLoS ComputationalBiology, 3(1):e2.

Caprara, A., 1999a. Formulations and hardness of multiple sorting by reversals. In RECOMB ’99: Proceedings of the thirdannual international conference on Computational molecular biology, pages 84–93, New York, NY, USA. ACM.

Caprara, A., 1999b. On the Tightness of the Alternating-Cycle Lower Bound for Sorting by Reversals. Journal of CombinatorialOptimization, 3(2-3):149–182.

Cardone, M., Jiang, Z., D’Addabbo, P., Archidiacono, N., Rocchi, M., Eichler, E., and Ventura, M., 2008. Hominoidchromosomal rearrangements on 17q map to complex regions of segmental duplication. Genome Biology, 9(2):R28.

24

Chaisson, M. J., Raphael, B. J., and Pevzner, P. A., 2006. Microinversions in mammalian evolution. Proceedings of the NationalAcademy of Sciences, 103(52):19824–19829.

Chauve, C. and Tannier, E., 2008. A Methodological Framework for the Reconstruction of Contiguous Regions of AncestralGenomes and Its Application to Mammalian Genomes. PLoS Computational Biology, 4(11):e1000234.

Deuve, J., Bennett, N., Britton-Davidian, J., and Robinson, T., 2008. Chromosomal phylogeny and evolution of the africanmole-rats (bathyergidae). Chromosome Research, 16(1):57–74.

Feuk, L., MacDonald, J. R., Tang, T., Carson, A. R., Li, M., Rao, G., Khaja, R., and Scherer, S. W., 2005. Discovery ofHuman Inversion Polymorphisms by Comparative Analysis of Human and Chimpanzee DNA Sequence Assemblies.PLoS Genetics, 1:e56.

Froenicke, 2005. Origins of primate chromosomes – as delineated by zoo-fish and alignments of human and mouse draftgenome sequences. Cytogenetic and Genome Research, 108(1-3):122–138.

Froenicke, L., Caldes, M. G., Graphodatsky, A., Muller, S., Lyons, L. A., Robinson, T. J., Volleth, M., Yang, F., and Wienberg,J., 2006. Are molecular cytogenetics and bioinformatics suggesting diverging models of ancestral mammalian genomes?Genome Res., 16(3):306–310.

Gordon, L., Yang, S., Tran-Gyamfi, M., Baggott, D., Christensen, M., Hamilton, A., Crooijmans, R., Groenen, M., Lucas, S.,Ovcharenko, I., et al., 2007. Comparative analysis of chicken chromosome 28 provides new clues to the evolutionaryfragility of gene-rich vertebrate regions. Genome Research, 17(11):1603–1613.

Graur, D., 1993. Towards a molecular resolution of the ordinal phylogeny of the eutherian mammals. FEBS Letters,325(1-2):152–159.

Hannenhalli, S. and Pevzner, P., 1995. Transforming men into mouse (polynomial algorithm for genomic distance problem).Proceedings of the 36th Annual Symposium on Foundations of Computer Science, :581–592.

Hannenhalli, S. and Pevzner, P., 1999. Transforming cabbage into turnip (polynomial algorithm for sorting signed permu-tations by reversals). Journal of the ACM, 46:1–27.

Hinsch, H. and Hannenhalli, S., 2006. Recurring genomic breaks in independent lineages support genomic fragility. BMCEvolutionary Biology, 6:90.

Huerta-Cepas, J., Dopazo, H., Dopazo, J., and Gabaldon, T., 2007. The human phylome. Genome Biology, 8(6):R109.

Huttley, G. A., Wakefield, M. J., and Easteal, S., 2007. Rates of Genome Evolution and Branching Order from Whole GenomeAnalysis. Molecular Biology and Evolution, 24(8):1722–1730.

Janke, A., Feldmaier-Fuchs, G., Thomas, W. K., von Haeseler, A., and Paabo, S., 1994. The Marsupial Mitochondrial Genomeand the Evolution of Placental Mammals. Genetics, 137(1):243–256.

Jorgensen, F., Hobolth, A., Hornshoj, H., Bendixen, C., Fredholm, M., and Schierup, M., 2005. Comparative analysis ofprotein coding sequences from human, mouse and the domesticated pig. BMC Biology, 3(1):2.

Kellis, M., Patterson, N., Endrizzi, M., Birren, B., and Lander, E. S., 2003. Sequencing and comparison of yeast species toidentify genes and regulatory elements. Nature, 423(6937):241–254.

Kemkemer, C., Kohn, M., Kehrer-Sawatzki, H., Minich, P., Hogel, J., Froenicke, L., and Hameister, H., 2006. Reconstructionof the ancestral ferungulate karyotype by electronic chromosome painting (E-painting). Chromosome Research, 14(8):899–907.

Kikuta, H., Laplante, M., Navratilova, P., Komisarczuk, A. Z., Engstrom, P. G., Fredman, D., Akalin, A., Caccamo, M., Sealy,I., Howe, K., et al., 2007. Genomic regulatory blocks encompass multiple neighboring genes and maintain conservedsynteny in vertebrates. Genome Research, 17(5):545–555.

Kumar, S. and Hedges, S. B., 1998. A molecular timescale for vertebrate evolution. Nature, 392(6679):917–920.

Lin, Y. and Moret, B. M., 2008. Estimating true evolutionary distances under the DCJ model. Bioinformatics, 24(13):i114–122.

Lunter, G., 2007. Dog as an Outgroup to Human and Mouse. PLoS Computational Biology, 3(4):e74.

25

Ma, J., Ratan, A., Raney, B. J., Suh, B. B., Miller, W., and Haussler, D., 2008. The infinite sites model of genome evolution.Proceedings of the National Academy of Sciences, 105(38):14254–14261.

Ma, J., Zhang, L., Suh, B. B., Raney, B. J., Burhans, R. C., Kent, J. W., Blanchette, M., Haussler, D., and Miller, W., 2006.Reconstructing contiguous regions of an ancestral genome. Genome Research, 16(12):1557–1565.

Madsen, O., Scally, M., Douady, C. J., Kao, D. J., DeBry, R. W., Adkins, R., Amrine, H. M., Stanhope, M. J., de Jong,W. W., and Springer, M. S., et al., 2001. Parallel adaptive radiations in two major clades of placental mammals. Nature,409(6820):610–614.

Mehan, M. R., Almonte, M., Slaten, E., Freimer, N. B., Rao, P. N., and Ophoff, R. A., 2007. Analysis of segmental duplicationsreveals a distinct pattern of continuation-of-synteny between human and mouse genomes. Human Genetics, 121(1):93–100.

Mikkelsen, T. S., Wakefield, M. J., Aken, B., Amemiya, C. T., Chang, J. L., Duke, S., Garber, M., Gentles, A. J., Goodstadt, L.,Heger, A., et al., 2007. Genome of the marsupial monodelphis domestica reveals innovation in non-coding sequences.Nature, 447(7141):167–177.

Misawa, K. and Janke, A., 2003. Revisiting the Glires concept–phylogenetic analysis of nuclear sequences. MolecularPhylogenetics and Evolution, 28(2):320–327.

Moret, B., Wyman, S., Bader, D. A., Warnow, T., and Yan, M., 2001. A new implementation and detailed study of breakpointanalysis. Pacific Symposium on Biocomputing, :583–594.

Moret, B. M. E., Siepel, A. C., Tang, J., and Liu, T., 2002. Inversion Medians Outperform Breakpoint Medians in PhylogenyReconstruction from Gene-Order Data. Lecture Notes In Computer Science, 2452:521–536.

Murphy, W. J., Eizirik, E., Johnson, W. E., Zhang, Y. P., Ryder, O. A., and O’Brien, S. J., 2001. Molecular phylogenetics andthe origins of placental mammals. Nature, 409(6820):614–618.

Murphy, W. J., Larkin, D. M., van der Wind, A. E., Bourque, G., Tesler, G., Auvil, L., Beever, J. E., Chowdhary, B. P., Galibert,F., Gatzke, L., et al., 2005. Dynamics of Mammalian Chromosome Evolution Inferred from Multispecies ComparativeMap. Science, 309(5734):613–617.

Niimura, Y. and Nei, M., 2007. Extensive Gains and Losses of Olfactory Receptor Genes in Mammalian Evolution. PLoSONE, 2(8):e708.

Ozery-Flato, M. and Shamir, R., 2003. Two Notes on Genome Rearrangement. Journal of Bioinformatics and ComputationalBiology, 1:71–94.

Pevzner, P. A., 2000. Computational Molecular Biology: An Algorithmic Approach. The MIT Press, Cambridge.

Pevzner, P. A. and Tesler, G., 2003. Human and mouse genomic sequences reveal extensive breakpoint reuse in mammalianevolution. Proceedings of the National Academy of Sciences, 100:7672–7677.

Pontius, J. U., Mullikin, J. C., Smith, D. R., Lindblad-Toh, K., Gnerre, S., Clamp, M., Chang, J., Stephens, R., Neelam, B.,Volfovsky, N., et al., 2007. Initial sequence and comparative analysis of the cat genome. Genome Research, 17(11):1675–1689.

Poux, C., van Rheede, T., Madsen, O., and de Jong, W. W., 2002. Sequence Gaps Join Mice and Men: Phylogenetic Evidencefrom Deletions in Two Proteins. Mol Biol Evol, 19(11):2035–2037.

Reyes, A., Gissi, C., Catzeflis, F., Nevo, E., Pesole, G., and Saccone, C., 2004. Congruent Mammalian Trees from Mitochondrialand Nuclear Genes Using Bayesian Methods. Molecular Biology and Evolution, 21(2):397–403.

Reyes, A., Gissi, C., Pesole, G., Catzeflis, F. M., and Saccone, C., 2000. Where Do Rodents Fit? Evidence from the CompleteMitochondrial Genome of Sciurus vulgaris. Mol Biol Evol, 17(6):979–983.

Robinson, T. J., Ruiz-Herrera, A., and Froenicke, L., 2006. Dissecting the mammalian genome – new insights into chromo-somal evolution. Trends in Genetics, 22(6):297–301.

Rocchi, M., Archidiacono, N., and Stanyon, R., 2006. Ancestral genomes reconstruction: An integrated, multi-disciplinaryapproach is needed. Genome Res., 16(12):1441–1444.

Ruiz-Herrera, A., Castresana, J., and Robinson, T. J., 2006. Is mammalian chromosomal evolution driven by regions ofgenome fragility? Genome Biology, 7:R115.

26

Sankoff, D. and Blanchette, M., 1998. Multiple genome rearrangement and breakpoint phylogeny. Journal of ComputationalBiology, 5(3):555–570.

Sankoff, D., Leduc, G., Antoine, N., Paquin, B., Lang, B. F., and Cedergren, R., 1992. Gene Order Comparisons for Phyloge-netic Inference: Evolution of the Mitochondrial Genome. Proceedings of the National Academy of Sciences, 89(14):6575–6579.

Shoshani, J. and McKenna, M. C., 1998. Higher Taxonomic Relationships among Extant Mammals Based on Morphology,with Selected Comparisons of Results from Molecular Data. Molecular Phylogenetics and Evolution, 9(3):572–584.

Tang, J. and Moret, B. M., 2003. Scaling up accurate phylogenetic reconstruction from gene-order data. Bioinformatics,19(Suppl. 1):i305–312.

Tannier, E. and Sagot, M.-F., 2004. Sorting by reversals in subquadratic time. Lecture Notes in Computer Science, 3109:1–13.

Tannier, E., Zheng, C., and Sankoff, D., 2008. Multichromosomal Genome Median and Halving Problems. Lecture Notes inBioinformatics, 5251:1–13.

Tesler, G., 2002a. Efficient algorithms for multichromosomal genome rearrangements. J. Comput. Syst. Sci., 65:587–609.

Tesler, G., 2002b. GRIMM: genome rearrangements web server . Bioinformatics, 18(3):492–493.

Thomas, J. W., Touchman, J. W., Blakesley, R. W., Bouffard, G. G., Beckstrom-Sternberg, S. M., Margulies, E. H., Blanchette,M., Siepel, A. C., Thomas, P. J., McDowell, J. C., et al., 2003. Comparative analyses of multi-species sequences fromtargeted genomic regions. Nature, 424(6950):788–793.

van der Wind, A. E., Kata, S. R., Band, M. R., Rebeiz, M., Larkin, D. M., Everts, R. E., Green, C. A., Liu, L., Natarajan,S., Goldammer, T., et al., 2004. A 1463 Gene Cattle-Human Comparative Map With Anchor Points Defined by HumanGenome Sequence Coordinates. Genome Research, 14(7):1424–1437.

Webber, C. and Ponting, C. P., 2005. Hotspots of mutation and breakage in dog and human chromosomes. Genome Research,15(12):1787–1797.

Wienberg, J. and Stanyon, R., 1997. Comparative painting of mammalian chromosomes. Current Opinion in Genetics &Development, 7(6):784–791.

Xia, A., Sharakhova, M., and Sharakhov, I., 2007. Reconstructing an inversion history in the anopheles gambiae complex.Lecture Notes in Bioinformatics, 4751:136–148.

Yancopoulos, S., Attie, O., and Friedberg, R., 2005. Efficient sorting of genomic permutations by translocation, inversionand block interchange. Bioinformatics, 21:3340–3346.

Yue, Y. and Haaf, T., 2006. 7E olfactory receptor gene clusters and evolutionary chromosome rearrangements. Cytogeneticand Genome Research, 112:6–10.

Zhao, H. and Bourque, G., 2007. Recovering True Rearrangement Events on Phylogenetic Trees. Lecture Notes in Bioinfor-matics, 4751:149–161.

Zhao, S., Shetty, J., Hou, L., Delcher, A., Zhu, B., Osoegawa, K., de Jong, P., Nierman, W. C., Strausberg, R. L., and Fraser,C. M., et al., 2004. Human, Mouse, and Rat Genome Large-Scale Rearrangements: Stability Versus Speciation. GenomeResearch, 14:1851–1860.

27

Supplement A Independent, semi-independent, and weakly-independent rearrange-ments

Let P be a genome represented a graph on black and obverse edges. For any m black edges in P, wedefine an m-break (or multi-break) rearrangement as replacement of these edges with other m blackedges forming a matching on the same vertices (see Alekseyev and Pevzner (2008); Alekseyev (2008)).

Let P1,P2, . . . ,Pk be genomes that evolved by some unknown multi-breaks following an unknownevolutionary tree (we allow any combination of 2-breaks, 3-breaks, and m-breaks for m > 3 in a singleevolutionary scenario). Without loss of generality, we assume that there was at least one multi-breakon every branch of the tree. One may classify an m-break as independent if it does not reuse breakpoints,i.e., creates exactly m new breakpoints. Similarly, we call the rearrangement scenario independent ifall its multi-breaks are independent.

If all multi-breaks are independent, the following theorem applies (the proof follows the proof ofa similar result in Chaisson et al. (2006)).

Theorem 1. If genomes P1,P2, . . . ,Pk are produced by independent multi-breaks, then both the correct evolu-tionary tree for these genomes and the ancestral genomes in all its branching nodes may be reconstructed inpolynomial time.

We now consider the case when genomes P1,P2, . . . ,Pk are produced by semi-independent 2-breaksthat may re-use breakpoints within single branches of the evolutionary tree (i.e., a semi-independent2-break does not share breakpoints with any other 2-break on a different branch of T). We callthe 2-break rearrangement scenario semi-independent if all its 2-breaks are semi-independent. Sinceany semi-independent 2-break scenario corresponds to an independent multi-break scenario (seeAlekseyev and Pevzner (2008)), Theorem 1 implies:

Theorem 2. If genomes P1,P2, . . . ,Pk are produced by semi-independent 2-breaks, then both the correctevolutionary tree for these genomes and the ancestral genomes in all its branching nodes may be reconstructedin polynomial time.

MGRA optimally solves the MGRP problem in case of semi-independent 2-breaks and uses heuris-tics to move beyond the semi-independent assumption. In a new manuscript14 we define the notion ofweakly independent rearrangements that relaxes the semi-independent assumption by allowing break-point re-uses within selected pairs of incident branches in the phylogenetic tree (as opposed to a singlebranch in semi-independent scenarios). We demonstrate that the TCMGRP can be solved efficientlyin the case of weakly independent scenarios.15 Below we show that most 2-breaks in mammalianevolution are either independent, or semi-independent, or weakly independent resulting in reliableancestral reconstructions. The theoretical analysis of weakly independent scenarios is not crucial forunderstanding MGRA and will be described elsewhere.

Supplement B Simultaneous T-consistent transformationsThe problem of finding a shortest rearrangement scenario typically has many solutions. To character-ize all genomes that may appear in shortest rearrangement scenarios between genomes A and B, wesay that a genome Q is located between A and B if the rearrangement distances between these genomessatisfy the condition: d(A,Q) + d(Q,B) = d(A,B). In the case of a phylogenetic tree T with knowngenomes at all internal nodes, we say that a genome Q is located on a branch (A,B) of the phylogenetictree T if it located between nodes (genomes) A and B. Similarly, a genome Q is located on the tree T ifit is located on a branch of T. A transformation between two genomes located on the tree T is calledT-consistent if every intermediate genome in this transformation is also located on T.

14M. Alekseyev and P. Pevzner. “Breakpoint Re-uses and Efficient Ancestral Genome Reconstruction”. Mimeo.15In particular, while the reversal median problem is NP-complete for arbitrary scenarios (Caprara, 1999a), one can

efficiently reconstruct the 2-break median for 3 genomes in case of weakly independent scenarios.

28

Let T be a tree with known genomes specified at every node. A tree T′ is homeomorphic to thetree T if it is derived from T by adding extra internal nodes (of degree 2) within branches of T andspecifying some genomes at these added nodes. For example, Fig. 3c represents a tree homeomorphicto the tree in Fig. 3a with two extra nodes added to the branch (Q2,Q3). A homeomorphic tree T′

defines a T-consistent rearrangement scenario if genomes at every two adjacent nodes of T′ differ fromeach other by a single rearrangement and the total number of rearrangements along each branch isminimal. We now reformulate the problem of finding the most parsimonious rearrangement scenarioas the problem of finding a T-consistent rearrangement scenario with the minimum number of nodesin the homeomorphic tree T′.

If X is an arbitrary genome (root) in T′ then the path path(X,Pi) from the root X to every leafgenome Pi in T′ corresponds to a series of rearrangements transforming X into Pi (i = 1, 2, . . . , k). A setof nodes C in T′ is called a cut if each path path(X,Pi) contains exactly one node from C (for 1 ≤ i ≤ k).For example, the sets {X} and {P1, . . . ,Pk} represent cuts in T′ with minimum and maximum numberof nodes correspondingly. Given a cut C and a leaf genome Pi, let vC

i be a (single) node in C locatedon a path path(X,Pi) and let PC

i be the genome assigned to the node vCi . Therefore, every cut C defines

k genomes PCi1 ,P

Ci2 , . . .P

Cik .

One can orient branches of T′ in the direction from X to the leaves and define next(v) as the setof children of an internal node v (the number of children equals the degree of v minus one). Givenan internal node v in a cut C, we define a new cut nextv(C) obtained from C by deleting a node v andadding the set of nodes next(v). A simultaneous T-consistent transformation of the root genome X into theleaf genomes P1, . . . ,Pk is a series of cuts {X} = C0,C1, . . . ,Cd = {P1, . . . ,Pk} such that Ci+1 = nextvi(Ci)for some node vi ∈ Ci (0 ≤ i < d). It is easy to see that for every T-consistent transformationthere exists a simultaneous T-consistent transformation. Below we give an equivalent definition ofthe simultaneous T-consistent transformation in terms of multiple breakpoint graphs that motivatesMGRA algorithm attempting to find a shortest simultaneous T-consistent transformation.

Any subset of edges from the multi-edge (x, y) represents a sub-multi-edge (x, y) of the multi-color formed by the colors of the edges in this subset. Any simultaneous T-consistent transfor-mation of X into P1, . . . ,Pk defines a transformation of the identity breakpoint graph G(X, . . . ,X)into G(P1,P2, . . . ,Pk) with a series of rearrangements applied to sub-multi-edges of T-consistentmulticolors. Namely, we define the multiple breakpoint graph corresponding to the cut Ci asGi = G(PCi

1 ,PCi2 , . . . ,P

Cik ). It is easy to see that Gi+1 is obtained from Gi by a single rearrangement

ρ applied to all copies of some genome Q in PCi1 ,P

Ci2 , . . . ,P

Cik . Alternatively, a transformation of

Gi into Gi+1 can be viewed as applying ρ to T-consistent sub-multi-edges in Gi (of the multicolor{Pi | PCi = Q}).

Supplement C Reconstructing reliable CARsTo reconstruct reliable adjacencies in the ancestor genome at a node of the phylogenetic tree, weselect this node as a root node X. Then we start to eliminate breakpoints in the breakpoint graphG(P1, . . . ,Pk) with reliable 2-breaks (for whatever definition of reliability) and stop when no furtherreliable 2-breaks exist. In the resulting breakpoint graph (that may still have some breakpoints) themulti-edges of multicolors containing X (as a subset of colors) represent the reliable block adjacenciesin the target ancestor genome, and we generate CARs based only on such adjacencies. Note thatthis approach can reconstruct CARs in only one ancestor genome at a time, and multiple runs (withdifferent root nodes X) are needed to reconstruct CARs in multiple ancestor genomes.

Supplement D Comparison of various ancestral reconstructions for a component of thebreakpoint graph representing the human chromosomes 7, 16, and 19

Below we focus on the connected component of the breakpoint graph representing the human chro-mosomes 7, 16, and 19 where the cytogenetics approach disagrees with Ma et al. (2006). The advantage

29

of the breakpoint graph approach is that it enables a simple analysis of this controversy since the anal-ysis is reduced to “genomes” with only 6 synteny blocks after equivalent transformations performedby MGRA Stage 1 (Fig. S13). Indeed, the genomes represented by this component can now be viewedas:

Macaque: 4, 5, (2,−6), (3,−1)Mouse: 2, 3, (4, 6), (1,−5)Dog: 1, (4,−6,−5), (2,−3)

where the block 1 is located on the human chromosome 7, the blocks 2 and 3 are located on thehuman chromosome 16 and the blocks 4, 5, 6 are located on the human chromosome 19 (Fig. S11,top panel and Fig. S13). Ma et al. (2006) proposed the ancestral architecture with 4 chromosomes1, 5, (2,−3), (6,−4) (no associations between chromosomes 7, 16, and 19), while Froenicke et al. (2006)proposed 4, 5, (1,−3), (2,−6) (associations 16 + 7 and 16 + 19). It is easy to see that the cytogeneticsreconstruction is less parsimonious than Ma et al. (2006) reconstruction. We therefore argue that thecriticism of Ma et al. (2006) in Rocchi et al. (2006) regarding the missing association 7 + 16 is not fullyjustified since the whole genome data do not support this associations.16

MGRA Stage 2 also generates a solution that improves on the cytogenetics reconstruction (Froenickeet al., 2006) and proposes the ancestral association 7 + 19. While both our and Ma et al. (2006) so-lutions are more parsimonious than the cytogenetics reconstruction, we are not claiming that thesesolutions are necessarily correct (the most parsimonious scenarios on 6 genomes are not necessarilythe most parsimonious scenarios on 100+ genomes). The important thing is that MGRA Stage 1reduces analysis of the 7/16/19 controversy to such a small example that all possible scenarios can beexplored.

Supplement E Selecting fair multi-edges in MGRA Stage 2As described in the main text, the order of selected fair multi-edges may affect the ancestral recon-structions at some nodes of the phylogenetic tree. Below we specify how MGRA Stage 2 selects suchedges.

Note that if a fair multi-edge is not ~T-consistent then this multi-edge can only be affected by 2-breaks on adjacent multi-edges. Although two such 2-breaks are possible, ordering of these 2-breaksdoes not influence the final result (see Fig. 6, bottom panel). However, the situations when two fairmulti-edges share a vertex (and both are ~T-consistent) may be ambiguous since the final result of theirprocessing may be affected by the order in which these edges are processed. MGRA Stage 2 starts byprocessing unambiguous fair edges first and selects the order of remaining fair edges according tothe following heuristics.

For a fixed ~T-consistent multicolor Q, an “ideal” 2-break of multicolor Q should satisfy twoconditions: (i) it increases the number of cycles alternating between every pair of colors from Q + Q(i.e., one color from Q while the other is from Q), and (ii) it does not decrease the number of cyclesalternating between the other pairs of colors. It is easy to see that if a 2-break on a ~T-consistentmulticolor Q increases the number of connected components in the breakpoint graph then both theseconditions are satisfied. For each ~T-consistent multicolor Q, MGRA Stage 2 finds all 2-breaks onmulticolor Q that increase the number of connected components and perform them (in an arbitraryorder).

Supplement F MGRA Stage 3: Processing complex breakpointsIf one attempts to find the positions and orientations of short synteny block in the ancestors, thereare two possibilities. If the signs of the blocks are inferred correctly then the same micro-inversion

16We are not claiming that the analysis of 7 + 16 association in Froenicke (2005); Robinson et al. (2006) is incorrect butinstead argue that it is not supported by data used in Ma et al. (2006).

30

1000h

1001t

1000t

999h

1001h

1002t

1002h

1034h

1003h

1035t

1003t

1004t

1004h

1005t

1005h

1006t

1006h

1007t

1007h

1008t

1008h

1009t

1009h1010t

100h

101t

100t

99h

1010h

1011t

1011h

1012t

1012h

1013t

1013h

1014h

1014t

1017h

1015h

1016t

1015t

1018t

1016h

1017t

1018h

1019t

1019h

1020h

101h

102t

1020t

1027t

1021h

1022t

1021t

1030t

1022h

1023t

1023h

1024t

1024h

1025t

1025h

1026t

1026h

1029h

1027h

1028t

1028h

1029t

102h

103t

1030h

1031t

1031h

1032t

1032h

1033h

1033t

1034t

1035h

970h

1036h

1037t

1036t

1037h

1038t

1038h 1039t

1039h

1040t

103h

104t

1040h

1041t

1041h

1042t

1042h

1043t

1043h

1045h

1044h

1049h

1044t

1048h

1045t

1046t

1046h

1047t

1047h

1048t

1049t

1050t

104h105t

1050h

1051h

1051t

1052t

1052h

1053t

1053h

1055t

1054h

1056t

1054t

1055h

1056h

1059h

1057h

1058t

1057t

1059t

1058h

1060h

105h 106t

1060t

1061t

1061h

1062t 555t

1062h

1063t

1191h

1063h

1064t

1064h

1065t

1065h

1066t

1066h

1067t

1067h

1068t

1068h

1069t

1069h

1070t

106h

107t

1070h

1071t

1071h

1072t

1072h

1073h

1073t

1074t

1074h

1075t

1075h

1076t

1076h

1077t

1077h

1078t

1078h1079t

1079h1080t

107h

108t

1080h1081t

1081h

1082h

1083t

1082t

1103h

1083h

1084t

1084h

1085t

1085h

1086t

1086h

1087t

1087h1088h

1088t

1089t

1089h

1090t

108h

109t

1090h

1091t

1091h

1092t

1092h

1093t

1093h

1095h

1094h

1095t

1094t

1096t

1096h

1097t

1097h

1098h

1099t

1098t

1105t

1099h

1100t

109h

110t

10h

11t

10t

9h

1100h

1101t

1101h

1102t

1102h

1103t

1104h

1106t

1104t

1105h

1106h

1116t

1107h

1108t

1107t

1120t

1108h

1109t

1109h

1110t

110h

111t

1110h

1111t

1111h

1112t

1112h

1114h

1113h

1115t

1113t

1114t

1115h

1121t

1116h

1117t

1117h

1118t

1118h

1119t

1119h

1120h

111h

112t

1121h

1122t

1122h

1123t

1123h

1124t

1124h

1125t

1125h

1126t

1126h

1127t

1127h

1128t

1128h

1129t

1129h

112h

113t

1130h1131t

1130t

1143t

1131h

1132t

1132h

1133h

1133t

1134t

1134h

1135t

1135h

1136t

1136h

1137t

1137h

1138t

1138h

1139t

1139h1140t

113h

114t

1140h

1141t

1141h

1142t

1142h

1143h

1144t

1144h

1145t

1145h

1146t

1146h

1147t

1147h

1148t

1148h

1149t

1149h

1150t

114h115t

1150h

1151t

1151h

1152t

1152h

1153t

1153h

1154t

1154h

1155t

1155h

1156t

1156h

1157t

1157h

1158t

1158h

1159t

1159h

1160t

115h

116t

1160h

1161t

1161h

1162t

1162h

1163t

1163h

1164t

1164h

1165t

1165h

1166t

1166h

1167t

1167h

1168t

1168h

1169t

1169h

1170t

116h

117t

1170h

1171t

1171h

1172t

1172h

1173h

1174t

1173t

1174h

1175t

1175h1176t

1176h

1177t

1177h

1180h

1178h

1179t

1178t1181t

1179h

1180t

117h118t

1181h

1182t

1182h

1183t

1183h1184t

1184h

1185t

1185h

1186t

1186h

1187t

1187h

1188t

1188h

1189t

1189h

1190t

118h

119t

1190h

1191t

1192t

1192h

1193t

1193h

1194t

1194h

1195t

1195h

1196t

1196h

1198h

1197h

1198t

1197t

1199h

1199t

1201h

1200t

1204t

1202t

119h

120t

11h

12t

1200h

1201t

1203h

1202h

1203t

1204h

1205t

1205h

1206t

1206h

1207t

1207h

1208t

1208h

1209t

1209h

1210t

120h

1210h

1211t

1211h

1212t

1212h

1213t

1213h

1214t

1214h

1215t

1215h

1216t

1216h

1217t

1217h

1218t

1218h

1219t

1219h

121h

122t 121t

128t

1220h

1221t

1220t

1225t

1221h

1222t

1222h

1223t

1223h

1224t

1224h1239t

250t

71h

1225h

1226t

1226h

1227t

1227h

1228t

1228h

1229t

1229h

1230t

122h

123t

1230h

1231h

1232t

1231t

216t

1232h

1233t

1233h

1234t

1234h

1235t

1235h

1236t

1236h

1237t

1237h

1238t

1238h

1239h

1241h

123h

124t

1240h

1241t

1240t

1242t

1242h

1243h

1243t

1244h

1244t

1245t

1245h

1246h

1246t

1247h

1248t

1247t

917t

1248h

1249t

1249h

1250t

124h

125t

1250h

1251t

1251h

1252t

1252h

1253t

1253h

1254t

1254h

1255h

1256t

1255t

916h

1256h

1257t

1257h

1258t

1258h

1259t

1259h

1260t

125h

126t

1260h

1261t

1261h

1262t

1262h

1263h

1263t

1264t

1264h667t

76h

1265h

1266t

1265t

1266h

1267t

1267h

1269t

1269h

1270t

126h

127t

1270h

1271t

1271h

1272t

1272h

1273h

1273t

1274t

1274h

1275t

1275h

1276t

1276h

1277t

1277h

1278t

1278h

1279t

1279h1280t

127h

1280h

1281t

1281h

1282t

1282h

1283t

1283h

1284t1284h

1285t

1285h

1286t

1286h

1287t

1287h

1288t1288h

1289t

1289h

1290t

128h

129t

1290h 1291t 1291h1292h

1292t 1293t1293h 1294t

1294h

1295t

1295h

1296t

1296h

1297t

1297h1298t

1298h

1299t

1299h1300t

129h

130t

12h

13t

1300h

1301t

1301h

1302t

1302h 1303t

1303h

1304t

1304h

1305t

1305h

1306t

1306h

1307t

1307h 1308t1308h

1309t

1309h

1310t

130h

131t

1310h

1311t

1311h

1312t

1312h

1313t

1313h

1314t

1314h

1315t

1315h

1316t

1316h

1317t

1317h

1318t

1318h

1319t

1319h

1320t

131h

1320h

1321t

1321h

1322t

1322h

1323t

1323h1324t

1324h

1325t

1325h

1326t

1326h

1327t

1327h

1328t

1328h

1329t

1329h

1330t

132h

133t

132t

143t

1330h

1331t

1331h

1332t

1332h

1333t

1333h

1334t

1334h

1335t

1335h

1336t

1336h

1337t

1337h1339h

1338h

1339t

1338t

1340t

133h

134t

1340h

1341t

1341h1342t1342h

1343t

1343h

1344t1344h

1345t

1345h1346t

1346h

1347t1347h

1348t

1348h1349t

1349h

1350t

134h

135t

1350h

1351t

1351h

1352t

1352h1354t

1354h

1355t

1355h1356t

1356h

1357t

1357h1358t

1358h

1359t

1359h

135h

136t

136h

137t

137h

138t138h

139t

139h

140t

13h

14t

140h

141t

141h

142h

142t

143h

144t144h

145t

145h

146t146h

147t

147h148t

148h

149t

149h

150t

14h

15t

150h

151t

151h

152t

152h153t

153h154t

154h

155t

155h

156t

156h157h157t

158t

158h

159t

159h

160t

15h

16t

160h

161t

161h

162t

162h

163t

163h

164t

164h

165t

165h

166t

166h

167t

167h

168t

168h

169t

169h

170t

16h

17t

170h

171t

171h

172t

172h

173t

173h

174t

174h

175t

175h

176t

176h

177t

177h

178h

179t

178t

215h

179h

180t

17h

18t

180h

253t

181h

182t

181t

252t

182h

253h

183h184t

183t

254t

184h

185t

185h

186t186h

187t

187h

188t188h

189t

189h

190t

18h

19t

190h 191t

191h

192t 192h

193t

193h

194t 194h

195t

195h

196t

196h

197t

197h

198t

198h

199t

199h

200t

19h

20t

1h

2t

1t

200h

201t

201h202t

202h

203t203h

204t

204h

205t

205h

206t

206h

207t

207h

208t

208h

209t

209h

210t

20h

21t

210h

211t

211h

212t

212h

213t

213h

214t

214h

215t

216h

217h

217t

218t

218h

219t

219h

220t

21h

22t

220h

221t

221h

222t

222h

223t

223h

224t

224h

225t

225h

226t

226h

227t

227h

228t

228h

229t

229h

230t

22h

23t

230h

231t

231h

232t

232h

233t

233h

234t

234h

235t

235h

236t

236h

237t

237h

238t

238h

239t

239h

240t

23h

24t

240h

241t

241h

242t

242h

243t

243h

244t

244h

245t

245h

246t

246h

247t

247h

248t

248h

249t

249h

289t

24h

25t

250h

251t

251h

252h

254h

255t

255h256t

256h

257t257h

258t

258h

259t

259h260t

25h

26t

260h261t

261h

262t

262h

263t

263h

264t

264h

265t

265h

266t

266h

267t

267h

268t

268h

269t 269h

270t

26h

27t

270h

271t271h

272t

272h

273t

273h

274t

274h

275t

275h

276t

276h

277t

277h

278t

278h

279t

279h280t

27h

28t

280h

281t

281h

282t

282h

283t

283h

284t

284h

285t

285h

286t

286h

287t

287h

288t

288h

289h

28h

29t

290h

291t290t

291h 292t

292h

293t293h

296h

294h

295t

294t297t

295h

296t

297h

298t

298h

299t

299h 300t

29h

30t

2h

3t

300h

301t

301h

302t

302h

303t

303h

304t

304h

305t

305h

306t

306h

307t

307h308t

308h

309t

309h

310t

30h

31t

310h

311t

311h

312t

312h

313t

313h

314t

314h

315t

315h

316t

316h

317h

317t

318t

318h

319t

319h

320t

31h

32t

320h

321t

321h

322t

322h

323t

323h

324t

324h

325t

325h

326t

326h

327t

327h

328t

328h

329t

329h

330t

32h

33t

330h

331t

331h

332t

332h

333t

333h

334t

334h335t

335h

336t

336h

337t

337h

338t338h

339t

339h

340t

33h

34t

340h

341t

341h

342t

342h

343t

343h

344t

344h

345t

345h

346t

346h

347t

347h

348h

348t

349t

349h

350t

34h

35t

350h

351t351h

352t

352h

353t

353h

354t

354h

355t

355h

356t

356h

357t

357h

358t

358h

359t

359h

360t

35h

36t

360h

361t

361h

362t

362h

363t

363h

364h364t

365t

365h

366t

366h

367t

367h

368t

368h

369t

369h

370t

36h

37t

370h

371t

371h

372t

372h

373t

373h

653h

374h

375t

374t

375h

376t

376h

377t

377h

378t

378h

379t

379h

380t

37h

38t

380h

381t

381h

382t

382h

383t

383h

384t

384h

385t

385h

386t

386h

387t

387h

388t

388h

389t

389h

390t

38h

39t

390h

391t

391h

392t

392h

393t

393h

394t

394h

395t

395h

396t

396h

397t

397h

398t

398h

399t

399h

400t

39h

40t

3h

4t

400h

401t

401h

402t

402h

403t

403h

404t

404h

405t

405h

406t

406h

407t

407h

408h

408t

409t

409h

410t

40h

41t

410h

411t

411h

412t

412h

413t

413h

414t

414h

415t

415h

416t

416h

417t

417h

418t

418h

419t

419h

420t

41h

42t

420h

421t

421h

422t

422h

423t

423h

424t

424h

426h

425h

426t

425t

427t

427h

428t

428h

429t

429h

430t

42h

43t

430h

431t

431h

432t

432h

433t

433h

434t

434h

435t

435h

436t

436h

437t

437h438t

438h

439t

439h

440t43h

44t

440h

441h

441t

442t

442h

443t

443h444t444h

445t

445h446t

446h447t

447h

448t448h449t449h

450t

44h

45t

450h451t451h

452t

452h

453t

453h

454t

454h

455t

455h

456t

456h

457t

457h

458t

458h

459t459h

460t

45h46t

460h

461t461h

462t

462h

463t

463h

464t

464h

465t 465h

466t

466h

467t

467h

468t

468h

469t

469h470t

46h

47t

470h

471t

471h

472t

472h

473t473h

474t474h

475t

475h

476h477t

476t

477h

478t478h

479t 479h

480t

47h

48t

480h

481t481h

482t

482h

483t

483h

484t

484h485t

485h

486t

486h

487t487h

488t488h

489t

489h

490t

48h

49t

490h

491t

491h

492t

492h

493t

493h

494t

494h495t495h

496t

496h

497t

497h498t

498h

499t

499h

500t

49h 50t

4h

5t

500h501t

501h

502t

502h

503t503h

504t504h

505t

505h

506t

506h

507t

507h

508t

508h

509t

509h

510t

50h 51t

510h

511t

511h

512t

512h

513t513h

514t514h

515t

515h

516t

516h

517t517h

518t

518h

519t

519h

520t

51h

52t

520h

521t

521h

522t

522h523t

523h

524t

524h525t

525h526t

526h

527t

527h

528t

528h

529t529h530t

52h

53t

530h531t

531h

532t

532h533t

533h

534h534t

535t

535h

536t

536h

537t

537h

538t

538h

539t 539h

540t

53h

54t

540h

541t

541h

542t

542h543t

543h

544t

544h

545t

545h

546t

546h

547t

547h548t

548h

549h549t

550t

54h

55t

550h

551t

551h

552t

552h

553t

553h554t

554h

555h 556h 556t

557t

557h

558t558h

559h

559t

560t

55h

56h

560h

609t

561h

562t

561t608h

562h

563t

563h564t

564h565t565h566t

566h

567t

567h

568t

568h

569t

569h

570t

56t

57t

570h

571t

571h

572t

572h573t 573h

574t 574h 575t

575h 576t576h

577t

577h

578t

578h

579t

579h

580t

57h

58t

580h581t

581h

582t

582h

584h

583h

584t

583t

585t

585h

586t

586h

587h

588h

587t

588t589t

589h

594h

58h

59t

590h

591t

590t

596h

591h

592t

592h

593t

593h

594t

595h

596t

595t

597t

597h

614h

598h

599t

598t

615t

599h

600t

59h

60t

5h

6t

600h

601t

601h

602t

602h

603t

603h

604t

604h

605t605h

606t

606h607t

607h608t

609h 610h

60h

61t

610t611t

611h

612t

612h

613t613h 614t

615h

616t

616h

617t

617h

618t

618h

619t

619h

620t

61h

62t

620h

621t

621h

622t

622h

623t

623h

624t

624h

625t

625h626t

626h

627t

627h

628t

628h

629t

629h

630t

62h

63t

630h

631t

631h

632t

632h 633t633h

634t

634h

635t

635h 636t

636h

637t

637h

638t

638h

639t

639h

640t

63h

64t

640h

641t

641h

642t

642h

643t

643h

644t

644h

645h

646t

645t

646h

647t

647h

648t

648h

649t

649h

650t

64h

65t

650h

651h

651t

652t

652h

658h

653t

659t

654h

655t

654t

655h

656t

656h

657t

657h

658t

659h

660t

65h

66t

660h

661t661h

662t662h

663t

663h

664t

664h

665t

665h

666t

666h941t

667h

668t

668h

669t

669h

670t

66h

67t

670h

671t

671h

672t

672h

673t

673h

674t

674h

675t

675h

676t

676h

677t

677h

678t

678h

679t

679h

680t

67h

68t

680h

681t

681h

682t682h

683t683h

684t 684h 685t 685h 686t 686h687t

687h688t

688h

689t

689h

690t

68h

69t

690h

691t

691h692t

692h

693t

693h

694t

694h

695t

695h

696t

696h

697t

697h

698t

698h

699t

699h

700t

69h

70t

6h

7t

700h

701t

701h

702t

702h

703h

704t

703t

718t

704h

705t

705h

706t

706h

707t

707h

708t

708h

709t

709h

710t

70h

71t

710h

711t

711h

712t

712h

713t

713h

714t

714h

715t

715h

716t

716h

732t

717h

717t

731h

718h

719t

719h

720t

78h

720h

721t

721h

722t

722h

723t

723h

724t

724h

725t

725h

731t

726h

727t

726t730h

727h

728t

728h

729t

729h

72h

73t

72t

730t

732h

733t

733h

734t

734h

735t

735h

736t

736h

737t

737h

738t

738h

739t

739h

740t

73h

74t

740h

741t

741h

742t

742h

743t

743h

744t

744h

745t

745h

747h

746h

747t

746t

748t748h

749t

749h

750t

74h

750h751t

751h

752h

753t

752t

753h

754t

754h

755t

755h

756t

756h

757t

757h

758h758t

759t

759h

760t

75h

75t

760h

761t

761h

762h

762t763t

763h

764t

764h

765t

765h

766t766h

767t

767h

768t

768h

769t

769h

77t

76t

770h

770t

771t

771h

774t

772h773t

772t

798h

773h

774h775h

799t775t

776t

776h777t

777h778t778h779t

779h

780t

77h

780h

781t

781h

782t

782h

783t783h

784t

784h

785t

785h

786t

786h

787t

787h

788t

788h

789t

789h

790t

78t

790h791t 791h 792t 792h

793t793h

794t

794h

795h

795t796t

796h797t

797h

798t

799h

800t

79h

80t

79t

7h

8t

800h

801t

801h

802t802h

803t 803h

804t

804h

805t

805h

806t

806h

807t

807h

808t

808h

809t

809h

810t

80h

81t

810h811t

811h812t

812h813t813h

814t814h815t

815h

816t816h

817t

817h

818t

818h

819t819h

820t

81h

82t

820h821t

821h

822t822h

823t823h

824h

825t

824t

825h

826t

826h

827t

827h

859h

828h829t

828t

860t

829h

830t

82h

83t

830h831t

831h

832t

832h

833t

833h

834t

834h

835t

835h

836t

836h837t

837h

838t 838h

839t 839h

840t

83h

84t

840h

841t

841h

842t

842h

843t 843h

844t

844h

845t

845h

846t

846h

847t

847h

848t

848h

849t

849h

850t

84h

85t

850h

851t

851h

852t

852h

853t

853h

854t

854h

855t

855h

856t

856h

857t

857h

858t

858h

859t

85h

86t

860h

861t

861h

862t

862h

863t

863h

864t

864h

865t

865h

866t

866h

867t

867h

868t

868h

869t869h

870t

86h

87t

870h

871h

872t

871t

873t

872h

873h

874t

874h

875t

875h

876t

876h

877t

877h

878t

878h

879t

879h

880t

87h

88t

880h

881t

881h

882t

882h

883t

883h

884t

884h

885t

885h

886t

886h

887t

887h

888t

888h

889t

889h

890t

88h

89t

890h

891t

891h

892t

892h

893t

893h

894t

894h

895t

895h

896t

896h

897t

897h

898t

898h

899t

899h

900t

89h

90t

8h

9t

900h

901t

901h

902t

902h

903t

903h

904t

904h

905t

905h

906t 906h

907t

907h

908t

908h

909t

909h

910t

90h

91t

910h

911t

911h

912t

912h

913t

913h

914t

914h

915t

915h

916t

917h

918h

918t

919t

919h

920t

91h 92t

920h

921h

922t

921t

925h

922h

923t

923h

924h

924t

926t

925t

926h

927t

927h

928t

928h

929t

929h

930t

92h

93t

930h

931t

931h

932t

932h

933t

933h

934t934h

935t

935h

941h

936h

942t

936t

937t

937h

938t

938h

939t

939h

940t

93h

94t

940h

942h

943h

943t

944t

944h945h

945t

946t

946h

947t

947h

948t

948h

949t

949h

950t

94h

95t

950h

951t951h

952t

952h

953t

953h

954t954h

955t

955h

956t

956h

957t

957h

958t

958h

959t

959h

960t

95h

96t

960h

961t

961h

962t

962h

963t

963h

964t

964h

965t

965h

966t

966h

967t

967h

968h

968t

969t

969h

96h

97t

970t

971t

971h972t

972h

973t

973h

974t

974h

975t

975h

976t

976h

977t

977h

978t

978h

979t

979h

980t

97h

98t

980h

981t

981h

982t

982h

983t

983h

984t

984h

985t

985h

986t

986h

987t

987h988t

988h

989t

989h

990t

98h

99t

990h

991t

991h

992t

992h

993t

993h

994t994h

995t

995h

996t

996h

997t997h

998t

998h

999t

Figure S10: The breakpoint graph of Mouse (red), Dog (green), and macaQue (violet) genomes after MGRA Stage 1 withcomplete multi-edges and obverse edges shown (in contrast to Fig. 7). The obverse edges reveal many unicolored pathsformed by alternating obverse edges and complete multi-edges. Vertices are labeled and colored similarly to Fig. 4.

31

1035h (15)

970h (14)

73.1M / 74 (15)

1197t (19)

7.9M / 14 (19)

1199h (19)

1199t (19)

1201h (19)1200t (19)

1204t (19)

1202t (19)

1203h (19)

156K / 4 (19)

17.7M / 32 (19) 352K / 4 (19)

1224h (20)

1239t (22)

250t (3)

71h (1)

1239h (22)

190.1M / 224 (3)

78h (1)

58.6M / 22 (20)

1231t (21)32.5M / 16 (21)

216t (3)

1241h (22)

2.1M / 12 (22)

1247t (22)

6.3M / 16 (22)

917t (12)

25.5M / 8 (12)

1255t (22)

18.4M / 20 (22)

916h (12)

1264h (22)667t (8)

76h (1)

96.4M / 72 (8)

77t (1)

373h (4)

653h (8)

183.9M / 168 (4)

19.2M / 18 (8)

666h (8)

941t (13)

941h (13)148.3M / 142 (1)

159K / 2 (1)

538K / 2 (1)

128K / 2 (1)

105.3M / 92 (12)

935h (13)

20.4M / 30 (13)

940h (13)

67.2M / 66 (13)

85.7M / 58 (14)

71.2M / 96 (17)

71.2M / 86 (18)

137.1M / 186 (X)

123.1M / 92 (2)

176.3M / 204 (5)161.8M / 158 (6)

124.0M / 134 (7)

17.4M / 26 (8)

101.2M / 88 (9)

36.5M / 36 (10)

85.5M / 108 (10)

122.0M / 94 (11)

87.8M / 84 (2)

1061h (16)

555t (7)

1062t (16)

20.1M / 46 (7)

1189h (19)

1190t (19)

43.1M / 40 (16)

1191h (19)

28.5M / 52 (16)

590K / 4 (19)1192t (19)

15.7M / 34 (19)

1035h (15)

970h (14)

73.1M / 74 (15)

1192t (19)25.8M / 48 (19)

555t (7)

20.1M / 46 (7)

1231t (21)

32.5M / 16 (21)

216t (3)

190.1M / 224 (3)

1239t (22)

3.1M / 14 (22)

250t (3)

1247t (22)6.3M / 16 (22)

917t (12)

25.5M / 8 (12)

1255t (22)18.4M / 20 (22)

916h (12)

373h (4)

653h (8)

183.9M / 168 (4)

19.2M / 18 (8)

105.3M / 92 (12)

85.7M / 58 (14)

71.6M / 92 (16)

71.2M / 96 (17)

71.2M / 86 (18)

16.3M / 38 (19)

58.6M / 22 (20)

137.1M / 186 (X) 123.1M / 92 (2)

176.3M / 204 (5)

161.8M / 158 (6)

124.0M / 134 (7)17.4M / 26 (8)

96.4M / 72 (8)

101.2M / 88 (9)

36.5M / 36 (10)

85.5M / 108 (10)122.0M / 94 (11)

87.8M / 98 (13)

87.8M / 84 (2)

148.3M / 142 (1)

Figure S11: Compact representation of the breakpoint graph of the Mouse (red), Dog (green), and macaQue (violet) genomesafter MGRA Stage 1 (top) and Stages 1-2 (bottom) (compare to Fig. S10). Every unicolored alternating path of obverse edgesand complete multi-edges (with possible exception of the initial and terminal synteny blocks) is represented as a rectangularvertex labeled by the overall length and number of synteny blocks in this alternating path. The numbers in parentheses aswell as vertex colors indicate the corresponding human chromosome. The isolated vertices of the total length shorter than15 Mb are not shown. The observed edges are shown as dashed edges. The boxed selected component (top) is analyzed inFig. S13.

32

73.1M / 74 (15) 85.7M / 58 (14)

71.6M / 92 (16) 20.1M / 46 (7)

42.6M / 94 (19)

58.6M / 22 (20)

3.1M / 14 (22)

190.1M / 224 (3)

149.1M / 148 (1)

32.5M / 16 (21)

6.3M / 16 (22) 25.5M / 8 (12)

18.4M / 20 (22)

105.3M / 92 (12)

115.7M / 90 (8)

87.8M / 98 (13)

183.9M / 168 (4)

71.2M / 96 (17)

71.2M / 86 (18)

137.1M / 186 (X)123.1M / 92 (2)

176.3M / 204 (5)

161.8M / 158 (6)

124.0M / 134 (7)

17.4M / 26 (8)

101.2M / 88 (9)36.5M / 36 (10)

85.5M / 108 (10)

122.0M / 94 (11)

87.8M / 84 (2)

Figure S12: Compact representation of the unicolored connected components in Fig. S11 (top).

Mouse:555t (7) 20.1M / 46 (7)28.5M / 52 (16) 1061h (16) 15.7M / 34 (19) 1189h (19)1062t (16) 43.1M / 40 (16)1190t (19) 590K / 4 (19) 1191h (19)7.9M / 14 (19) 1192t (19)

Dog:555t (7) 20.1M / 46 (7)28.5M / 52 (16) 1061h (16)15.7M / 34 (19) 1189h (19)1062t (16) 43.1M / 40 (16)1190t (19) 590K / 4 (19) 1191h (19)7.9M / 14 (19) 1192t (19)

Macaque:555t (7) 20.1M / 46 (7)28.5M / 52 (16) 1061h (16) 1062t (16)15.7M / 34 (19) 1189h (19) 1190t (19) 43.1M / 40 (16)590K / 4 (19) 1191h (19) 1192t (19) 7.9M / 14 (19)

Figure S13: The regions of the Mouse, Dog, and Macaque genomes corresponding to the boxed component of the graph inFig. S11, top panel.

33

happened independently on two different branches of the evolutionary tree. However, if the signsof the blocks are incorrect, manual re-examination of some blocks may be necessary. Recently, Maet al. (2006) and Chaisson et al. (2006) emphasized the difficulties in detecting micro-inversions andimproved on previous work in detecting micro-inversions (Feuk et al., 2005). While these papersresulted in two largely consistent sets of human-chimpanzee micro-inversions, there are still somedifferences between the sets/signs of human-chimpanzee micro-inversions generated by algorithmsin Ma et al. (2006) and Chaisson et al. (2006), indicating that some micro-inversions detected by theseapproaches may be unreliable. The micro-inversions detection becomes even more difficult whenone moves from human and chimpanzee to more distant mammals.

The manual analysis of block 1300 (Jian Ma, personal communication) revealed that it indeedrepresents two independent micro-inversions in rat and macaque resulting in arrangements +1299,−1300, +1301 in rat and macaque as opposed to the arrangement +1299, +1300, +1301 in the hu-man, chimpanzee, dog, and mouse genomes. It also found small aligned regions between blocks1299 and 1300 (block 1299a) as well as between blocks 1300 and 1301 (block 1301a). While theregions 1299a and 1301a are too short to pass through any reasonable threshold on the syntenyblocks size, they revealed the following arrangements: (+1299, −1300, −1299a, +1301a, +1301) in rat,(+1299, +1299a, −1301a, −1300, +1301) in macaque, and (+1299, +1299a, +1300, +1301a, +1301) inhuman/chimpanzee/dog/mouse genomes.

We use similar arguments to process the remaining components of the breakpoint graph. Forexample, the simplest explanation for a component with two vertices 970t and 971t is a fission indog that transforms the T-consistent split (mouse/rat/dog vs. human/chimpanzee/macaque) into aninconsistent split (mouse/rat vs. human/chimpanzee/macaque/dog). Note that the dog genome wassubject to frequent fissions resulting in nearly doubling the number of chromosomes as compared toother five mammals. We remark that this processing at MGRA Stage 3 is viewed as less reliable andthe resulting associations are not considered in the proposed ancestral reconstructions.

Supplement G The architecture of the ancestral X chromosome

QHCMRD

HCMR

QD H C

M

MR

R

Figure S14: The architecture (up to micro-rearrangements) of 19 synteny blocks forming X chromosomes in the Dog,macaQue, Human, and Chimpanzee genomes as well as their common ancestral genomes (top panel). The Mouse and RatX chromosomes (along with the MR ancestral X chromosome) are shown on a separate (bottom) panel since they displaymuch higher fragmentation (46 synteny blocks).

Supplement H Rearrangements between the reconstructed ancestral genomesTable S5 illustrates how the rearrangement distances between genomes at the leaves of the phyloge-netic tree are being reduced while progressing through Stages 1 and 2 of MGRA.

Table S6 shows the pairwise rearrangement distances between the ancestral and leaf genomes,following the strict T-consistent transformation constructed by MGRA, and compares them to the

34

M R D Q H CM 0 438 436 392 395 406R 0 739 689 696 707D 0 283 284 292Q 0 104 113H 0 22C 0

M R D Q H C0 37 90 95 95 97

0 91 93 94 960 43 40 39

0 16 160 7

0

M R D Q H C0 4 3 9 5 7

0 7 5 5 70 9 5 5

0 6 60 4

0

Table S5: The estimated pairwise genomic distances (based on the formula from Alekseyev (2008)) between the genomesbefore (left table) and after MGRA Stage 1 (center table) as well as after MGRA Stage 2 (right table).

genomic distances computed by GRIMM (Tesler, 2002b). The differences between these distances arerather small, suggesting that the ~T-consistent transformation found by MGRA is close to the mostparsimonious.

M R D Q H C MR MRD QHC HC

M 0 499 450 407 409 421 81:81 285 354 404R 0 800 749 753 765 436:384 637 701 748D 0 291 295 304 380 173:170 241 290Q 0 110 117 334 130 54:53 107H 0 23 336 133 59 7:6C 0 347 145 72 18:17

MR 0 212:213 281 331MRD 0 76:76 128QHC 0 54:53HC 0

M R O D Q H C MR MRO MROD QHC HC

M 0 442 822 412 370 378 382 74:77 276 279 334 371R 0 1107 713 665 674 676 382:341 579 581 631 665O 0 714 675 682 682 761 587:586 591 637 673D 0 245 253 256 351 156 150:148 210 246Q 0 94 95 305 107 102 44:44 85H 0 21 315 118 112 50 9:9C 0 317 119 114 53 12:12

MR 0 212:215 215 269 306MRO 0 7:9 69 109

MROD 0 63:65 103QHC 0 41:40HC 0

Table S6: The pairwise rearrangement distances between the Human, Mouse, Rat, Dog, Chimpanzee, and macaQuegenomes (top table) as well as the Opossum genome (bottom table) and their ancestral genomes MR, MRD, MRO, MROD,HC, and QHC reconstructed by MGRA. Each cell contains a number x or a pair of numbers x : y where x is the genomicdistance (computed by GRIMM (Tesler, 2002b)) and y is the number of 2-breaks between the genomes in the ~T-consistenttransformation constructed by MGRA. The distances corresponding to the branches of the phylogenetic tree T are grayed.

It is not surprising that some of the 2-break distances in Tables S6 are smaller than the correspond-ing genomic distances. The explanation for this phenomenon is that 2-breaks have an “advantage”over the standard rearrangements in the presence of complex components (such as hurdles (Hannen-halli and Pevzner, 1999)) in linear genomes. Such components can be typically resolved with smallernumber of 2-breaks via temporary creation of circular chromosomes.

Table S7 shows the breakdown of intrachromosomal and interchromosomal rearrangements (gen-erated by MGRA) between different branches of the phylogenetic tree. While the number of intra-chromosomal 2-breaks is roughly twice larger than the the number of interchromosomal 2-breaks

35

(on average), some branches (D + MRQHC and MR + DQHC) reveal an elevated number of inter-chromosomal rearrangements (approaching and even exceeding the number of intrachromosomalrearrangements).

Branch # intrachromosomal 2-break # interchromosomal 2-breaks Total

M+RDQHC 53 28 81R+MDQHC 294 90 384D+MRQHC 92 78 170Q+MRDHC 32 21 53H+MRDQC 5 1 6C+MRDQH 16 1 17MR+DQHC 80 133 213HC+MRDQ 40 13 53MRD+QHC 55 21 76

Total 667 386 1053

Table S7: The statistics of the 2-break scenario reconstructed by MGRA between the Mouse, Rat, Dog, macaQue, Chim-panzee, and Human genomes. For each branch of the phylogenetic tree, it gives the number of intrachromosomal 2-breaks(reversals and intrachromosomal translocations) and the number of interchromosomal 2-breaks (fissions/fusions and inter-chromosomal translocations).

In presence of the Opossum genome, MGRA assigns the following 2-breaks to the contestedMRO+DQHC branch: three fissions (1547h with 710h, 1420h with 627h, and 1377h with 748t at MGRAStage 2), five fusions (1548t with 1547h, 1668t with 1667h, 1531h with 1377h, 748t with 747h, and 957twith 924h at MGRA Stage 2), and one translocation (on edges (952t, 953t) and (951t, 952h) at MGRAStage 3).

Supplement I Detailed comparison of MGRA and inferCARs reconstructionsTo further compare these MGRA and inferCARs reconstructions we constructed the breakpointgraph G(MRDMGRA,MRDCARs) (Fig. S15). The non-trivial components of G(MRDMGRA,MRDCARs)are formed by 2 cycles (on 4 vertices each) and 19 paths (on 55 vertices), out of which 8 paths consistof single purple edges and represent various CARs (constructed by inferCARs) that were connected byMGRA. 5 out of 19 paths are purple-purple paths (representing CARs that are connected in MRDMGRAand disconnected in MRDCARs), 11 are cyan-cyan paths (representing CARs that are connected inMRDCARs and disconnected in MRDMGRA), and 3 are a purple-cyan path. One out of the two cyclesas well as some paths in Fig. S15 represent different interpretations of micro-inversions (formed bysynteny blocks that are located closely to each other in some genomes) by MGRA and inferCARsalgorithms and do not affect the large-scale view of ancestral architectures.

Supplement J How stable are the ancestral reconstructions?In order to test the stability of MGRA reconstructions with changing resolution (minimum size of thesynteny blocks), we removed short synteny blocks from the original set of 1357 blocks for six genomesand compared the resulting reconstructions. While removing some synteny blocks unavoidablyaffects the ancestral reconstructions (e.g., some adjacencies may become “invisible”), it is importantto verify that the number of changes is relatively small.

Note that removal of a synteny block may “enlarge” others by merging them (two blocks aremerged as soon as they are adjacent in all 6 genomes). Therefore, we performed short block removalas iterative procedure that removes the shortest block (w.r.t. the human genome) and possibly mergesall pairs of consistently adjacent blocks into longer blocks. The procedure stops when the length ofthe shortest blocks exceeds the specified threshold. We further reconstructed the Boreoeutherianancestors using the genomes with all short blocks removed.

Removing synteny blocks may result in either loosing some ancestral adjacencies (e.g., breaking a

36

1035t (15)

970h (14)

1177h (19)

1178t (19)

1180h (19)

1181t (19)1189h (19)

1190t (19)

1192t (19)

1191h (19)

555t (7)

1196h (19)

1198h (19)

1202t (19)

1199t (19)

1199h (19) 1200t (19)

1204t (19)

1203h (19)

1201h (19)

120h (2)

131h (2)

1241h (22)

1246h (22)

250t (3)

1289h (23)1290t (23)

1290h (23)1291t (23)

1299h (23)1300t (23)

1300h (23)1301t (23)

375h (5)376t (5)

586h (7)

587t (7)

769h (10)

609h (7)610h (7)

610t (7)611t (7)

666h (8)

667t (8)

74h (1)

75t (1)

75h (1)

76t (1)

770t (10)

771t (10)770h (10)

872h (12)

873t (12)926h (13)

927t (13)

940h (13)

941t (13)

941h (13)

1003h (15)

1014t (15)

1017h (15)

1035h (15)

1197t (19)

1224h (20)

71h (1)

1245h (22)

1264h (22)

1246t (22)

1254h (22)

142t (2)

730t (9)

652h (8)

658h (8)

729h (9)

752t (10)

871t (12)

935h (13)

970t (14)

971t (14)

Figure S15: The breakpoint graph of the genomes MRDCARs (cyan) and MRDMGRA (purple). Bold purple edges representreliable adjacencies obtained by MGRA Stage 1, while dashed purple edges (shown even if parts of complete multi-edges)represent adjacencies (between vertices incident to a split in M/R/D colors in Fig. 7, bottom panel) viewed as less reliable.Dashed cyan and orange edges represent ambiguous joins made by inferCARs.

37

single CAR into two CARs) or in introducing new adjacencies (as compared to the original ancestralreconstruction). To compare reconstructions on different sets of blocks we selected the set of blocksshare between two reconstructions and computed the number of “missing” and “extra” adjacenciesbetween two ancestral reconstructions. The results for minimal blocks thresholds of 100K, 250K, and500K are shown in Tab. S8 that illustrates that MGRA reconstructions are rather stable. For example,removing all 168 blocks shorter than 100 Kb results (12% of all blocks) in reconstructions that retain99.5% adjacencies compared to each other. Increasing the threshold to 250K results in removing 34%of all blocks but retains 98.5% of adjacencies.

MinBlockLength #Blocks Left #Adjacencies Extra Adjacencies Missing Adjacencies100K 1189 1161 5 6250K 903 871 11 17500K 711 676 15 24

Table S8: Comparison of MGRA reconstructions on the original 1357 synteny blocks (for 6 genomes) with MGRA recon-struction on the reduced set of synteny blocks (blocks shorter than MinBlockLength threshold removed).

Supplement K CytoAncestor softwareTo bridge the gap between the cytogenetics and the rearrangement-based approaches we implementedCytoAncestor software, which follows the logic of the cytogenetics approach described in Kemkemeret al. (2006). The tests of CytoAncestor revealed that the cytogenetics approach does not scale well withincrease in the number of synteny blocks. In particular, on Ma et al. (2006) data CytoAncestor producesa Boreoeutherian ancestor that does not agree with the widely accepted cytogenetics reconstruction(Supplement K).17 MGRA Stage 1, in contrast to CytoAncestor, produces a reconstruction that islargely consistent with the current view of the Boreoeutherian ancestor.

Kemkemer et al. (2006) recently applied the cytogenetics approach to E-painting data using semi-manual data analysis. We implemented their algorithm and applied it to the Human, Mouse, and Dogdata (1357 synteny blocks from Ma et al. (2006)). The goal of our analysis is to investigate whetherCytoAncestor scales well when one moves from the cytogenetics resolution (typically 100-200 syntenyblocks) to genomic resolution (1000+ synteny blocks).

We briefly describe the cytogenetics approach for the case of 3 genomes P1,P2,P3 with p1, p2, p3chromosomes (see Kemkemer et al. (2006)). We use a synteny-triple (t1, t2, t3) to describe a syntenyblock located on chromosome t1 in P1, chromosome t2 in P2, and chromosome t3 in P3. Clearly, thereexist at most p1 ·p2 ·p3 distinct synteny-triples. In reality the number of synteny-triples is much smallerthan this maximum, and for 1357 synteny blocks in the Human, Mouse, and Dog genomes we haveonly 204 synteny-triples. The synteny-triples represent vertices in the synteny graph that are furtherconnected by edges as described in Kemkemer et al. (2006) (Fig. S16). The connected componentsin the resulting graph represent the ancestral chromosomes and reveal the synteny associations. Forexample, the unicolored connected components representing human chromosomes 6, 9, 11, 17, 18, 20,and X all correspond to single chromosomes in the ancestor and are consistent with the now favoredcytogenetics reconstruction. However, all other connected components disagree with the existingreconstruction (Froenicke et al., 2006). In particular, the giant multicolored component formed byhuman chromosomes 1+5+10+16+4+7+8+13 was never reported in previous cytogenetics studiesand is likely to reflect the limitations of the cytogenetics approach when applied to a small number ofspecies with many synteny blocks. We remark that with the same dataset, the rearrangement-basedapproaches inferCARs and MGRA produce ancestors that are largely consistent with the now favored

17The results improve when one limits attention to very large synteny blocks (e.g., larger than 3 Mb) indicating thatfurther studies are needed to extend the cytogenetics approach to high-resolution data.

38

cytogenetics reconstruction.

137.1M / 93(X,X,X)

18.2M / 9(1,1,7)

8.5M / 1(1,1,38)

4.6M / 1(1,3,7)

309K / 1(1,3,38)

13.3M / 7(1,3,17)

39.3M / 21(1,3,6)

3.8M / 4(1,5,6)

522K / 1(1,6,6)

121K / 1(4,5,6)

20.1M / 23(7,5,6)

23.2M / 11(1,4,2)

23.8M / 6(1,4,5)

145K / 1(1,4,9)

18.9M / 8(1,4,15)

314K / 1(1,6,5)

722K / 1(4,5,15)

12.6M / 2(4,5,32)

29.8M / 7(4,5,13)

7.9M / 10(7,5,14)

6.5M / 1(7,5,16)

11.5M / 3(7,5,18)

159K / 1(1,7,14)

128K / 1(1,11,14)

6.5M / 1(1,8,4)

4.8M / 2(1,13,4)

14.1M / 9(5,13,4)

538K / 1(1,11,8)

28.2M / 10(5,15,4)

789K / 1(5,17,4)

3.2M / 4(5,18,4)

3.0M / 1(5,13,11)

8.0M / 3(5,13,34)

9.4M / 1(2,1,10)

143K / 1(2,1,17)

17.9M / 14(2,1,19)

17.4M / 4(2,1,25)

1.4M / 1(2,1,36)

34.9M / 14(2,1,37)

2.5M / 2(2,6,10)

642K / 1(2,10,10)

15.0M / 6(2,11,10)

10.0M / 5(2,17,10)

4.4M / 6(2,2,17)

2.6M / 1(2,5,17)

17.8M / 9(2,6,17)

25.9M / 19(2,12,17)

13.9M / 2(2,17,17)

15.0M / 4(2,2,19)

1.1M / 1(2,18,19)

34.9M / 8(2,2,36)

23.5M / 14(12,10,10)

819K / 1(22,10,10)

402K / 2(12,10,27)

28.9M / 12(12,10,15)

1.6M / 1(22,10,26)

15.2M / 8(22,15,10)

10.0M / 4(3,3,23)

24.3M / 13(3,3,34)

34.9M / 22(3,9,23)

873K / 1(3,11,23)

192K / 1(3,13,23)

5.8M / 7(3,14,23)

344K / 1(3,16,23)

4.5M / 1(3,17,23)

10.4M / 5(3,16,34)12.2M / 12

(3,16,31)

36.8M / 26(3,16,33)

29.9M / 14(3,6,20)

7.7M / 3(3,9,20)11.6M / 2

(3,14,20)

3.4M / 1(19,9,20)

4.1M / 6(19,10,20)

1.2M / 2(19,17,20)

27.9M / 6(21,16,31)

1.6M / 1(21,17,31)

12.3M / 3(4,3,15)

19.0M / 3(4,3,19)

25.5M / 4(4,3,32)

12.3M / 17(4,8,15)

1.2M / 1(4,6,19)

3.1M / 1(4,8,19)

6.3M / 7(4,6,32)

14.1M / 13(4,8,16)

6.9M / 3(4,8,25)

321K / 1(16,8,15)

39.4M / 21(4,5,3)

25.9M / 17(8,8,16)

420K / 1(8,8,25)

146K / 1(13,8,25)

1.2M / 1(8,8,37)

148K / 1(8,14,16)

8.9M / 2(8,14,25)

5.8M / 6(13,14,25)

4.0M / 7(5,1,3)

21.9M / 9(5,13,3)

9.6M / 10(5,17,3)

2.1M / 1(5,18,3)

18.8M / 22(5,18,11)

22.6M / 5(5,11,4)

6.8M / 3(5,11,11)

5.9M / 6(5,15,34)

17.8M / 10(5,13,2)

8.9M / 1(5,18,2)

5.8M / 3(10,13,2)

6.9M / 9(10,18,2)

13.7M / 7(6,1,12)

13.3M / 1(6,4,12)

16.3M / 9(6,9,12)

16.4M / 15(6,10,12)

202K / 1(6,14,12)

20.0M / 14(6,17,12)

485K / 2(6,17,35)

37.7M / 13(6,10,1)

14.5M / 9(6,17,1)

28.9M / 8(6,13,35)

42.1M / 19(7,6,14)

3.3M / 2(7,9,14)

13.5M / 11(7,12,14)

1.2M / 1(7,13,14)

16.2M / 5(7,6,16)

2.8M / 4(7,11,16)

1.8M / 1(7,12,16)

372K / 1(7,6,18)

7.3M / 5(7,11,18)

2.9M / 2(7,12,18)

5.9M / 2(7,13,18)

15.0M / 6(8,1,29) 14.0M / 7

(8,3,29)

17.5M / 11(8,4,29)

160K / 1(8,13,29)

1.3M / 1(8,15,29)

894K / 1(8,16,29)

47.4M / 9(8,15,13)

16.8M / 13(9,2,9)

708K / 1(9,2,11)

55.2M / 18(9,4,11)

834K / 1(9,19,11)

3.2M / 6(9,4,1)12.3M / 8

(9,13,1)

17.7M / 2(9,19,1)

21.4M / 4(10,2,2)

2.2M / 2(10,8,2)

11.7M / 4(16,8,2)

369K / 1(10,6,4)

3.0M / 1(10,6,28)

14.2M / 11(10,10,4)

14.4M / 8(10,14,4)

94K / 1(10,19,4)

13.9M / 4(10,7,28)

214K / 1(10,14,28)

29.6M / 18(10,19,28)4.7M / 3

(10,10,26)4.7M / 6

(10,19,26)

31.0M / 15(16,8,5)

2.8M / 1(22,11,26)

2.1M / 6(22,16,26)

18.1M / 15(11,2,18)

4.3M / 1(11,2,21)

5.5M / 5(11,7,18)

8.7M / 2(11,19,18)

39.8M / 15(11,7,21)

11.3M / 3(11,9,21)

706K / 1(11,19,21)

33.2M / 5(11,9,5)

25.5M / 4(12,5,26)

3.4M / 7(22,5,26)

31.4M / 15(12,6,27)

20.1M / 1(12,15,27)

879K / 1(12,16,27)

995K / 1(22,6,27)

93K / 1(12,7,6)

12.9M / 15(16,7,6)

83K / 1(16,11,6)

12.4M / 9(16,16,6)

3.0M / 1(16,17,6)

369K / 1(13,1,22)

11.7M / 7(13,8,22)

55.1M / 25(13,14,22)

6.8M / 4(13,3,25)

7.7M / 5(13,5,25) 74.8M / 24

(14,12,8)

9.2M / 3(14,14,8)

1.5M / 1(14,14,15)

152K / 1(14,14,30)

18.2M / 3(15,2,30)

26.9M / 10(15,9,30)

26.0M / 20(15,7,3)

1.7M / 3(15,9,3)

313K / 1(15,9,13)

13.7M / 11(17,11,5) 57.5M / 37

(17,11,9)

6.4M / 3(18,1,1)

22.2M / 18(18,18,1)

33.1M / 15(18,18,7)

198K / 1(18,5,7)

9.2M / 6(18,17,7)

26.7M / 29(19,7,1) 145K / 1

(19,7,15)

6.9M / 8(19,8,20)

2.2M / 1(20,2,23) 56.3M / 10

(20,2,24)

2.9M / 1(21,10,31)

2.3M / 1(22,8,10)

137.1M / 93(X,X,X)

18.2M / 9(1,1,7)

8.5M / 1(1,1,38)

4.6M / 1(1,3,7)

309K / 1(1,3,38)

13.3M / 7(1,3,17)

39.3M / 21(1,3,6)

3.8M / 4(1,5,6)

522K / 1(1,6,6)

20.1M / 23(7,5,6)

23.2M / 11(1,4,2)

23.8M / 6(1,4,5)

18.9M / 8(1,4,15)

314K / 1(1,6,5)

7.9M / 10(7,5,14)6.5M / 1

(7,5,16)11.5M / 3(7,5,18)

6.5M / 1(1,8,4)

4.8M / 2(1,13,4)

14.1M / 9(5,13,4)

538K / 1(1,11,8)

28.2M / 10(5,15,4)

789K / 1(5,17,4)

3.2M / 4(5,18,4)

3.0M / 1(5,13,11)

8.0M / 3(5,13,34)

9.4M / 1(2,1,10)

17.9M / 14(2,1,19)

17.4M / 4(2,1,25)

1.4M / 1(2,1,36)

34.9M / 14(2,1,37)

2.5M / 2(2,6,10)

642K / 1(2,10,10)

15.0M / 6(2,11,10)

10.0M / 5(2,17,10)

15.0M / 4(2,2,19)1.1M / 1

(2,18,19)

34.9M / 8(2,2,36)

17.8M / 9(2,6,17)

23.5M / 14(12,10,10)

819K / 1(22,10,10)

13.9M / 2(2,17,17)

4.4M / 6(2,2,17)

2.6M / 1(2,5,17)

25.9M / 19(2,12,17)

402K / 2(12,10,27)

28.9M / 12(12,10,15)

1.6M / 1(22,10,26)

15.2M / 8(22,15,10)

10.0M / 4(3,3,23)

24.3M / 13(3,3,34)

34.9M / 22(3,9,23)

873K / 1(3,11,23)

5.8M / 7(3,14,23)

344K / 1(3,16,23)

4.5M / 1(3,17,23)

10.4M / 5(3,16,34)

12.2M / 12(3,16,31) 36.8M / 26

(3,16,33)

29.9M / 14(3,6,20)

7.7M / 3(3,9,20)

11.6M / 2(3,14,20)

3.4M / 1(19,9,20)

4.1M / 6(19,10,20)

1.2M / 2(19,17,20)

27.9M / 6(21,16,31)

1.6M / 1(21,17,31)

12.3M / 3(4,3,15) 19.0M / 3

(4,3,19)

25.5M / 4(4,3,32)

722K / 1(4,5,15)

12.3M / 17(4,8,15)

1.2M / 1(4,6,19)

3.1M / 1(4,8,19)

12.6M / 2(4,5,32)

6.3M / 7(4,6,32)

14.1M / 13(4,8,16) 6.9M / 3

(4,8,25)321K / 1(16,8,15)

39.4M / 21(4,5,3)

29.8M / 7(4,5,13)

25.9M / 17(8,8,16)

420K / 1(8,8,25)

1.2M / 1(8,8,37)

8.9M / 2(8,14,25)

4.0M / 7(5,1,3)

21.9M / 9(5,13,3)

9.6M / 10(5,17,3)

2.1M / 1(5,18,3)

18.8M / 22(5,18,11)

22.6M / 5(5,11,4)

6.8M / 3(5,11,11)

5.9M / 6(5,15,34)

17.8M / 10(5,13,2)

8.9M / 1(5,18,2)

5.8M / 3(10,13,2)

6.9M / 9(10,18,2)

13.7M / 7(6,1,12)

13.3M / 1(6,4,12)

16.3M / 9(6,9,12)

16.4M / 15(6,10,12)

20.0M / 14(6,17,12)

485K / 2(6,17,35)

37.7M / 13(6,10,1)

14.5M / 9(6,17,1)

28.9M / 8(6,13,35)

42.1M / 19(7,6,14) 3.3M / 2

(7,9,14)

13.5M / 11(7,12,14)

1.2M / 1(7,13,14)

16.2M / 5(7,6,16)

2.8M / 4(7,11,16)

1.8M / 1(7,12,16)

372K / 1(7,6,18)

7.3M / 5(7,11,18)

2.9M / 2(7,12,18)

5.9M / 2(7,13,18)

15.0M / 6(8,1,29)

14.0M / 7(8,3,29)

17.5M / 11(8,4,29)

1.3M / 1(8,15,29)

894K / 1(8,16,29)

5.8M / 6(13,14,25)

47.4M / 9(8,15,13)

16.8M / 13(9,2,9)

708K / 1(9,2,11)

55.2M / 18(9,4,11)

834K / 1(9,19,11)

3.2M / 6(9,4,1)12.3M / 8

(9,13,1)

17.7M / 2(9,19,1)

21.4M / 4(10,2,2)

2.2M / 2(10,8,2)

11.7M / 4(16,8,2)

369K / 1(10,6,4)

3.0M / 1(10,6,28)

14.2M / 11(10,10,4)

14.4M / 8(10,14,4)

13.9M / 4(10,7,28)

29.6M / 18(10,19,28)

4.7M / 3(10,10,26)

31.0M / 15(16,8,5)

4.7M / 6(10,19,26)

2.8M / 1(22,11,26)

2.1M / 6(22,16,26)

18.1M / 15(11,2,18)

4.3M / 1(11,2,21)

5.5M / 5(11,7,18)

8.7M / 2(11,19,18)

39.8M / 15(11,7,21)

11.3M / 3(11,9,21)

706K / 1(11,19,21)

33.2M / 5(11,9,5)

25.5M / 4(12,5,26)3.4M / 7

(22,5,26)

31.4M / 15(12,6,27)

20.1M / 1(12,15,27)

879K / 1(12,16,27)

995K / 1(22,6,27)

369K / 1(13,1,22)

11.7M / 7(13,8,22)

55.1M / 25(13,14,22)

6.8M / 4(13,3,25)

7.7M / 5(13,5,25)

74.8M / 24(14,12,8)

9.2M / 3(14,14,8)

1.5M / 1(14,14,15)

18.2M / 3(15,2,30)

26.9M / 10(15,9,30)

26.0M / 20(15,7,3)

1.7M / 3(15,9,3)

313K / 1(15,9,13)

12.9M / 15(16,7,6)

12.4M / 9(16,16,6)

3.0M / 1(16,17,6)

13.7M / 11(17,11,5) 57.5M / 37

(17,11,9)

6.4M / 3(18,1,1)

22.2M / 18(18,18,1)

33.1M / 15(18,18,7)

9.2M / 6(18,17,7)

26.7M / 29(19,7,1)

6.9M / 8(19,8,20)

2.2M / 1(20,2,23) 56.3M / 10

(20,2,24)

2.9M / 1(21,10,31)

2.3M / 1(22,8,10)

137.1M / 93(X,X,X)

18.2M / 9(1,1,7)

8.5M / 1(1,1,38)

4.6M / 1(1,3,7)

13.3M / 7(1,3,17)

39.3M / 21(1,3,6)

3.8M / 4(1,5,6)

20.1M / 23(7,5,6)

23.2M / 11(1,4,2)

23.8M / 6(1,4,5)

18.9M / 8(1,4,15)

7.9M / 10(7,5,14)

6.5M / 1(7,5,16)

11.5M / 3(7,5,18)6.5M / 1

(1,8,4)

4.8M / 2(1,13,4)

14.1M / 9(5,13,4)

28.2M / 10(5,15,4)

3.2M / 4(5,18,4)

3.0M / 1(5,13,11)

8.0M / 3(5,13,34)

9.4M / 1(2,1,10)

17.9M / 14(2,1,19)

17.4M / 4(2,1,25)

1.4M / 1(2,1,36)

34.9M / 14(2,1,37)

2.5M / 2(2,6,10)

15.0M / 6(2,11,10)

10.0M / 5(2,17,10)

15.0M / 4(2,2,19)

1.1M / 1(2,18,19)

34.9M / 8(2,2,36)

17.8M / 9(2,6,17)

13.9M / 2(2,17,17)

4.4M / 6(2,2,17)

2.6M / 1(2,5,17)

25.9M / 19(2,12,17)

10.0M / 4(3,3,23)

24.3M / 13(3,3,34)

34.9M / 22(3,9,23)

5.8M / 7(3,14,23)

4.5M / 1(3,17,23)

10.4M / 5(3,16,34)

29.9M / 14(3,6,20)

7.7M / 3(3,9,20)

11.6M / 2(3,14,20)

3.4M / 1(19,9,20)

4.1M / 6(19,10,20)

1.2M / 2(19,17,20)

12.2M / 12(3,16,31)

36.8M / 26(3,16,33)

27.9M / 6(21,16,31)

1.6M / 1(21,17,31)

12.3M / 3(4,3,15)

19.0M / 3(4,3,19)

25.5M / 4(4,3,32)

12.3M / 17(4,8,15)

1.2M / 1(4,6,19)

3.1M / 1(4,8,19)

12.6M / 2(4,5,32)

6.3M / 7(4,6,32)

14.1M / 13(4,8,16)

6.9M / 3(4,8,25)

39.4M / 21(4,5,3) 29.8M / 7

(4,5,13)

25.9M / 17(8,8,16)

1.2M / 1(8,8,37)

4.0M / 7(5,1,3)

21.9M / 9(5,13,3)

9.6M / 10(5,17,3)

2.1M / 1(5,18,3) 18.8M / 22

(5,18,11)

22.6M / 5(5,11,4)

6.8M / 3(5,11,11)

5.9M / 6(5,15,34)

17.8M / 10(5,13,2)

8.9M / 1(5,18,2)

5.8M / 3(10,13,2)

6.9M / 9(10,18,2)

13.7M / 7(6,1,12)

13.3M / 1(6,4,12)

16.3M / 9(6,9,12)

16.4M / 15(6,10,12)

20.0M / 14(6,17,12)

37.7M / 13(6,10,1)

14.5M / 9(6,17,1)

28.9M / 8(6,13,35)

42.1M / 19(7,6,14) 3.3M / 2

(7,9,14)

13.5M / 11(7,12,14)

1.2M / 1(7,13,14)

16.2M / 5(7,6,16)

2.8M / 4(7,11,16)

1.8M / 1(7,12,16)

7.3M / 5(7,11,18)

2.9M / 2(7,12,18)

5.9M / 2(7,13,18)

15.0M / 6(8,1,29)

14.0M / 7(8,3,29)

17.5M / 11(8,4,29)

1.3M / 1(8,15,29)

8.9M / 2(8,14,25)

5.8M / 6(13,14,25)

47.4M / 9(8,15,13)

16.8M / 13(9,2,9)

3.2M / 6(9,4,1)

55.2M / 18(9,4,11)

12.3M / 8(9,13,1)

17.7M / 2(9,19,1)

21.4M / 4(10,2,2)

2.2M / 2(10,8,2)

11.7M / 4(16,8,2)

3.0M / 1(10,6,28)

13.9M / 4(10,7,28)

29.6M / 18(10,19,28)

31.0M / 15(16,8,5)

14.2M / 11(10,10,4)

4.7M / 3(10,10,26)

14.4M / 8(10,14,4)

4.7M / 6(10,19,26)

1.6M / 1(22,10,26)

2.8M / 1(22,11,26)

2.1M / 6(22,16,26)

18.1M / 15(11,2,18)

4.3M / 1(11,2,21)

5.5M / 5(11,7,18)

8.7M / 2(11,19,18)

39.8M / 15(11,7,21)

11.3M / 3(11,9,21)

33.2M / 5(11,9,5)

25.5M / 4(12,5,26)

3.4M / 7(22,5,26)

31.4M / 15(12,6,27) 20.1M / 1

(12,15,27)

23.5M / 14(12,10,10) 28.9M / 12

(12,10,15)

6.8M / 4(13,3,25)

7.7M / 5(13,5,25)

11.7M / 7(13,8,22)

55.1M / 25(13,14,22)

74.8M / 24(14,12,8)

9.2M / 3(14,14,8)

1.5M / 1(14,14,15)

18.2M / 3(15,2,30)

26.9M / 10(15,9,30)

26.0M / 20(15,7,3)

1.7M / 3(15,9,3)

12.9M / 15(16,7,6)

12.4M / 9(16,16,6)

3.0M / 1(16,17,6)

13.7M / 11(17,11,5) 57.5M / 37

(17,11,9)

6.4M / 3(18,1,1)

22.2M / 18(18,18,1)

33.1M / 15(18,18,7)

9.2M / 6(18,17,7)

26.7M / 29(19,7,1)

6.9M / 8(19,8,20)

2.2M / 1(20,2,23) 56.3M / 10

(20,2,24)

2.9M / 1(21,10,31)

2.3M / 1(22,8,10) 15.2M / 8

(22,15,10)

137.1M / 93(X,X,X)

18.2M / 9(1,1,7)8.5M / 1

(1,1,38)

4.6M / 1(1,3,7)

13.3M / 7(1,3,17)

39.3M / 21(1,3,6)

3.8M / 4(1,5,6)

20.1M / 23(7,5,6)

23.2M / 11(1,4,2)

23.8M / 6(1,4,5)

18.9M / 8(1,4,15)

7.9M / 10(7,5,14)

6.5M / 1(7,5,16)

11.5M / 3(7,5,18)

6.5M / 1(1,8,4)

4.8M / 2(1,13,4)

14.1M / 9(5,13,4)

28.2M / 10(5,15,4)

3.2M / 4(5,18,4)

3.0M / 1(5,13,11)

8.0M / 3(5,13,34)

9.4M / 1(2,1,10)

17.9M / 14(2,1,19)

17.4M / 4(2,1,25)

34.9M / 14(2,1,37)

15.0M / 6(2,11,10)

10.0M / 5(2,17,10)

15.0M / 4(2,2,19)

13.9M / 2(2,17,17)

34.9M / 8(2,2,36)

4.4M / 6(2,2,17)

17.8M / 9(2,6,17)

25.9M / 19(2,12,17)

10.0M / 4(3,3,23)

24.3M / 13(3,3,34)

34.9M / 22(3,9,23)5.8M / 7

(3,14,23)

4.5M / 1(3,17,23)

10.4M / 5(3,16,34)

29.9M / 14(3,6,20)

7.7M / 3(3,9,20)11.6M / 2

(3,14,20) 3.4M / 1(19,9,20)

4.1M / 6(19,10,20)

12.2M / 12(3,16,31)

36.8M / 26(3,16,33)

27.9M / 6(21,16,31)

12.3M / 3(4,3,15)

19.0M / 3(4,3,19)

25.5M / 4(4,3,32)

12.3M / 17(4,8,15)

3.1M / 1(4,8,19)

12.6M / 2(4,5,32)6.3M / 7

(4,6,32)

14.1M / 13(4,8,16)

6.9M / 3(4,8,25)

39.4M / 21(4,5,3)

29.8M / 7(4,5,13)

25.9M / 17(8,8,16)

4.0M / 7(5,1,3)

21.9M / 9(5,13,3)

9.6M / 10(5,17,3)

22.6M / 5(5,11,4)

6.8M / 3(5,11,11)

18.8M / 22(5,18,11)

5.9M / 6(5,15,34)

17.8M / 10(5,13,2)8.9M / 1

(5,18,2)

5.8M / 3(10,13,2)6.9M / 9

(10,18,2)

13.7M / 7(6,1,12)

13.3M / 1(6,4,12)

16.3M / 9(6,9,12)

16.4M / 15(6,10,12)

20.0M / 14(6,17,12)

37.7M / 13(6,10,1)

14.5M / 9(6,17,1)

28.9M / 8(6,13,35)

42.1M / 19(7,6,14)

3.3M / 2(7,9,14)

13.5M / 11(7,12,14)

16.2M / 5(7,6,16)

7.3M / 5(7,11,18) 5.9M / 2

(7,13,18)

15.0M / 6(8,1,29)

14.0M / 7(8,3,29)

17.5M / 11(8,4,29)

8.9M / 2(8,14,25)

5.8M / 6(13,14,25)

47.4M / 9(8,15,13)

16.8M / 13(9,2,9)

3.2M / 6(9,4,1)

55.2M / 18(9,4,11)

12.3M / 8(9,13,1)

17.7M / 2(9,19,1)

21.4M / 4(10,2,2)

3.0M / 1(10,6,28)

13.9M / 4(10,7,28)

29.6M / 18(10,19,28)

14.2M / 11(10,10,4)

4.7M / 3(10,10,26)

14.4M / 8(10,14,4)

4.7M / 6(10,19,26)

18.1M / 15(11,2,18)

4.3M / 1(11,2,21)

5.5M / 5(11,7,18)

8.7M / 2(11,19,18)

39.8M / 15(11,7,21)

11.3M / 3(11,9,21)

33.2M / 5(11,9,5)

25.5M / 4(12,5,26) 3.4M / 7

(22,5,26)

31.4M / 15(12,6,27) 20.1M / 1

(12,15,27)

23.5M / 14(12,10,10) 28.9M / 12

(12,10,15)

6.8M / 4(13,3,25)

7.7M / 5(13,5,25)

11.7M / 7(13,8,22)

55.1M / 25(13,14,22)

74.8M / 24(14,12,8) 9.2M / 3

(14,14,8)

18.2M / 3(15,2,30) 26.9M / 10

(15,9,30)

26.0M / 20(15,7,3)

12.9M / 15(16,7,6)

12.4M / 9(16,16,6)

3.0M / 1(16,17,6)

11.7M / 4(16,8,2) 31.0M / 15

(16,8,5)

13.7M / 11(17,11,5) 57.5M / 37

(17,11,9)

6.4M / 3(18,1,1)

22.2M / 18(18,18,1)

33.1M / 15(18,18,7)

9.2M / 6(18,17,7)

26.7M / 29(19,7,1)

6.9M / 8(19,8,20)

56.3M / 10(20,2,24)

15.2M / 8(22,15,10)

Figure S16: Chromosomal associations between the Human, Mouse, and Dog genomes on 1357 synteny blocks revealed bythe cytogenetics approach for all synteny-triples (top left), and synteny-triples longer than 300 Kb (top right), 1 Mb (bottomleft), and 3 Mb (bottom right). Each vertex corresponds to a synteny-triple (t1, t2, t3) located on chromosomes t1 in Human,t2 in Mouse, and t3 in Dog. Each vertex is also labeled with the total size and number of the synteny blocks correspondingto the synteny-triple (t1, t2, t3). For example, a blue vertex labeled as 74.8M / 24

(14,12,8) describes 24 synteny blocks of the total size74.8 Mb described by the synteny-triple (14, 12, 8).

In an attempt to alleviate these shortcomings of CytoAncestor we limited our attention to longsynteny blocks by excluding synteny-triples that cover less 300 Kb, 1 Mb, and 3 Mb from the datasetin Ma et al. (2006) (Fig. S16). While the size of the giant component reduces, even for synteny-triplesof size 3 Mb and longer (typical cytogenetics resolution), most of the resulting synteny associationsremain unrealistic.

Supplement L Benchmarking MGRA on simulated dataWe benchamarked MGRA on various simulated datasets with a fixed phylogenetic tree shown inFig. 5 (for illustration purposes, we refer to the leaves of the tree as M, R, D, Q, H, and C). In addition,we evaluated MGRA’s ability to reconstruct an unknown tree in case of short internal branches.

39

In the first “constant branch length” simulation, we fixed the number of rearrangements on eachbranch to the same number varying from 25 to 250 and generated the leaf genomes by performingrearrangements on a fixed MRD genome consisting of 20 chromosomes with 75 synteny blocks each.The total number of synteny blocks in this simulation is close to the number of synteny blocks for sixmammalian genomes studied in Ma et al. (2006). The leaf genomes were generated from the MRDgenome by applying random 2-breaks (preserving linearity) along the branches of the tree. We furtherapplied MGRA to the leaf genomes and compared the reconstructed MRD ancestral genome with thesimulated one, counting the number of missing and incorrectly reconstructed adjacencies (Table S9).

Below we focus on the simulation with branch lenght 125 which results in a rather difficult ancestralreconstruction problem with high breakpoint re-use rate18 of 1.5. We remark that 125 rearrangementson each branch imply 5 · 125 = 625 rearrangements between the simulated H (“human”) and M(“mouse”) nodes in Fig. 5, a rather large number of rearrangements (as compared to the number ofrearrangements between the real human and mouse genomes). Note that 625 rearrangements breakthe lion share of adjacencies between 1500 synteny blocks in the simulated genomes, making theancestral reconstruction difficult. Nevertheless, MGRA produced an error-free reconstruction of theancestral MRD genome in this case (with only 2 missing adjacencies). As expected, MGRA becomesless accurate and more fragmented when the genomes become extremely scrambled (e.g., 4% ofadjacencies are incorrect and 9% of adjacencies are lost for the branch length 250).

Table S9 also shows the results of inferCARs reconstructions and illustrates that MGRA generatesmore accurate ancestral reconstructions for all choices of parameters. In particular, for simulation withthe branch length 125, about 2% of adjacencies reconstructed by inferCARs are incorrect. While it is arelatively small proportion of incorrect adjacencies (for a rather difficult ancestral reconstruction prob-lem with high breakpoint re-use), MGRA produced an error-free and less fragmented reconstructionin this case.

Branchlength

H-Mbreakpointre-use

Conservedadjacencies

Reconstructed adjacenciesCorrect Missing Incorrect

MGRA inferCARs MGRA inferCARs MGRA inferCARs

25 1.14 1093 1480 1472 0 8 0 650 1.18 806 1480 1464 0 16 0 975 1.32 608 1479 1461 1 19 0 9100 1.38 432 1478 1446 2 34 0 15125 1.50 317 1478 1425 2 55 0 28150 1.58 233 1463 1412 17 68 8 39175 1.70 175 1460 1373 20 107 14 56200 1.78 130 1448 1342 32 138 20 86225 1.83 113 1429 1305 51 175 39 118250 1.89 81 1343 1255 137 225 62 162

Table S9: Reconstruction of the MRD ancestor of six simulated genomes with the phylogenetic tree shown in Fig. 5 whereall branches have the same length. The genomes were generated from a fixed MRD genome on 20 chromosomes and 1500synteny blocks (with 1480 adjacencies), applying random 2-breaks (preserving linearity) along the branches of the tree.The second column refers to the breakpoint re-use rate between simulated genomes corresponding to H (“human”) and M(“mouse”) nodes. Some of the pairs of adjacent synteny blocks in the MRD genome remain adjacent in all six generatedleaf genomes. The number of such conserved adjacencies is shown in the third column (hence, the effective number of thesynteny blocks in each simulation is 1480 minus the number of conserved adjacencies). The columns from forth to ninthgive the statistics of reconstructed adjacencies by classifying them into correct, missing, and incorrect (for both MGRA andinferCARs).

To evaluate effect of more complex rearrangements (e.g., transpositions) on MGRA performance,we further added 3-breaks (happening with the probability 0.1 at every step) to the set of simulated re-

18Similarly to Pevzner and Tesler (2003), we estimate the breakpoint re-use rate as two times the 2-break distance dividedby the number of breakpoints between two genomes.

40

arrangements. Table S10 illustrates that adding 3-breaks has only minor effect on MGRA performance.Again, MGRA improves on inferCARs in the case when both 2-breaks and 3-breaks are included inthe simulations.

Branchlength

H-Mbreakpointre-use

Conservedadjacencies

Reconstructed adjacenciesCorrect Missing Incorrect

MGRA inferCARs MGRA inferCARs MGRA inferCARs

25 1.20 1092 1478 1467 2 13 0 1150 1.23 804 1480 1466 0 14 0 975 1.36 587 1480 1449 0 31 0 17100 1.42 429 1478 1430 2 50 0 29125 1.49 315 1474 1417 6 63 6 33150 1.63 233 1453 1387 27 93 15 53175 1.75 186 1461 1375 19 105 11 58200 1.80 116 1462 1350 18 130 14 77225 1.85 105 1382 1304 98 176 68 121250 1.90 61 1376 1256 104 224 60 148

Table S10: Simulations similar to those in Table S9, where rearrangements in addition to 2-breaks include 3-breaks occuringwith the probability 0.1.

In the second “variable branch length” simulation, we selected the number of rearrangements oneach branch of the tree according to the values from Table S6(top) to better reflect various rates ofrearrangements in different mammalian lineages. To model breakpoint re-use, we varied the numberof initial synteny blocks from 1000 to 2000 (the smaller is the number of blocks, the large is thebreakpoint re-use). The benchmarking results for this simulation are shown in Table S11 (for MRDgenome).

Numberofblocks

H-Mbreakpointre-use

Conservedadjacencies

Reconstructed adjacenciesCorrect Missing Incorrect

MGRA inferCARs MGRA inferCARs MGRA inferCARs

1000 1.52 99 893 909 87 71 21 551100 1.50 123 1021 1015 59 65 17 361200 1.47 188 1120 1114 60 66 9 361300 1.33 242 1228 1227 52 53 2 341400 1.32 282 1325 1338 55 42 0 261500 1.32 346 1446 1441 24 39 0 251600 1.32 381 1553 1532 27 48 2 291700 1.32 451 1652 1629 28 51 0 341800 1.27 532 1740 1739 40 41 0 271900 1.28 592 1862 1842 18 38 0 202000 1.25 622 1954 1938 26 42 0 32

Table S11: Reconstruction of the MRD ancestor of six simulated genomes with the phylogenetic tree shown in Fig. 5, wherethe length of branches is the same as in Table S6(top) and the number of synteny blocks varies from 1000 to 2000. Compareto Table S9.

Table S11 illustrates that for 1400 synteny blocks (roughly the number of synteny blocks identifiedin Ma et al. (2006)), MGRA is error-free but rather fragmentary with 55 missing adjacencies. We remarkthat such significant fragmentation was not observed when MGRA was applied to real data. A possibleexplanation is that many rearrangements in real scenarios are actually micro-rearrangements that aretypically easier to analyze (unless they lead to breakpoint re-use). Our simulation does not modelmicro-rearrangements, thus making reconstruction of simulated genomes in this case somewhat moredifficult than reconstruction of real genomes. We remark that while inferCARs generated slightly lessfragmented reconstruction than MGRA in this case, it generated 26 incorrect adjacencies.

41

MO + DH length MO + DH multi-edges/paths MH + DO multi-edges/paths MD + HO multi-edges/paths

0 0 / 0 0 / 0 4 / 15 20 / 5 0 / 0 0 / 0

10 40 / 10 0 / 0 0 / 015 54 / 13 0 / 0 0 / 020 76 / 19 0 / 0 0 / 025 92 / 23 0 / 0 0 / 030 112 / 27 3 / 1 0 / 035 132 / 33 0 / 0 4 / 140 136 / 34 0 / 0 0 / 045 172 / 43 0 / 0 0 / 050 170 / 42 4 / 1 0 / 0

0 4 / 1 0 / 0 0 / 05 16 / 4 7 / 2 0 / 0

10 24 / 6 0 / 0 0 / 015 48 / 12 0 / 0 0 / 020 76 / 19 0 / 0 0 / 025 96 / 24 0 / 0 0 / 030 114 / 29 0 / 0 0 / 035 129 / 32 4 / 1 0 / 040 122 / 30 0 / 0 3 / 145 140 / 34 0 / 0 0 / 050 130 / 32 0 / 0 0 / 0

0 0 / 0 0 / 0 0 / 05 20 / 5 0 / 0 3 / 1

10 24 / 6 0 / 0 0 / 015 52 / 13 0 / 0 0 / 020 63 / 16 0 / 0 3 / 125 70 / 18 4 / 1 0 / 030 88 / 22 0 / 0 4 / 135 96 / 24 3 / 1 0 / 040 98 / 24 4 / 1 0 / 045 108 / 21 0 / 0 4 / 150 148 / 36 3 / 1 0 / 0

0 0 / 0 4 / 1 8 / 25 16 / 4 0 / 0 9 / 2

10 20 / 5 3 / 1 0 / 015 31 / 8 0 / 0 0 / 020 56 / 14 4 / 1 0 / 025 55 / 13 7 / 2 0 / 030 82 / 21 0 / 0 6 / 235 58 / 14 9 / 3 0 / 040 65 / 15 0 / 0 0 / 045 72 / 18 0 / 0 0 / 050 105 / 25 0 / 0 3 / 1

Table S12: The statistics of the breakpoint graph of the simulated M, H, D, and O genomes (using the ((H,D)(M,O)) treetopology with 5 branches) on 1750 (first table), 1500 (second table), 1250 (third table), and 1000 (fourth table) syntenyblocks. The tables represent the statistics after MGRA Stages 1-2 run on the confident leaf branches. The length of thebranch separating M and O from H and D varied from 0 to 50. Compare to Tables 1(bottom), S13, and 4.

In the third simulation we investigated whether MGRA is capable of revealing the short branchesof the phylogenetic tree in the “blind mode” when the tree is not known in advance. The goal is toevaluate whether the phylogenetic characters generated by MGRA (such as in Tables 1, S13, and 4)may be misleading in the case of very short branches. TSuch short branches typically incurred veryfew rearrangements that may be difficult to reconstruct due to a variety of factors (e.g., breakpoint

42

re-use or long branch attraction).We simulated H, M, D, and O genomes and preserved the rearrangement distances between the

Human, Mouse, Dog, and Opossum genomes shown in Table S6(bottom). We considered a tree on 4leaves and two internal nodes XHD and XMO with branch distance d(H,XHD) = 110, d(D,XHD) = 140,d(M,XMO) = 260, d(O,XMO) = 560 and the varying length of the internal edge between XHD and XMO)(from 0 to 50). We further simulated rearrangements on 5 branches of the resulting tree according tothe specified rearrangement distances. We performed 4 simulations (for 1000, 1250, 1500, and 1750synteny blocks) resulting in various breakpoint re-use rates. Table S12 illustrates that MGRA revealsthe correct topology in nearly all cases with the exception of the cases when the length of the internalbranch is close to zero.

Supplement M Paths in the breakpoint graph and the primate–rodent–carnivore splitWe analyzed the paths in the breakpoint graph in Fig. S18(top) with the goal to find a path thatmay support or reject the primate–carnivore split. Since the branch corresponding to this split isrelatively short (≈ 7 million years), we do not expect to find many rearrangements supporting eitherthe primate–carnivore, or the primate–rodent splits. Alternating MRO-DQHC paths would representa strong supporting evidence for the primate-carnivore split, while alternating DO-MRDQHC pathswould represent a strong supporting evidence for the primate-rodent split. Not surprisingly, neitherthe original breakpoint graph, nor the breakpoint graph after applying MGRA Stage 1 contains suchpaths, an indication that the branch corresponding to the split is indeed short. Fig. S17 enlarges a pathof alternating O and MR edges in the breakpoint graph in Fig. S18(top) that groups opossums withrodents and represents the best supporting evidence for the primate–carnivore split (most vertices onthis path represent chromosome endpoints in D, Q, H, and C genomes). We emphasize that while thispath is hard to explain under the assumption of primate-rodent split, it does not represent a “proof” ofthe primate-carnivore split since complex breakpoint re-uses combined with difficulties in finding thesynteny blocks between distant mammals may skew the statistics of paths in the breakpoint graph.

161h 1267h 814t 108t 1208t 1111h 747h 927h

Figure S17: A path of the alternating MR and O edges from the breakpoint graph shown in Fig. S18(top). The verticesforming this path represent mostly chromosome endpoints in the other genomes.

43

1004h

1005t

161h

161t

487t

676t

1006h

1007t

285h

1258t

1523t

286t

1527t

101h

102t

107h

1547h

1561h1401h

1607h

710h

107t

1518t

1548t

1599t

1552h

1608h

106h

711t

747h

956h

773t

1531h

748t

927h

108t

1208t

814t

826h

1111h

1113h

1136h

1137t

836t

1137h

1138t

1154h

1155t

1509t

1528t

1207h

1514t

377h378t

1211h

1212t

772h

772t

1216h

1217t

837t

124h

125t

250t

1257h

186t1267h

1268h

1269t

1350h

1268t

1307h

813h

957t

820t

1308t

1335t

1342t

1344t

1343h

1351t

924h

925t

1364h

1365t

836h

1377h

978h

1420h

627h

1460h

1461t

1472h

1490h

628t

1491t

1494h

159t

1495t

173t

1507t

1517h

835h

1522h

1508h1513h

151h167h

168t

1526h

1527h

1551h

1552t

1640t

979t

1578h

1579t

292t

158h

1598h

180t

879h

1599h

1602h

327t

881t

160h

249h

489t

1620h

1621t

453h454t

1637t

1667h

1668t

1638h

1639t

1700t

1647h

1639h

1666t

1648t

1717t

1746h

1665h

1699h

181t

1716h

172h

1744h

1745t

179h

180h

185h

968t

969t

1t

376t

880h

240h

326h

374t

373h

375h

409h

410h

410t

411h

411t

412t

488h

574h

575t

822t

675h

861h

862t

771h

818h

821h

840h

841t

853t

852h

950h

951h

951t

952h

952t

953t

101h

107h

1547h

1561h

107t

1548t

102t

1401h

1607h

710h

1552h

1608h

106h

711t

747h

956h

748t

927h

108t

1208t

814t

826h

1111h

1207h

1531h

880h

924h

881t

1267h

161h

1268h

1269t

1350h

1268t

1307h

813h

820t

957t

1308t

1351t925t

1377h

978h

1420h

627h

1472h628t

1507t

835h

836t

979t

1578h

1579t

292t

1598h

180t

879h

1599h

1602h327t

1637t

1667h

1668t

1746h

1t

376t

240h

410t

411h

411t

412t

487t

574h 575t

822t

818h

821h

950h

951h

951t

952h

952t

953t

Figure S18: Top panel: The breakpoint graph of the Mouse (red), Rat (blue), Dog (green), macaQue (violet), Human (orange),Chimpanzee (yellow), and Opossum (brown) genomes after MGRA Stages 1-2 on the confident branches. Compare toFig. 7(bottom). Restricting this graph to 4 genomes M,D,Q,O and running MGRA on this smaller graph using only 4confident branches M + DQO, D + MQO, Q + MDO, and O + MDQ results in the breakpoint graph G(M,D,Q,O) shown atthe bottom panel. Bottom panel: The breakpoint graph G(M,D,Q,O) of the Mouse, Dog, macaQue, and Opossum genomesafter MGRA Stages 1-2 on the confident branches.

44

Multicolors Multi-edges

Simplevertices

Simplemiltiedges

Simplepaths+cycles

O +MRDQHC 561 + 738 = 1299 1120 559 + 434 = 993 125 + 92 = 217R +MDQHCO 442 + 557 = 999 884 442 + 391 = 833 51 + 148 = 199MR + DQHCO 226 + 177 = 403 288 104 + 126 = 230 44 + 24 = 68D +MRQHCO 135 + 241 = 376 270 135 + 86 = 221 49 + 30 = 79M + RDQHCO 138 + 64 = 202 128 39 + 64 = 103 25 + 16 = 41QHC +MRDO 49 + 104 = 153 81 34 + 27 = 61 13 + 11 = 24Q +MRDHCO 46 + 80 = 126 92 46 + 33 = 79 13 + 13 = 26HC +MRDQO 38 + 66 = 104 70 32 + 23 = 55 12 + 8 = 20C +MRDQHO 12 + 25 = 37 24 12 + 6 = 18 6 + 3 = 9H +MRDQCO 9 + 18 = 27 18 9 + 6 = 15 3 + 2 = 5MRO + DQHC 4 + 46 = 50 1 0 + 0 = 0 0 + 0 = 0RO + MDQHC 31 + 2 = 33 1 0 + 0 = 0 0 + 0 = 0DO + MRQHC 21 + 11 = 32 0 0 + 0 = 0 0 + 0 = 0MRD + QHCO 15 + 7 = 22 1 0 + 0 = 0 0 + 0 = 0HCO + MRDQ 5 + 2 = 7 0 0 + 0 = 0 0 + 0 = 0DHC + MRQO 2 + 4 = 6 0 0 + 0 = 0 0 + 0 = 0MO + RDQHC 0 + 5 = 5 0 0 + 0 = 0 0 + 0 = 0DQO + MRHC 1 + 3 = 4 0 0 + 0 = 0 0 + 0 = 0QC + MRDHO 0 + 3 = 3 0 0 + 0 = 0 0 + 0 = 0QH + MRDCO 0 + 3 = 3 0 0 + 0 = 0 0 + 0 = 0MRQ + DHCO 2 + 1 = 3 0 0 + 0 = 0 0 + 0 = 0RDO + MQHC 1 + 1 = 2 0 0 + 0 = 0 0 + 0 = 0DQC + MRHO 0 + 1 = 1 0 0 + 0 = 0 0 + 0 = 0MRC + DQHO 0 + 1 = 1 0 0 + 0 = 0 0 + 0 = 0DQH + MRCO 0 + 1 = 1 0 0 + 0 = 0 0 + 0 = 0RQC + MDHO 0 + 1 = 1 0 0 + 0 = 0 0 + 0 = 0

Table S13: The statistics of the breakpoint graph for the Mouse, Rat, Dog, macaQue, Human, Chimpanzee, and Opossumgenomes. For every pair of complementary multicolors, we show the number of multi-edges of these multicolors, thenumber of simple vertices that are incident to such multi-edges, the number of simple multi-edges, and the number ofsimple paths and cycles. The confident T-consistent multicolors are shown in bold.

45

Supplement N Additional Tables and Figures

CAR Length+1239 1.0 Mb+72 +73 +74 11.3 Mb+75 7.0 Mb+76 0.5 Mb+77 +78 0.3 Mb

Table S14: The list of short CARs (shorter than 15 Mb w.r.t. the Human genome) in MGRA reconstruction of the Boreoeuthe-rian ancestral genome.

1035t (15)

970h (14)

1061h (16)1062t (16)

555t (7)1191h (19)

1177h (19)

1178t (19)

1180h (19)

1181t (19)

1189h (19)

1190t (19)

1192t (19)

1196h (19)

1198h (19)

1202t (19)

1199t (19)

1199h (19)

1200t (19)

1204t (19)

1203h (19)

1201h (19)

120h (2)

131h (2)

1241h (22)

1246h (22)

250t (3)

1289h (23)1290t (23)

1290h (23)1291t (23)

1299h (23)1300t (23)

1300h (23)1301t (23)

375h (5)376t (5)

586h (7)

587t (7)

78t (1)

769h (10)

609h (7)610h (7)

610t (7)611t (7)

666h (8)

667t (8)

74h (1)

75t (1)

770h (10)

75h (1)

76t (1)

770t (10)

771t (10)

77h (1)

872h (12)

873t (12)

926h (13)927t (13)

940h (13)

941t (13)

941h (13)

1003h (15)

1035h (15)

1014t (15)

1017h (15)

1197t (19)

1245h (22)

72t (1)

1264h (22)

1246t (22)

1254h (22)

652h (8)

658h (8)871t (12)

935h (13)

1224h (20)

71h (1)

142t (2)

730t (9)

729h (9)

752t (10)

970t (14)

971t (14)

Figure S19: The breakpoint graph of the genomes MRDCARs (cyan) and MRD′CARs (orange) reconstructed by inferCARs aswell as MRDMGRA (purple) reconstructed by MGRA. Bold purple edges represent reliable adjacencies obtained by MGRAStage 1, while dashed purple edges (shown even if parts of complete multi-edges) represent adjacencies (between verticesincident to a split in M/R/D colors in Fig. 7, bottom panel) viewed as less reliable. Dashed cyan and orange edges representambiguous joins made by inferCARs.

46

ch

ct

ah

at

bh

bt

ch

ct

ah

at

bh

bt

ah

at

bt

ch

ct

ah

at

bh

bt

ch

ct

bh

ch

ct

ah

at

bh

bt

breakpoint graph G(P,Q,R)of the genomes P, Q, and R

a

b

c

PP

a)

c)

a

c

b

R R

d)b)

Q

a

b

c

Q

G(P,Q,R)

Figure S20: a) Unichromosomal genome P = (+a + b− c) represented as a black-obverse cycle. b) Unichromosomal genomeQ = (+a − b + c) represented as a green-obverse cycle. b) Unichromosomal genome R = (+a − c − b) represented as ablue-obverse cycle. d) The (multiple) breakpoint graph G(P,Q,R) with and without obverse edges (compare to Fig. 1).

ch

ct

ah

at

bh

bt

ch

ct

ah

at

bh

bt

(+a −b +c)

(+a −b −c)(+a +b −c)

(+a −b +c)

ch

ct

ah

at

bh

bt

ch

ct

ah

at

bh

bt

ch

ct

ah

at

bh

bt

G(Q,Q)G(P,Q)

(+a −b +c)

(+a −b +c)(+a +b −c) (+a +b −c)

(+a +b +c)(+a +b −c)

G(P,P)

Figure S21: Transformation of the breakpoint graph G(P,Q) of the “black” genome P = (+a + b − c) and “green” genomeQ = (+a − b + c) (see Fig. 1) into the identity breakpoint graphs G(P,P) (with “green” 2-breaks) and G(Q,Q) (with “black”2-breaks).

47