crusview: a java-based visualization platform for comparative

9
CrusView: A Java-Based Visualization Platform for Comparative Genomics Analyses in Brassicaceae Species [OPEN] Hao Chen and Xiangfeng Wang* School of Plant Sciences, University of Arizona, Tucson, Arizona, 85721 ORCID IDs: 0000-0003-1224-0688 (H.C.); 0000-0002-6406-5597 (X.W.). In plants and animals, chromosomal breakage and fusion events based on conserved syntenic genomic blocks lead to conserved patterns of karyotype evolution among species of the same family. However, karyotype information has not been well utilized in genomic comparison studies. We present CrusView, a Java-based bioinformatic application utilizing Standard Widget Toolkit/ Swing graphics libraries and a SQLite database for performing visualized analyses of comparative genomics data in Brassicaceae (crucifer) plants. Compared with similar software and databases, one of the unique features of CrusView is its integration of karyotype information when comparing two genomes. This feature allows users to perform karyotype-based genome assembly and karyotype-assisted genome synteny analyses with preset karyotype patterns of the Brassicaceae genomes. Additionally, CrusView is a local program, which gives its users high exibility when analyzing unpublished genomes and allows users to upload self-dened genomic information so that they can visually study the associations between genome structural variations and genetic elements, including chromosomal rearrangements, genomic macrosynteny, gene families, high-frequency recom- bination sites, and tandem and segmental duplications between related species. This tool will greatly facilitate karyotype, chromosome, and genome evolution studies using visualized comparative genomics approaches in Brassicaceae species. CrusView is freely available at http://www.cmbb.arizona.edu/CrusView/. The Brassicaceae (crucifer) plant family contains more than 3,700 species, including the model plant organism Arabidopsis (Arabidopsis thaliana); economi- cally important crop species, such as Brassica rapa and Brassica napus; and close relatives of Arabidopsis used in abiotic stress research, such as Eutrema salsugineum and Schrenkiella parvula. Because Brassicaceae plants have high scientic and economic importance, several whole-genome sequencing projects of the species in this family have been recently launched (http://www. brassica.info). Moreover, Brassicaceae is also a good system for population genomics. The 1001 Arabidopsis Genomes Project (http://www.1001genomes.org/) plans to generate complete genome sequences for 1,001 Arabidopsis strains to study the associations between genetic variation and phenotypic diversity. The Value- directed Evolutionary Genomics Initiative project aims to understand the genome evolution of Brassicaceae species by sequencing several close relatives of Arab- idopsis, such as Arabidopsis lyrata and Capsella rubella. Recent advances in high-throughput sequencing tech- nology have greatly expedited these whole-genome sequencing projects of versatile nonmodel organisms. Although increasingly longer reads can now be produced from high-throughput sequencing experi- ments, de novo assembler tools can only generate contig and/or scaffold sequences from high-throughput sequencing reads. These tools cannot generate complete chromosome sequences without genetic and/or physi- cal maps that typically require years to create. This limitation makes chromosome-scale structural variation (i.e. translocation, inversion, deletion and insertion, and segmental and tandem duplication) and genomic mac- rosynteny analyses difcult to perform. In both plants and animals, genomes of species within the same family have evolved with conserved karyotype patterns due to the rearrangements of large chromosomal segments. Chromosomal karyotypes can be obtained from comparative chromosomal painting (CCP) experiments by performing in situ hybridi- zation experiments on bacterial articial chromosome sequences between related species. The genome of each Brassicaceae member is composed of 24 con- served genomic blocks that have been considered as the basic units of chromosomal rearrangement during genome evolution (Lysak et al., 2006). The sizes of these conserved blocks range from several to dozens of megabases. Currently, karyotypes proled by CCP experiments in approximately 20 Brassicaceae species are available; such karyotypes include those from Arabidopsis (n = 5), Homungia alpine (n = 6), Eutrema spp. (n = 7), A. lyrata (n = 8), B. rapa (n = 10), and Polyctenium fremontii (n = 14). By utilizing the karyo- type information in Brassicaceae, we have developed a tool, KGBassembler (for Karyotype-based Genome * Address correspondence to [email protected]. The author responsible for distribution of materials integral to the ndings presented in this article in accordance with the policy de- scribed in the Instructions for Authors (www.plantphysiol.org) is: Xiangfeng Wang ([email protected]). [OPEN] Articles can be viewed online without a subscription. www.plantphysiol.org/cgi/doi/10.1104/pp.113.219444 354 Plant Physiology Ò , September 2013, Vol. 163, pp. 354362, www.plantphysiol.org Ó 2013 American Society of Plant Biologists. All Rights Reserved. www.plantphysiol.org on February 12, 2018 - Published by Downloaded from Copyright © 2013 American Society of Plant Biologists. All rights reserved.

Upload: dangkhanh

Post on 02-Jan-2017

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CrusView: A Java-Based Visualization Platform for Comparative

CrusView: A Java-Based Visualization Platformfor Comparative Genomics Analyses inBrassicaceae Species[OPEN]

Hao Chen and Xiangfeng Wang*

School of Plant Sciences, University of Arizona, Tucson, Arizona, 85721

ORCID IDs: 0000-0003-1224-0688 (H.C.); 0000-0002-6406-5597 (X.W.).

In plants and animals, chromosomal breakage and fusion events based on conserved syntenic genomic blocks lead to conservedpatterns of karyotype evolution among species of the same family. However, karyotype information has not been well utilized ingenomic comparison studies. We present CrusView, a Java-based bioinformatic application utilizing Standard Widget Toolkit/Swing graphics libraries and a SQLite database for performing visualized analyses of comparative genomics data in Brassicaceae(crucifer) plants. Compared with similar software and databases, one of the unique features of CrusView is its integration ofkaryotype information when comparing two genomes. This feature allows users to perform karyotype-based genome assemblyand karyotype-assisted genome synteny analyses with preset karyotype patterns of the Brassicaceae genomes. Additionally,CrusView is a local program, which gives its users high flexibility when analyzing unpublished genomes and allows users toupload self-defined genomic information so that they can visually study the associations between genome structural variationsand genetic elements, including chromosomal rearrangements, genomic macrosynteny, gene families, high-frequency recom-bination sites, and tandem and segmental duplications between related species. This tool will greatly facilitate karyotype,chromosome, and genome evolution studies using visualized comparative genomics approaches in Brassicaceae species.CrusView is freely available at http://www.cmbb.arizona.edu/CrusView/.

The Brassicaceae (crucifer) plant family containsmore than 3,700 species, including the model plantorganism Arabidopsis (Arabidopsis thaliana); economi-cally important crop species, such as Brassica rapa andBrassica napus; and close relatives of Arabidopsis usedin abiotic stress research, such as Eutrema salsugineumand Schrenkiella parvula. Because Brassicaceae plantshave high scientific and economic importance, severalwhole-genome sequencing projects of the species inthis family have been recently launched (http://www.brassica.info). Moreover, Brassicaceae is also a goodsystem for population genomics. The 1001 ArabidopsisGenomes Project (http://www.1001genomes.org/)plans to generate complete genome sequences for 1,001Arabidopsis strains to study the associations betweengenetic variation and phenotypic diversity. The Value-directed Evolutionary Genomics Initiative project aimsto understand the genome evolution of Brassicaceaespecies by sequencing several close relatives of Arab-idopsis, such as Arabidopsis lyrata and Capsella rubella.Recent advances in high-throughput sequencing tech-nology have greatly expedited these whole-genomesequencing projects of versatile nonmodel organisms.

Although increasingly longer reads can now beproduced from high-throughput sequencing experi-ments, de novo assembler tools can only generatecontig and/or scaffold sequences from high-throughputsequencing reads. These tools cannot generate completechromosome sequences without genetic and/or physi-cal maps that typically require years to create. Thislimitation makes chromosome-scale structural variation(i.e. translocation, inversion, deletion and insertion, andsegmental and tandem duplication) and genomic mac-rosynteny analyses difficult to perform.

In both plants and animals, genomes of specieswithin the same family have evolved with conservedkaryotype patterns due to the rearrangements of largechromosomal segments. Chromosomal karyotypes canbe obtained from comparative chromosomal painting(CCP) experiments by performing in situ hybridi-zation experiments on bacterial artificial chromosomesequences between related species. The genome ofeach Brassicaceae member is composed of 24 con-served genomic blocks that have been considered asthe basic units of chromosomal rearrangement duringgenome evolution (Lysak et al., 2006). The sizes ofthese conserved blocks range from several to dozensof megabases. Currently, karyotypes profiled by CCPexperiments in approximately 20 Brassicaceae speciesare available; such karyotypes include those fromArabidopsis (n = 5), Homungia alpine (n = 6), Eutremaspp. (n = 7), A. lyrata (n = 8), B. rapa (n = 10), andPolyctenium fremontii (n = 14). By utilizing the karyo-type information in Brassicaceae, we have developeda tool, KGBassembler (for Karyotype-based Genome

* Address correspondence to [email protected] author responsible for distribution of materials integral to the

findings presented in this article in accordance with the policy de-scribed in the Instructions for Authors (www.plantphysiol.org) is:Xiangfeng Wang ([email protected]).

[OPEN] Articles can be viewed online without a subscription.www.plantphysiol.org/cgi/doi/10.1104/pp.113.219444

354 Plant Physiology�, September 2013, Vol. 163, pp. 354–362, www.plantphysiol.org � 2013 American Society of Plant Biologists. All Rights Reserved. www.plantphysiol.orgon February 12, 2018 - Published by Downloaded from

Copyright © 2013 American Society of Plant Biologists. All rights reserved.

Page 2: CrusView: A Java-Based Visualization Platform for Comparative

assembler for Brassicaceae), to finalize the assembly ofchromosomes from scaffolds/contigs without relyingon a genetic/physical map (Ma et al., 2012).Over the past 2 years, complete whole-genome

sequences of several Brassicaceae species have beenreleased, including the aforementioned A. lyrata,S. parvula, B. rapa, and E. salsugineum (Dassanayakeet al., 2011; Hu et al., 2011; Wang et al., 2011; Wrightand Agren, 2011; Wu et al., 2012; Yang et al., 2013).These genomic resources have opened a new era ofcomparative genomics in Brassicaceae to better un-derstand the genomic evolution (Cheng et al., 2012).Numerous tools and databases are available for per-forming comparative genomics analysis in plants.CoGe is a comparative genomics analysis platformthat is now a part of the iPlant Collaborative Project(Goff et al., 2011). The CoGe database currently in-cludes nearly 2,000 genome sequences of approxi-mately 1,500 organisms, allowing users to performonline visual analyses of genome synteny and dupli-cation events (Tang and Lyons, 2012). PLAZA andVista are also Web-based databases that providecomparative analysis services on the genomic datadeposited in the databases (Frazer et al., 2004; Van Belet al., 2012). Other stand-alone bioinformatic applica-tions for comparative genomic analysis, such as Easyfigand genoPlotR, are commonly used to generate syntenyplots of given genome segments at a scale ranging froma single gene to one chromosome (Guy et al., 2010;Sullivan et al., 2011).In this work, we present a Java-based bioinformatic

application, CrusView, for performing visualizedanalyses of genome synteny and karyotype evolutionin Brassicaceae species. CrusView features a user-friendly graphical user interface (GUI) implementedwith Standard Widget Toolkit (SWT)/Swing graphicslibraries and a SQLite database used to manage localgenomic data. Compared with the most commonlyused tools in comparative genomics, one of theunique features of CrusView is that available karyo-type data of a Brassicaceae species are incorporated tofacilitate karyotype-based chromosome assembly andanalyses of chromosomal structural evolution. Com-pared with Web-based tools, the stand-alone Crus-View tool was also designed to give users higherflexibility in analyzing currently unpublished genomedata and integrating self-defined genomic informa-tion based on the users’ interests, such as gene fami-lies, gene duplications, chromosomal break points,Gene Ontology terms, and groups of orthologs/paralogs, with the genomic synteny maps. In ad-dition, CrusView can generate images representinggenomic synteny between two compared genomesin PNG/SVG/PDF high-resolution formats that aresuitable for publication.

RESULTS AND DISCUSSION

To demonstrate the basic functionality of CrusView,we prepared two example genomes and related data

sets from Arabidopsis (n = 5) and E. salsugineum (n = 7)to perform visualized comparative genomics analyses.E. salsugineum (also known as salt cress and Thellun-giella halophila) is a halophytic relative of Arabidopsis;it inhabits the seashore saline soils of eastern China.Because E. salsugineum and Arabidopsis share similarlife cycles, morphological characters, and geneticcomposition, E. salsugineum has been widely used inplant salt-tolerance studies using the genetic systemsand molecular tools previously established in Arabi-dopsis. The E. salsugineum genome (243 Mb) containsseven chromosomes and approximately 24,000protein-coding genes (Yang et al., 2013). The karyo-type maps derived from CCP experiments of bothE. salsugineum and Arabidopsis are currently avail-able (Lysak et al., 2006). We used these two genomesto demonstrate the karyotype-based genome assem-bly of the E. salsugineum chromosomes and the com-parative analyses of E. salsugineum and Arabidopsiswith integrated karyotype information.

Overview of the Functional Panels in CrusView

CrusView can be launched via Web start at http://www.cmbb.arizona.edu/crusview. The navigationpanel includes quick buttons that perform basic oper-ations in CrusView. The published karyotypes of 20Brassicaceae species have been integrated into Crus-View, and they are shown in the left “karyotype”panel. We will constantly collect the published kary-otypes generated based on CCP experiments. Eachtime CrusView is launched, the program will auto-matically query the CrusView server to update thelocal karyotype database. Genomic data files fromE. salsugineum and Arabidopsis can be imported intothe SQLite database to run a demonstration for userswho run CrusView for the first time. The primary vi-sualization window shows the seven chromosomesof the primary E. salsugineum genome (Fig. 1). Theprotein-coding genes of E. salsugineum are designatedwith the corresponding colors based on the con-served genomic blocks in which they are located.The top right panel shows the color schemes and theletter labels for the 24 genomic blocks (A–X), whilethe bottom right panel shows the five chromosomesof the secondary Arabidopsis genome (Fig. 1). Theinformation window displays the genomic annota-tions of the genes in the primary genome recorded inthe Browser Extensible Data (BED) file, including thegene identifiers (IDs), chromosomal locations, ge-nomic block IDs, orthologous group IDs, sequencesimilarities with the homologs in the secondary ge-nome, gene functional descriptions, and other user-defined information (Fig. 1). The user can switch theprimary and secondary genomes, zoom in/out of thechromosome images, perform a query for genes ofinterest, and invoke a chromosome-level comparisonwindow using the quick buttons in the navigationpanel.

Plant Physiol. Vol. 163, 2013 355

Visualized Comparative Genomics Analytical Software

www.plantphysiol.orgon February 12, 2018 - Published by Downloaded from Copyright © 2013 American Society of Plant Biologists. All rights reserved.

Page 3: CrusView: A Java-Based Visualization Platform for Comparative

Visualized Karyotype Comparison between E. salsugineumand Arabidopsis

One of the unique functions of CrusView is that itcan generate the digital karyotype of a genome,allowing users to visually compare the chromosomalkaryotypes of the primary and secondary genomes.The A. lyrata (n = 8) genome represents an ancestralkaryotype in the Brassicaceae family in which eachmember’s genome is composed of 24 conserved ge-nomic blocks according to the karyotype analyses ofseveral representative species in the family using CCPexperiments (Lysak et al., 2006). Each conserved ge-nomic block is a large chromosomal segment that canbe represented by a group of Arabidopsis genes insynteny with their orthologs in the genomes of otherBrassicaceae species. Thus, the Arabidopsis genes canbe used as markers to infer the assignment of the 24conserved genomic blocks to another species’ ge-nome in Brassicaceae (Lysak et al., 2006; Yang et al.,2013). Our previously developed software programKGBassembler includes a pipeline to assign the genesin a Brassicaceae species genome to the 24 conservedgenome blocks, with a color scheme and a letter label(A–X) based on the homology with Arabidopsis genes(Ma et al., 2012). Here, we elucidate this procedureusing E. salsugineum as a newly sequenced genomebased on three basic steps: first, the Arabidopsis amino

acid sequences were mapped to the E. salsugineumscaffold sequences using BLAST, followed by the se-lectionof the best aligned locations; second, the Arab-idopsis genes mapped onto the E. salsugineumscaffolds were used to infer the conserved genomicblocks, followed by the assignment of the colorschemes and letter labels of the 24 blocks to theE. salsugineum genes; and third, pseudochromosomesequences were generated based on the CCP-derived(n = 7) karyotype of E. salsugineum. This pipeline wasintegrated into CrusView and can be applied to anynewly sequenced Brassicaceae species genome to per-form karyotype-based genome assembly and generatedigital karyotypes for comparison purposes.

In CrusView, the digital karyotypes of the primaryand secondary genomes will greatly facilitate visualizedgenomic comparison and the identification of majorchromosomal rearrangement events causing the ge-nomic evolution of the chromosomal karyotype in thestudied Brassicaceae species genome. For example,Arabidopsis chromosome 2 (AtChr2) resulted from themerging of E. salsugineum chromosome 4 (EsChr4) andthe long arm (14–37 Mb) of EsChr3 (Fig. 1). Moreover,when compared with the ancestral karyotype of theeight A. lyrata chromosomes, users may study the dif-ferent evolutionary paths of the karyotype in anotherspecies. For example, although AtChr1 resulted from

Figure 1. Functional panels in the CrusView main screen. A, Navigation panel. B, List of available karyotypes in Brassicaceaespecies. C, Main window showing the primary genome (E. salsugineum). D, Color scheme and letter labels of the 24 conservedgenomic blocks. E, Window showing the secondary genome (Arabidopsis). F, Gene annotation panel. G, Digital ancestralkaryotype of A. lyrata. H, Digital karyotype of Arabidopsis. I, Digital karyotype of E. salsugineum.

356 Plant Physiol. Vol. 163, 2013

Chen and Wang

www.plantphysiol.orgon February 12, 2018 - Published by Downloaded from Copyright © 2013 American Society of Plant Biologists. All rights reserved.

Page 4: CrusView: A Java-Based Visualization Platform for Comparative

the merging of A. lyrataAlChr1 and AlChr2, the structureof EsChr1 remains unchanged compared with AlChr1(Fig. 1). Furthermore, users can search for gene of interestIDs or ortholog group IDs from the navigation panel andmap their positions on the compared primary and sec-ondary genomic karyotypes.

Visualized Fine Adjustment of PseudochromosomeAssembly in CrusView

The automatic generation of pseudochromosomesequences based on the KGBassembler algorithm maymiss or misplace certain scaffolds that do not containsufficient gene synteny information for inferring theassignment of conserved genomic blocks, which areeither relatively short or contain too many repetitivesequences. Additionally, de novo scaffold assembly isusually interrupted at the edges of highly repeatedcentromere sequences. Thus, manual adjustment of thepseudochromosomes may be necessary. Different fromKGBassembler, in which users need to edit a text filefor manual adjustment, CrusView allows users toperform visualized fine adjustment of pseudochro-mosome assembly in GUI and to consider additionalgenomic information, such as positions of geneticmarkers, centromere-specific (CentO) tandem repeats,and the density of protein-coding genes during theadjustment. Users can directly load the project result

produced in KGBassembler for visualized fine adjust-ment or use the “assembling” function in CrusView toassemble pseudochromosomes from the scaffold se-quences. When the assembling function in CrusView isrun for the first time, users must indicate the workingfolder containing the required input files described in“Materials and Methods” and an output folder to savethe generated chromosome sequences. Users may setup necessary parameters in the “parameter panel” andsave the parameters into a Windows Initialization(INI) configuration file that can be directly loaded torun the assembling function (Fig. 2). The details of theparameters were explained in the KGBassemblermanual, and users may wish to apply different pa-rameter settings to produce the most optimal assem-bly, which is largely dependent on the quality of thescaffold sequences themselves as generated by de novoassembler tools.

To fine-tune the draft pseudochromosome se-quences, CrusView allows users to add files containinggenetic markers and CentO tandem repeats. In plants,CentO sequences are approximately 170-bp motifs thatare tandemly arrayed and specifically located in thecore centromeric regions (Benson, 1999). CentO repeatslocated at one terminal of a long scaffold are generallyindicative of the centromeric end of a scaffold (Fig. 2).Moreover, the density of protein-coding genes is typ-ically higher in the euchromatic regions of short and

Figure 2. Genome assembling function. A, Digital karyotype of E. salsugineum. B, Unplaced short-scaffold sequences.C, Parameter panel. D, Menu bar. E, Main working panel for the manual curation of the genome assembly of E. salsugineum.F, Density of protein-coding genes on scaffolds. G, CentO tandem repeat. H, Genetic marker track.

Plant Physiol. Vol. 163, 2013 357

Visualized Comparative Genomics Analytical Software

www.plantphysiol.orgon February 12, 2018 - Published by Downloaded from Copyright © 2013 American Society of Plant Biologists. All rights reserved.

Page 5: CrusView: A Java-Based Visualization Platform for Comparative

long arms than in the pericentromeric heterochromaticregions (Fig. 2). Thus, these types of information arevery useful in assisting users to further inspect andadjust the scaffold layouts and orientations on thechromosomes as well as the genomic positions ofthe genetic markers. Users can simply perform drag-and-drop actions with a mouse to correct potentiallymisplaced scaffolds or to adjust the orientation ofscaffolds. When a manual adjustment is performed,users can save the pseudochromosome sequences to aFASTA file and simultaneously generate the gene an-notation file. Finally, users can use the “push to mainscreen” function to directly add the assembled pseu-dochromosome and perform further visualized com-parative analyses.

Visualization of Genomic Synteny between Two Genomes

The “compare two genomes” function in CrusViewcan provide a visualization of genomic synteny foreach pair of homologous chromosomes for the primary

and secondary genomes. Chromosome-scale genomicsynteny can be visualized in two manners, a chromo-somal karyotype with homologous genes linked be-tween the two chromosomes and a dot plot indicatingchromosomal macrosynteny with duplication events(Fig. 3A). For example, a comparison of the karyotypesof EsChr4 and AtChr2 indicated that Arabidopsischromosome 2 resulted from an event in which theentire chromosome 4 (genomic blocks I and J) mergedwith the long arm of chromosome 3 (genomic blocksK, G, and H) in E. salsugineum (Fig. 3A). In addition,the visualized chromosomal synteny with karyotypeinformation can also allow users to examine the dif-ferences in the chromosome structures between thetwo genomes. For instance, the 18-Mb-long region from27 to 35 Mb of block J on EsChr4 remains highly similarto the 17-Mb-long region from 13 to 20 Mb on AtChr2,whereas the 25-Mb-long I block of EsChr4 has seem-ingly expanded dramatically, with highly enrichedrepetitive sequences and transposable elements, com-pared with the corresponding approximately 17-Mb I

Figure 3. Visualization of genome synteny and gene alignment. A, Panels for genome synteny visualization: a, navigation bar;b, primary genome; c, secondary genome; d, chromosome synteny; e, dot plot; f, genes in the selected area; g, action list;h, selection of segmental duplication; i, genes in the ortholog groups. B, Alignment of multiple gene members in the CDPKfamily showing tandem duplication events. C, Exon-level alignment of the SALT OVERLY SENSITIVE1 (SOS1) genes betweenArabidopsis and E. salsugineum.

358 Plant Physiol. Vol. 163, 2013

Chen and Wang

www.plantphysiol.orgon February 12, 2018 - Published by Downloaded from Copyright © 2013 American Society of Plant Biologists. All rights reserved.

Page 6: CrusView: A Java-Based Visualization Platform for Comparative

block region on AtChr2. More interestingly, a smallregion of EsChr4 between the positions 10 to 11 Mb wasfound to result from the inverted translocation of a re-gion from AtChr2. The selection of a genomic regionwith the mouse can invoke the information window,which contains the genes located in the regions of in-terest. By clicking on a gene homologous to the corre-sponding Arabidopsis gene, users will be redirected toThe Arabidopsis Information Resource database, whichcontains detailed gene function information.Chromosome-scale genomic synteny can also be vi-

sualized as a dot plot in CrusView to facilitate theidentification of segmental duplication and tandemduplication events between the two compared species.From the dot-plot screen, users can select the regionscontaining duplication events of interest with themouse to obtain information regarding the genes lo-cated in the selected regions (Fig. 3A). Right clickingthe mouse will invoke a pull-down list of advancedactions, such as querying selected genes in the externalThe Arabidopsis Information Resource database toview detailed functional descriptions, retrieving genesequences to a FASTA file, performing exon-levelsequence alignment for a single gene, and aligningmultiple genes in a user-defined synteny region usingAJaligner. Figure 3B demonstrates a genomic regionbetween 23.8 and 24.1 Mb on AtChr4 encompassingtwo tandem duplication events of the gene members inthe calcium-dependent protein kinase (CDPK) familythat may be involved in stress-responsive pathwaysin Arabidopsis. While AtCDPK27 and AtCDPK31represent a pair of tandemly duplicated genes thatcorrespond to the single-copy E. salsugineum geneThhalv10028618m.g, AtCDPK21 and AtCDPK23 cor-respond to the single-copy gene Thhalv10028567m.g(Fig. 3B). An exon-level sequence alignment of a pair ofinteresting orthologous genes will reveal exon-levelstructural variations, amino acid variations, inser-tions and deletions, and single-nucleotide polymor-phisms, which is illustrated by the comparison ofSALT OVERLY SENSITIVE1 in Arabidopsis and itsE. salsugineum ortholog (Fig. 3C).

Visualization of a User-Defined List of Genes, DuplicationEvents, and Copy Number Variations in a GenomicSynteny Plot

Using CrusView, users may visualize a group ofgenes of interest in the two compared genomes todetermine their associations with genomic syntenyand possible duplication events. We demonstratethis utility by analyzing the tandemly duplicatedF-box superfamily that has been found to display greatcopy number variations between Arabidopsis (505genes) and E. salsugineum (613 genes). First, the genesin E. salsugineum were assigned to the orthologousgroups annotated in the OrthoMCL database (Li et al.,2003). Each ortholog group indicated by a unique IDcontains the putative orthologous genes in Arabi-dopsis and E. salsugineum. We found that one of the

ortholog groups (OG5_127192) that showed high var-iation in copy number contained 148 and 130 F-boxgenes in Arabidopsis and E. salsugineum, respec-tively. In plants, F-box genes consist of a large super-family encoding an E3 ubiquitin ligase that is involvedin substrate-specific protein degradation. First, usingthe “predict tandem duplication” function in Crus-View, highly homologous genes defined with a cutoffof 40% protein identity and located adjacent to eachother within a 5-kb window were highlighted in greenin the dot plot of EsChr3 and AtChr3 (Fig. 4). Theprotein identity cutoff and window size can both beadjusted by the user when predicting tandem dupli-cations. Then, using the “keyword search” function, agroup of genes of interest is displayed in the currentdot plot. For instance, when searching ID OG5_127192,F-box genes classified in this ortholog group byOrthoMCL were highlighted in red in the same dot-plot image (Fig. 4). From the overlapping green dots(tandemly duplicated genes) and red dots (F-box genesin group OG5_127192), we observed a macrosyntenicblock covering an approximately 5-Mb region on AtChr3and an approximately 15-Mb region on EsChr3,encompassing 59 and 78 tandemly arrayed F-boxgenes in Arabidopsis and E. salsugineum, respectively(Fig. 4).

Similarly, users can also add additional genomicinformation to the BED file to allow searching for self-defined keywords, such as Gene Ontology terms, genefunctional descriptions, or gene families. CrusViewalso allows users to filter a list of genes or genomicpositions of interest from the user-defined genomicinformation file, which can be displayed on the dot-plot synteny map. Users can define the color schemesfor different gene groups on the plots using the settingfunction of CrusView. Finally, the digital karyotypemaps, macrosynteny plots based on the 24 color-codedgenomic blocks, and dot-plot synteny map showingduplication events and mapped genes of interest canbe saved as high-quality PNG/SVG/PDF publication-quality images.

CONCLUSION

In this work, we developed a Java-based bio-informatic application, CrusView, using the powerfulSWI/Swing graphics libraries in the Java and SQLitedatabases; this application was designed to facilitateresearch in comparative genomics. We demonstratedthe basic functionality of CrusView by performing avisual comparison of the Arabidopsis and E. salsugineumgenomes in the plant Brassicaceae (crucifer) family.Compared with other bioinformatic tools that havebeen developed for similar purposes, one of Crus-View’s unique features is its incorporation of genomickaryotype information derived from CCP experiments.The karyotype of a species associated with the genomestructure visualized in CrusView can greatly assistusers in identifying chromosomal rearrangements,genomic synteny, and major duplication events among

Plant Physiol. Vol. 163, 2013 359

Visualized Comparative Genomics Analytical Software

www.plantphysiol.orgon February 12, 2018 - Published by Downloaded from Copyright © 2013 American Society of Plant Biologists. All rights reserved.

Page 7: CrusView: A Java-Based Visualization Platform for Comparative

the related species. Thus, this unique CrusView featuremay facilitate the understanding of karyotype, chro-mosome, and genome evolution based on a compara-tive genomics approach. Furthermore, by consideringthe advantage of a species’ karyotype, CrusView pro-vides a unique function to infer pseudochromosomesequences from scaffold sequences generated by denovo assemblers based on conserved genomic blocks.This feature is especially convenient for nonmodelspecies that lack a genetic and/or physical map. How-ever, users should be aware that CrusView does notreplace de novo assembler tools, and its performancein finalizing the assembly of a pseudochromosome se-quence depends largely on the quality of the scaffoldsand contigs produced from whole-genome shotgunsequencing projects.

CrusView also includes an array of utilities that canbe used to visualize genome synteny and duplicationevents and to map a list of genes of interest associatedwith syntenic regions between the two analyzed ge-nomes. Compared with database-based comparativegenomics tools, CrusView is much more flexible inthe ability to analyze unpublished genomes; it allowsusers to integrate self-defined genomic information,

such as Gene Ontology classifications, gene families ofinterest, hot spots of chromosomal breakage/fusionpoints, high-frequency recombination sites, and tan-dem duplication to study their correlations with ge-nomic variations and duplication events. User-definedinformation and genome synteny plots can be exportedas high-resolution, publication-quality PNG/SVG/PDFimages.

Karyotype mapping based on in situ hybridizationexperiments is a common genomic technique that iswidely used in animals and plants. Conserved patternsof chromosomal rearrangements based on syntenicgenomic blocks as basic units of chromosomal break-age and fusion events are commonly observed inthe animal and plant kingdoms (Lysak et al., 2006;Ferguson-Smith and Trifonov, 2007). Therefore, al-though CrusView was primarily developed and presetbased on the karyotype evolution patterns in theBrassicaceae family (primarily for the convenience ofthe Brassicaceae community), this software programmay also be used to perform karyotype-based genomeassembly or karyotype-assisted genome syntenyanalysis in other plant families or in other organismsfor which karyotype data exist. If users wish to use the

Figure 4. Mapping duplication events and genes of interest onto the dot-plot synteny map. A dot-plot synteny map of EsChr3and AtChr3 is shown. The blue dots represent homologous gene pairs in the Arabidopsis and E. salsugineum genomes. The bluedots arranged along the diagonal line indicate a macrosynteny region. The aligned blue dots deviating from the diagonal lineindicate segmental duplications. The green dots represent potential tandemly duplicated genes selected using a protein identitycutoff of 40% and a 5-kb window size. The red dots represent F-box genes selected by a keyword search. The overlapping reddots and green dots indicate the tandemly duplicated F-box genes on EsChr3 and AtChr3.

360 Plant Physiol. Vol. 163, 2013

Chen and Wang

www.plantphysiol.orgon February 12, 2018 - Published by Downloaded from Copyright © 2013 American Society of Plant Biologists. All rights reserved.

Page 8: CrusView: A Java-Based Visualization Platform for Comparative

current version of CrusView for non-Brassicaceae spe-cies, they can access the “setting” function to define thecolor schemes and letter labels of the conserved ge-nomic blocks based on the karyotype evolution patternsof the species of interest. Additionally, to promote thebroad use of CrusView in other organisms, the sourcecode of CrusView has been released through Source-forge.net to allow academic users to freely downloadand modify the programs.

MATERIALS AND METHODS

Basic Input Files for CrusView

CrusView utilizes the Java Web-start function so that it can be launchedthrough the CrusView homepage. When it is run for the first time, CrusViewcreates a “CrusView” folder on the user’s local computer and automaticallyinstalls the programs and basic data set in the folder. CrusView simulta-neously creates a local Java SQLite database to manage the genomic data thatthe user wishes to analyze. The data files include a FASTA file containingchromosome or scaffold/contig sequences and a GFF file containing genemodel annotation that will be imported into the SQLite database. The usermust also prepare a BED file in the “bed” folder to provide additional infor-mation, such as ortholog group IDs, genome block IDs, and protein sequenceidentities between the primary and secondary genomes. To enable the ad-vanced search function, the BED file may also include the user’s self-definedgenomic information and functional descriptions added in the last column,such as Gene Ontology terms, gene families, recombination hot spots, and soon. To analyze a specific group of genes of interests, the user can load a TXTfile containing the gene IDs or genomic positions and their further descriptionsinto CrusView through provided functions.

Input Files for Karyotype-Based Genome Assembly

For species only containing scaffold sequences but with an available CCP-derived karyotype map, a karyotype-based genome assembly of pseudo-chromosomes from scaffold sequences is recommended. The KGBassemblerwill be invoked by the “assembling” function in CrusView. The assemblyfunction requires the following input files: a KARYOTYPE file containingCCP-based karyotype information obtained from the CrusView Web site orprepared by the user based on instructions, a PSL file containing Arabidopsis(Arabidopsis thaliana) genes aligned on the scaffolds, and a FASTA file con-taining scaffold sequences. The user can either provide a configuration file inWindows Initialization (INI) format or edit the “Parameter” tab in the CrusViewinterface to set up necessary parameters for assembly. If a genetic map withgene marker information is prepared by the user as a GMM file with designatedformat described in the CrusView manual, CrusView may also incorporate thisinformation during the manual adjustment of the pseudochromosomes. To fa-cilitate the prediction of scaffold orientations on the pseudochromosomes, theuser may run the tandem repeat finder software program (Benson, 1999) toidentify the scaffolds containing CentO sequences. CentO repeat locations for-matted as a BED file can be loaded into CrusView as an additional track.

After the KGBassembler has generated the pseudochromosome sequences,the user may use CrusView to perform fine adjustments to the orientations andorders of the scaffolds on the pseudochromosomes based on the additionalinformation provided by the user, such as the density of protein-coding genes,user-customized genetic markers, and the locations of CentO tandem repeatson the scaffolds. CrusView has been implemented with an enhanced GUI thatcan be used to further adjust the pseudochromosome assembly usingdragging-and-placing mouse actions. By clicking the “save assembly” button,the pseudochromosome sequences and gene annotation information will besaved in a FASTA file and a GFF file, respectively.

Conversion of a User’s Unpublished Genome Sequenceand Self-Defined Gene Annotation to Input FilesCompatible with CrusView

To facilitate the user to analyze a yet-to-be-published genome sequence,CrusView includes a function to help the user prepare the input files necessary

to be used in CrusView. The user must provide the genome/scaffoldsequences in FASTA format, the gene annotation file in GFF or GTF for-mat, and one additional karyotype file if the user wants to performkaryotype-based assembly of pseudochromosome sequences. The user isalso prompted to submit protein sequences to the OrthoMCL online data-base (http://www.orthomcl.org) to assign the genes to the correspondingortholog groups to facilitate genome comparisons, gene duplication anal-yses, and copy number variation analyses. To assign the 24 conserved ge-nome block IDs to the genes, the user must provide a BLAST result of theprotein sequences of the analyzed genome against Arabidopsis proteins.Additional genomic information that the user wishes to include will beintegrated into the last column of the BED file to enable the keyword searchfunction in CrusView.

Inference of Genomic Macrosynteny Based on ConservedGenomic Blocks

The genomes of the Brassicaceae species share 24 conserved genomic blocks(large chromosomal segments) designated A to X. The additional ID “0” is usedby CrusView to label undetermined regions that are not assigned to any ge-nomic blocks. The chromosomal locations of the 24 genomic blocks can beinferred from the CCP-derived karyotype. Each gene located within the sameconserved genomic block is assigned a designated color code to illustrate thedigital karyotype of the studied species. Genes shared within the same ge-nomic block IDs are considered to be in the same genomic macrosyntenicregions. To analyze a genome lacking a CCP-derived karyotype or a genomein other families of plant or animal organisms that have different conservedgenomic blocks, the user can self-define the block IDs with hexadecimal colorcodes in the BED file.

Visualization of Chromosomal Karyotype, GenomicSynteny, and Gene Alignment

CrusView was implemented with the Java SWT/Swing libraries to developthe GUI interface and visualization functions. Visualization of the genomicdata of an analyzed species can be performed at three levels: the genome level,the chromosome level, and the gene level. If the karyotype information hasbeen associated with the studied genome, all of the chromosomes will be vi-sualized with the 24 genomic block IDs with corresponding colors. The usercan select any two chromosomes of interest in the two compared species tovisualize chromosomal synteny. When comparing the karyotypes of twochromosomes, the pairs of orthologous genes between the two species arelinked to indicate major chromosomal rearrangement events. CrusView alsogenerates a dot plot for each pair of selected chromosomes to visualize tandemand segmental duplication events. The user may select a group of genes fromthe dot plot using a mouse framing action to trigger gene-level visualization.A multigene alignment within a designated genomic region (less than 1 Mb)between the two genomes and an exon-to-exon alignment of one pair oforthologous genes with single-nucleotide polymorphism information can bevisualized.

Output Image Files Generated from CrusView

One of the useful utilities of CrusView is to generate high-resolution imagesand save them in PNG/SVG/PDF formats for publication use. Such imagesinclude digital karyotypes, genome synteny plots, dot plots of two chromo-somes, multigene alignments within a genomic region, exon-to-exon alignmentplots, plots of genomic duplication events, and mapping of a list of genes ofinterest in the genomic synteny plots.

Software Availability

CrusView is publicly available online (http://www.cmbb.arizona.edu/crusview) and has been implemented as a Java Web-start application underWindows and Linux 32/64-bit systems with options for different memorysizes. Sample data sets from Arabidopsis and Eutrema salsugineum are pro-vided to demonstrate the basic functions of CrusView. The software manualand a series of video tutorials for CrusView are also provided online (http://www.cmbb.arizona.edu/crusview/video_tutorial).

Received April 10, 2013; accepted July 20, 2013; published July 29, 2013.

Plant Physiol. Vol. 163, 2013 361

Visualized Comparative Genomics Analytical Software

www.plantphysiol.orgon February 12, 2018 - Published by Downloaded from Copyright © 2013 American Society of Plant Biologists. All rights reserved.

Page 9: CrusView: A Java-Based Visualization Platform for Comparative

LITERATURE CITED

Benson G (1999) Tandem repeats finder: a program to analyze DNA se-quences. Nucleic Acids Res 27: 573–580

Cheng F, Wu J, Fang L, Wang X (2012) Syntenic gene analysis betweenBrassica rapa and other Brassicaceae species. Front Plant Sci 3: 198

Dassanayake M, Oh DH, Haas JS, Hernandez A, Hong H, Ali S, Yun DJ,Bressan RA, Zhu JK, Bohnert HJ, et al (2011) The genome of the ex-tremophile crucifer Thellungiella parvula. Nat Genet 43: 913–918

Ferguson-Smith MA, Trifonov V (2007) Mammalian karyotype evolution.Nat Rev Genet 8: 950–962

Frazer KA, Pachter L, Poliakov A, Rubin EM, Dubchak I (2004) VISTA:computational tools for comparative genomics. Nucleic Acids Res 32:W273–W279

Goff SA, Vaughn M, McKay S, Lyons E, Stapleton AE, Gessler D,Matasci N, Wang L, Hanlon M, Lenards A, et al (2011) The iPlantcollaborative: cyberinfrastructure for plant biology. Front Plant Sci 2: 34

Guy L, Kultima JR, Andersson SG (2010) genoPlotR: comparative geneand genome visualization in R. Bioinformatics 26: 2334–2335

Hu TT, Pattyn P, Bakker EG, Cao J, Cheng JF, Clark RM, Fahlgren N,Fawcett JA, Grimwood J, Gundlach H, et al (2011) The Arabidopsislyrata genome sequence and the basis of rapid genome size change. NatGenet 43: 476–481

Li L, Stoeckert CJ Jr, Roos DS (2003) OrthoMCL: identification of orthologgroups for eukaryotic genomes. Genome Res 13: 2178–2189

Lysak MA, Berr A, Pecinka A, Schmidt R, McBreen K, Schubert I (2006)Mechanisms of chromosome number reduction in Arabidopsis thalianaand related Brassicaceae species. Proc Natl Acad Sci USA 103: 5224–5229

Ma C, Chen H, Xin M, Yang R, Wang X (2012) KGBassembler: a karyotype-based genome assembler for Brassicaceae species. Bioinformatics 28:3141–3143

Sullivan MJ, Petty NK, Beatson SA (2011) Easyfig: a genome comparisonvisualizer. Bioinformatics 27: 1009–1010

Tang H, Lyons E (2012) Unleashing the genome of Brassica rapa. FrontPlant Sci 3: 172

Van Bel M, Proost S, Wischnitzki E, Movahedi S, Scheerlinck C, Van dePeer Y, Vandepoele K (2012) Dissecting plant genomes with the PLAZAcomparative genomics platform. Plant Physiol 158: 590–600

Wang X, Wang H, Wang J, Sun R, Wu J, Liu S, Bai Y, Mun JH, Bancroft I,Cheng F, et al (2011) The genome of the mesopolyploid crop speciesBrassica rapa. Nat Genet 43: 1035–1039

Wright SI, Agren JA (2011) Sizing up Arabidopsis genome evolution.Heredity (Edinb) 107: 509–510

Wu HJ, Zhang ZH, Wang JY, Oh DH, Dassanayake M, Liu BH, Huang QF,Sun HX, Xia R, Wu YR, et al (2012) Insights into salt tolerance from thegenome of Thellungiella salsuginea. Proc Natl Acad Sci USA 109: 12219–12224

Yang R, Jarvis DE, Chen H, Beilstein M, Grimwood J, Jenkins J, Shu S,Prochnik S, Xin M, Ma C, et al (2013) The reference genome of thehalophytic plant Eutrema salsugineum. Front Plant Sci 4: 46

362 Plant Physiol. Vol. 163, 2013

Chen and Wang

www.plantphysiol.orgon February 12, 2018 - Published by Downloaded from Copyright © 2013 American Society of Plant Biologists. All rights reserved.