evolutionary breakpoint analyser (eba)bioinformaticsonline.com/mod/eba/manual.pdffig. 1.3.1: perl...

23
1 EBA Manual Evolutionary Breakpoint Analyser (EBA) Identification of evolutionary breakpoint regions and classify them using their phylogenetic relationship. User manual Jitendra Narayan and Denis M. Larkin Institute of Biological, Environmental and Rural Sciences (IBERS), Aberystwyth University Pengais, Aberystwyth Ceredigion SY23 1PW Version 1.1 (09/06/2015)

Upload: others

Post on 13-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evolutionary Breakpoint Analyser (EBA)bioinformaticsonline.com/mod/EBA/Manual.pdfFig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules

1 EBA Manual

Evolutionary Breakpoint Analyser (EBA)

Identification of evolutionary breakpoint regions and classify them using their

phylogenetic relationship.

User manual

Jitendra Narayan and Denis M. Larkin

Institute of Biological, Environmental and Rural Sciences (IBERS),

Aberystwyth University

Pengais, Aberystwyth

Ceredigion

SY23 1PW

Version 1.1 (09/06/2015)

Page 2: Evolutionary Breakpoint Analyser (EBA)bioinformaticsonline.com/mod/EBA/Manual.pdfFig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules

2 EBA Manual

Contents Chapter 1 ....................................................................................................................................................................... 3

First steps .................................................................................................................................................................. 3

1.1 Introduction ....................................................................................................................................................... 3

1.2 Installation .......................................................................................................................................................... 3

1.2.1 System Requirements................................................................................................................................ 3

1.2.2 Obtaining the tool ..................................................................................................................................... 3

1.2.3 Installing the EBA framework ................................................................................................................ 3

1.3 Mandatory Perl Modules.................................................................................................................................. 4

1.3.1 Install Perl Modules Manually ................................................................................................................. 7

1.4 GenBank Taxonomy Database ....................................................................................................................... 8

1.5 Sample Data ....................................................................................................................................................... 8

Chapter 2 ....................................................................................................................................................................... 9

How to use EBA ..................................................................................................................................................... 9

2.1 The script EBA.pl ............................................................................................................................................. 9

2.1.1 Mandatory parameters .............................................................................................................................. 9

2.1.2 Optional parameters ................................................................................................................................. 9

2.2 Input file formats ............................................................................................................................................ 10

2.2.1 Homologous Synteny Blocks (HSB) files ............................................................................................ 10

2.2.2 Classification file ...................................................................................................................................... 11

2.2.3 Chromosome size file ............................................................................................................................. 11

2.2.4 BetaScore file ........................................................................................................................................... 12

2.3 Folder setting ................................................................................................................................................... 12

2.3.1 Resolution Folders .................................................................................................................................. 12

2.3.2 Location of all resolutions folders ........................................................................................................ 12

Chapter 3 ..................................................................................................................................................................... 13

Result and Intermediate Files .............................................................................................................................. 13

3.1 Intermediate files and folder ......................................................................................................................... 13

3.1.1 EBA_OutFiles Directory ....................................................................................................................... 13

3.1.2 EBA_ImageFiles Directory ................................................................................................................... 14

3.1.3 ResultFiles Directory .............................................................................................................................. 16

3.1.4 Viz_EHFiles Directory .......................................................................................................................... 18

Chapter 4 ..................................................................................................................................................................... 19

Merging all resolutions ......................................................................................................................................... 19

Page 3: Evolutionary Breakpoint Analyser (EBA)bioinformaticsonline.com/mod/EBA/Manual.pdfFig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules

3 EBA Manual

4.1 Merging resolution with prime ..................................................................................................................... 19

4.2 Final merged reuse .......................................................................................................................................... 19

4.3 Final merged graph ......................................................................................................................................... 19

4.4 Final merged classification pie chart ............................................................................................................ 20

4.5.3 Final merged breakpoint line graph .......................................................................................................... 20

4.5 Other graphs .................................................................................................................................................... 21

4.5.1 Breakpoint line graph ............................................................................................................................. 21

4.5.2 Breakpoint histogram ............................................................................................................................. 21

Chapter 5 ..................................................................................................................................................................... 22

Point to remember ................................................................................................................................................ 22

Chapter 1

First steps

1.1 Introduction

The evolutionary breakpoint analyser (EBA) implements methods for identification of evolutionary

breakpoint regions and classify them using their phylogenetic relationships at different resolutions. Aim

of this manual to provide you an overview of organization, installation, configuration of the program.

The EBA code was written in Perl and is distributed under the GNU GPL License or the Artistic License

2.0. For more information please check http://www.gnu.org/copyleft/gpl.html and

http://opensource.org/licenses/artistic-license-2.0

1.2 Installation

1.2.1 System Requirements

EBA was implemented in Perl and tested under the Window, Linux and Mac OS environments. It has the

following requirements:

Window, Linux or Mac OS

Perl

Perl Modules

1.2.2 Obtaining the tool

The Evolutionary Breakpoint Analyser (EBA) tool can be downloaded from Aberystwyth University, or

BioinformaticsOnline (BOL) websites.

www.bioinformaticsonline.com/EBA

1.2.3 Installing the EBA framework

Users are requested to check the system for Perl language and some required Perl modules before

installing and running the EBA tool. Moreover, if you are working on MacOS or Linux OS then it is

expected that you have Perl installed in your system/server. You can check it by typing perl –v at the

command prompt which shows the following message if Perl installed:

Page 4: Evolutionary Breakpoint Analyser (EBA)bioinformaticsonline.com/mod/EBA/Manual.pdfFig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules

4 EBA Manual

jitendra@jitendra-OptiPlex-390[EBAv1.1] perl -v

This is perl 5, version 14, subversion 2 (v5.14.2) built for x86_64-linux-gnu-thread-multi (with 80 registered patches, see perl -V

for more detail)

Copyright 1987-2011, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the GNU General Public License, which may be found

in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on this system using "man perl" or "perldoc perl". If

you have access to the Internet, point your browser at http://www.perl.org/, the Perl Home Page.

If you need more information, visit http://www.perl.org/. If your system does not show above message

then Perl is not installed, to install access http://www.perl.org/get.html and follow the download and

installation instructions. You can also try strawberry Perl (http://strawberryperl.com/) or CitrusPerl

http://www.citrusperl.com/ if working on Window machine.

In order to install the EBA framework, the user must obtain the compressed EBA folder from the

website mentioned above (see Section 1.2.2) and decompress it under the desired directory:

> cp –i EBAv1.1.tar.gz <directory>

> cd <directory>

> tar xvf EBAv1.1.tar.gz

Where <directory> is the path for the desired directory (Example: /usr/local/). After decompressing the

EBA.zip, a directory named EBA can be found inside of <directory> that contains all required scripts

package.

1.3 Mandatory Perl Modules

In order to install and run EBA framework script, user needs to install the following mandatory modules

in your operating system.

File::Path

Math::Round

List::Compare

GD

GD::Graph

GD::Text

GD::Image

The easiest way to install Perl module is to use CPAN, the repository of Perl modules. It has an

interactive shell that you can use, and we will see how to do that. In order to make sure our installation

works you need to use sudo and call CPAN (from any directory).

$ sudo cpan

If this is the first time you are running CPAN, just answer yes to all config questions and you are good to

go. Now we have to install above mentioned modules required. By using the command install, that can be

achieved by

Page 5: Evolutionary Breakpoint Analyser (EBA)bioinformaticsonline.com/mod/EBA/Manual.pdfFig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules

5 EBA Manual

cpan$ install File::Path

cpan$ install Math::Round

cpan$ install List::Compare

In Mac OS you need to follows following steps for GD:

Now, you deal with the last module and usually the most laborious to install, GD. Ideally you should have

all possible library support for GD and for this you have to install additional libraries. Below are the

demonstrations of the most common dependencies and see if we need anything else. Usually libjpeg and

libpng are required by GD, but sometime FreeType, zlib and LibGD are also required if not available. So,

let’ download

$ mkdir tempDir

$ cd tempDir

Install libjpeg:

tempDir$ curl -O http://www.ijg.org/files/jpegsrc.v8d.tar.gz

tempDir$ tar -xzvf jpegsrc.v8d.tar.gz

tempDir$ cd jpeg-8d

tempDir/jpeg-8d$ ./configure

tempDir/jpeg-8d$ make

tempDir/jpeg-8d$ sudo make install

Install libpng:

tempDir/jpeg-8d$ cd ..

tempDir$ curl -O ftp://ftp.simplesystems.org/pub/libpng/png/src/libpng16/libpng-1.6.2.tar.gz

tempDir$ tar -xzvf libpng-1.6.2.tar.gz

tempDir$ cd libpng-1.6.2.tar.gz

tempDir/libpng-1.6.2$ ./configure

tempDir/libpng-1.6.2$ make

tempDir/libpng-1.6.2$ sudo make install

Hopefully, you have all required dependencies. Now you can download GD and check if the configuration you have so far is enough. You can download GD libraries from website, and if GD’s website is down, then you can get the source from Bitbucket and use identical commands to install is

tempDir$ curl -O https://bitbucket.org/libgd/gd-libgd/downloads/libgd-2.1.0-rc2.tar.gz tempDir$ tar -xzvf libgd-2.1.0-rc2.tar.gz tempDir$ cd libgd-2.1.0-rc2 tempDir/libgd-2.1.0-rc2/src$ ./configure

If everything goes well then you will see, at the end of configuration run something like this:

** Configuration summary for gd 2.1.0:

Support for PNG library: yes Support for JPEG library: yes Support for Freetype 2.x library: yes Support for Fontconfig library: yes Support for Xpm library: no Support for pthreads: yes

Page 6: Evolutionary Breakpoint Analyser (EBA)bioinformaticsonline.com/mod/EBA/Manual.pdfFig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules

6 EBA Manual

Sometime, you will fail to get above mention message with yes option. Then you need to check Freetype (freetype-2.5.0.1) and Fontconfig in your OS, you would have to download, configure, make and install them, just like libpng and libjpeg. If everything is OK then

tempDir/pierrejoye-gd-libgd-5551f61978e3/src$ make tempDir/pierrejoye-gd-libgd-5551f61978e3/src$ sudo make install

The last step is to install GD in Perl. Normally, if you use CPAN to install it on OS X, it fails. You can try and see, it might be possible in your cases, if all goes well. $ perl -MCPAN -e 'get GD'

If not successful then, you need to do it by manual method. You need to go to CPAN website and download the latest Perl’s GD implementation and with similar commands to above we install it.

tempDir$ curl -O http://www.cpan.org/authors/id/L/LD/LDS/GD-2.49.tar.gz (if curl fails copy and past on your browser)

tempDir$ tar -xzvf GD-2.49.tar.gz tempDir$ cd GD-2.49 tempDir/GD-2.49$ perl Makefile.PL tempDir/GD-2.49$ make tempDir/GD-2.49$ sudo make install

Done, you can check you program now, hopefully it will work perfect.

Other easiest way to get a lot of these is with a program called Fink, which is similar in nature to the CPAN installer, but installs common GNU utilities. Fink is available from <http://sourceforge.net/projects/fink/>.

In Linux you need to follow these steps:

sudo apt-get install libgd2-xpm-dev

sudo install GD::Graph

sudo indtall GD

If you need to install addition module, you need to search the perl packages with

sudo apt-cache search perl <Module Name>

If you module is there you will get the package distribution name to download, then you can download that module

with provided name by following the same steps mentioned above.

In Window follow these steps:

C:\Documents and Settings\Owner>ppm

PPM interactive shell (2.2.0) - type 'help' for available commands.

PPM> install GD

Install package 'GD?' (y/N): y

Installing package 'GD'...

Downloading http://ppm.ActiveState.com/PPMPackages/5.6plus/MSW. ...

Installing C:\Perl\site\lib\auto\GD\GD.bs

Installing C:\Perl\site\lib\auto\GD\GD.dll

Installing C:\Perl\site\lib\auto\GD\GD.exp

Installing C:\Perl\site\lib\auto\GD\GD.lib

Installing C:\Perl\html\site\lib\GD.html

Installing C:\Perl\site\lib\GD.pm

Installing C:\Perl\site\lib\qd.pl

Page 7: Evolutionary Breakpoint Analyser (EBA)bioinformaticsonline.com/mod/EBA/Manual.pdfFig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules

7 EBA Manual

Installing C:\Perl\site\lib\auto\GD\autosplit.ix

PPM>

You can download all the above mentioned module using CPAN or install manually. You can find the

Perl module installation instruction on http://www.cpan.org/modules/INSTALL.html website. Manual

installation need to follow some rules describe below.

If you are not very familiar with command line, you can use Perl Package Manager (PPM) graphical

interface to install GD module. If you are using ActiveState then you need to type ppm at command

prompt, then it will open a graphical interface for Perl module installation. Search GD and GD-Graph

and install it by clicking on install button (Fig: 1.3.1)

Fig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules.

Some modules are not available as ppm's or have reported errors on the ppm build system, this does not

mean they do not work. You can use CPAN in ActiveState Perl like cpan <Module name>. User can

follow the above mentioned (PPM or CPAN) methods to install any supporting Perl modules User can

find more detail at http://www.activestate.com/blog/2010/10/how-install-cpan-modules-activeperl.

1.3.1 Install Perl Modules Manually

1.3.1.1 Download Perl module

Go to CPAN Search website (http://www.cpan.org/) and search for the module that you wish to

download. In this example, let us search, download and install GD::Graph Perl module. We have

downloaded the GDGraph-1.47.tar.gz to /home/download

> cd /home/download

> gzip -d GDGraph-1.47.tar.gz

> tar xvf GDGraph-1.47.tar.gz

Page 8: Evolutionary Breakpoint Analyser (EBA)bioinformaticsonline.com/mod/EBA/Manual.pdfFig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules

8 EBA Manual

> cd GDGraph-1.47.tar.gz

1.3.1.2 Build the perl module

> perl Makefile.PL

Checking if your kit is complete...

Looks good

Writing Makefile for GDGraph-1.47.tar.gz

Writing Makefile for GDGraph-1.47.tar.gz

> make

> make test

1.3.1.3 Install the perl module

> make install

Typically, Perl modules will be dependent on several other modules. Chasing all these dependencies one-

by-one can be very painful and annoying task, therefore manual method is not recommended but useful if

working offline. We recommend the CPAN method of installation which downloads all dependencies

automatically. User need to use CPAN method only if the system/server is connected to the Internet.

1.4 GenBank Taxonomy Database

The EBA program uses GenBank taxonomy database to classify the studied species and convert them

into evolutionary breakpoint classification format (see 2.2.2). Therefore, user should download the latest

version of GenBank taxonomy database from NCBI server. These compressed dataset is located

at ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz, which need to uncompressed and transfer the

uncompressed taxdump folder at EBA root.

> tar -zxvf taxdump.tar.gz

> mv -i taxdump /EBA

By doing all these steps, the installation of EBA is concluded and the package is ready to use (See Chapter

2 for instructions on how to use the script EBA.pl).

1.5 Sample Data

Keeping the fact that you are new to EBA, we have provided sample input file and their structure in a

SAMPLE_DATA folder for testing purpose. User can check the sample files and the folder structure as

well.

User can check it by running the following commands on command line.

EBA.pl -n 6 -d SAMPLE_DATA –r Gallus_gallus –t 20 –p 100

It will generate all the intermediate files and final results in ResultFiles folder. If you are getting any error

running EBA on test data please contact us [email protected] or [email protected]

Page 9: Evolutionary Breakpoint Analyser (EBA)bioinformaticsonline.com/mod/EBA/Manual.pdfFig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules

9 EBA Manual

Chapter 2

How to use EBA

2.1 The script EBA.pl

The evolutionary breakpoint framework, EBA contains one main script called EBA.pl. This script

coordinates the execution of the whole process of breakpoint identification and classification framework.

The following section describes the mandatory and optional flags need to be used at the time of calling

the program.

The script EBA.pl has the following command line:

perl EBA.pl [-n | --number] <number> [-d | --directory] <dir> [-r | –reference] <refName> [-p | --

prime] <prime> [-t |--threshold] <number>

It has mandatory parameters and also a set of optional parameters.

2.1.1 Mandatory parameters

[-n | --number] <number>: This parameter requires total number of all species that users are going to

work with. The number should not include the reference species in total list. For example, if the user is

working with 7 species (including reference species) then they need to provide --number or –n 6.

[-d | --directory] <dir>: All the homologous synteny blocks (HSB) files are expected to be in a single

data directory (see 2.3). The name of the data directory is independent and it can be anything, so user can

name it according to their requirements.

[-r | --reference] <refName>: The scientific name of reference species should be provided. The first

and second name of the species should be separated by space or underscore.

[-p | --prime] <prime>: This parameter need the name of primary resolution name. The name of the

resolution should be numeric (see 2.3).

[-t | --threshold] <number>: In order to calculate reuse breakpoint amongst species, user need to

provide certain threshold value for breakpoint filtration. The values are usually 20, but it may vary.

2.1.2 Optional parameters

[Optional] Add more option if required or would like to interfere the default options.

[-b | --beta]: Betascore is required to calculate final classification scores. The EBA program

automatically calculates the beta score for each species, but user can also provide their own beta score if

not happy with the EBA’s beta score. This parameter is useful; if user has no sufficient resolutions to

calculate beta score. If you would like to calculate the beta then "-b or --beta" option need to provide. If

no option is provided then user should provide own beta score in “BetaScore” file at root (see 2.2.4).

[-c | --classify]: The phylogenetic relationship of all the species should be provided in evolutionary

breakpoint format in “classification.eba” file at root, if not calculated using EBA tool (see file format). If

the user relies on GenBank based taxonomy classification then they can classify them without using –c or

--classy option.

[-i | --increase] <number>: Sometime user need to increase the size of the breakpoint to calculate the

breakpoint overlaps. The increment parameter will be handy in such cases. It increases the breakpoint

Page 10: Evolutionary Breakpoint Analyser (EBA)bioinformaticsonline.com/mod/EBA/Manual.pdfFig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules

10 EBA Manual

sizes by <number> on both sides. For example, if user would like to increase the breakpoint size by

1000bp then they should provide –i or --increase 500.

[-l | --lineage]: If user are not interested in lineage specific breakpoint then they can provide –l or –

lineage flag in command line, which will exclude the lineage from the final table.

[-x | --exclude]: If user are not interested in classification group which is define with only one species

then they can exclude the classification group using –x or --exclude parameter.

[-k | --keep]: The EBA tool produces all the intermediate files for future verification, if user are

interested in it then they can keep it by -k or --keep.

[-v | --validate]: If you are using it for the first time, it is recommended to keep validation flag on, by –v

or --validate. Moreover, you can save some time if it is off.

[-e | --engrave]: If you would like to print the each target breakpoint coordinates, in final classification

file then user need to use this flag. By default EBA will only print the EBRs classification with their

corresponding coordinates.

[-logfile]<filename>: User can write all the command line message in a single file by –logfile flag and

file name. The file will have .log extension and can be found in EBA root directory.

2.2 Input file formats

Input file format is always a crucial part for any pipeline setting. Therefore, we try to keep the input files

format as simple as possible. This section covers and explains all the input file formats which are received

as input by the EBA framework.

2.2.1 Homologous Synteny Blocks (HSB) files

The HSB files contain the information about valid homologous syntenic blocks between reference (RE)

and target (TA) genomes. The HSB file is a TSV (Tab Separated Values) file which has RE and TA

chromosome, blocks coordinates, orientations, and assembly status. The file should not contain header

information, and only one tab separated HSB blocks allowed per row. The one pair of HSB block is

defined by the following columns:

Name Definition

+Column1 Reference (RE) name

+ Column2 Reference (RE) chromosome number

+ Column3 Start coordinates of RE HSBs

+ Column4 End coordinates of the RE HSBs

+ Column5 Chromosome name of target (TA) species.

+ Column6 Start coordinates of TA HSBs

+ Column7 End coordinates of TA HSBs

+ Column8 Strand of the block

+ Column9 Target (TA) species name

+ Column10 Target (TA) chromosome [ Scaffolds or Chromosome]

The HSBs file should contain real homologous syneny block information, in which the start coordinates

of HSBs should be smaller than the end coordinates. The HSBs blocks for a single species must be non-

overlapping with each other.

Page 11: Evolutionary Breakpoint Analyser (EBA)bioinformaticsonline.com/mod/EBA/Manual.pdfFig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules

11 EBA Manual

The TA species name should be binomial name (which may be shortened to just "binomial"), a

binomen or a scientific name; the first part of the name identifies the genus to which the species belongs;

the second part identifies the species within the genus. For example, humans belong to the

genus Homo and within this genus to the species Homo sapiens. The Homo sapiens should be mention in

HSBs files as homo_sapiens. In other word the genus and species name should be separated by

underscore (_) sign. The TA species name is not case sensitive.

The EBA script checks the target genome assembly information by looking at the TA chromosome

column of the HSBs file. If the target genome is not completely assembled to chromosome then user

should mention “Scaffolds”, and if TA species genome is well assembled and you rely on the assembly

you should mention “Chromosome” in TA chromosome column. The name is not case sensitive, so you

can mention in capital, small or mixed letter. User should always keep in mind that a genome can be

assemble to chromosome or scaffolds but not mixed of both name are allowed in a single species HSBs

file.

Additionally, the name of the HSBs file for an individual species must have the following naming pattern:

<NameOfSpecies>_<ResolutionNumber>_<OtherInfo>.txt

For example, if the 100 Kb resolutions’ HSB file for cattle (reference) and pig (target) then the file name

must be pig_100_final.txt

2.2.2 Classification file

The EBA framework needs phylogenetic classification information of all species, which are being used

for current analysis. The EBA tool provides an option to either calculate standard GenBank classification,

or use our own classification. If use decided to calculate the studied species classification using EBA, then

it automatically generate the desire classification file format, otherwise user need to provide the

classification file named “classification.eba” at the root of the program. Following are the file formats for

the classification file:

Name Definitions

+GroupName Name of the order or phylogenetic nodes.

+Seperator Equal to (=) sign

+SpeciesName Name of all species which constitute that group.

Note: If the GroupName is Lineage the SpeciesName should leave blank. User can mention comments in

classification file which should start with hash sign (#). [[Need to check #]]

2.2.3 Chromosome size file

The EBA tool uses chromosome based approach to calculate classification scores. In order to do so, the

tool need chromosome sizes of reference (RE) genome. The RE chromosome size file is tab separated

two column file, in which the first column contains RE chromosome name and second column contains

their maximum size. Following table will be much helpful to understand the file format:

Name Definition

+Column1 The reference chromosome name

+Column2 Maximum size of the reference chromosome

Page 12: Evolutionary Breakpoint Analyser (EBA)bioinformaticsonline.com/mod/EBA/Manual.pdfFig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules

12 EBA Manual

2.2.4 BetaScore file

Beta score is the probability of missed breakpoint at certain resolutions. This score is only calculated,

when there are least three/four resolutions dataset available during analysis. However, user can use EBA

tool on less number of resolution by providing own probable beta score. In our analysis, we generally

found on average 0.02 score for missed breakpoint in each species. Following is the file format for

BetaScore file:

Name Definition

+Column1 Resolution number of target species.

+Column2 Target species scientific name, the first and second should be separated by

underscore (_).

+Column3 Beta score of the species.

Note: The resolution and TA name is not separated by tab, rather concatenated by colon (:).

The beta score for all studied resolution were plotted for all studied species in root EBA directory for

visual analysis purpose. The betaScore graph is saved as “betaScore.gif”.

2.3 Folder setting

The data folder, which contains all the HSBs files, should be organized in proper way to recognized by

the EBA script. Following are the set of rules that user need to keep in mind before running on new

dataset.

2.3.1 Resolution Folders

The resolution folders name must be number, and named after resolution at which this is analysed. For

example, if user is working with four resolutions namely 50, 100, 300, and 500 Kb then user need to

create the directory name by its resolution number and store it’s all HSBs files.

2.3.2 Location of all resolutions folders

All the resolution folders (named after studied resolution) should be kept inside the main input data

directory folder (project folder). This is mandatory as EBA script look for all resolutions HSBs files in

separate folders. The EBA read understand the resolution of HSBs files by their folders name.

Page 13: Evolutionary Breakpoint Analyser (EBA)bioinformaticsonline.com/mod/EBA/Manual.pdfFig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules

13 EBA Manual

Fig: 3.0. The folder hierarchy structure for the EBA tool. The resolution folder name should be numeric

and named after resolution number. Only three resolution folders are shown in image, but user can enter

any number of resolutions folders in it. The executable program or script EBA.pl should be at root, with

some mandatory files like “chr_size.txt” and “classification.eba”. For classification point of view user

need to unzip taxonomy database at root.

Chapter 3

Result and Intermediate Files

3.1 Intermediate files and folder

The EBA tool writes all output results inside each resolution directory by creating following mandatory

directories:

EBA/DataDirectory/ResolutionDirectory/EBA_OutFiles.

EBA/DataDirectory/ResolutionDirectory/EBA_ImageFiles

EBA/DataDirectory/ResolutionDirectory/VizEHFiles

EBA/DataDirectory/ResolutionDirectory/ResultFiles

This section describes each one of the items which are output by it.

3.1.1 EBA_OutFiles Directory

The EBA_OutFiles directory contains all the intermediates files generated during breakpoint analysis.

Each file is name ends with *.eba[N] extensions, and follows the following pattern:

<NameOfSpecies>_*.eba[N]

Page 14: Evolutionary Breakpoint Analyser (EBA)bioinformaticsonline.com/mod/EBA/Manual.pdfFig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules

14 EBA Manual

where [N] is the number which indicates the intermediate file order, which start from 0 and end at 8.

Note: This EBA_OutFiles folder will be deleted automatically after the completion of EBA program. In

order to keep intermediate folders and files, users need to add –k or --keep flag at the time of running

program.

3.1.2 EBA_ImageFiles Directory

After performing the breakpoint identification and analysis against all studied species, the script generates

certain image and data files for future usages. These images and their data files are written inside the

ImageFiles directory and have the following name pattern:

NameOfImage.gif

NameOfImage.data

Each image file has one corresponding data file, which can be identified by same name with <*>.data

extension.

Following are the image files present in EBA_ImageFiles directory:

EBR_classification_fractions_<resolution name>.gif and .data

EBR_density_chromosomes_<resolution name>.gif and .data

Fraction_EBRs_and_gaps_<resolution name>.gif and .data

Fraction_EBRs_and_gaps_<resolution name>.gif and .data

Number_unique_resuse_EBRs_<resolution name>.gif and .data

Pie_chart_classification_EBRs_<resolution name>.gif and .data

Pie_chart_unclassified_EBRs_<resolution name>.gif and .data

The *.data file which contain actual data provide flexibility to use any other program to draw the same but

different style images.

3.1.2.1 Cumulated Bar Graph

This cumulated bar graph contains the information about real and pseudo breakpoints in each studies

species. User can find real and pseudo definition and rules in paper supplementary section.

Page 15: Evolutionary Breakpoint Analyser (EBA)bioinformaticsonline.com/mod/EBA/Manual.pdfFig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules

15 EBA Manual

Figure 3.1: Example cumulative bar graph for real and pseudo breakpoint data. The red bar identifies the

number of real breakpoints. The stacked red bar indicate the pseudo breakpoint detected in that species.

3.1.2.2 Cumulated Stacked Bar Final Graph

The final stacked bar graph represents total number of EBRs identified for a single classification group or

species.

Page 16: Evolutionary Breakpoint Analyser (EBA)bioinformaticsonline.com/mod/EBA/Manual.pdfFig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules

16 EBA Manual

Figure 3.2: The red bar represents the total number of EBRs found reuse in that lineage or order, whereas

green stacked bar denotes the unique EBRs in each species or group. The graph is shorted with species

and group name.

3.1.2.3 Final classification line graph

The final classifications for each resolution were represented with line graph. In this line graph each

classification group and lineage breakpoint were plotted against reference chromosome. The line graph is

saved in EBA_ImageFiles folder with “EBR_density_chromosomes_<resolution name>.gif” name.

3.1.3 ResultFiles Directory

The ResultFiles directory contains the final results files of breakpoint analysis. The results are stored in

two different files.

Result_<resolution name>.final and ResultReuse_<resolution name>.final

Page 17: Evolutionary Breakpoint Analyser (EBA)bioinformaticsonline.com/mod/EBA/Manual.pdfFig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules

17 EBA Manual

3.1.3.1 Result_<resolution name>.final file format

This file contains the final results of the process of breakpoint identification and classification for all

breakpoints. It is a TSV file which has the breakpoint coordinates, scores, species, and decisions of all

breakpoints. It has the following columns:

Name Definition

+Chromosome The name of reference(RE) chromosome

+Widest EBR interval

start(bp)

The possible widest breakpoint which overlaps with all studied species.

+Widest EBR interval end(bp)

The possible widest breakpoint which overlaps with all studied species.

+Narrowest EBR

interval start(bp)

The narrow breakpoint region was calculated by looking at all narrowest

overlapping breakpoint region.

+Narrowest EBR

interval end(bp)

The narrow breakpoint region was calculated by looking at all narrowest

overlapping breakpoint region.

+Species containing the

EBR

It contains the scientific name of all the species overlapping at certain

reference breakpoint region. All the species names are separated by column (:)

sign.

+Classification

group:assignment score

The classification scores for all possible order and lineage group are store in

this column. Each order/lineage name and their score are separated by

column (:) sign. Whereas the entire classification group are spaced in the

column.

+Highest probability

classification

This column contains the most probable breakpoint classification name with

their corresponding scores.

+Second highest

probability classification

The second best breakpoint classification score is stored in this column. The

classification name and score are separated by column (:) sign.

+Ratio between the first

and second probabilities

This column contain ratio of breakpoint classification score.

+No. species in the

dataset

Total number of studied species.

+No. species with gap Number of gaps overlapping with certain breakpoint region.

+No. species with the

EBR

The total number of breakpoint number overlapping at specific RE break

region.

Percentage informative

species

Percentage of species used in classification

Error probability The error probability for each classification

Note: The species name may appear in this file if users apply –e|--engrave flag at command line at the

time of running the program. The species name is dynamic in nature, and depends in number of species

used.

3.1.3.2 ResultReuse_<res>.final file format

The occurrences of reuse breakpoint amongst species were calculated using final classification file. The

reuse file stores the final classification, reuse, uncertain, unique breakpoint information along with their

scores. The file format for reuse file is as follows:

Name Definition

Page 18: Evolutionary Breakpoint Analyser (EBA)bioinformaticsonline.com/mod/EBA/Manual.pdfFig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules

18 EBA Manual

+Reference genome name The name of the reference species. It can be modify according to the user

requirements.

+Reference chromosome Chromosome name of the reference species

+Final classification The final classification name of the RE breakpoint region.

+Narrowest EBR interval

start(bp)

The reference breakpoint starts coordinate.

+Narrowest EBR interval

end(bp)

The reference breakpoint ends coordinate.

+Classification The breakpoint classification decision, which can be Unique, Reuse, and

Uncertain.

+Scores The classification score for the most probable breakpoint

+Ratio between the first

and second classification

probabilities

The breakpoint score ration between first and second best classification

for a single breakpoint region.

+No. species in the dataset Total number of species used in breakpoint classification and analysis.

+No. species with gap Total number of gap regions.

+No. species with the EBR The number of all breakpoint overlapping at species RE breakpoint

region.

+Percentage informative

species

Total percentage of species used to calculate the breakpoint scores.

+Ratio between the first

and second classification

probabilities adjusted for

reuse EBRs

The new ratios calculated for reuse cases using certain set of rules.

Note: The reuse, unique, and uncertain are discussed in supplementary information file.

3.1.4 Viz_EHFiles Directory

The evolution highway (EH) friendly visualization files were generated to visualize it for visual

confirmation. All the files in this folder are saved with their species name. Following are the naming

pattern of the file.

<species_name>_viz_table.viz

Each file contains the breakpoint overlapping information of all the species with respect to reference

genome. Following are the file format of the *.viz files.

Name Definitions

+Column1 Scientific name of the target (TA) species.

+Column2 Chromosome name of the reference (RE) species.

+Column3 Number of breakpoint overlapping species.

+Column4 RE start coordinates

+Column5 RE end coordinates

Note: These files are useful to look through species specific breakpoint regions and check the overlapping

species.

Page 19: Evolutionary Breakpoint Analyser (EBA)bioinformaticsonline.com/mod/EBA/Manual.pdfFig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules

19 EBA Manual

Chapter 4

Merging all resolutions

4.1 Merging resolution with prime

All the studies resolutions were merged and created a new hypothetical resolution. The merging of

resolutions was done using any user-defined prime resolution (see 2.1.1) and created a

“Result_Merge.final” file in root folder. The EBA script use the prime resolution and added new EBRs

information to already existing break regions by checking upper and lower resolutions. The new EBRs

information (which includes breaks coordinates and gap info) are added only if specific break region is

present in at least two different resolutions. The file format of the “Result_Merge.final” file is same as

Result_<resolution name>.final file (see 3.1.3.1).

4.2 Final merged reuse

The final “ResultResue.final” file is created which contain unique, uncertain, and reuses breakpoint

information. The file format is same as ResultResue.final files (see 3.1.3.2).

4.3 Final merged graph

The merge folder also generated same sort of graph as of each resolutions, with one additional graph

name “Unique_reuse_EBRs_all_resolutions.gif”.

The final data with all breakpoint classification information were graphically demonstrated in

“Unique_reuse_EBRs_all_resolutions.gif”.

Page 20: Evolutionary Breakpoint Analyser (EBA)bioinformaticsonline.com/mod/EBA/Manual.pdfFig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules

20 EBA Manual

Figure 4.1: The red bar represents the total number of EBRs found reuse in specific lineage or order,

whereas green stacked bar denotes the unique EBRs in each species or group. The graph is shorted with

species and group name.

4.4 Final merged classification pie chart

The final EBRs classification file was plotted in pie chart named as “EBR_classification_fractions.gif”. In

which uncertain is red, reuses are displayed in green, whereas unique EBRs were represented by cyan

colour.

4.5.3 Final merged breakpoint line graph

The final breakpoint classification data were represented with line graph. It represents the number of

breakpoint per chromosome for specific classification group or lineage. The image file is named as

“EBR_denisity_chromosomes.gif”.

Page 21: Evolutionary Breakpoint Analyser (EBA)bioinformaticsonline.com/mod/EBA/Manual.pdfFig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules

21 EBA Manual

4.5 Other graphs

4.5.1 Breakpoint line graph

The numbers of breakpoints in each species for all resolution were plotted in line graph, which display

the number of breakpoint in each species with respect to reference chromosomes.

Figure 4.2: The numbers of breakpoint occurrences in each species were represented with line graph. The

Y axis represents the rate of breakpoints occurrences irrespective of their chromosome size and X axis

are the name of reference chromosomes.

4.5.2 Breakpoint histogram

The comparative breakpoint numbers were calculated in all studied resolutions, which are useful to so

comparative visual analysis of breakpoint in each species at studied resolutions. This histogram graph is

name as “No_unclassified_EBRs_per_resolution.gif”.

Page 22: Evolutionary Breakpoint Analyser (EBA)bioinformaticsonline.com/mod/EBA/Manual.pdfFig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules

22 EBA Manual

Figure 4.3: The number of breakpoint present in each species at all studies resolutions (in above cases

three resolutions). The Y axis represents number of breakpoint, whereas X axis represents the studied

resolutions.

Chapter 5

Point to remember

Some important point about the EBA framework:

1. The list of homologous synteny block (HSBs) for a single species should not be duplicated in HSBs

files. In addition, the end of the HSBs coordinates should always bigger than the start coordinates.

2. Double check the GenBank taxonomy classification file, as some taxonomy of the species is still

doubtful and need improvements or human interferences. Moreover, if users are doing any modification

in “classification.eba” file then he/she should be careful in typing species name, any duplication or typo

can cause error in final EBRs classification.

3. If users don’t have sufficient resolution to calculate betaScore using EBA script, the user are advised to

use default 0.02 score in betaScore file, which suggest that there is 2% chances of breakpoint to be missed

in certain species (see 2.2.4).

4. The EBA script uses user-defined threshold values [-t | --threshold] (see 2.1.2) to calculate the reuse

breakpoint amongst species and group. We strongly recommend the use 20 threshold values, but users

are advised to check the threshold themselves for a suitable value according to their input data and

species taxonomy.

5. Name of the species should always be scientific if using GenBank taxonomy data for EBA

classification.

Page 23: Evolutionary Breakpoint Analyser (EBA)bioinformaticsonline.com/mod/EBA/Manual.pdfFig. 1.3.1: Perl package manager interface to install GD, GD-graph and other supporting Perl modules

23 EBA Manual

6. Mandatory parameters:

[-n |--number] <number> provide the number of species that you are going to compare.

[-d |--directory]<dir> name of the directory containing the list of HSBs.

[-r |--reference]<refname> name of the reference species [must be a scientific name ].

[-p |--prime]<prime> provide the primary resolution name [the resolution name should be numeric ].

[-t |--threshold]<number> threshold value for reuse breakpoint filtration.

Try -h for more detail.