gpro manual - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - introduction and overview 1.1 - ngs...

80
GPRO MANUAL

Upload: nguyendung

Post on 21-Aug-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

GPRO MANUAL

Page 2: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

INDEX

1 - Introduction and Overview

1.1 - NGS bioinformatic analysis

1.2 - About GPRO

1.3 - Current version and further implementations

2 - Acquiring and Installing GPRO

2.1 - Copyright and License

2.2 - System requirements

2.3 - Installation

2.4 - Warranty and technical support

3 - Software Layout and Computing Cluster

3.1 - Workbench and GUIs

3.2 - Computing cluster overview

4 - Mouse functions and tricks

4.1 - How to deal with GPRO depending on the GUI

4.2 - At directory space

4.3 - About the database editor and fasta explorer

4.4 - Dealing with the worksheet

4.5 - FTP explorer and pipeline GUIs

4.6 - Working with management and data mining utilities

5 - The Menu

5.1 - Tools and functions

5.1.1 - Databases

5.1.2 - Directory

5.1.3 - Editor

5.1.4 - Data preprocessing

5.1.5 - Functional analyses

5.1.6 - Alignment analyses

5.1.7 - Management

5.1.8 - Preferences

5.1.9 - Help

6 - Menu Tools: Editor

6.1 - Sequence editing

6.2 - Database editor

6.2.1 - Menu

6.2.1.1-File

6.2.1.2 - Search and replace fasta labels

6.2.1.3 - Export and Remove seqs

6.2.1.4 - Find Motifs

6.2.1.5 - Undo

6.2.2 - Fasta Explorer

6.3 - TIME sequence editor

6.3.1 - Menu

6.3.1.1 - File

6.3.1.2 - Edit

6.3.1.3 - Translate

6.3.1.4 - Geometry

6.3.1.5 - Orientation

6.3.1.6 - Find ORFs

6.3.1.7 - Find motifs

Page 3: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

7 - Menu Tools: Data preprocessing

7.1 - Data preprocessing

7.2 - Menu

7.2.1 - Converters

7.2.2 - Private user tools

7.2.3 - Processing and cleaning

7.2.4 - Quality analyses

7.3 - How to proceed with the preprocessing interface

8 - Menu Tools: Functional analyses

8.1 - Functional Analyses overview

8.2 - BLAST and HMM searches

8.2.1 - Format databases

8.2.2 - BLAST analyses

8.2.3 - HMM analyses

8.2.4 - Process BLAST outputs

8.2.5 - Process HMM outputs

8.3 - InterproScan

8.3.1 - Running Interproscan via GPRO

8.3.2 - Processing INTERPROSCAN outputs

8.4 - Augustus

8.4.1 - Running Augustus via GPRO

9 - Menu Tools: Alignment Analyses

9.1 - Sequence logos

9.2 - HMMs and consensus sequences

10 - Menu Tools: Management

10.1 - Management overview

10.2 - Files and folders

10.2.1 - Join folders

10.2.2 - Join files

10.2.3 - Split files

10.3 - Find sequences

10.4 - Alignments

10.5 - Join alignments

10.6 - Format alignments

11 - Menu Tools: Preferences

11.1 - Pipeline connection settings

11.2 - Worksheet preferences

11.3 - Evidence code weights

11.4 - Activate/License software

11.5 - Proxy connection settings

12 - Menu Tools: Help

12.1 - About GPRO

12.2 - Manual

12.3 - Report suggestions and bugs

12.4 - Check updates

13 - Worksheet annotation system ( I )

13.1 - Managing annotations for downstream analyses: Multihit and Single hit CSV files

13.2 - Worksheet overview

13.3 - Workseet menu: File Tab

13.3.1 - Open worksheet

13.3.2 - Save

13.3.3 - Save as

13.3.4 - Download GenBank Accession files

13.3.5 - Download sequences by GI

13.3.6 - Set as default worksheet

Page 4: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

13.3.7 - Unset as default worksheet

13.3.8 - Import

13.3.8.1 - Append worksheet

13.3.8.2 - Combine worksheets

13.3.8.3 - Clusters

13.3.9 - Export

13.3.9.1 - Export worksheet and FASTA

13.3.9.2 - Export annotation file

13.3.10 - Export categories and clusters

13.3.11 - Show/Hide columns

13.4 - Workseet menu: Edit Tab

13.4.1 - Search and replace

13.4.2 - Undo

13.5 - Worksheet menu: Sorting/Filtering

13.5.1 - Sort

13.5.2 - Filter by position

14 - Worksheet annotation system ( II )

14.1 - Worksheet menu: Annotation Tab

14.1.1 - GO annotation

14.1.1.1 - Append GO terms

14.1.1.2 - Evidende code weights

14.1.1.3 - Display graph

14.1.1.4 - GO depth statistics

14.1.2 - Append InterPro data

14.1.3 - Append COG/KOG terms

14.1.4 - Apply annotation colors

14.1.5 - Switch database IDs

14.2 - Worksheet menu: Select Tab

14.2.1 - Select sequences by key terms

14.2.2 - Select sequences by expect or statistics values

14.2.3 - Selecting set of sequences differentiated by colors

14.2.4 - Selecting sequences by multiple criteria

14.2.5 - Delete checked sequences from the Worksheet

14.3 - Worksheet menu: Associate database

14.3.1 - Associate fasta sequences to your annotation CSV file

14.3.2 - Remove association between your fasta file and the CSV

14.4 - Worksheet menu: Statistics

14.4.1 - Numerical data statistics

14.4.2 - Categorical data statistics

14.5 - Metabolic pathways

14.6 - Worksheet menu: Transcriptome post-processing

14.6.1 - Filter best isoform

14.6.2 - Sequence trimming

15 - Acknowledgements and citing

16 - References

Page 5: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

1 - Introduction and Overview

1.1 - NGS bioinformatic analysis

Bioinformatic analysis of omic data from next generation sequencing (NGS) usually consists of four essential steps:

o Pre-processing of raw data including quality analysis, cleaning trimming, clipping and/or demultiplex if proceed

o Alignment approaches consisting on mapping over reference or de novo assembly of reads into contigs, isotigs, scaffolds etc

o Analysis and annotation o Post-processing and downstream analysis

In the last years, NGS analysis based on the steps summarized above has become one of the most important and productive bioinformatics topics in terms of design and development of databases and software. However, NGS analysis is anything but easy. First of all, conventional PCs are not recommendable for performing NGS analyses because of the high computational requirements to simultaneously deal with thousand and millions of sequences. Therefore, if you are involved in a NGS project you will probably need a workstation or more powerful tool such as a computing server (this is the usual hardware). Secondly, data format and software protocols normally vary depending on the employed NGS platform (Solexa-Illumina, Abi-Solid, Roche, Ion Torrent, etc) and goals and background of a NGS project, it can Whole deNovo genome sequencing, Target Genome/Exome Mapping, Transcriptomics/RNA-seq, Metagenomics, Chip-Seq, etc. So, although bioinformaticians must have an essential background in biology (normally in genetics and molecular biology) they must also have also expertise in informatics, operative systems (usually Unix/Linux) and syntax of commands. On the other hand, much has been done in terms of development but there is still much to do. Indeed, one of the most interesting challenges for the future in Bioinformatics is to provide versatility and automation to the whole bioinformatic NGS analysis so that to let any bio-researcher or lab technician to easily manage complex protocols and pipelines just with the skill levels of a usual PC user. With this aim, we designed and launched GPRO.

1.1 - About GPRO

GPRO is a "bioinformatic proprietary project" or "an integrative professional solution" in continuous development for genetic analysis and management of NGS (and other sequence) data and databases consisting of two components, a stand-alone installable software coupled online with an infrastructure of computing pipelines. The software is a Java application structured as an eclipse-like workbench of utilities managed by a central menu implementing; sequence and database editors; a worksheet system for annotation and functional analysis; a suite of tools for data-mining and management of files and folders, a FTP protocol and friendly-to-use collection of interfaces for managing the pipelines in a remote server. All actions of the software are quite intuitive. You can launch an analysis by just selecting a file or a folder, dragging it to a box of options and then making click. It is however recommended to read the manual before beginning to work with GPRO. The online component of GPRO is a package of pipelines enabling the users to run intensive computational jobs in remote private sessions. These jobs may be BLAST or HMM Searches, Mapping/Assembling Runs, Exome Analysis with SNP/Indel Calling, Gene Prediction, Mobilome annotation, GO-annotation, RNA-seq, Metagenome Analysis, Downstream Analysis, etc. Some of these are not yet ready because development and availability of pipelines is a work in continuous progression where we follow protocols based on the implementation of well-known free-source tools, together with a collection of tailored scripts and graphic interfaces (available in the GPRO software menu) to automate the whole flow of data.

By default the distinct GPRO pipelines are installed on the high-end computing servers of GyDB (Llorens et al. 2011), which give the package its name. The term “GPRO” is the acronym of “Gypsy Database PROfessional”. The default package also includes, for each GPRO licensee, user accounts in the computing server with space in the hard disk and computing time as well as a FTP protocol with which users can easily transfer files from their PCs to their user accounts. Of course, you can acquire and install the GPRO package in your own server

Page 6: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

or workstation but bear in mind that the package does not include third party software. We provide the scripts and interfaces for easily running the pipelines but you must to download and to install all the free-source code tools over which we design a pipeline flowchart. If you are not an expert Bioinformatician, do not worry about this, we give technical support for all the steps needed for installing and running GPRO. In summary, GPRO is an ideal tool for experts of laboratory interested in excellence bioinformatics but maintaining computing skill levels as simple software-users because it implements multiple NGS functions accessible through various easily handleable menus and an intuitive layout organized in graphical interfaces and easy-to-use mouse actions. However, it could also be interesting to highlight the idea of that GPRO is also useful for bioinformatic departments and/or sequencing services interested in providing to their users an integrative tool to navigate and manage the results and databases derived from the annotations and NGS projects.

1.3 - Current version and further implementations

The current GPRO version is 1.1. To manage GPRO you will probably need familiarization with the most basic concepts of bioinformatics and computational biology. This wikisite constitutes the manual of GPRO that will be updated in parallel to the progression of new versions of GPRO. We will try to upload practical examples, videos, etc. Anyway, if you are new to the subject, it would also be good for you to read some essential bibliography in the topic such as the following references (Durbin et al. 2009; Higgs and Attwood 2005). By clicking this link you can access again the main web site of GPRO where you can purchase the tool, find a trial version, and/or find additional information. Bear in mind that GPRO project is an autosustainable initiative that we maintain and update without grants, just with the funding support of clients. This means that the way to get and use GPRO is by purchasing a license (for more details go to the download site). The good news is however that the project is constantly upgraded and updated according to the comments, needs and feedbacks provided by our users.

Page 7: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

2 - Acquiring and Installing GPRO

2-1 - Copyright and License

GPRO is the intellectual property of Biotech Vana SL (Biotechvana). The software is protected under copyright and intellectual property laws (including international copyright treaties, and other intellectual property treaties). Licensing of GPRO is subject to a commercial and private source agreement that should be accepted during installation of the tool in your PC. This agreement allows unrestricted use of the tool for academic and industrial research and services (online or to third parties) but does not permit the sale, rent, re-distribution, reverse engineering, decompiling, disassembly, or otherwise translation or analysis of any source code of the software, underlying ideas, algorithms or programming by any means without explicit authorization from Biotechvana.

2.2 - System requirements

GPRO is a Java application that runs on personal computers (PCs) and workstations as standalone software. The program is distributed as an installer for Windows XP/Vista/7 (32 bit and 64 bit), a self-extracting disk image for Mac OS X 10.5 or later (64 bit), and a compressed tarball archive for Linux 2.6 kernel series or later (32 bit and 64 bit).

Microsoft Windows

Windows XP/Vista/7/8

Intel Pentium 4 1.5 GHz or Athlon XP 1500+ processor or higher

2 GB RAM minimum

Linux

Linux distributions

Intel Pentium 4 1.5 GHz or Athlon XP 1500+ processor or higher

2 GB RAM minimum

Apple Mac OS X

Mac OS X 10.5 or later

Intel Core Duo processor or higher

2 GB RAM minimum

All systems require Java 1.5 or later. The latest version can be downloaded from the following URL: http://java.com/download/index.jsp. Please note that for projects containing a large number of sequences, a minimum of 4 GB of RAM is recommended for optimum performance.

2.3 - Installation

Microsoft Windows: GPRO for Microsoft Windows systems is distributed with an executable installer. To install it, double-click on the installer and follow the instructions on screen. This process automatically generates desktop and start menu shortcuts. To uninstall GPRO, it is recommended to use the “Add/Remove Program” option in the Windows Control Panel. A dialog box will display a list of programs. Choose GPRO and then click “Add/Remove”.

Page 8: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Apple MacOS X: The MacOS X version is installable via an Apple disk image (DMG) file. Double click the file to mount it and then drag GPRO to your Applications folder. To uninstall, as with any other Mac OS X application, drag it to the Trash.

Linux distributions: Installation on Linux systems is carried out by uncompressing a gzipped tarball (.tar.gz archive) in your home directory (or any directory for which you have write access). This will create a directory named “GPRO”. Inside you will find, among others, an executable file that you can run to open GPRO. To uninstall GPRO you need to delete the directory created by the installation process.

2.4 - Warranty and technical support

GPRO has been satisfactorily tested in the various computing systems for which we support the installation. Should your license prove defective, warranty assistance is provided via the technical support service. You can access this service via the "suggestions and bugs" utility implemented within the "Help" tab of the main GPRO menu or by accessing this URL (Technical support).

Warranty is free of charge for two years after acquiring the license. However, Biotechvana does not provide any warranty or assume any liability or responsibility for the use of and results obtained using this tool, expressed or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. In no event shall Biotechvana be liable for any damages including direct, indirect, incidental, consequential and loss of business profits including general, special, incidental or consequential damages arising out of the use or inability to use the program including but not limited to loss of data or data being rendered inaccurate or losses sustained by you or third parties or a failure of the program to operate with any other programs, even if Biotechvana has been advised of the possibility of such damages.

Page 9: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

3 - Software Layout and Computing Cluster

3.1 - Workbench and GUIs

GPRO software consists of friendly-to-use central menu with distinct tools accessible via graphical interfaces (GUIs) each one usually displaying their own submenu of functionalities. GUIs vary depending on the invoked tool. As we will show you in this manual, you will find GUIs for both a database and a sequence editor called TIME, but also for the worksheet tool for annotation and functional analysis, for all the pipeline and online servers, for the data management tool and more. As for layout organization, GPRO has been designed based on a eclipse-like workbench shown in Figure 3.1 and consisting the following:

1. MAIN DESKTOP: This is the central working space, where the most important tools and interfaces of GPRO are graphically launched

2. DIRECTORY: users can select any folder of their PCs and set it as a directory shown left of the main desktop in order to store and organize files/folders through a dynamic filetree hierarchy allowing the users a variety of actions and analyses based on both menu and mouse utilities. The directory can be shown and hidden via menu.

3. FTP explorer: this is a File Transfer Protocol (FTP) called FTP explorer that allows you to transfer and download files and folders (using the computer mouse) from the directory to the remote user at the computing cluster or vice versa

4. FASTA EXPLORER: this is a window-based utility coupled with a "Database Editor" accessible via menu that allows the users to visualize, search and select sequences from plain-text files in fasta format. It is really useful to manage large databases and Refseq files because of the implementation of an "informatic buffer" allowing to browse sequences one-to-one by clicking on the sequences names shown in the fasta explorer

Figure 3.1. GPRO layout organization and interface implementation. Numbers indicate the four window-based

sections as described in the text and visualized in the figure: (1) main desktop; (2) Directory; (3) FTP; (4) FASTA

explorer.

Page 10: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

3.2 - Computing cluster overview

The software is coupled via distinct menu tabs with an online computing infrastructure in continuous progress and installed on the high-end remote cluster at GyDB project. This infrastructure can be only accessed via GPRO (which is the infrastructure remote manager) and includes hard-disk accounts for all GPRO users in order to run intensive computing jobs in private session, at present, based on the following items.

50GB hard disk space, which will increase periodically to guarantee sufficient computational space to fit the requirements of the most demanding projects

A guaranteed quality of service distributed CPU bandwidth for high-throughput computing analyses, providing users with the maximum available processing capacity on the cluster.

A SSH client for logging into a user's private account on the remote cluster and sending commands for launching automated analysis tools.

An FTP client system organized as a remote Filetree manager for transferring analysis files between a client computer and the remote cluster user's account. Users can upload sequence files for processing on the remote cluster and download generated result files to a local computer.

A pipeline for pre-processing and quality analysis of Next Generation sequencing (NGS) raw data based on distinct scripts and data format converters and the following tools and packages; CUTADAPT (Martin 2011) , FASTX-TOOL-KIT, PRINSEQ-LITE (Schmieder and edwards 2011]) and FASTQC

A pipeline protocol for functional analysis based on BLAST (Altschul et al.1997) and HMMER packages plus automatic annotation based on GO vocabulary (Gene Ontology Consortium 2008) and KEGG (Nakaya 2013) data

A pipeline protocol for functional analysis based on the INTERPROSCAN (Quevillon et al. 2005) and the INTERPRO set of member databases (Hunter et al. 2012) plus automatic annotation

A server for de Novo Gene Prediction based on Augustus (Stanke et al. 2008)

A server for constructing HMM profiles based on HMMER

A server for Mapping/Assembling NGS raw data based on the following tools; BWA, BFAST and MIRA. The GPRO GUI for accessing this server is already under development. This means that at present this server is only accessible via SSH command shell client protocol

A pipeline for Genome/Exome analysis plus SNP/Indel Calling and Annotation based on BFAST and BWA mappers plus PICARD, GATK-LITE (McKenna et al. 2010), and the snpEff annotator (Cingolani et al. 2012). The GPRO GUI for accessing this pipeline is already under development. This means that at present this server is only accessible via SSH command shell client protocol

A post-processing pipeline for correction of frameshifts with particular focus on those generated by 454 and IonTorrent homopolimer artifacts based on a combination of a collection of scripts with the BLAST and HMMER packages with the INTERPROSCAN databases and the frameshift corrector HMM-FRAME (Zhang and Yung 2011). The GPRO GUI for accessing this pipeline is already under development. This means that at present this server is only accessible via SSH command shell client protocol

A pipeline for Mobilome or Mobile Genetic Elements mapping and annotation based on the GyDB and other RefSeq databases and the following tools; REPEATMODELER, REPEATMASKER, RECON (Bao and Eddy 2002), REPEATSCOUT (Price et al. 2005) , TANDEM REPEAT FINDER ((Benson. 1999), and RMBLAST. The GPRO GUI for accessing this pipeline is already under development. This means that at present this server is only accessible via SSH command shell client protocol

A MySQL database architecture to store and manage and consult any kind of relational, classificatory and GoldenPath or including user-created databases using the worksheet system of the GPRO software.

Page 11: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

4 - Mouse functions and tricks

4.1 - How to deal with GPRO depending on the GUI

GPRO is very easy and simple to use. Almost all file and folder actions to perform on the GUI can be executed by selecting with the mouse the file / folder to use and then moving it with the mouse to the corresponding dialog box of the GUI being used. There are also functions accessible by right-clicking the mouse that can vary depending on the GUI being used. As all these utilities might not immediate for a user when using GPRO for the first times, this section is oriented to show you a perspective about the availability and utility of these tricks depending on the used GUI.

4.2 - At directory space

The directory space of GPRO is organized following the typical tree-file dynamic hierarchy. You can thus navigate the folders containing folders and files etc. Figure 4.1 presents an idealized perspective of all right click utilities of the mouse. In doing so on the directory space, a dialog will appear offering you diverse actions. You can create a new file/folder, open a worksheet, and use the editors. You can also cut, copy, paste, delete and rename files etc via this dialog. Finally, at the bottom of this dialog, you will also access to a parser for extracting data from files formatted as GenBank accessions (indicated with a left-oriented arrow). This parser (indicated by a right-oriented arrow) let you to extract and create a sub-database with sequences and symbols and other coding or not coding features if they are annotated in the accession file. You can also select and extract sequence frames delimited by particular starting and ending positions within the accession. The tool is really useful in order to manage large files such as for instance, a full-length human chromosome accession, etc.

Figure 4.1. Right-click utilities when positioning the mouse on the directory space

Page 12: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

4.3 - About the database editor and fasta explorer

By positioning and right-clicking the mouse on the Database editor, this contextual menu provides distinct functions for editing (cut, copy and paste) contents. In addition, by clicking on any sequence name summarized in the FASTA Explorer, the database editor browses the edited file to reach particular sequences, which appear highlighted in blue when selected in the Fasta Explorer (Figure 4.2). The latter also allows opening of selected sequences in the TIME sequence editor, and also allows you to delete and rename selected sequences.

Figure 4.2. Database editor contextual menu and fasta explorer.

4.4 - Dealing with the worksheet

The worksheet system is an excel-like template of rows and columns with which manage a variety of annotation, analyses and data mining actions. Rows represent sequences and columns the distant annotation features (we will see this more in deep in subsequent sections of this manual). The worksheet can be handled via the worksheet-menu or via mouse-dependent actions, which are additional to those available in the worksheet menu. Figure 4.3 shows an idealization of all the available mouse actions. For instance, to export a sub-annotation of your project into a new database you can select rows (sequences) and columns (some of their attributes) by manually clicking the checkboxes of each row and column of interest. You can also positionally switch columns by just selecting and dragging one of them from one position into another. More, all worksheet cells (including column headers) can be edited. By right clicking the mouse a little dialog will appear with two options “Columns” and “Rows”. The first one offers you a sub-dialog with additional actions for adding, selecting, removing renaming or joining and splitting columns. The second one drives to another sub-dialog permitting the same actions in regards of rows but letting you additional options. You can place the mouse in one row and them make right click > Rows > Show BLAST multihit. This action allows you to visualize the set of alternative best BLAST hits (if you are dealing with a multihit CSV file) for you sequence in the selected sequence and switch it (if appropriate) by another with an alternative description. Do not worry if you are unclear about the details and functionalities of this option, we will explain it more in depth in subsequent sections of this manual. Similarly, you can select the best isoforms among a selection of

Page 13: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

sequences annotated according to the same Genbank accession or any other feature. Again, do not worry about this, as we will explain it later with much more details. Finally, by double clicking on any row, such a row will appear below to the worksheet as a notebook for detailed editing (as shown in the figure).

Figure 4.3. Worksheet functions and utilities dependent on mouse actions

4.5 - FTP explorer and pipeline GUIs

If you launch, via the main menu, any pipeline or server a FTP protocol called as FTP explorer, which is linked with your user account at the remote computing cluster will automatically appear right to the directory together with a GUI that varies depending on the pipeline. As shown in Figure 4.4, if you want to run a pipeline and/or server analysis your data files must be transferred from your PC to your user server account. You can do that by just simply dragging files with the mouse from your GPRO directory to your user account. Meanwhile the file transfer is under progression, a parallel pop-up will appear in order to inform you about the status of the process. Once your files are in your user server account you can launch the tool by selecting with the mouse the file (and other material if requested) to analyze and dragging it to the dialog box to which it corresponds to.

Page 14: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 4.4. Transferring files from your PC to the remote computing cluster via FTP and actions to launch the

pipeline and server tools via GUI

5.6 - Working with management and data mining utilities

GPRO provides you a variety of tools for data mining and database management which are run in your PC via the main menu. As shown in Figure 4.5 these tools also have particular GUIs, which can be launched using the mouse with selecting and dragging actions in identical way to those examples summarized above in the previous examples.

Figure 4.5. Management and data mining utilities dependent on mouse select and drag actions.

Page 15: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

5 - The Menu

5.1 - Tools and functions

GPRO implements a central main MENU (Figure 5.1) horizontally located at the top of the software layout and that integrates the following tabs:

Figure 5.1. Menu and functions.

There follows a description of each tab according to the numbers highlighted below figure

5.1.1 - Databases

This tab implements five utiliies:

Select directory folder: for selecting a folder as a directory workspace.

Open single FASTA file: to open sequence and database files in fasta format from any site of your PC.

GenBank Accession: to retrieve GenBank sequence accessions from GenBank at NCBI.

Open worksheet: to open annotation worksheet files which are CSV files (plain files with comma-separated values).

New worksheet: To create a new empty worksheet.

5.1.2 - Directory

Using this tab, users can show or hide the directory at the left of the tool GUI

5.1.3 - Editor

This tab gives the option to launch two editors.

The first is a "Database editor" associated to the worksheet and implemented with distinct utilities of editing, mining and management of plain fasta files.

The second is an implementation of "TIME editor" an editor/analyzer of sequences of up to two gigabases

5.1.4 - Data preprocessing

This tab gives you access to the GUI of a pipeline of "Preprocessing tools" for quality analysis and data preprocessing installed in the computing cluster with which you can configure and design ad hoc a particular analysis flowchart based on distinct tools.

Page 16: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

5.1.5 - Functional analyses

This tab allows launching of GUIs for three distinct functional analysis tools installed in the remote cluster server: One is a "BLAST and HMM search Pipeline plus GO-annotation" The other is a "pipeline for INTERPRO based annotation" The third is a server for de novo gene finding based on "Augustus software"

5.1.6 - Alignment analyses

This tab implements two GUIs for multiple alignment tools:

One is for run a "Sequence Logos maker" based on multiple alignment inputs.

The second lets to launch a HMMER server installed in the computing cluster for creating "HMM profiles and MRC sequences"

5.1.7 - Management

The tab "Management" launches a suite of scripts to manage files and folders in various ways. For instance, users can join, split and rearrange files, folders and contents. In addition, users can execute specific data mining searches in these files and folders, and export the results in new files and folders.

5.1.8 - Preferences

The tab "Preferences" is useful for fixing the with regards to diverse issues (for example, FTP and pipeline connection).

5.1.9 - Help

The tab "help" is for accessing the technical support service, the user guide (this wiki) and intrinsic information of GPRO.

Page 17: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

6 - Menu Tools: Editor

6.1 - Sequence editing

GPRO provides two editing programs available via the Editor Tab (the third icon on the menu). The first is a database (plain text) editor designed for editing and manipulating database files (plain files in FASTA) while the second is sequence editor aimed at the molecular analysis of nucleotide and amino acid sequences.

6.2 - Database editor

The database editor lets you to edit and browse sequences and their contents in plain fasta databases and files. Basic actions, such as copy, paste cut of sequence can be done by just right clicking the mouse. As shown in Figure 6.1, the GUI of database editor offers an intuitive menu (highlighted in red) and two graphical components; 1) the editing framework where sequences can be edited at any time, users can in fact write any kind of text but it is preferable to follow the FASTA format; and 2) the Fasta Explorer, which is a list of the names of all sequences contained in the edited file.

Figure 6.1 Database editor screenshot: 1) Database editor frame; 2) Fasta Explorer

6.2.1 - Menu

Almost all the utilities provided by the database editor are organized into a menu bar displayed at the top of the Editor (highlighted in with a red rectangle in Figure 6.1). There follows a description for each Tab according to its provided tools

6.2.1.1 - File

This is to open, save and close files

Page 18: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

6.2.1.2 - Search and replace fasta labels

The database editor provides "Search" and "Search and replace" utilities over the sequence names using three distinct options ("Exact term", "Case sensitive" and "Regular expression"). The first (Exact term) permits the user to search sequences according to their exact name. The second (case sensitive) distinguishes uppercases from lowercases in the search. The third option (Regular expression) considers particular characters, words or patterns written in formal language allowing users to identify parts that match the specification provided by a particular grammar (for instance a consensus pattern)

6.2.1.3 - Export and Remove seqs

This tool is for selecting a set of selected sequences (or their fasta headers only) from a fasta file and to remove them or to export them to another file. By clicking on this tab a new window (Figure 6.2) will appear showing a summary of all sequences names and different options. You only need to enter a particular term of interest (for instance MOV as it is in figure) in the ‘Search’ box and then to choose a Filter options. The tool provides four options. Three of them are the usual "Exact match", "Case sensitive" and "Regular expression". The fourth option - "Append selections" - lets you to perform a new selection using a different key term adding this result to a previous selection. If the tool finds any sequence labeled according to the search it will be highlighted in the summary. Finally, the tool allows to export or to remove the selected sequences from the database file, as sequences or as fasta headers only

6.2.1.4 - Find Motifs

This is for searching sequence patterns (motifs) by parsing all sequences in a nucleotide or protein sequence file. Found motifs can be exported to a new file text or CSV file.

6.2.1.5 - Undo

To cancel or reverse the effects or results of a previous editing action

Figure 6.2. "Export and remove Seqs dialog”.

Page 19: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

6.2.2 - Fasta Explorer

As previously shown in Figure 6.1, the Fasta Explorer is a summary of all sequences contained in the edited file based on their fasta headers. By clicking on any header listed in the Fasta Explorer the editor navigates the file and drives the user to the position of the selected sequence within the file. The Fasta Explorer is ideal when searching and selecting particular sequences from large database files (for instance a file containing a full length genome) which cannot be usually edited by conventional text editors because of their high size. The fasta Explore lets you to navigate the file sequence-to-sequence allowing you to easily make editing actions on the selected sequence.

6.3 - TIME sequence editor

TIME (Munoz-Pomer et al. 2011) is the other editor that can be launched via the Editor Tab. TIME is a sequence editor that permits editing, displaying and molecular analysis of sequences up to 2 x 109 bases (two gigabases or amino acids), which will suffice for the largest chromosomes known to date [54]. TIME is implemented as a GPRO plug-in. Its GUI is organized in three components (Figure 6.3). First, the menu bar which lets you to access the distinct TIME functions and utilities. Second, the Sequence Editing Frame where you can select, cut, paste and edit sequences and frames within sequences with the mouse. And third, the Results Table where you can display a summary of results derived from ORF or motif searches performed via menu or export their annotation in a CSV file or a fasta database.

Figure 6.3. Screenshot of TIME. 1) TIME menu; 2) Sequence Editing Frame; 3) Results Table

6.3.1 - Menu

TIME implements a horizontal menu consisting on the following Tabs and functions:

6.3.1.1 - File

To create, open, save and close amino acid and nucleotide sequence FASTA files and quit the program. If the chosen file is a database containing multiple sequences, TIME will open a tab with a summary of all available

Page 20: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

sequences. By double clicking on a sequence name in that summary the tool will show it in the sequence editing frame.

6.3.1.2 - Edit

This tab contains the cut, copy and paste functions. With this tab you can also undo and redo each individual change using the “Undo” and “Redo” utilities and unlock the sequence for editing if you want to make edits on it ( Sequences are locked by default).

6.3.1.3 - Translate

As shown in Figure 6.4 This tab invokes a pop-up dialog allowing you to translate the sequence in analysis to all six reading frames (or only any of them by checking the boxes). The standard genetic code is used by default. However, clicking on “Edit” beside “Custom genetic code” will take you to the genetic code editor. In addition to editing the translation codons, users can rename it, save it to a file or to open a previously saved custom code. The default colors for start and stop codons are blue and red, respectively, but they can be changed to color using the palettes shown in the dialog.The Translate utility of TIME can also open Gene Runner’s translation table format (.trt files) and a native plain text format, which can be easily created in the editor of your choice. In these files, lines starting with the hash symbol (“#”) are interpreted as comments and ignored, with the exception of the first line, which holds the name of the code; however, this line is not mandatory. Each following line is formed by a codon (RNA and DNA are allowed), a hyphen and a greater than symbol (“->”) followed by an amino acid symbol (according to the 1-letter IUPAC codes).

Figure 6.4. TIME editor screenshot and pop-up for protein translation and genetic code.

6.3.1.4 - Geometry

This tab, allows you to change from DNA to RNA and vice versa, and to view either RNA or DNA sequences as a single strand or a double strand.

Page 21: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

6.3.1.5 - Orientation

To switch among the following options: reverse, complementary and reverse-complementary.

6.3.1.6 - Find ORFs

Using this tab you can search and retrieve ORFs in both forward and reverse frames specifying a condition of minimum ORF length. Then a report with all ORFs fitting the length conditions specified and their coordinates is summarized in the Results table of the editor (to the right, number 3 in Figure 6.3). By double-click the row of summarized ORF such an ORF will be selected and highlighted in the Sequence Editing Frame. Yu can annotate the whole set of ORFs summarized (or many of them) by exporting those selected in the checkbox left the column description (see Figure 6.3), as a CSV or as a fasta file. In the second case, the tool lets you to export the ORFs as a nucleotide sequence or as a protein.

6.3.1.7 - Find motifs

You can perform searches for particular protein or nucleotide motifs (binding sites, restriction sites, etc) over the sequence in analysis, using the tab “Find motifs" (Figure 6.5A). Results are shown at the Results table with identical format to that of Find ORFs search results. You can also use the “Find motifs” tool to search multiple patterns and or motifs as single occurrences or as clusters of motifs. By clicking on the “Multiple Motif Editor” button below and to the right within the “Find motif” dialog a new dialog will be opened (Figure 6.5B). In doing so, you can add and remove as motifs as necessary and give a name for each and then select at the top-right of this dialog the “Multiple motif” search mode. The search can be performed for single occurrences (motifs will be searched independently) or as Clustered motifs (motifs falling together in a sequence frame).By selecting the latter you will allowed to specify three new parameters in the search; 1) minimum cluster size (for instance a frame of 500 nucleotides); 2) minimum number of motif within the cluster; 3) decide if clusters overlap (overlapping clusters option) or not (Disjoint clusters). Note that you can also add motifs loaded from a file using the tab Load file. For instance you can use a FASTA file list of enzyme restriction sites downloaded from the Rebase web site.

Figure 6.5. Find motifs screenshot. A) “Find Motifs” dialog. By clicking on the tab to the right and below you will open the multiple motif editor (B and to the right) where you can add motifs and performing searches as single occurrences or as cluster of motifs

Page 22: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

7 - Menu Tools: Data preprocessing

7.1 - Data preprocessing

In Next generation Sequencing (NGS) sequence preprocessing is the process of transforming raw reads to assembly-ready sequence generating in parallel associated informative reports. Raw data preprocessing includes tasks such as converting the raw trace file from proprietary to standard form, deriving template information, base-calling, vector screening, quality evaluation and control, disk management, associated tracking and reporting operations, demultiplex, sequence trimming/clipping and elimination of artifacts etc.

Preprocessing of raw data is thus a necessity has given rise to a number of Unix-based free-source software tools using a wide range of paradigms. Management and use of these tools requires some informatic skills about linux commands. Taking this into primary consideration we implemented GPRO with a multi-funtional friendly-to-use interface (Figure 7.1) in order to let the users to deal with the most representative preprocessing tools installed in the remote server just having skills at the user level (click-and-go actions) although we assume . You can access the preprocessing GPRO interface just clicking in the tab “Data Preprocessing” of the main menu highlighted in the Figure 7.1 below.

Figure 7.1. Interface for Data preprocessing

7.2 - Menu

Figure 7.2 schematizes the menu within the preprocessing interface showing the organization of the different solutions installed in our server to which GPRO currently support the accession. Almost all these pre-processing tools are free source tools designed by third parties so you must cite them if you obtain some interesting publishable results.

Page 23: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 7.2. Preprocessing menu.

Following is a brief description of each interface tab.

7.2.1 - Converters

This tab facilitates accession to some scripts for format conversion (as shown in Figure 7.2). You have two tabs "Color space" and "Nucleotide space". The first links to a script that converts Solid-based (color-space) fasta files coupled with quality files either into color space fastq (csFastq) or the conventional nucleotide based fastq.

7.2.2 - Private user tools

If you have your own server coupled with GPRO you also have a tab you can use for running other proprietary source code tools our your personal scripts (if you need more details about how proceed please contact us

7.2.3 - Processing and cleaning

This tab provides accessing to three distinct software packages for preprocessing and cleaning via the preprocessing interface. These are;

1. Cutadapt (Martin 2011), for removing primers and adapters from the sequences any many more actions. For more details please visit the web site

2. Fastxtool kit a collection of tools summarized in Figure 2 for fasta and fastq preprocessing FASTX-TOOL-KIT,

3. Prinseq (Schmieder and edwards 2011), which is a tool for filtering, reformat, and/or trimming sequence data, for more info visit the web site

7.2.4 - Quality analyses

You can use this tab for performing quality analyses using FASTQC

Page 24: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

7.3 - How to proceed with the preprocessing interface

The way to manage any of the aforesaid tools via the interface is quite intuitive and friendly. As shown in Figure 7.3, when launching the preprocessing interface you also activate a FTP protocol between your PC and your user account in the server pipeline. The first step is you to drag the files you want to process from your PC to your user account.

Then create (by right-clicking) an output folder that you can name as your wish, select the file format (fastq, fasta or fasta + qual) in the interface box named format,

Subsequently, select in the menu the tool you are going to use (the interface will automatically the applications and requirements of the selected tool). Then, use the mouse to drag both the input files you want to preprocess and the output folder wherein you want to get the resulting files to the top box (small red line in figure) and the output fold box (larger red line), respectively.

Figure 7.3. Managing the preprocessing interface.

Finally, at the bottom of the interface you have an interactive form listing all command options and parameters (Figure 7.4) provided by this tool, select the option or fill the box data analysis parameters where required and then you are ready to launch the preprocessing analysis. In this task, you have two options. You can click the tab "Run program" (in Figure 7.3) to launch the analysis as such you configured or you can click on the Tab "Append and command" then your command string will appear in the queue box below allowing you to prepare other analyses. In this way you can simultaneuosly launch the same command on multiple files where you will only need to drag a new input file to the input box, or yet more interesting, if you keep the option "Use output file created by previous command as the next input file" selected you can design an "ad hoc" preprocessing pipeline for a particular data file. This is, you design a command for demultiplexing your file and then another command for trimming the first 10 nucleotides at 5´in the output of the last command and then eliminate all those sequences having not enough quality (according to a threshold) from the output of the former output and so on.

Page 25: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 7.4. Form for selecting commands and parameters.

Page 26: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

8 - Menu Tools: Functional analyses

8.1 - Functional Analyses overview

The horizontal main menu of GPRO implements a scroll down list to different GUIs allowing you to easily run from your PC comparative searches to the most common reference databases using a variety of software pipelines and free software tools installed in the GPRO computing server, or in your own server if you installed the GPRO protocol in your own cluster. As shown in Figure 8.1, GPRO gives you the option of performing automatic annotation using BLAST and HMM searches or INTERPROSCAN ((Quevillon et al. 2005). This section also allows you to make Ab initio gene predictions based on your query sequence using the software Augustus (Stanke et al. 2008).

Figure 8.1. Functional analysis options.

8.2 - BLAST and HMM searches

BLAST and HMM searches have been predominantly used to characterize the function of novel described sequences, either nucleotides or amino acids. In fact, the best hit using a BLAST search is sufficient to establish a relationship of homology between query and subject. Quality of annotation depends on the quality of the database used as the subject in the homology search. GPRO implements a pipeline installed for running the whole BLAST search process. You can manage the analysis and conditions from your PC but you launch the analysis at the server. This means that you must transfer the files from your PC to your user account at the computing cluster. In particular, this pipeline consists of the NCBI-BLAST package (Altschul et al.1997) and HMMER with a collection of scripts managed by a GUI presenting five tabs; “Format databases”, “BLAST analyses”; “HMM analyses”; “Process BLAST outputs” and “Process HMM outputs”.

8.2.1 - Format databases

This tool is the first tab of the GUI for BLAST and HMM searches. It performs calls the “formatdb” script of the NCBI-BLAST package allowing you to easily give BLAST format to any fasta file created based on your own data or any RefSeq database downloaded from any internet source (Silva, Repbase, NCBI-NR, Uniprot, etc).

The procedure for formatting databases is shown in Figure 8.2 and detailed below according to the following steps:

1) Transfer the fasta file you want to format from your PC to your cluster account just dragging the file with the mouse from a place into the other (to the FTP explorer).

2) Then drag your input file with the mouse to the input box “Drop here fasta file from FTP explorer” displayed at the top of the GUI. If you did it successfully you will see a green icon at right to the box.

3) Drag the output folder (a new folder or an existing one) on which you want to store the BLAST database files (when your original file is BLAST formatted it will be in a set of three binary files) from the FTP to the output box (below to that for input). Again, if you did it successfully you will see a green icon at right to the box.

Page 27: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

4) Give the database a name and select the type of database (nucleotide or amino acid)

5) Compile the database

6) The tool will automatically generate the three binary files (.phr; .pin; .psq) within the output folder. As previously noted, these three files constitute the formatted database recognized by the BLAST package compiled in GPRO.

Figure 8.2. Procedure for giving BLAST format to fasta files. In doing so you can create your own refseq databases in further BLAST comparisons using new sequences as queries to this refseq material

8.2.2 - BLAST analyses

“BLAST analyses” is the second tab of the GUI for BLAST and HMM searches. It calls the distinct tools

(summarized in Table 1) provided by NCBI-BLAST package for similarity searches using both, nucleotide or

protein fasta files as input queries to any (blast-formatted) refseq database. In other words, the tool allows

you to manage in graphical mode the distinct BLAST programs and to characterize one or more fasta files with

hundreds and thousands of sequences by simultaneous comparison with any a RefSeq database.

Table 8.1. Search tools implemented in the NCBI-BLAST package

BLASTP Identifies protein queries or finds protein sequence homologs in a protein database

BLASTX Finds similar proteins to translated DNA queries in a protein database

BLASTN Identifies DNA queries or find DNA sequences similar to the queries

TBLASTN Finds similar sequences to protein queries in a nucleotide database

Figure 8.3 illustrates the typical steps for launching a BLAST analysis with GPRO. Following is a description for each step.

Page 28: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

1) Upload the query file you want to analyze from your PC to the FTP explorer using the mouse as told in the previous section above.

2) Drag the query file from the FTP explorer to the input submission box.

3) Drag the folder containing your blast-formatted RefSeq databases to the database-input submission box. Then, the names of the distinct RefSeq databases available the submitted folder will be alphabetically listed in the box. The name highlighted in blue in the figure indicates the RefSeq database selected for being searched by the query. You can change the RefSeq database just scrolling the list and selecting that of your interest with the mouse. Note that the Figure emphasizes with a red circle, a link below the output folder box. This link gives you the option of using large RefSeq databases such as the NCBI NR, INTERPRO or others available for common use of all GPRO users that we pre-compile and update periodically because of their public nature and high size.

4) Select with the mouse an output folder for deposit your BLAST results and drag it from the FTP to the output submission box of the GUI.

5) Select the BLAST program (according Table 8.1 indications) and options (E-value cut-off value) to filter your search in box Option to the right and the top in the GUI.

6) Enter your e-mail address to receive notification when the job is complete. Bear in mind that searches performed with a certain number of query sequences, are usually computationally intense. GPRO performs the distinct search jobs in remote unattended mode. This means that you can launch the search and leave the tool. The analysis will keep running despite quitting the pipeline and closing GPRO.

7) Click run BLAST for running the analysis but wait until a message appears confirming that the analysis has been launched. Then you will see that info about your launched analysis appears in the summary at the bottom of the GUI. This information includes a column called “Actions” where you will see (indicated with a red arrow) a “red icon” accompanied by the word “Stop”. This button is for aborting the analysis at any moment if needed, so do not touch his button unless you want to do that.

Figure 8.3. Launching a BLAST search. Note that submission boxes and text areas include a check icon that can show two color states; green that means data successfully submitted/typed and red that will mean data not submitted or incorrectly typed.

Page 29: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

8.2.3 - HMM analyses

“HMM analyses” corresponds with the third tab of the GUI for BLAST and HMM searches and provides an interface for performing HMM searches with HMMER based on two alternative options; HMMSCAN and HMMSEARCH. The first allows comparisons using a sequence fasta file as query to database of HMMs. The second performs comparisons using the HMM database as query to the sequence fasta file as subject. Figure 8.4 shows a screenshot of HMMER GUI although the procedure is, in overall terms, quite similar to that above described for BLAST searches. Just to note that you can create your own HMM databases using the HMMER server accessible via GUI in the Menu tab "Alignment analysis" or to download any RefSeq HMM databases from internet. It is also worth to remember that you can share pre-compiled databases in the repository system with other users if you are a user of the GPRO cluster. If you need to do so, just import the database of your interest by clicking on the link to the repository of pre-compiled databases highlighted in red within Figure 8.4. If not, let us to known and we will try to add it to the list of common resources.

Figure 8.4. HMM search screenshot

8.2.4 - Process BLAST outputs

Once a BLAST search is completed you will find an e-mail notification and the pipeline will report the result of the search in the folder selected as output folder. The BLAST result consists of a number of XML files, usually thousands because the search delivers one XML file for each sequence of the query file and all hits detected. XMLs must be thus processed in order to extract and export the obtained results in a single but interpretable annotation file. You can do this using the “Process BLAST outputs” script (Figure 8.5), which is the third tab of the GPRO GUI for BLAST and HMM searches.

Management of “Process BLAST outputs” is similar to that previously explained for other GUIs. Briefly, you must use the mouse to take the whole output folder into the input box labeled as “Drop here BLAST XML result folder” and then an additional empty folder into the box below “Drop here output folder. From that point in, the tools gives three options for processing the output XML files.

The first option is a tool that retrieves and exports the BLAST results from all XML outputs to a single CSV file (the annotation file) consisting in as many rows as sequence queries and as many columns and annotation features. This CSV can be opened, visualized and managed via the annotation worksheet system of GPRO. You can also apply an e-value cutoff before running the script, take as many best hit per sequence query as desired, filter positional redundancies and append Gene

Page 30: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Ontology (GO) annotation terms (Gene Ontology Consortium 2008). In doing so, the script will also include annotations based on the INTERPRO (Hunter et al. 2012) and KEGG (Nakaya 2013) systems.

The second option generates the aforesaid CSV plus an additional FASTA database file with the FASTA header of the query sequences labeled with the BLAST results annotated according to the best BLAST hit information.

The third option generates the CSV plus an additional FASTA database file with all the subject sequences detected by the queries but labeled with additional annotations according to the queries sequence fasta names.

Both the second and third options need additional information in order to append information to the fasta files to annotate in the analysis. When clicking on any of these options an additional form will appear to the left providing two additional utilities; “Fasta retrieval options” and “Additional retrieval option”.

“Fasta retrieval options” is needed in order to parse the files of which the information to append in the sequences to annotate in the fasta file generated in parallel to the CSV. Here, you can use a fasta file (usually the same query file) if you decide to create a database with subject sequences annotated on the basis of the queries, or alternatively use a BLAST compiled database if you choose to annotated the query sequences on the basis on the subject information. In the first case, just drag the query file to the box. In the second one, drag the folder you have for storing your RefSeq databases and then select from the list that will appear that used in the analysis.

“Additional retrieval functions” is a utility that allows you to decide if you want to export full sequences or just their alignment core of BLAST similarity. Furthermore, you can also ask the tool to retrieve the cores flanked by an additional number of nucleotides (in the case of DNA sequences) or residues (in the case of protein sequences) flanking them at both upstream and downstream.

Figure 8.5. Processing results from the XML outputs reported by the BLAST search.

8.2.5 - Process HMM outputs

“Process HMM outputs” is a script corresponding with the fifth tab of the GPRO pipeline GUI for BLAST and HMM searches. This script (Figure 8.6) allows you to export annotations and results from the output file generated by HMMER to a CSV file that can be opened, visualized and managed via the annotation worksheet system of GPRO. The procedure is very similar to that for processing BLAST outputs but with the difference of

Page 31: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

that HMMER does generates plain files as outputs instead of XMLs and also that “Process HMM” only permits annotation in a CSV (it does not generate additional fasta files).

Figure 8.6. Processing mapping results from HMM outputs

8.3 - InterproScan

INTERPROSCAN is a software package combining and different protein signature recognition methods native to the INTERPRO member databases (Hunter et al. 2012) into one resource with look up of the corresponding INTERPRO and GO annotation. For more details about INTERPROSCAN and INTERPRO databases, please refer to is web site and documentation at EMBL-EBI.

8.3.1 - Running Interproscan via GPRO

By down scrolling the functional analysis tab of the main GPRO menu you can access a GUI for performing searches to any of all (or any) the INTERPRO database members using INTERPROSCAN. The GUI has two tabs, one for running INTEPROSCAN and the other for processing the XML output provided by this analysis. Figure 8.7 shows a screenshot of GUI section provided by GPRO when launching the search.

Page 32: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 8.7. Running INTERPROSCAN. The procedure is quite similar to those previously explained in this section for functional analysis with the exception that you should to check the databases to which you want to perform your search. You can select all or any of them.

8.3.2 - Processing INTERPROSCAN outputs

Similarly to a BLAST search INTERPROSCAN generates distinct XML outputs as a result. By clicking on the second tab of the INTEPROSCAN GUI you will access to the interface of a script allowing you to obtain annotate your INTERPRO results and GO codes into a single CSV file similar to that provided the above mentioned script for processing BLAST outputs. This CSV consists of as many rows as sequence queries and as many columns and annotation features and can be opened, visualized and managed via the annotation worksheet system of GPRO. The procedure is similar to that when managing other GPRO GUIs.

Figure 8.8. Processing INTERPROSCAN XMLs into a single CSV.

Page 33: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

8.4 – Augustus AUGUSTUS is a program that makes ab initio prediction of genes in eukaryotic genomic sequences. AUGUSTUS can predict alternative splicing and alternative transcripts, as well as 5'UTR and 3'UTR including introns on species specific training sets. For more details about Augustus and the distinct species training sets supported by Augustus please refer to it web site

8.4.1 - Running Augustus via GPRO

By down scrolling the functional analysis tab of the main GPRO horizontal menu you can access to a GUI to

run Augustus. Figure 8.9 shows a screenshot of the GUI provided.

Figure 8.9. GUI provided by GPRO in order to run Augustus. The procedure is quite similar to all previously explained in this section for functional analysis with the exception of that you need to select the species training set on which you will base your prediction from the down scroll list available in the box species within the GUI (indicated with a red arrow)

Page 34: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

9 - Menu Tools: Alignment Analyses

9.1 - Sequence logos

Multiple alignments are central to analyze the gene and protein sequence patterns and functional homologies by constructing HMM profiles, consensus sequences and sequence logos. GPRO constructs Sequence logos from both gapped and ungapped alignments using CheckAlign (Muñoz-Pomer et al. 2008) a logo-maker implementation following the methodology introduced by Schneider et al. (Schneider et al. 1986; Schneider and Stephen 1990) based on Information Theory.

By clicking to the menu path Alignment analysis > Sequence Logos you will find a GUI (Figure 9.1) with a box for paste your multiple alignment in Fasta format (you can also upload the alignment from a file). Then select if you alignment is based on DNA or protein sequences choose a method for constructing the logo. If you want to obtain a significant Logo select the Schneider method which implements three options for applying corrections to alignments with a small number of aligned sequences. If you are just interested in to visualize the most prominent consensus common to your aligned sequences you can try a logos approximation based on a relative frequency analysis. This is not particularly significant under Information Theory but it may give you some keys for further analyses if the Shannon approach would fail because the high divergence of the aligned sequences.

Figure 9.1. Screenshot of the GPRO logos maker tool. Upload/paste a multiple alignment then make click on the button create logos (in a circle) to create a logos representation (within a square)

9.2 - HMMs and consensus sequences

Hidden Markov Model (HMM) profiles (Eddy. 1998) are probabilistic models capable to capture specific information of the sequence consensus of a set of aligned sequences. In this regard, the most representative

Page 35: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

to our knowledge is the HMMER package created by Sean Eddy. GPRO implements a GUI for constructing HMMs and Majority Rule Consensus (MRC) sequences based on multiple alignment input by running HMMER.

Figure 9.2 shows a screenshot of the GPRO GUI for running a HMMER-based server installed in the remote computing cluster. Note that to run HMMER via GPRO you must be connected to the internet. The procedure is similar to those previously mentioned for running of all others GPRO pipelines.

1)Transfer via FTP explorer the alignment or alignments you want to use as input/s from your PC to your server account. Note that alignment format may be fasta or Stockholm. Then drop the alignment file from to the input box. if you did it successfully you will see a green icon at right to the box.

2)Optional, check an option for creating a MRC sequence if you also want a MRC sequence.

3)Choose an output folder from the FTP explorer and drag it to the output box (again, if you did it successfully you will see a green icon at right to the box). Finally, enter your e-mail address to receive completion notification via e-mail (the analysis can be computationally intense and you may quit GPRO before finishing the job) and run the script. Just like this.

Figure 9.2. Creating HMM profiles and consensus sequences using a HMMER sever via GPRO

Page 36: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

10 - Menu Tools: Management

10.1 - Management overview GPRO offers you three distinct tools for performing data mining and managing files, folders and contents within folders. If you make click on the menu tab "Management" a drop-down list will show you appears three tool options; “File and folders”; “Find sequences”; and “Alignments”.

10.2 - Files and folders This tool is organized in three sub-sections called "Join Folders", "Join Files" and "Split Files" and that you can access just by clicking on their respective tabs at the top of the GUI.

10.2.1 - Join folders

“Join folder” is a script that allows you to reorganize folders using distinct key terms such as enzyme or annotation names, etc (even if folders are in other folders within the Directory). Figure 10.1 shows an example consisting on the following steps:

1) Use the mouse to drag the folder containing all the relevant folders from the Directory to the "input folder" text box.

2) Type as many “filter words” as needed in the box “By name” below the tag “Filter options” in order to define the criteria for organizing the folders (all folders called with an identical or similar term will be selected). Then, press add and words will pass on to the text square below.

3) Check any of the three “Exact term”, “Regular expression” or “Case sensitive” options

4) Use the mouse to select an output folder from the directory and drop it to the output box at the bottom of the GUI and make click on the button “Proceed” to run the script, which will select and export all folders fitting to your search term/s in the output. </td>

Figure 10.1. Join folders using distinct folders with contents concerning LTR retroelement protein domains as an example : 1) drag folder in analysis to the "input folder" area; 2) type “filter words” and options; 3) use the mouse to select an output folder and drop it to the output box; 4) run the script; 5) the script will export all folders joined in other folders named according to the terms used as as key words.

Page 37: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

10.2.2 - Join files

The script “Join files” allows you to select and to group files either in a single file and/or folder. The tool accepts only FASTA and XML files. Using “Join files”, you can collect, for example, the distinct XML files of a BLAST search in a single XML file or to retrieve common gene features of an annotation database divided into distinct files. The possibilities offered by this tool are the following:

Select files and place them in a single folder

Select XML files and place them in a single folder

Join files in a single file

Join XML files in a single XML file

In all cases, it is possible to filter the results using various criteria such as file name, extension and type of sequence (nucleotide or protein). The tool works even if the files are in distinct sub-folders of the selected folder from which you want to retrieve the files. The procedure is similar to that of "Join folders", but instead of working with folders, in this case the tool manages files permitting the use of key terms as including or excluding filter options to export all files fitting to this term within a single folder or to create a single database file encompassing the contents of all files fitting the filter option. Figure 10.2 provides a screenshot of the Join files process.

Figure 10.2. Join Files; 1) Drag your input folder from which you want to retrieve and reorganize contents from the Directory to the "Input folder" box; 2) Check the option, in this case "Select files and place them in a single folder"; 3) type filter; 4) Drop the output folder; and finally run the script. Alternatively, should you select the option "Join files in a single file" the tool will collect the contents of all files and will group them in a single file. In other words, this second option is an easy way to create DNA and protein databases.

10.2.3 - Split files

The script “Split files” is the third tab available at the top of the "Files and folders" GUI and allows you to split a fasta file into different files using the labels of sequences within the original file as classificatory criterion. The options offered by this tool are the following:

Split files by sequence names

Split files into blocks

Page 38: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

The selection of the sequences is performed over the fasta header according to a key term in the name of the sequences (one or more) or alternatively by blocks of sequences (1000 sequences, 10000 sequences, etc). As shown in Figure 10.3, to use “Split files” drag your input into the corresponding text box (as in the other scripts). Then select one of the proposed split options. For instance, if you want to split your file according to distinct search terms (for instance enzyme acronyms) type each term in the box “Split by sequence names” and then add it to the list at left (you can add as many terms as desired). In doing so, the script will parse the file and will divide it in as many files as search terms used, each one containing the sequences whose fasta labels matched with the term giving the new file its name. The script will also deliver an additional file (no match) containing the remaining sequences of the processed database.

Figure 10.3. Split files screnshot.

10.3 - Find sequences

This is a data mining tool for searching and extracting sequences from FASTA file databases using the labels or names of the sequences in the FASTA header as a search criterion. With this utility you can search a fasta file using one or more term options to export sequences and create new databases as described in Figure 10.4.

In this task:

1) Drag the input file from the Directory to the "Input file" text box. The FASTA headers of all sequences within the file will be listed in the sequence list dialog

2) Type the label or name of those sequences you want to extract and the filter options "Exact term", "Case sensitive", "Regular expresion", "Append selection" or none (which is the default filter) in the box "Select by term” below using a search term or two or more terms by typing them separated by commas. If your file contains any item matching your search terms they will be automatically highlighted in blue within the "Sequence list" dialog (note that to the left of the sequence list you have additional tabs to select all sequences, to deselect them or for reverse the process of selection).

3) Drop an output file to the corresponding dialog and choose if you want to overwrite the output file. If you do not make this selection the sequences will be added to the output file without removing previous contents.

Page 39: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

4) Click "Run" to retrieve and export your selection into a new database within the directory.

Figure 10.4. Find sequences script

10.4 - Alignments

Frequent jobs in multiple alignment methodology is changing the format of an alignment or trying to identify the set of motifs common to all aligned sequences. GPRO implements two scripts for doing these tasks using DNA/RNA and protein multiple alignment inputs.

10.4.1 - Join alignments

GPRO includes a Join Alignments script allowing you to automatically join different files containing multiple distinct alignments (one per each domain) into a single alignment within a single file and arrange them in a user-defined order. The tool has two requisites. The number and name of the sequences must be identical in each file to join them, and the alignments must be provided in FASTA format. Figure 10.5 illustrates the join alignment process. As usual the process is the followings:

1) Drop the alignment files to be joined to the corresponding text box dialog. This dialog lists the distinct files defining the order in which the alignments will be joined. This order can be modified (or removed) using the commands at the right "move up" or "move down"

Page 40: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

2) Drop an output file into the corresponding dialog

3) run the script

4) If the number and name of the sequences are identical, the tool will successfully join the sequences and report a single FASTA alignment with all gag-pol domains joined in the specified order for each common name

Figure 3.24. Join alignments. As an example, we used different files, each containing a multiple alignment

based on the GAG (red), Protease (AP, green), Reverse transcriptase (RT, blue), Ribonuclease H (RNaseH,

yellow), Integrase (INT, violet) and Envelope (ENV, orange) proteins encoded by distinct Retroviridae

retroviruses. We will join these six alignments in a single gag-pol alignment, organized as described in the

figure..

Page 41: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

10.4.2 - Format alignments

This is an alignment format converter a tool that allows users to upload a protein or nucleotide multiple alignment file in one format and convert it into other formats in one step. The utility accepts and converts the following formats: FASTA, Clustal, Pir, MSF, Phylip and Stockholm. Any of the formats can be used for input and output, and can be selected in one, several or all formats, simultaneously. Figure 10.6 shows a scheme of the procedure.

1) Drag the file to be processed from the Directory to the input text box

2) Select the format of the input alignment and the format you wish to change it to (aln, msf, phy, pir, sto)

3) Drop an output folder into the corresponding dialog

4) run the scrip

Figure 10.6. Alignment format converter.

Page 42: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

11 - Menu Tools: Preferences

11.1 - Pipeline connection settings

To access to your account in the remote computing cluster of GPRO you must have internet access and to configure your connection credentials in the dialog “Pipeline connection settings” available in the Tab “Preferences” as shown in Figure 11.1A. Concretely, you must provide the e-mail given when registering as GPRO user. Then introduce the IP of your server. The default IP is that of the GPRO server at University of Valencia shown in figure but if you are installed the GPRO pipeline in your server you must type your server IP, then the port and then your user name and your password. If you were successful doing, so you will find a positive message when clicking on the button "Test connection settings" as shown in figure. Alternatively, you can also check if your GPRO license is connected to the computing cluster by just trying to enter in any of the pipeline sections available in the main menu (Data preprocessing, Functional analysis etc.

11.2 - Worksheet preferences

This is a dialog to configure or change the font preferences for text style used by the worksheet (Figure 11.1B).

Figure 11.1. A) Configuring the connection to the computing cluster via the Pipeline connection settings tab. B) Configuring the font style of the worksheet

11.3 - Evidence code weights

GPRO follows an algorithm for Gene Ontology (GO) annotation inspired in that of BLAST2GO ( Conesa et al. 2005). You can configure "evidence code weights" for your GO annotation by accessing the utility with the same name available via “Preferences” (Figure 11.2). You can find BLAST2GO and more details about it here

Page 43: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 11.2. Configuring “evidence code weights” for GO annotation

11.4 - Activate/License software

Use this dialog for activating/licensing your software. There are three options:

Activate trial mode: if you have not already purchased the software but you want to try it for free for 30 days.

Activate license: if you have acquired a license, you can activate it here for using pipeline function for 1 year without limitations.

Buy a license: click here for acquiring a GPRO license in our website.

Page 44: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 11.3. Activation/License dialog.

11.5 - Proxy connection settings

A proxy is a intermediary server that manages connections between your computer and the Internet. If your computer is accessing the Internet through a proxy, GPRO requires to be configured to use that proxy for using the pipeline functions.

If you know that your network is not using any proxies, leave selected the "Without proxy" option.

However, if you are behind a proxy, you can choose one of the three following methods:

If you do not know the proxy settings, you can choose the "Use system proxy settings" for letting GPRO to guess the default proxy settings already configured in your computer.

If you know your proxy settings, you can mannualy specify the proxy configuration. This is the preferred option when using a network proxy. User, password and FTP settings are optional. The port for HTTP proxy is usually 8080 by default.

If you have a Proxy Automatic Configuration file (.pac) URL, you can use it for loading settings automatically from a remote file.

Figure 11.4. Configuring proxy server settings.

Page 45: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

12 - Menu Tools: Help

12.1 - About GPRO

Information about GPRO version

12.2 - Manual

Shows the GPRO online Wiki

12.3 - Report suggestions and bugs

A form for submitting comments on GPRO's features and reporting bugs

12.4 - Check updates

Check if there is a newer version of GPRO

Page 46: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

13 - Worksheet annotation system ( I )

13.1 - Managing annotations for downstream analyses: Multihit and Single hit CSV files

The collection of GPRO pipelines report annotation outputs using a plain file commonly known as CSV file (comma separated). Here you have an example of CSV-like annotation file, which can be for instance, an excel document saved as CSV. This file can be navigated and interrogated for downstream analyses using a management system called worksheet that consists on a grid of editable cells arranged in numbered rows and columns. As shown in Figure 13.1, CSVs can be opened or created as a new worksheet using the menu tab "Databases" (at the top in red circle). If the CSV is already available in the directory of GPRO you can also open it by double clicking on its icon (below also in red circle). In one way or another, when opening a CSV as a worksheet a pop-up will appear allowing you to adapt GPRO to the format of your CSV via two drop down tabs at the top called Field Separator and Delimeter (note that CSVs contains columns of data that may be separated by spaces, commas, semicolons etc). By selecting or deselecting the option below “worksheet contains multiple HSPs for each match” the interface also allows you to decide if the CSV should be open as a multihit file, which is the default option, or as single hit file (indicated with a blue arrow).

Multihit files are CSVs collecting more than one hit matched per each query. Following is an example to better explain it. If you perform a BLAST search via the functional analyses pipeline and subsequently process the resulting Xml files with the GUI called “Process BLAST outputs” you can decide to take a number of best hits (for more details see the section Process BLAST outputs in the chapter Functional Analyses). The result of this procedure is a CSV containing as many rows as number of best hits detected per query.

If you open a CSV not selecting the option “worksheet contains multiple HSPs for each match” the worksheet will list all the hits for all the queries contained in the CSV. However, if you take the aforesaid CSV and open it as a worksheet with the multihit option selected, what GPRO will do is to create two coupled CSVs that respectively constitute the “reference” and the “master” files of the annotation to open as worksheet. Bear however in mind that if you only took one best hit during the previously performed processing of Xmls you will only have one hit per query independently if you open or not the CSV with the multihit option selected.

The reference file has the same name you gave to the CSV and the master file has the same followed by the additional “_multiHit” tag (Figure 13.1 rounded in blue) and it will be automatically opened as a worksheet summarizing as many rows as queries and showing only the most significant hit per each query (the best of the best hits) according to the obtained BLAST score and e-value. If your analysis was not a BLAST search, you can select the columns on which to base the best hit selection.

If you click on the row (we will explain this more carefully in the next sections of this chapter) a pop up will appear with a summary of all detected hits for each query so that to let you to see how many alternative hits does each query have or to switch the hit selected as representative per any other in the summary (for more details see the next sections of this Chapter). This is possible because the reference file is linked to the master file that is the original annotation containing all hits per query. From that point on, you need to preserve both files together in the same directory as if you remove the master file you will lose the original annotation and the option to see all alternative hits for each query. If you want to open again the annotation worksheet you only need to open the reference CSV. You do not need to open the master (which in some way is a security copy) unless you would prefer to repeat the multihit process again (well because you did something wrong or because you want to make a selection of your data according to various criteria and/or items) as if you directly open the master file with multihit option selected you will again create two coupled files a reference labeled as “_multihit” and a master file twice labeled as _mutlhit_multihit.

Page 47: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 13.1.- Opening CSV files as multihit or single hit worksheets. CSVs can be opened using the menu tab Databases or, if the CSV is available in the directory, you can also make double click on its icon (in red circles above and below). Then a pop-up interface will appear allowing you some options to adapt GPRO to the format of your CSV. Among these, GPRO permits you to open your CSV as a Multihit (the default mode) or as a Single hit file (blue arrow). If you select to open your CSV using the Multihit option your original CSV will become into two couple files (in a blue circle) you can consider as reference (the file to work with) and the master (the backup).

13.2 - Worksheet overview

The worksheet is launched in the main desktop and implements a wide variety of functions to: a) create and remove rows and columns; b) import, export and combine databases on the basis of a selection of rows and/or columns; c) add, search and replace annotation terms based on commonly used taxonomies and vocabularies; d) organize and color the cells according to key terms such as mapping, annotation, function, statistics; e) perform functional annotation based on the Gene Ontology vocabulary (GO) and/or to retrieve metabolic maps from the Kyoto Encyclopedia of Genes and Genomes (KEEG); f) switch accessions between refseq databases; g) make data mining and downstream analyses; h) statistics, etc. In addition, the worksheet can be linked to a sequence FASTA database and make changes simultaneously in the worksheet and the database. Figure 13.2 shows a screenshot of the GPRO worksheet.

Page 48: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 13.2.- Worksheet screenshot: A) menu bar and functions; B) Column headers; C) Grid of cells with numbered rows and columns, which can be selected by clicking on the corresponding checkboxes; D) Information concerning availability of a link between the worksheet and FASTA database.

Following is detailed the default columns implements by GPRO in CSV outputs. It is worth to note that you can add/remove columns and rows by right-bottom clicking with the mouse in any place on the worksheet. You can also move a column from one position to another by left-bottom selecting and dragging it with the mouse.

Sequence Your sequence/query name or label

Subj. mapping The database subject mapped using your query

GI Gene identifier (if available) of the subject

Accession Accession number (if available) of the subject

Species Host species

Score Scoring for the alignment between query and subject (the HSP or alignment core between

the query and the subject).

E-value Statistics associated with the alignment between query and subject. The interpretation is

the lower E-value the more significant result, with the exception of the case of the perfect

hit (one sequence against itself), where BLAST usually assigns an E-value of "0".

Query-from Query sequence start position in the HSP

Query-to Query sequence end position in the HSP

Subject-from Subject sequence start position in the HSP

Subject-to Subject sequence start position in the HSP

Query frame Frame of the query

Page 49: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Subject frame Frame of the subject

Identities Degree to which two the query and the subject are invariant.

Positives Number and fraction of residues for which the alignment scores have positive values

Query length Length of the sequence query

Subject length Length of the sequence subject

Align length Length of the HSP or core alignment between the query and the subject

Similarity Sequence similarity between the query and the subject

Hsp/Query Coverage between the HSP and the query

Hsp/Hit Coverage between the HSP and the subject

GO# Number of GO terms detected

GO Summary of GO terms detected

Evidence codes Evidence code of GO annotation

Enzyme codes Enzyme codes based on KEEG classification

InterProScan InterProScan classification equivalence for each GO term

Comments To take notes about this sequence

13.3 - Workseet menu: File Tab

As previously shown in Figure 13.2, the worksheet of GPRO is implemented by its own menu of utilities in addition to those available via mouse actions. The first tab “File” (Figure13.3) provides a drop-down summary of utilities. A brief description of each utility follows.

13.3.1 - Open worksheet

To launch pre-existing CSV files as worksheets

13.3.2 - Save

To save changes performed in the opened CSV

13.3.3 - Save as

To save a worksheet as a new CSV

Page 50: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

13.3.4 - Download GenBank Accession files

To download sequences (as annotation files or as sequences) from GenBank using GenBank Accessions selected from a worksheet column

13.3.5 - Download sequences by GI

Download sequences from GenBank using the Gene Identifier (GI) accession selected from a worksheet column with the possibility to define the full-length annotated sequence or a core with start and end positions determined by other columns of the worksheet.

13.3.6 - Set as default worksheet

To set a CSV file as default worksheet (It will be opened automatically when launching GPRO).

13.3.7 - Unset as default worksheet

To unset a CSV file as default worksheet

Figure 13.3.- File Tab of the worksheet GPRO menu. You can save a worksheet into a CSV or set or unset a CSV as a default worksheet. By selecting the options “Download GenBank Accession files” or “Download sequences by GI” two distinct pop ups will appear to automatically download sequences or annotation files based on GenBank Accession or its GI.

13.3.8 - Import

This utility allows you to join two or more CSVs (from two or more annotations) using three different options - “Append worksheet”, “Combine worksheets” and “Clusters”

Page 51: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

13.3.8.1 - Append worksheet

You can use “Append worksheet” for merging two CSVs using one of them as a template (Figure 13.4) provided that all CSVs have the same number of columns and with the same name. Go to “import” > “Append worksheet”. GPRO will open a window in which you can browse the worksheet from which you import the data using the “Import” tab of the dialog. The grid of this dialog shows the name of the columns available in each worksheet. If the names are accompanied by a green icon in the section “status”, the column names are identical and you can proceed to join both worksheets. If any status icon appears in red please revise the names of the two worksheets to make them to coincide before running the utility.

Figure 13.4.- Append worksheet. Open the CSV you want to use as template (outlined red in the figure).

Page 52: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

13.3.8.2 - Combine worksheets

The utility “Combine worksheets” is for combining data from two CSVs into a single worksheet using a common column as a join reference (for instance, the column with the sequence name or identifier). Note that the difference between this tool and "Append worksheet" is that while the latter add new rows (i.e. new sequences to annotate in your database project), "combine worksheets" adds new columns with new information (taxonomy, ontology, etc.). The utility is useful for joining results of two comparative analyses (for instance, two independent BLAST searches against two different Refseq databases). Figure 13.5 shows a graphical description how to run Combine worksheets. To do this, select the option “combine worksheets” within the utility “Import”. Then (1) a GUI will appear for you to browse the CSVs you want to respectively use as template or master and as related worksheet. (2) A drop-down dialog called "Key Column" is available for each case in order to you to select the column common to both worksheets (remember that this column must contain identical labels). (3) Below this dialog you have a list presenting the distinct columns of each file you want to join. Check the columns you want to combine for each worksheet, and (4) browse an output CSV to save the new project and click OK to run the function.

Figure 13.5. Combine worksheets. This function allows you to export columns from a worksheet to another using a column that is common to both worksheets as a join reference (identical row labels or terms, for instance sequence names).

Page 53: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

13.3.8.3 - Clusters

If you have previously identified cluster or family relationships (for example, paralogs repeats or MGEs related to one another) among the sequences of your annotation, you can add this information to the worksheet using the utility “Clusters” and a cluster file (File:Clusterfile example.zip) in CSV format containing the clustering information distributed in as many columns as names of related sequences (the members of a cluster), and as many rows as clusters (framed in blue in the example). Figure 13.6 details the process.

Figure 13.6.- Import clusters. This utility is useful for adding a new column providing known information about common relationships (function, taxonomy, paralogs, repeats, MGEs, etc.) among rows in the worksheet.

13.3.9 - Export

To export a set of annotated sequences, the worksheet allows you to perform a variety of selections on the basis of distinct row/column criteria (function, host, E-value, ontology, etc.) described in next sections of this manual. If you are ready to export results, click on the "Export" option in the "File" command of the worksheet. You can choose among one of three possibilities, "Export CSV & FASTA", "Export Annotation file" and “Export categories & clusters”.

13.3.9.1 - Export worksheet and FASTA

This option provides the possibility to export results in three modes: first, as a new worksheet; second, as a new worksheet coupled with an associated FASTA database with the sequences´ FASTA headers labeled according to annotation terms; and third, as a FASTA database with the annotated sequences. Figure 13.7 illustrates graphical description. In the first and third modes, note that you need to have a reference sequence database previously associated to the worksheet. To learn how to do this, see the menu tab "Associate database". In all exporting cases, it is required that you check rows and columns for exporting the data.

Page 54: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 13.7.- Exporting worksheet and sequence data. You can export a selected subset of your CSV into another CSV or this subset coupled to a subset of fasta sequences you can click on the Tab file and select the path Export”>“Export CSV & FASTA”. You have three options: A) “Worksheet & fasta”, which lets you to export a CSV coupled with a FASTA database if there is a previous worksheet-database association (to do this see the menu Tab “Associate Database”). B) “Worksheet only” that only exports the selected subset into a new CSV. C) FASTA only”, which exports the subset as a fasta file (again it there must be a previous association between the worksheet and a reference sequence database). In all exporting cases, it is required that you check rows and columns for exporting the data (step indicated with hand icons).

13.3.9.2 - Export annotation file

This section is for exporting annotated sequences (rows) from the worksheet using one or more columns as the reference for the annotation records and other columns as records´ features. The exported output is a plain text file (named “Annotation” in all cases) with the annotation information usually organized in pairs of lines for each worksheet row (i.e. the annotated sequences) except for those that share header information. As shown in the example below, the first pair is the annotation header and provides information about the organization of the annotation in the file.

Page 55: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Reference=Query def|Function|

Hit def|Hit accession|Score|e-value|Query from|

The first line of the head indicates which columns (separated by bars) have been selected as

the annotation references. In the example above, these are "Query_def" and "Function". The

second line sets the order assigned to the distinct columns (separated by bars) you select as

the subject. In the example above, distinct columns are called "Hit def", "Hit accession",

"Score", "e-value" and "Query from". The remaining pairs correspond to the distinct sequences

annotated (one pair for each sequence) according to the header organization. Four examples

of annotation follow.

Reference=contig00720gene_4|stage iii sporulation protein j precursor|

lin2986|179848|608|8,06E-58|14|

Reference=contig00745gene_4|nitrate reductase beta chain|

SA2184|114419|2535|0|1|

Reference=contig00667gene_92|general stress protein 13|

SA0816|113083|409|2,34E-35|41|

Reference=contig00667gene_91|peptidylprolyl isomerase|

SA0815|113082|982|2,09E-101|1|

If some rows reveal a share of the header (for instance, duplicated or related ORFs within the

same contig or scaffold), they will be grouped into a cluster with as many lines as there are

rows sharing the header. See the example below.

Reference=contig00667gene_89|monovalent cation h+ antiporter|

SA0813_1|4920|2617|0|1|

SA0812|113080|574|2,12E-54|1|

SA0811|113079|525|9,02E-49|1|

SA0810|113078|2008|0|1|

Reference=contig00667gene_70|membrane protein|

SA0329|112610|994|1,67E-102|1|

SA0794|113062|1871|0|1|

SA0792|113060|207|5,95E-12|1|

Figure13.8 presents a graphical description of this process and the format with which the data

are annotated in the output file.

Page 56: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 13.8.- Exporting annotations. To export your annotation results the option "Export annotation file" lets you prepare and export the annotation via the worksheet. Select the path Export >“Export annotation file” available within the “File” worksheet command. The format of annotation followed by GPRO is a summary of records organized into pairs of lines for each annotated row except for that sharing header information. The first line of the output file is the item header, created by selecting one or more reference columns to refer to each item in the annotation. To make the header selection, move any column you want to use as header (they can be one, two or more) from the dialog list called “Reference columns” to the adjacent area using the transferring arrow between these two dialogs. You can the reorganize the order of reference columns in the annotation header using the vertical arrows. Select the columns you want export as annotation features and move them from the dialog list “Export columns” to its adjacent. The procedure is identical to that for the “Reference columns” but in this case you have the additional option of joining the information from two columns into a single one. You can select the type of field separator you want to apply to separate columns in each annotation item (by default, a vertical bar). Finally, the program presents you with a preview of the exporting format at the bottom of the window. If this is correct, press OK for running the automatic annotation export.

13.3.10 - Export categories and clusters

This function allows the user to export sequence rows as categories (one file per category) or clusters (sequence pools created on the basis of common features and exported to a single file). The selection is performed, by taking into consideration a term repeated in, or common to, distinct rows within the column you select as a reference (function, clades, host species, etc.). Figure 13.9 illustrates the process to be followed in order to export rows by categories.

Page 57: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 13.9. - Exporting rows in categories. This tool allows you to export sequence rows by categories in distinct files (one file per category) on the basis of common features in a column selected as a reference as follows. 1) Check the rows (the output units) and columns (the information associated to each row) you want to export and follow the path Export” >“Categories & clusters” available in the “File” command of the worksheet menu. A windows dialog will appear. 2) Select a destination folder where the generated files will be stored and the worksheet column you want to search for common terms. In addition, a summary of the distinct worksheet columns is available for you to add or remove columns. 3) Check the exporting option (in this case “Categories”) you want to use and press OK for running the utility. 4) The output folder “Export Categories” will contain all CSV files generated by this tool that were divided into distinct files according to the recurrence of common terms in the searched column called “function”. 5) An example of a generated file displaying the terms grouped by the same “function” category (framed in red in the input worksheet). Here (File:Categories example.zip) you have an example of template CSV and folder output resulting from this analysis.

The process for exporting clusters of related rows within a single file is almost identical to that previously described in the Figure shown above for exporting categories. Figure 13.10 depicts the procedure. If the worksheet in use is associated with a database, the “Export categories & cluster” function will provide the corresponding FASTA files for the sequences of the categories and clusters obtained.

Page 58: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 13.10.- Exporting rows in clusters. You can export sequence pools as clusters of rows created on the basis of common features in a column selected as a reference. The procedure is the same as that shown for exporting categories. 1) Check the rows and columns you want to export and follow the path “File”>“Export” >“Categories & clusters” in the worksheet menu. 2) Select a destination folder to deposit the output file into and the key column in the worksheet you want to search for common terms. 3) Check the exporting option (in this case “Clusters”) you want to use and press OK to run the utility. 4) An example of a “Clusters file” where sequences with the same function were clustered (framed in red in both “Clusters” and worksheet files).

13.3.11 - Show/Hide columns

This function allows the user to select which columns to show or hide in the worksheet. You can select the columns you want to show (or export) by manually clicking on the checkbox at the top of each selected column using the mouse (Figure 13.11).

Page 59: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 13.11.- Hide/Shows column screenshot.

13.4 - Workseet menu: Edit Tab

13.4.1 - Search and replace

This utility works identically to that previously discussed in the Database Editor but applies the search and replacement of labels within the Worksheet. By selecting this utility, a dialog is opened for you to choose between two utilities - "Search" or "Replace".

13.4.2 - Undo

Undo the last actions performed.

13.5 - Worksheet menu: Sorting/Filtering

13.5.1 - Sort

Worksheet contents can be organized according to the ascending or descending order established in a column. As shown in Figure 13.12A, use this function to select the column of reference and type of data (text or numerical data), then decide the ordering. The whole worksheet will be rearranged according to your choice.

13.5.2 - Filter by position

In most cases, you can have two separate annotations from the same genome performed using for instance two different RefSeq databases as queries. This function allows you to filter mapping positional redundancies (using a minimum overlapping) between both files using the starting and ending positions and then keep in one annotation file that that was not captured by the other or vice versa. Figure 13.12B shows a screenshot of this function.

Page 60: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 13.12.- Sorting/Filtering. A) Sort. B) Filter by position positional redundancies between two annotation files.

Page 61: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

14 - Worksheet annotation system ( II )

14.1 - Worksheet menu: Annotation Tab

This Tab offers diverse options for functional annotation based on the most frequent ontology vocabularies and classificatory systems or according to personalized criteria such as levels of significance (non-significant hits, mapped sequences, annotated sequences, etc). Here you can also switch sequences IDs between distinct classification systems (GeneBank, Ensembl, Uniprot etc).

14.1.1 - GO annotation

The Gene Ontology (GO) ontology (Gene Ontology Consortium 2008) is a muti-disciplinary initiative created with the aim to provide a controlled vocabulary of terms for describing and annotating gene product data. GO is a component of the Open Biological and Biomedical Ontologies (OBO) for shared use of vocabularies across different biological and medical domains.

GO covers three domains:

Cellular component (C) which correspond to the parts of a cell or its extracellular environment;

Molecular function (F) that collects the elemental activities of a gene product at the molecular level such as binding or catalysis

Biological process (P) which describes operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs and organisms.

By default the device processing BLAST outputs available in the GPRO pipeline for "BLAST and HMM search Pipeline plus GO-annotation" automatically adds Gene Ontology (GO) annotations to your BLAST results. However, if your do not need/want to perform BLAST searches (because they have been already done with other tool) you can use the GPRO worksheet for adding GO terms plus KEEG enzyme codes (EC) to your data provided of that they are row-to-row summarized in a CSV and accompanied by at least an additional column with sequence IDs (such as those of Genbank, Uniprot, Interpro etc) that GPRO can process and appropriately associate with respective GO IDs and terms.

The EC is a number assigned to a type of enzyme according to a scheme of standardized enzyme nomenclature found in ENZYME, the enzyme nomenclature database, and KEGG: Kyoto Encyclopedia of Genes and Genomes. InterProScan is an integrated database of predictive protein signatures (Quevillon et al. 2005) used for the classification and automatic annotation of proteins and genomes, available atEBI.To read more about the GO initiative, go to geneontology.org

The Clusters of Orthologous Groups (COGs) of prokaryotic proteins and their (KOGs) eukaryotic counterparts (Tatusovet al. 2003) are two collections of prokaryotic or eukaryotic proteins classified in ortholog groups of different species (or paralogs derived from duplication of a single gene within a genome).

14.1.1.1 - Append GO terms Once uploaded an CSV with your gene data to the worksheet you can add new columns containing the GO terms, GO IDs, Enzyme Codes (ECs) and InterProScan IDs by clicking on the tab "Append GO terms" as summarized in Figure 14.1.

Page 62: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 14.1.- GO annotation. To add new columns containing GO terms to the worksheet: 1) open the window dialog of function “Append GO terms” available in the worksheet-tab called "Annotation"; 2) Use the drop-down selectors called "Column name" and "Data type", respectively, to select the worksheet column containing the “IDs" of the mapped sequences and the type of IDs, which must be “GIs” or “Uniprot” IDs . If you mapped your sequences using GenBank accessions use the function "Switch GI/accession" also available in the command "Annotation" to convert GenBank accessions to GIs; 3) Use the mouse to select the annotation columns you want to append to the worksheet based on the GO system and its related nomenclatures (“GO”, “EC”, “InterProScan”); and 4) click OK. If you select the three features, GPRO will add three new columns to the worksheet (framed in red), providing annotation information for each sequence (row).

14.1.1.2 - Evidende code weights

GPRO follows an algorithm of GO annotation inspired in that previously applied by BLAST2GO (Conesa et al. 2005). Using this tab you configure distinct weigths to the evidence codes of your GO annotation.

Page 63: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 14.2.- Evidence codes manual configuration.

14.1.1.3 - Display graph

GO Annotation results can also be visualized as directed acyclic graph (DAG) by selecting the option Display Graph available in the submenu of the Annotation Tab. By clicking on any GO term within the DAG.

Figure 14.3.- Displaying and browsing the DAG, whose nodes and edges can be moved or edited manually. By clicking on any particular node the tool links to the AmiGO browser of the GO consortium for searching the term s elected.

14.1.1.4 - GO depth statistics

By selecting the tab "Annotation" -> rigth submenu "GO Depth statistics" the tool allows you to obtain Bar or Pie chart figures constructed based on any of the three "Cellular component", "Biological Process" or "Molecular fucntion" domains (Figure 14.4) with distinct filters considering number of sequences, distance decay, node score, DAG level, graphic type and color etc. A graphical representation will appear in the working space layout of GPRO (below). By rigth clicking on the imagen you can export it as an image or as matrix table (option shaded in the pop-up at the rigth and below in the figure) in a csv for further graphical representation with any other tool (Excel for instance).

Page 64: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 14.4.- Creating graphical images based on the GO annotation at the DAG.

14.1.2 - Append InterPro data

This utility lets you to add new information to your annotation based on the IDs of any or all the InterPro-like databases contemplated by InterProScan (Quevillon et al. 2005). As shown in Figure 14.5, you only need to launch the utility, select a reference column, give the name to the new column to created and then check any or all the databases implemented by InterProscan. For more details about the InpterPro innitiative, go to InterPro Site.

Page 65: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 14.5.- Adding InterPro Database IDs to your annotation.

14.1.3 - Append COG/KOG terms

To do this, it is necessary to have previously mapped your sequences (via a BLAST search) to the Refseq COG and KOG databases integrated in the NCBI Conserved Domain Database (CDD) (Schug et al. 2002) available at the FTP of the NCBI. For details about how to perform a BLAST search see the section "BLAST and HMM Pipeline".

Once you have the CSV resulting from the COG/KOG automatic annotation you can use the Worksheet utility "Annotation" to add COG or KOG terms to your CSV by clicking on the tabs "Append COG terms" (if you are annotating prokaryotic orthologs) or "Append KOG terms" (if you are dealing with eukaryotic orthologs) by choosing the column of reference (“GI”) and the type of data contained in this column (“gi” or “Protein names”). Finally, Check the boxes below this dialog to choose the COG/KOG terms you want to append to the worksheet by adding two or three new columns and click OK.

Page 66: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 14.6.- Appending COG or KOG terms to the worksheet. The image summarizes the process for appending GOG terms. The process for appending KOG terms is identical but just following the submenu path below (rounded in red)

14.1.4 - Apply annotation colors

As shown in Figure 14.7, this utility is for configuring specific preferences for your worksheet. It allows you to set the colors to be applied to the worksheet's rows according to the following annotation criteria:

Non-significant hits: rows containing null or E-values higher than the threshold value specified by the user (in white).

Significant hits: rows containing significantly lower e-values than the threshold value specified by the user (in orange).

Mapped: rows containing significant hits and GO codes (in green).

Annotated: mapped rows that also contain Enzyme Codes (in steel blue).

Annotated plus: annotated rows that contain other annotation criteria (in dark goldenrod).

The box for e-value threshold can be edited for you to type thre threshold of your choice.

Page 67: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 14.7.- Apply annotation colors. Select “Apply annotation colors” in the Annotation tab and a window

dialog is open, select the columns to which you want to apply the available criteria and click Ok,.the resulting

worksheet will display the rows colored according to these criteria.

14.1.5 - Switch database IDs

The GI is an identification number for nucleotide and protein sequences, while the accession number of such sequence represents the database record of a sequence in GenBank a database where nucleotide and protein sequences from more than 260,000 organisms are publicly available thanks to an international collaboration among the NCBI, the European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database and the DNA DataBank of Japan (DDBJ).

As shown in Figure 14.8, you can switch from an Accession Number to its corresponding Gene Identifier (GI) or vice versa or from a database ID into another database ID by selecting "Annotation" -> "switch database format", a dialog will appear for you to choose a worksheet column in the dialog "select column from" and type of data in the dialog "Format from". Then you have two options a) to select a preexisting column via the boxes "Select column to" and "Format to" if you want replace the terms of this column; b) to create a new column with new information where just need to give it a name.

The tool permits you to make this process in two modes, first applying the changes directely on a CSV open via the GPRO worksheet or in batch mode (by selecting the option Select folder in the figure, below) to process several CSVs simultaneously.

Page 68: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 14.8.- Switch database IDs. You can change the GI terms to accession numbers or vice versa.

14.2 - Worksheet menu: Select Tab

To select specific rows of the worksheet (for export/annotation purposes) by using different selection criteria (terms, colors, etc.). Any of these can be combined with the "Export" function available in the first worksheet tab "File" to create subsets from your database or annotation.

14.2.1 - Select sequences by key terms

To select sequences (i.e. rows) by performing a selection using specific terms in any column as described in Figure 14.9.Clicking on “Select key terms” will offer a new dialog. Press the "Add" tab, and enter as many terms as you want to select and color rows representing sequences with such label within a particular column (you should specify). As a result, the rows matching the terms selected will be checked in the worksheet and highlighted with the assigned colors.

Page 69: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 14.9.- Select sequences using key terms.

14.2.2 - Select sequences by expect or statistics values

This selection can only be performed on columns containing numerical data. According to the chosen value cutoff, you can make the selection of sequences using the statistical significance of your values as a criterion (for instance, a column containing e-values shown in Figure 14.10). Just enter a value cutoff criterion, and choose a numerical column of reference. The utility will differentially color rows with lower and higher values and others with non-significant hits according to the established selection of colors in these tasks. You can modify the colors code by clicking in each color box and can tell GPRO to check in the worksheet, the rows emphasized in any color (as shown in figure). Then click OK for running the script.

Page 70: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 14.10.- Select sequences based on e-value expects or statistics values.

14.2.3 - Selecting set of sequences differentiated by colors

This tab provides an additional utility for selecting previously colored rows (Figure 14.11). Just choose one of the colors (orange in the example) corresponding to the rows previously colored according to your criteria, press OK to run the script.

Page 71: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 14.11.- Selecting sequences sets by colors

14.2.4 - Selecting sequences by multiple criteria

This is for selecting sequences according to a combined criteria of selection based on the terms found in up to three columns. As shown in Figure 14.12 by selecting this option, a dialog table appears for you to select three columns a search term for each column.

Figure 14.12.- Selecting sequences by multiple criteria

Page 72: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

14.2.5 - Delete checked sequences from the Worksheet

To delete all previously selected rows in the worksheet (Figure 14.13). To do this you only need to check the sequences you want to eliminate either manually or using any of the distinct utilities available in the Tab "Select" and then click on the last utility of this menu (delete checked rows).

Figure 14.13.- Remove checked sequences (those selected in the red circle) from your worksheet

14.3 - Worksheet menu: Associate database

Using this tab you can associate a specific FASTA file with its automatic annotation in a CSV using the worksheet and one column as common criterion reference. To make the association the contents of the selected columns must be found in the FASTA header of the sequences within the fasta file.

14.3.1 - Associate fasta sequences to your annotation CSV file

By clicking on the “Associate database” Tab, you open a new window for database association (Figure 14.14). Then use the “worksheet columns” option to select on the left area a reference columns for the association (in the example “Sequence” and “Function”) and move it to the right area named "Sequence columns" (manually by selecting it and dragging it with the mouse). If required, choose the column separator char. Then browse the FASTA file of your interest in your directory. The script will present a preview of the worksheet selected sequence columns titles and that of the FASTA header. Press OK for running the script.If the association was successful you will be noticed (the link in a red circle below to worksheet GUI),any change performed in the worksheet column will be automatically applied to the FASTA header of the sequence associated with the worksheet row in the FASTA file.

Page 73: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 14.14.-Associate a fasta file to your CSV worksheet

14.3.2 - Remove association between your fasta file and the CSV

Associations between the worksheet and database file are maintained even if GPRO is closed. This option allows you to remove such an association.

Page 74: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

14.4 - Worksheet menu: Statistics

The Tab “Statistics” permits to perform some statistics based on the results distribution, which depending on the column can be categorical or numerical. Figure 14.15 below summarizes both two possibilities.

14.4.1 - Numerical data statistics

For analyzing quantitative data, just select the path Statistics -> Numerical Data statistics, and pop-up (Figure 14.15 to the left) wuill appear allowing you to select the column of interest and in the case of e-value-based data also select a log10-based distribution of the results.

14.4.2 - Categorical data statistics

Figure 14.15 (to the rigth) shows the alternative path Statistics -> Categorical Data Statistics you can select for analyzing results distribution based on cualitative data (i.e. names, species etc). The device allows you to perform the analyses either as a Bar or a Pie char, in horizontal or vertical orientation, tuned in any color and in 2D or 3D mode similarly to the statistics based on GO annotation discussed above. By rigth clicking, you can export the image or the data as a matrix table in a csv for further editing with other programas such as Excel.

Figure 14.15.- Analyzing the the results distribution of both worksheet data per column

14.4.3 - Metabolic pathways

Should you have previously performed a functional annotation for appending GO terms and enzyme codes to the worksheet and have internet access you can use this tool for linking the KEEG: Kyoto Encyclopedia and

Page 75: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

retrieve graphical figures of the metabolic maps on which the annotated enzymes are functionally involved. As shown in Figure 14.16, you only need to select the CSV columns summarizing the sequences names, the enzyme code and the Evidence code of sequences under study. Then click run and wait, the process will take some time depending on the number of sequences in your annotation. Once the process is donde, GPRO will open a working space in the layout below the worksheet with a summary to navigate the metabolic maps, and a display to the rigth of the map. Below, this layout you have also a detail of the metabolic path displayed, and above the layout you have three tabs for working with the maps or for downloading all of them in a folder that will be deposited in the projecto folder.

Figure 14.16.- Retrieving metabolic maps from KEEG

14.6 - Worksheet menu: Transcriptome post-processing

This Tab permits you to make downstream curation of transcriptome sequence data in two different ways.

14.6.1 - Filter best isoform

By selecting the option "Filter best isoform" you can reads the annotation CSV file of the whole transcriptome under analysis and then to state one or more classificatory filters to select the most representative sequences among the distinct cDNAs annotated per gene transcribed using an Algorithm shown in Figure 14.17 which is a normalized combination of the most relevant BLAST statistics such as the high-scoring segment pairs (HSPs) of both, the query and the hit as well as the similarity and the inverse of the E-value and the sequencing depth. You can state the filter based on any these filters or based on all of them.

You can also filter the clusters by positionaly redundancy, to detect and select all nor-overlapping sets of isotigs/contigs of a gene partually characterized and then select the best isoform within each one of these non-overlapping sets (in a red circle within the figure).

Page 76: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 14.17.- Filter best isoform

14.6.2 - Sequence trimming

Using this utility you can upload both the fasta file with your cDNA sequences and its associated annotation CSV file and then trim the fasta sequences, according to two options:

a) Option "a" combines two algorithms based on the combination of the HSPs for noth the query and the hit for detecting and classifing the sequencs as full-lenght cDNA or partial CDNAs depending on if the queries share a core percentage with the subject hit established by the user. For instance we can define a sequence as full-length if the query share a core of more than 80% with the subject. The tool obviously assumes that you use the appropiate subject models as reference.

b) Option "b" is simpler than option "b" as it just considers the ratio between the HSPs of the query and the subject to consider full-length sequences or partial domains, depending of the criterion of shared core defined by the user.

Finally, the tool permits you to label the sequences as "full-length CDNAs, "Partial Sequence" or "Related Domain" depending on the core shared and then to trim upstream and donwstream each sequence to eliminate frames respectively upstream and downstream from the start codon and stop codon or from the defined core.

Page 77: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

Figure 14.18.- Classify and trim sequences into full-length or partial CDNAs

Page 78: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

15 - Acknowledgements and Citing

15.1 - Acknowledgements

We thank to all the GPRO users that are collaborating with us in improving and pushing forward this project. The version 1.0 of GPRO was supported in part by research&innovation- loan IDI-20100007 from CDTI (Centro de Desarrollo Tecnológico Industrial) and PTQ-09-01-00020 and PTQ-09-01-00670 grants from MICINN (Ministerio de Ciencia e Innovación) in Spain.

15.2 - Citing the last GPRO version

If you are going you publish some findings obtained using GPRO and wish to cite the tool, it is suggested that you use the following publication:

R. Futami, A. Muñoz-Pomer, JM. Viu, L. Dominguez-Escribá, L. Covelli, GP. Bernet, JM. Sempere, A. Moya, C. Llorens. GPRO: the professional tool for annotation, management and functional analysis of omic databases. (2011) Biotechvana Bioinformatics: 2011-SOFT3

Page 79: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

16 - REFERENCES

LLORENS,C,FUTAMI,R,COVELLI,L,DOMíNGUEZ-ESCRIBA,L,VIU,JM,TAMARIT,D,AGUILAR-

RODRíGUEZ,J,VICENTE-RIPOLLES,M,FUSTER,G,BERNET,GP,MAUMUS,F,MUNOZ-

POMER,A,SEMPERE,JM,LATORRE,A,MOYA,A. 2011. The Gypsy Database (GyDB) of mobile

genetic elements: release 2.0. Nucleic Acids Res.39:D70-4

DURBIN, R., EDDY, S., KROGH, A., MITCHISON, G.:. 2009. Biological Sequence analysis. Thirteenth edn. New York: Cambridge University Press.

HIGGS, P., ATTWOOD, T.:. 2005. Bioinformatics and Molecular Evolution . Carlton, Victoria 3053, Australia: Blackwell Publishing Ltd. MARTIN, M. . 2011. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. journal .Vol1 SCHMIEDER, R. AND EDWARDS R. 2011. Quality control and preprocessing of metagenomic datasets. Bioinformatics.27:863-864 ALTSCHUL, S.F., MADDEN, T.L., SCHäFFER, A.A., ZHANG, J., ZHANG, Z., MILLER, W. AND D.J.

LIPMAN. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search

programs. Nucleic Acids Res..25:3389-402

GENE ONTOLOGY CONSORTIUM. 2008. The Gene Ontology project in 2008. Nucleic Acids Res.36:D440-D444 NAKAYA,A., KATAYAMA,T., ITOH,M., HIRANUKA,K., KAWASHIMA,S., MORIYA,Y., OKUDA,S., TANAKA,M., TOKIMATSU,T., YAMANISHI,Y., YOSHIZAWA,A.C., KANEHISA,M., GOTO,S. . 2013. KEGG OC: a large-scale automatic construction of taxonomy-based ortholog clusters. Nucleic Acids Res. .41:D353-D357 QUEVILLON,E., SILVENTOINEN,V., PILLAI,S., HARTE,N., MULDER,N., APWEILER,R., LOPEZ,R. . 2005. InterProScan: protein domains identifier. . Nucleic Acids Res.33(We:W116-W120 HUNTER,S., JONES,P., MITCHELL,A., APWEILER,R., ATTWOOD,T.K., BATEMAN,A., BERNARD,T., BINNS,D., BORK,P., BURGE,S., DE,CASTRO E., COGGILL,P., CORBETT,M., DAS,U., DAUGHERTY,L., DUQUENNE,L., FINN,R.D., FRASER,M., GOUGH,J., HAFT,D., HULO,N., KAHN,D., KELLY,E., LETUNIC,I., LONSDALE,D., LOPEZ,R., MADERA,M., MASLEN,J., MCANULLA,C., MCDOWALL,J., MCMENAMIN,C., MI,H., MUTOWO-MUELLENET,P., MULDER,N., NATALE,D., ORENGO,C., PESSEAT,S., PUNTA,M., QUINN,A.F., RIVOIRE,C., SANGRADOR-VEGAS,A., SELENGUT,J.D., SIGRIST,C.J., SCHEREMETJEW,M., TATE,J., THIMMAJANARTHANAN,M., THOMAS,P.D., WU,C.H., YEATS,C., YONG,S.Y. . 2012. . InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Res.40:D306-D312 STANKE, M., DIEKHANS, M., BAERTSCH, R., HAUSSLER, D. 2008. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics .24(5):637-644 MCKENNA,A., HANNA,M., BANKS,E., SIVACHENKO,A., CIBULSKIS,K., KERNYTSKY,A., GARIMELLA,K., ALTSHULER,D., GABRIEL,S., DALY,M., DEPRISTO,M.A. . 2010. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res .20(9):1297-1303 CINGOLANI,P., PLATTS,A., WANG,LE L., COON,M., NGUYEN,T., WANG,L., LAND,S.J., LU,X., RUDEN,D.M. 2012. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3 . Fly (Austin. ) .6(2):80-92 ZHANG, Y., SUN, Y. . 2011. HMM-FRAME: accurate protein domain classification for metagenomic sequences containing frameshift errors. BMC Bioinformatics.12:198

Page 80: GPRO MANUAL - gydb.orggydb.org/documents/gpro_manual.pdf · 1 - Introduction and Overview 1.1 - NGS bioinformatic analysis Bioinformatic analysis of omic data from next generation

BAO, Z., AND S.R EDDY. 2002. Automated de novo Identification of Repeat Sequence Families in Sequenced Genomes.Genome . Research .12:1269-1276 PRICE, A.L., JONES, N.C., AND P.A. PEVZNER. 2005. De novo identification of repeat families in large genomes. To appear in Proceedings of the Annual International conference on Intelligent Systems for Molecular Biology (ISMB-05). Detroit, Michigan. BENSON, G. . 1999. "Tandem repeats finder: a program to analyze DNA sequences". Nucleic Acids Research.27:573 -580 MUÑOZ-POMER, A., FUTAMI, R., COVELLI, L., DOMÍNGUEZ-ESCRIBÀ, L., BERNET, GP., SEMPERE, JM., MOYA, A., AND C LLORENS. 2011. TIME: a sequence editor for the molecular analysis of large DNA and protein sequence samples.Biotechvana Bioinformatics.2011-SOFT2.

QUEVILLON,E., SILVENTOINEN,V., PILLAI,S., HARTE,N., MULDER,N., APWEILER,R., LOPEZ,R. . 2005. InterProScan: protein domains identifier. . Nucleic Acids Res.33(We:W116-W120

GENE ONTOLOGY CONSORTIUM. 2008. The Gene Ontology project in 2008. Nucleic Acids Res.36:D440-D444 MUÑOZ-POMER, A., FUTAMI, R., SEMPERE, JM., MOYA, A., AND C LLORENS. 2011. CheckAlign 2.0. Biotechvana Bioinformatics.2011-SOFT1. SCHNEIDER,T.D., STORMO,G.D., GOLD,L. AND A. EHRENFEUCHT. 1986. Information content of

binding sites on nucleotide sequences. J.Mol.Biol..188:415-431

SCHNEIDER,T.D. AND R.M. STEPHENS. 1990. Sequence Logos - A New Way to Display Consensus Sequences. Nucleic Acids Research.18:6097-6100 EDDY,S.R. 1998. Profile hidden Markov models. Bioinformatics..14:755-763

CONESA, A., GOTZ, S., GARCIA-GOMEZ, J. M., TEROL, J., TALON, M., AND M ROBLES. 2005. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics.21:18:3674-3676 TATUSOV RL, FEDOROVA ND, JACKSON JD, JACOBS AR, KIRYUTIN B, KOONIN EV, KRYLOV DM, MAZUMDER R, MEKHEDOV SL, NIKOLSKAYA AN, RAO BS, SMIRNOV S, SVERDLOV AV, VASUDEVAN S, WOLF YI, YIN JJ, NATALE DA. . 2003. The COG database: an updated version includes eukaryotes. . BMC Bioinformatics .4:41 SCHUG J, DISKIN S, MAZZARELLI J, BRUNK BP, STOECKERT CJ, JR. . 2002. Predicting gene ontology functions from ProDom and CDD protein domains. . Genome research.12:648-655