diya: an annotation pipeline for any genomics lab

22
Do It Yourself Annotator An annotation pipeline for every genomics lab

Upload: andrew-stewart

Post on 27-Jan-2015

106 views

Category:

Data & Analytics


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: DIYA: An annotation pipeline for any genomics lab

Do It Yourself Annotator

An annotation pipeline for every genomics lab

Page 2: DIYA: An annotation pipeline for any genomics lab

Andrew Stewart *, Timothy Read

●Genomics Department, Biological Defense Research

Directorate, Navy Medical Research Center,

Rockville, Maryland, United States

●Distribution and source code are available at

https://sourceforge.net/projects/diyg/

●Contact: [email protected]

Page 3: DIYA: An annotation pipeline for any genomics lab

●DIYA is an open source pipeline for the rapid annotation of genomic sequences. 

The software is designed to use as input DNA contigs, either in the form of

complete genomes or the result of shotgun sequencing of a genome library, and

produce as output a fully annotated sequence.

●The DIYA pipeline is modular in nature, and easily expandable to include further

forms of feature finding. Each module follows a similar structure, using for input

and output a standard format as a conduit between stages in the pipeline. The

usefulness of BioPerl (http://bioperl.org) as a format conversion utility and parser

is demonstrated in this system. SGE support allows running multiple sequences

in parallel.

Page 4: DIYA: An annotation pipeline for any genomics lab

Background

●“A sequencing center in every genomics lab”

●Thus, an annotation pipeline in every genomics lab

●Need for sequence analysis tools with

decentralization of sequencing technology

Page 5: DIYA: An annotation pipeline for any genomics lab

Background

●Explosion of tools onto the bioinformatics community

●Inconsistent formats, need for ‘pipelining’, bioperl

Page 6: DIYA: An annotation pipeline for any genomics lab

Background: BDRD

●454 Life Systems FLX sequencers

●Push data off onto servers

oAssembly

oAnnotation

oAnalysis

Page 7: DIYA: An annotation pipeline for any genomics lab

Outline of the pipeline

●diya-assemble-pseudocontig

●diya-glimmer

●diya-blast

●diya-rfam_scan

●diya-tRNAscan

●Auxiliary scripts

Page 8: DIYA: An annotation pipeline for any genomics lab

Installation requirements

●Software

oPerl v5+, SGE, MUMer, Glimmer, Blast, tRNAscanSE, Infernal, rfamscan.pl

●Databases

oProtein Clusters, Rfam

●Perl libraries

oBioPerl, Getopt::Long, Data::Dumper, XML::Simple, etc..

Page 9: DIYA: An annotation pipeline for any genomics lab

Pipeline: diya.pl

●Controller script for the pipeline

●Manages configuration and project data table

generation

●Fires off jobs to SGE

Page 10: DIYA: An annotation pipeline for any genomics lab

Pipeline: Assembly

●Generate a ‘pseudocontig’

●MUMmer v3.20 (http://mummer.sourceforge.net/)

Page 11: DIYA: An annotation pipeline for any genomics lab

Pipeline: Glimmer

●Prediction of gene coding regions

●Glimmer v3.02 (http://www.cbcb.umd.edu/software/glimmer/)

og3-iterated.csh - two rounds of iteration

●Uses interpolated Markov models to distinguish

between coding and non-coding regions

Page 12: DIYA: An annotation pipeline for any genomics lab

Pipeline: Blast

●BLAST v2.2.16 (ftp://ftp.ncbi.nih.gov/blast/)

●Two rounds of blast against..

oReference genome

oProtein Clusters database

Page 13: DIYA: An annotation pipeline for any genomics lab

Pipeline: rfam_scan

●Identification of ncRNA (rRNA, tRNA)

●Infernal v0.81 (http://infernal.janelia.org/)

●Rfam (http://www.sanger.ac.uk/Software/Rfam/)

●rfamscan.pl v0.1 (http://www.sanger.ac.uk/Users/sgj/code/)

Page 14: DIYA: An annotation pipeline for any genomics lab

Pipeline: tRNAscan-SE

●Identification of tRNA

●tRNAscan-SE v1.23 (http://lowelab.ucsc.edu/tRNAscan-SE/)

Page 15: DIYA: An annotation pipeline for any genomics lab

Pipeline: Auxiliary scripts

●Locus tag reordering (cleanup)

●Protein extraction (ie, PIPA input)

●Pseudocontig disassembly

●Hooks

oLoad databases

oReport genome statistics

oWikiLIMS integration

Page 16: DIYA: An annotation pipeline for any genomics lab

Modularity

●Adding extra modules is rather simple

●Things to come...

oCRISPR elements

opseudogenes

oprophages

Page 17: DIYA: An annotation pipeline for any genomics lab

Do It Yourself Genomics

●A project community and collection of bioinformatics

tools and applications for the analysis of genomic

sequence data, with the intent of bringing these tools

into the hands of medium to small scale sequencing

labs.

Page 18: DIYA: An annotation pipeline for any genomics lab

DIYG on disk

●OS (Linux) distribution with DIYG pre-installed

●Simplifies process of installation, compilation,

‘prerequisite gathering’

●Run analysis directly on sequencer workstation?

●Easy deployment across a high performance

computing cluster

Page 19: DIYA: An annotation pipeline for any genomics lab

DIYG: Virtual Machine

●Virtualization creates a complete, self-contained

deployment of an operating system

●“Disposable” analysis machine

Page 20: DIYA: An annotation pipeline for any genomics lab

DIYG: Cloud Computing

●Ideal for labs without direct access to a HPC cluster

●Truly an annotation pipeline in every genomics lab

Page 21: DIYA: An annotation pipeline for any genomics lab

Deployment at BHSAI

●Make sequence annotation available to wider DOD

community

●Concerns about ‘perl’ nature of DIYA

●Need to determine HPC guidelines

●Possible integration / hook into PIPA

Page 22: DIYA: An annotation pipeline for any genomics lab

Deployment at BHSAI

●Conventional installation (integration into existing

systems, ala PIPA)

●Sourced from disk image

●Virtualization servers? (if available)