camera metagenomic annotation pipeline

54
CAMERA Annotation Pipelines (and related infrastructure) Brett Whitty 12/20/2007

Upload: brett-whitty

Post on 04-Aug-2015

1.906 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: CAMERA metagenomic annotation pipeline

CAMERA Annotation Pipelines(and related infrastructure)

Brett Whitty

12/20/2007

Page 2: CAMERA metagenomic annotation pipeline

Overview

Compute Infrastructure GOS/CAMERA ncRNA/ORF calling pipeline

rRNA finding pipeline ORF calling

GOS (incremental) protein clustering CAMERA Annotation Pipeline

Specifications Implementation

Page 3: CAMERA metagenomic annotation pipeline

Compute Infrastructure

Page 4: CAMERA metagenomic annotation pipeline

CALIT2 Compute Grid

48 dual-core dual-CPU 64 bit machines 192 SGE slots

Redhat-based ‘Rocks Clusters’ Linux distribution (see http://rocksclusters.org)

‘Rocks Rolls’ Bio-roll (/opt/Bio) Used to image/install each node separately,

including local Perl module installs (patches)

Page 5: CAMERA metagenomic annotation pipeline

sos.camera.calit2.net

Head node of sos cluster SSH into here

Is not an SGE submit host

Page 6: CAMERA metagenomic annotation pipeline

SOS Cluster Global Mounts

/share/apps applications (and related files) are installed here,

analysis data should not be stored here /home/thumper6

a global mount point --- 18T(!!!) storage volume on which all analysis data/results should be stored

/opt/Bio tools such as clustalw, EMBOSS, hmmer, ncbi

blast are installed under here

Page 7: CAMERA metagenomic annotation pipeline

SOS Local Mounts(on each grid node)

/state/partition1 local storage device on each grid node available

for local scratch space (438G)

/tmp system tmp partition (7G)

Page 8: CAMERA metagenomic annotation pipeline

pg0-0.camera.calit2.net

SSH accessible only through head Is an SGE submit host Running apache and postgres servers

Page 9: CAMERA metagenomic annotation pipeline

pg0-0.camera.calit2.net

http://web1.camera.calit2.net/ergatis/

/var/www/cgi-bin/ergatis /home/bwhitty/temp/subversion-1.4.5/subversion/svn/svn export --force

https://ergatis.svn.sf.net/svnroot/ergatis/tags/ergatis-v2r6b1/htdocs ergatis

/var/www/html/ergatis /home/bwhitty/temp/subversion-1.4.5/subversion/svn/svn export --force

https://ergatis.svn.sf.net/svnroot/ergatis/tags/ergatis-v2r6b1/cgi-bin ergatis

Page 10: CAMERA metagenomic annotation pipeline

pg0-0.camera.calit2.net

CGI scripts run as the user 'apache' on pg0-0, but ‘apache’ has sudo permissions for user 'ergatis' The two CGI scripts in the install which run RunWorkflow and

KillWorkflow (ergatis/kill_wf.cgi, ergatis/Ergatis/Pipeline.pm) have been modified, and 'sudo -u ergatis ' has been appended to their normal execution strings

IdGenerator.pm has been modified to use JCVIIdGenerator.pm

Many of the settings in ergatis.ini have been changed from defaults, including disabling a number of the components When updating the Ergatis CGI directory from the SVN

repository, a backup copy should be set-aside in advance

Page 11: CAMERA metagenomic annotation pipeline

SGE/Workflow Notes

Two SGE queues have been configured for ergatis: ergatis.q (192 slots) ergatis-fast.q (144 slots)

ergatis.q is subordinate queue of ergatis-fast.q

ergatis.q is set as default queue for user ‘ergatis’ by specifying ‘-q ergatis.q’ in/home/ergatis/.sge_request

Workflow version 3.0 is installed /share/apps/workflow

Workflow requires that the SGE queue's prolog and epilog scripts be set to the following: prolog=/share/apps/workflow/bin/prolog $host $job_owner $job_id $job_name $queue epilog=/share/apps/workflow/bin/epilog $host $job_owner $job_id $job_name $queue

The queue configuration can be checked using the command'qconf -sq ergatis.q'

Page 12: CAMERA metagenomic annotation pipeline

Ergatis Application Install

The main ergatis application install directory is under /share/apps/ergatis

The chado-v1r12b1 release is the current version installed direct copy of the install located at /usr/local/devel/ANNOTATION/ard/ at JCVI Perl wrappers were modified via sed to the correct local directory structures Proper install wasn't done because no working installer script was available at the time

/share/apps/ergatis/chado-v1r12b1symlinked to /share/apps/ergatis/current

Executables which some ergatis component use, but are not installed with Ergatis (e.g.: JCVI internal scripts) are located under /share/apps/ergatis/bin

External tools which are not globally installed on sos are installed under /share/apps/ergatis/external_apps

Ergatis global directories (global_id_repository, global_saved_templates) are located under /share/apps/ergatis/ergatis_global

Page 13: CAMERA metagenomic annotation pipeline

Ergatis Data Locations

All ergatis data should be put under /home/thumper6/ergatis

Project repositories are located under /home/thumper6/ergatis/projectsor symlink /share/apps/ergatis/projects

CAMERA project repository is /home/thumper6/ergatis/projects/camera

Databases are located under /home/thumper6/ergatis/dbor symlink /share/apps/ergatis/db

Global scratch space is under /home/thumper6/ergatis/scratchor symlink /share/apps/ergatis/scratch

Page 14: CAMERA metagenomic annotation pipeline

ikelite.rocksclusters.org

Less machines than sos cluster (~20 slots?) Initial test ergatis install was done here

(similar directory structure to sos) Completely distinct from sos cluster Sandbox Shibu, Weizhong Li and others run computes

here (e.g.: clustering pipeline)

Page 15: CAMERA metagenomic annotation pipeline

Pipelines

Page 16: CAMERA metagenomic annotation pipeline

ncRNA/ORF Finding Pipeline

Annotation Pipeline

Incremental Clustering Pipeline

Metagenomic Reads

ORFs/peptides

GOS/CAMERA Pipelines Overview

Cluster Memberships

Page 17: CAMERA metagenomic annotation pipeline

Challenges

All computes in pipeline must be performed on multi-sequence input/output files, as the filesystem can not physically support 12M+ individual FASTA input files/output files other partitioning solutions could work(?) but most tools

support multiple sequence inputs anyway

Overall total space consumption was an issue when computes were running on TIGR grid, but this is not as much an issue (currently) on CALIT2 grid Solution here was to keep all inputs/outputs gzipped

during pipeline execution, at the cost of some performance loss (using things like zcat –f | with NCBI BLAST, etc.)

Page 18: CAMERA metagenomic annotation pipeline

GOS/CAMERA ncRNA and ORF Finding Pipeline

Page 19: CAMERA metagenomic annotation pipeline

Reads

Find tRNAs Extract tRNAs

Soft-Mask tRNAs

GOS/CAMERA ncRNA and ORF Finding Pipeline Overview

Find rRNAs

Soft-Mask rRNAs

Extract rRNAs

GOS ORF calling

ORF stats ORF overlaps

tRNAs FASTA

rRNAs FASTA

ORFs FASTA

Peptides FASTA

MetageneORFs FASTA

Peptides FASTA

Page 20: CAMERA metagenomic annotation pipeline

GOS/CAMERAncRNA and ORF Finding Pipeline

CAMERA-specificErgatis components

Page 21: CAMERA metagenomic annotation pipeline

camera_extract_trna

Page 22: CAMERA metagenomic annotation pipeline

CAMERA rRNA Finder Overview

BLAST vs. a database of coded pooled rRNA subunit sequences

BLAST prefilter step with loose parameters blastall -p blastn -i reads.fsa -d rrna_db.fsa -e 0.1 -F 'T' -b 1 -v 1

-z 3000000000 -W 9

Reads with prefilter hits are searched using strict parameters blastall -p blastn -i aligned.fsa -d rrna_db.fsa -e 1e-4 -F 'm L' -b

1500 -v 1500 -q -5 -r 4 -X 1500 -z 3000000000 -W 9 -U T

Collapse aligned intervals of the same rRNA type and extract the highest scoring alignments from each region

Page 23: CAMERA metagenomic annotation pipeline

camera_filter_blast

Page 24: CAMERA metagenomic annotation pipeline

camera_rrna_finder

Custom DB

Page 25: CAMERA metagenomic annotation pipeline

rRNA Finder DB/usr/local/annotation/CAMERA/CustomDB/camera_rRNA_finder.all_rRNA.coded.cdhit_80.fsa

5S Sequences from Archaea, Bacteria and Eukaryota were

obtained from the 5S Ribosomal RNA Database http://biobases.ibch.poznan.pl/5SData/

16S Sequences for Archaea and Bactera were obtained from the

Green Genes 16S db http://greengenes.lbl.gov/

18S Source was Doug Rusch's 18S database prepared for the GOS

paper 23S

Source was Doug Rusch's 23S database prepared for the GOS paper.

Page 26: CAMERA metagenomic annotation pipeline

rRNA Finder DB

Fasta headers were coded as follows:

>#S [D] ...original.header...

where # is one of (5, 16, 18, 23) and D is one of (A, B, E). The camera_rrna_finder component expects this format.

Page 27: CAMERA metagenomic annotation pipeline

rRNA Finder DB

CD-HIT was run on the entire database to cluster sequences with high similarity to reduce the database size but maintain a range of diverse sequences

Command line:/usr/local/devel/ANNOTATION/bwhitty/cdhit/cd-hit/cd-hit-est -i

input_database.fsa -o output_database.fsa -c 0.8 -n 4

Consistency of clustering was checked with a Perl script to ensure no heterogeneous clustering(e.g.: 18S and 16S clustering together)

Clusters were consistent Database size was reduced from 65,591 sequences to 1,329

Page 28: CAMERA metagenomic annotation pipeline

rRNA Finder

Page 29: CAMERA metagenomic annotation pipeline

open_reading_frames

Page 30: CAMERA metagenomic annotation pipeline

ORF Overlaps/ORF Stats

Page 31: CAMERA metagenomic annotation pipeline

FASTA Headers

>HOT_READ_85779353 /accession=DU765170.1 /sample_id=JGI_SMPL_HF4000_12-21-03 /template_id=JGI_TMPL_ANIW12796 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=1088 /length=1088

>JCVI_ORF_1108836626524 /pep_id=JCVI_PEP_1108836626525 /read_id=HOT_READ_85760722 /begin=0 /end=234 /orientation=1 /5_prime_stop=0 /3_prime_stop=TAG /ttable=11 /length=234 /read_defline="/accession=DU750886.1 /sample_id=JGI_SMPL_HF130_10-06-02 /template_id=JGI_TMPL_ASNF1709 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=841 /length=841"

>JCVI_PEP_1108836626525 /orf_id=JCVI_ORF_1108836626524 /read_id=HOT_READ_85760722 /begin=0 /end=234 /orientation=1 /5_prime_stop=0 /3_prime_stop=TAG /ttable=11 /length=234 /read_defline="/accession=DU750886.1 /sample_id=JGI_SMPL_HF130_10-06-02 /template_id=JGI_TMPL_ASNF1709 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=841 /length=841"

>JCVI_NT_1108826205795 /read_id=HOT_READ_85801707 /begin=785 /end=858 /orientation=1 /type=Asn_tRNA /ergatis_id=1108826197895 /defline="HOT_READ_85801707 /accession=DU787412.1 /sample_id=JGI_SMPL_HF770_12-21-03 /template_id=JGI_TMPL_APKH2110 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=902 /length=902"

>JCVI_NT_1108806998652 /read_id=HOT_READ_85760731 /begin=55 /end=847 /orientation=0 /type=23S_rRNA /ergatis_id=1108826197895 /read_defline="/accession=DU750895.1 /sample_id=JGI_SMPL_HF130_10-06-02 /template_id=JGI_TMPL_ASNF1714 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=847 /length=847"

Page 32: CAMERA metagenomic annotation pipeline

The absence of called ORFs in this region of the read is due to the

soft-masked rRNA sequence

RNAmmer didn’t identify the 23S

sequence, though it is capable of finding 23S

Page 33: CAMERA metagenomic annotation pipeline

Again, RNAmmer failed to identify rRNA sequence

Page 34: CAMERA metagenomic annotation pipeline

BLAST-based approach does a pretty good job of

finding correct boundaries

These ORFs have >150 unmasked

bases

Page 35: CAMERA metagenomic annotation pipeline

BLAST-based rRNA finding appears to

outperform RNAmmer for 23S sequences, and

some 16S

Page 36: CAMERA metagenomic annotation pipeline

GOS (Incremental) Clustering Pipeline

http://camera.venterinstitute.org/wiki/display/VISWCAMERA/Incremental+clustering%2C+work+flow+details

Page 37: CAMERA metagenomic annotation pipeline

Clustering Overview

All Public Proteins +

GOS ORFs

Core Cluster

Core Cluster Core

Cluster

Core Cluster

Core Cluster

Core Cluster

Longest Sequence Representatives

Non-Redundant 90% Identity CD-HIT Sequence

Representatives

GOS v1.2

Historical Artifacts(with respect to annotation)

Page 38: CAMERA metagenomic annotation pipeline

CAMERA Polypeptide Annotation Pipeline

Page 39: CAMERA metagenomic annotation pipeline

Thoughts on Specifications

Annotation rules should not be literally codified as Perl code (and only Perl code)!!!(especially when the “decision makers” never look at the code)

What tools do we trust? What cutoffs do we use? What evidence/data types do we consider?

These will (in some cases should) change over time

Page 40: CAMERA metagenomic annotation pipeline

More Thoughts

Specifications are easier to change than code, so code should be written to support change

But unless they’re defined first, the specifications will be a moving target

Page 41: CAMERA metagenomic annotation pipeline

(My) Design Objectives

Must be able to add/remove annotation data sources as the annotation SOP changes

Must be able to easily change the ways in which these annotation data types are applied/combined to produce final annotation

Must be able to change/expand the types of final annotation data we are producing

Page 42: CAMERA metagenomic annotation pipeline

Object-Oriented Design Approach

OOP in Perl == *, but lesser of two evils (don’t ask me what the other evil is, but it must be pretty evil)

Encapsulates possible sources of change and prevents them from affecting downstream components(like HACCP)

Polymorphism of $parser->parse($infile) producing annotation objects is nice

Re-use was not really a motive here

*Damian Conway in his OOP Perl book says using OOP in Perl yields 5X performance hit

Page 43: CAMERA metagenomic annotation pipeline

Annotation Pipeline Overview

Annotation Source Data

Parser(s)

Annotation Tool(s)

Annotation Data Object(s)

Annotation Rules

Final Annotation Data

We can make changes to the annotation rules,

without having to necessarily re-run or re-

parse the data

Page 44: CAMERA metagenomic annotation pipeline

Design Objectives for Parsers

A parser must: Produce polypeptides with associated AnnotationData objects of a defined type Produce AnnotationData object with attributes specified in a consistent way

E.g.: All parsers should produce EC number attributes that look like ‘1.1.1.1’ -> ‘1.-.-.-’, not sometimes ‘1.-’. Multiple values should be split. Any clean-up or verification should be done before the AnnotationData object is created; if the data is invalid, the attribute should not be populated, or the object should not be created.

Produce annotation data objects that are independent of the source annotation data they were parsed from e.g.: They have already been canonized as a type of ‘trusted annotation evidence

type’ when they are created as AnnotationData objects. These trusted types are defined in the annotation SOP.

These features create a separation between how trusted evidence is defined (input data), and how the evidence is used to produce annotation (annotation rules)

Page 45: CAMERA metagenomic annotation pipeline

AnnotationData Objects

AnnotationData

AnnotationData::Polypeptide

type:

[some string]

attributes:

common_namegene_symbolECGOTIGR_role

Polypeptide

AnnotationData Object(s)

Page 46: CAMERA metagenomic annotation pipeline

AnnotationRules

AnnotationRules object implements the rules from the annotation SOP document

AnnotationRules::PredictedProtein takes a Polypeptide object with associated AnnotationData objects of varying type and applies the annotation rules to create a final AnnotationData object

Page 47: CAMERA metagenomic annotation pipeline

AnnotationRules

Rules are encoded as an array in the following format:

ANNOTATION_TYPE|OPERATOR|ATTRIBUTE1 ATTRIBUTE2

Where OPERATOR is one of: = for assign attribute (if unassigned) + for append attribute - for overwrite attribute

Any operators can be defined as they are applied with a hash of handler subroutines

Page 48: CAMERA metagenomic annotation pipeline

AnnotationRules::PredictedProtein

my @annotation_order = ( ## equivalog level tigrfam hits 'TIGRFAM::FullLength::Equivalog|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FullLength::Exception|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FullLength::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',

'TIGRFAM::FRAG::Equivalog|=|GO', 'TIGRFAM::FRAG::Exception|=|GO', 'TIGRFAM::FRAG::HypotheticalEquivalog|=|GO', 'TIGRFAM::FullLength::Domain|=|GO', 'PandaBLASTP::Characterized|=|GO',

'PRIAM|=|GO EC', ## equivalog level hits vs tigrfam frag 'TIGRFAM::FRAG::Equivalog|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FRAG::Exception|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FRAG::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role', ## characterized high confidence blast hit 'PandaBLASTP::Characterized|=|common_name gene_symbol', ## pfam and non-equivalog tigrfams 'PFAM::FullLength::Equivalog|=|common_name gene_symbol GO EC TIGR_role', 'PFAM::FullLength::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FullLength::Subfamily|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FullLength::Superfamily|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FullLength::EquivalogDomain|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FullLength::HypotheticalEquivalogDomain|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FullLength::SubfamilyDomain|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FullLength::Domain|=|common_name gene_symbol GO EC TIGR_role', …

Page 49: CAMERA metagenomic annotation pipeline

CAMERA Annotation Pipeline

CAMERA-specificErgatis components

Page 50: CAMERA metagenomic annotation pipeline

camera_annotation_parser

Page 51: CAMERA metagenomic annotation pipeline

camera_annotation_rules

Page 52: CAMERA metagenomic annotation pipeline

camera_annotation_rules

Page 53: CAMERA metagenomic annotation pipeline

CAMERA-specific Code in SVN

http://iwebsvn.tigr.org/listing.php?repname=ANNOTATION&path=%2FCAMERA%2F&rev=0&sc=1

Page 54: CAMERA metagenomic annotation pipeline

Future Development(My 2 cents)

Pipeline development must be driven by annotation SOP development work Feedback on pipeline bugs must be vigilantly kept separate from feedback

on annotation SOP bugs First discuss and update the SOP, then modify the code

Cluster summary annotation Shortest path here seems to be a combination of GO Slim and EC

assignments? GO consortium makes some scripts available for summarizing sets of GO assignments

If using the current code, PolypeptideSet container class exists already. Cluster members can be added to a PolypeptideSet and that can be used as input to an AnnotationRules::FinalCluster object that is similar to the one for PredictedProtein, but with a different set of handler routines.

Incremental clustering pipeline Good luck