camera metagenomic annotation pipeline

CAMERA Annotation Pipelines(and related infrastructure)

Brett Whitty

12/20/2007

Overview

Compute Infrastructure GOS/CAMERA ncRNA/ORF calling pipeline

rRNA finding pipeline ORF calling

GOS (incremental) protein clustering CAMERA Annotation Pipeline

Specifications Implementation

Compute Infrastructure

CALIT2 Compute Grid

48 dual-core dual-CPU 64 bit machines 192 SGE slots

Redhat-based ‘Rocks Clusters’ Linux distribution (see http://rocksclusters.org)

‘Rocks Rolls’ Bio-roll (/opt/Bio) Used to image/install each node separately,

including local Perl module installs (patches)

sos.camera.calit2.net

Head node of sos cluster SSH into here

Is not an SGE submit host

SOS Cluster Global Mounts

/share/apps applications (and related files) are installed here,

analysis data should not be stored here /home/thumper6

a global mount point --- 18T(!!!) storage volume on which all analysis data/results should be stored

/opt/Bio tools such as clustalw, EMBOSS, hmmer, ncbi

blast are installed under here

SOS Local Mounts(on each grid node)

/state/partition1 local storage device on each grid node available

for local scratch space (438G)

/tmp system tmp partition (7G)

pg0-0.camera.calit2.net

SSH accessible only through head Is an SGE submit host Running apache and postgres servers


http://web1.camera.calit2.net/ergatis/

/var/www/cgi-bin/ergatis /home/bwhitty/temp/subversion-1.4.5/subversion/svn/svn export --force

https://ergatis.svn.sf.net/svnroot/ergatis/tags/ergatis-v2r6b1/htdocs ergatis

/var/www/html/ergatis /home/bwhitty/temp/subversion-1.4.5/subversion/svn/svn export --force

https://ergatis.svn.sf.net/svnroot/ergatis/tags/ergatis-v2r6b1/cgi-bin ergatis


CGI scripts run as the user 'apache' on pg0-0, but ‘apache’ has sudo permissions for user 'ergatis' The two CGI scripts in the install which run RunWorkflow and

KillWorkflow (ergatis/kill_wf.cgi, ergatis/Ergatis/Pipeline.pm) have been modified, and 'sudo -u ergatis ' has been appended to their normal execution strings

IdGenerator.pm has been modified to use JCVIIdGenerator.pm

Many of the settings in ergatis.ini have been changed from defaults, including disabling a number of the components When updating the Ergatis CGI directory from the SVN

repository, a backup copy should be set-aside in advance

SGE/Workflow Notes

Two SGE queues have been configured for ergatis: ergatis.q (192 slots) ergatis-fast.q (144 slots)

ergatis.q is subordinate queue of ergatis-fast.q

ergatis.q is set as default queue for user ‘ergatis’ by specifying ‘-q ergatis.q’ in/home/ergatis/.sge_request

Workflow version 3.0 is installed /share/apps/workflow

Workflow requires that the SGE queue's prolog and epilog scripts be set to the following: prolog=/share/apps/workflow/bin/prolog $host $job_owner $job_id $job_name $queue epilog=/share/apps/workflow/bin/epilog $host $job_owner $job_id $job_name $queue

The queue configuration can be checked using the command'qconf -sq ergatis.q'

Ergatis Application Install

The main ergatis application install directory is under /share/apps/ergatis

The chado-v1r12b1 release is the current version installed direct copy of the install located at /usr/local/devel/ANNOTATION/ard/ at JCVI Perl wrappers were modified via sed to the correct local directory structures Proper install wasn't done because no working installer script was available at the time

/share/apps/ergatis/chado-v1r12b1symlinked to /share/apps/ergatis/current

Executables which some ergatis component use, but are not installed with Ergatis (e.g.: JCVI internal scripts) are located under /share/apps/ergatis/bin

External tools which are not globally installed on sos are installed under /share/apps/ergatis/external_apps

Ergatis global directories (global_id_repository, global_saved_templates) are located under /share/apps/ergatis/ergatis_global

Ergatis Data Locations

All ergatis data should be put under /home/thumper6/ergatis

Project repositories are located under /home/thumper6/ergatis/projectsor symlink /share/apps/ergatis/projects

CAMERA project repository is /home/thumper6/ergatis/projects/camera

Databases are located under /home/thumper6/ergatis/dbor symlink /share/apps/ergatis/db

Global scratch space is under /home/thumper6/ergatis/scratchor symlink /share/apps/ergatis/scratch

ikelite.rocksclusters.org

Less machines than sos cluster (~20 slots?) Initial test ergatis install was done here

(similar directory structure to sos) Completely distinct from sos cluster Sandbox Shibu, Weizhong Li and others run computes

here (e.g.: clustering pipeline)

Pipelines

ncRNA/ORF Finding Pipeline

Annotation Pipeline

Incremental Clustering Pipeline

Metagenomic Reads

ORFs/peptides

GOS/CAMERA Pipelines Overview

Cluster Memberships

Challenges

All computes in pipeline must be performed on multi-sequence input/output files, as the filesystem can not physically support 12M+ individual FASTA input files/output files other partitioning solutions could work(?) but most tools

support multiple sequence inputs anyway

Overall total space consumption was an issue when computes were running on TIGR grid, but this is not as much an issue (currently) on CALIT2 grid Solution here was to keep all inputs/outputs gzipped

during pipeline execution, at the cost of some performance loss (using things like zcat –f | with NCBI BLAST, etc.)

GOS/CAMERA ncRNA and ORF Finding Pipeline

Reads

Find tRNAs Extract tRNAs

Soft-Mask tRNAs

GOS/CAMERA ncRNA and ORF Finding Pipeline Overview

Find rRNAs

Soft-Mask rRNAs

Extract rRNAs

GOS ORF calling

ORF stats ORF overlaps

tRNAs FASTA

rRNAs FASTA

ORFs FASTA

Peptides FASTA

MetageneORFs FASTA

Peptides FASTA

GOS/CAMERAncRNA and ORF Finding Pipeline

CAMERA-specificErgatis components

camera_extract_trna

CAMERA rRNA Finder Overview

BLAST vs. a database of coded pooled rRNA subunit sequences

BLAST prefilter step with loose parameters blastall -p blastn -i reads.fsa -d rrna_db.fsa -e 0.1 -F 'T' -b 1 -v 1

-z 3000000000 -W 9

Reads with prefilter hits are searched using strict parameters blastall -p blastn -i aligned.fsa -d rrna_db.fsa -e 1e-4 -F 'm L' -b

1500 -v 1500 -q -5 -r 4 -X 1500 -z 3000000000 -W 9 -U T

Collapse aligned intervals of the same rRNA type and extract the highest scoring alignments from each region

camera_filter_blast

camera_rrna_finder

Custom DB

rRNA Finder DB/usr/local/annotation/CAMERA/CustomDB/camera_rRNA_finder.all_rRNA.coded.cdhit_80.fsa

5S Sequences from Archaea, Bacteria and Eukaryota were

obtained from the 5S Ribosomal RNA Database http://biobases.ibch.poznan.pl/5SData/

16S Sequences for Archaea and Bactera were obtained from the

Green Genes 16S db http://greengenes.lbl.gov/

18S Source was Doug Rusch's 18S database prepared for the GOS

paper 23S

Source was Doug Rusch's 23S database prepared for the GOS paper.

rRNA Finder DB

Fasta headers were coded as follows:

>#S [D] ...original.header...

where # is one of (5, 16, 18, 23) and D is one of (A, B, E). The camera_rrna_finder component expects this format.

rRNA Finder DB

CD-HIT was run on the entire database to cluster sequences with high similarity to reduce the database size but maintain a range of diverse sequences

Command line:/usr/local/devel/ANNOTATION/bwhitty/cdhit/cd-hit/cd-hit-est -i

input_database.fsa -o output_database.fsa -c 0.8 -n 4

Consistency of clustering was checked with a Perl script to ensure no heterogeneous clustering(e.g.: 18S and 16S clustering together)

Clusters were consistent Database size was reduced from 65,591 sequences to 1,329

rRNA Finder

open_reading_frames

ORF Overlaps/ORF Stats

FASTA Headers

>HOT_READ_85779353 /accession=DU765170.1 /sample_id=JGI_SMPL_HF4000_12-21-03 /template_id=JGI_TMPL_ANIW12796 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=1088 /length=1088

>JCVI_ORF_1108836626524 /pep_id=JCVI_PEP_1108836626525 /read_id=HOT_READ_85760722 /begin=0 /end=234 /orientation=1 /5_prime_stop=0 /3_prime_stop=TAG /ttable=11 /length=234 /read_defline="/accession=DU750886.1 /sample_id=JGI_SMPL_HF130_10-06-02 /template_id=JGI_TMPL_ASNF1709 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=841 /length=841"

>JCVI_PEP_1108836626525 /orf_id=JCVI_ORF_1108836626524 /read_id=HOT_READ_85760722 /begin=0 /end=234 /orientation=1 /5_prime_stop=0 /3_prime_stop=TAG /ttable=11 /length=234 /read_defline="/accession=DU750886.1 /sample_id=JGI_SMPL_HF130_10-06-02 /template_id=JGI_TMPL_ASNF1709 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=841 /length=841"

>JCVI_NT_1108826205795 /read_id=HOT_READ_85801707 /begin=785 /end=858 /orientation=1 /type=Asn_tRNA /ergatis_id=1108826197895 /defline="HOT_READ_85801707 /accession=DU787412.1 /sample_id=JGI_SMPL_HF770_12-21-03 /template_id=JGI_TMPL_APKH2110 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=902 /length=902"

>JCVI_NT_1108806998652 /read_id=HOT_READ_85760731 /begin=55 /end=847 /orientation=0 /type=23S_rRNA /ergatis_id=1108826197895 /read_defline="/accession=DU750895.1 /sample_id=JGI_SMPL_HF130_10-06-02 /template_id=JGI_TMPL_ASNF1714 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=847 /length=847"

The absence of called ORFs in this region of the read is due to the

soft-masked rRNA sequence

RNAmmer didn’t identify the 23S

sequence, though it is capable of finding 23S

Again, RNAmmer failed to identify rRNA sequence

BLAST-based approach does a pretty good job of

finding correct boundaries

These ORFs have >150 unmasked

bases

BLAST-based rRNA finding appears to

outperform RNAmmer for 23S sequences, and

some 16S

GOS (Incremental) Clustering Pipeline

http://camera.venterinstitute.org/wiki/display/VISWCAMERA/Incremental+clustering%2C+work+flow+details

Clustering Overview

All Public Proteins +

GOS ORFs

Core Cluster

Core Cluster Core

Cluster

Core Cluster

Core Cluster

Core Cluster

Longest Sequence Representatives

Non-Redundant 90% Identity CD-HIT Sequence

Representatives

GOS v1.2

Historical Artifacts(with respect to annotation)

CAMERA Polypeptide Annotation Pipeline

Thoughts on Specifications

Annotation rules should not be literally codified as Perl code (and only Perl code)!!!(especially when the “decision makers” never look at the code)

What tools do we trust? What cutoffs do we use? What evidence/data types do we consider?

These will (in some cases should) change over time

More Thoughts

Specifications are easier to change than code, so code should be written to support change

But unless they’re defined first, the specifications will be a moving target

(My) Design Objectives

Must be able to add/remove annotation data sources as the annotation SOP changes

Must be able to easily change the ways in which these annotation data types are applied/combined to produce final annotation

Must be able to change/expand the types of final annotation data we are producing

Object-Oriented Design Approach

OOP in Perl == *, but lesser of two evils (don’t ask me what the other evil is, but it must be pretty evil)

Encapsulates possible sources of change and prevents them from affecting downstream components(like HACCP)

Polymorphism of $parser->parse($infile) producing annotation objects is nice

Re-use was not really a motive here

*Damian Conway in his OOP Perl book says using OOP in Perl yields 5X performance hit

Annotation Pipeline Overview

Annotation Source Data

Parser(s)

Annotation Tool(s)

Annotation Data Object(s)

Annotation Rules

Final Annotation Data

We can make changes to the annotation rules,

without having to necessarily re-run or re-

parse the data

Design Objectives for Parsers

A parser must: Produce polypeptides with associated AnnotationData objects of a defined type Produce AnnotationData object with attributes specified in a consistent way

E.g.: All parsers should produce EC number attributes that look like ‘1.1.1.1’ -> ‘1.-.-.-’, not sometimes ‘1.-’. Multiple values should be split. Any clean-up or verification should be done before the AnnotationData object is created; if the data is invalid, the attribute should not be populated, or the object should not be created.

Produce annotation data objects that are independent of the source annotation data they were parsed from e.g.: They have already been canonized as a type of ‘trusted annotation evidence

type’ when they are created as AnnotationData objects. These trusted types are defined in the annotation SOP.

These features create a separation between how trusted evidence is defined (input data), and how the evidence is used to produce annotation (annotation rules)

AnnotationData Objects

AnnotationData

AnnotationData::Polypeptide

type:

[some string]

attributes:

common_namegene_symbolECGOTIGR_role

…

Polypeptide

AnnotationData Object(s)

AnnotationRules

AnnotationRules object implements the rules from the annotation SOP document

AnnotationRules::PredictedProtein takes a Polypeptide object with associated AnnotationData objects of varying type and applies the annotation rules to create a final AnnotationData object

AnnotationRules

Rules are encoded as an array in the following format:

ANNOTATION_TYPE|OPERATOR|ATTRIBUTE1 ATTRIBUTE2

Where OPERATOR is one of: = for assign attribute (if unassigned) + for append attribute - for overwrite attribute

Any operators can be defined as they are applied with a hash of handler subroutines

AnnotationRules::PredictedProtein

my @annotation_order = ( ## equivalog level tigrfam hits 'TIGRFAM::FullLength::Equivalog|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FullLength::Exception|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FullLength::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',

'TIGRFAM::FRAG::Equivalog|=|GO', 'TIGRFAM::FRAG::Exception|=|GO', 'TIGRFAM::FRAG::HypotheticalEquivalog|=|GO', 'TIGRFAM::FullLength::Domain|=|GO', 'PandaBLASTP::Characterized|=|GO',

'PRIAM|=|GO EC', ## equivalog level hits vs tigrfam frag 'TIGRFAM::FRAG::Equivalog|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FRAG::Exception|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FRAG::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role', ## characterized high confidence blast hit 'PandaBLASTP::Characterized|=|common_name gene_symbol', ## pfam and non-equivalog tigrfams 'PFAM::FullLength::Equivalog|=|common_name gene_symbol GO EC TIGR_role', 'PFAM::FullLength::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FullLength::Subfamily|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FullLength::Superfamily|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FullLength::EquivalogDomain|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FullLength::HypotheticalEquivalogDomain|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FullLength::SubfamilyDomain|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FullLength::Domain|=|common_name gene_symbol GO EC TIGR_role', …

CAMERA Annotation Pipeline

CAMERA-specificErgatis components

camera_annotation_parser

camera_annotation_rules

CAMERA-specific Code in SVN

http://iwebsvn.tigr.org/listing.php?repname=ANNOTATION&path=%2FCAMERA%2F&rev=0&sc=1

Future Development(My 2 cents)

Pipeline development must be driven by annotation SOP development work Feedback on pipeline bugs must be vigilantly kept separate from feedback

on annotation SOP bugs First discuss and update the SOP, then modify the code

Cluster summary annotation Shortest path here seems to be a combination of GO Slim and EC

assignments? GO consortium makes some scripts available for summarizing sets of GO assignments

If using the current code, PolypeptideSet container class exists already. Cluster members can be added to a PolypeptideSet and that can be used as input to an AnnotationRules::FinalCluster object that is similar to the one for PredictedProtein, but with a different set of handler routines.

Incremental clustering pipeline Good luck

camera metagenomic annotation pipeline

Technology

user ergatis

ergatis cgi directory

subordinate queue of

slots ergatisfast

main ergatis application

commandqconf sq ergatis

queue epilog

sos local mounts