functional annotation

Functional Annotation

Episode 2: Preliminary Results

The Group

127th Feb 2012

Lavanya RishishwarArtika NathLu WangHaozheng Tian

Shengyun PengAshwath Kumar

Hamidreza Hassanzadeh

• What is Functional Annotation• The Importance of Functional Annotation• The Biology of H. haemolyticus• Background for Functional Annotation• Pros/Cons of Available Approaches • Planned Approach

– Breadth– Depth

27th Feb 2012 2

Flowchart

327th Feb 2012

Flowchart

427th Feb 2012

PRELIMINARY RESULTS

27th Feb 2012 5

Subject Organisms

fucK : ncoding fuculose-kinase. fucK deletion has been observed in some Hi isolatesHpd: encoding a lipoprotein protein D,

Species Disease State State Isolated Hemolysis Hpd fuculose-

kinase

M19107 H. haemolyticus Asymptomatic Minnesota Y - -

M19501 H. haemolyticus Asymptomatic Minnesota N + -

M21127 H. haemolyticus Pathogenic Georgia Y - -

M21621 H. haemolyticus Pathogenic Texas Y - -

M21639 H. haemolyticus Pathogenic Illinois N - -

M21709 H. influenzae Pathogenic NY N - +

27th Feb 2012 6

BLAST: Output and Parsing

• Once the results received from gene prediction tools, we should blast them against different databases

• The selected threshold: 0.005• This is automatically done by the ad-hoc

scripts utilizing the BioPerl lib, for all 6 strains• The results are then processed and the

certain metrics elicited for further analysis

27th Feb 2012 7

27th Feb 2012 8

27th Feb 2012 9

BLAST v/s UniProt: Coverage

Organism # of unique organisms in the hits

M19107 2338M19501 2332M21127 2360M21621 2364M21639 2433M21709 2154

27th Feb 2012 10

BLAST v/s UniProt: M19107

27th Feb 2012 11

Pasteurella Ralstonia Lactobacillus Mus Coxiella HomoXylella Legionella Klebsiella Erwinia Arabidopsis RickettsiaBrucella

Rhizobium

BordetellaActinobacillus

Acinetobacter

Francisella

Clostridium

MycobacteriumBuchnera

NeisseriaXanthomonas

Streptococcus

Shigella

Haemophilus

Bacillus

Vibrio

Staphylococcus

Burkholderia

Yersinia

ShewanellaPseudomonas

Salmonella

Escherichia

Others

27th Feb 2012 12

BLAST v/s UniProt: M21709Listeria Homo Coxiella Legionella ErwiniaXylella Klebsiella Arabidopsis Rickettsia Brucella RhizobiumBordetella

ActinobacillusAcinetobacter

FrancisellaClostridium

Mycobacterium

BuchneraNeisseria

Xanthomonas

Streptococcus

Shigella

Vibrio

Burkholderia

Staphylococcus

Haemophilus

Bacillus

Yersinia

ShewanellaPseudomonas

Salmonella

Escherichia

Others

CONSERVED DOMAIN DATABASE (CDD)

Introduction

• CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequence alignment models for ancient domains and full-length proteins.

• These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST.

• The PSSMs are meant to be used for compiling RPS-BLAST search databases only.

RPS-BLAST

• Reversed Position Specific Blast• It searches a query sequence against a

database of profiles (opposite of PSI-BLAST).• Use pre-computed lookup table for the

profiles to allow the search to proceed faster (architecture dependent).

• The CD-Search databases for RPS-BLAST: ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/

Strategy

FORMATRPSDB

• Formatrpsdb is a utility that converts a collection of input sequences into a database suitable for use with RPS-Blast.

• Formatrpsdb is designed to perform the work of formatdb, makemat and copymat simultaneously, without generating the large number of intermediate files these utilities would need to create an RPS Blast database.

Build Database

Title for database file

Input file containing

list of ASN.1 Scoremat filenames

Create index files

for database

Threshold for

extending hits for RPS

database

For scoremats that contain only

residue frequencies, the scaling factor to

apply when creating PSSMs

Base name of output

database

RUN RPS-BLAST

Results for CDD: COGs

27th Feb 2012 22

Organism: M19107

Results for CDD: COGs

27th Feb 2012 23

Organism: M21709

27th Feb 2012 24

LiopP• LipoP classifies genes into 4 classes:

– SpI: Signal peptide I– SpII: Lipoprotein signal peptide– TMH: N-terminal transmembrane helix (Not very reliable, It is used to avoid

TMH being falsely predicted as signal peptides)– CYT: Cytoplasmic. (All the rest)

• The classification system in LipoP uses HMM with four branches, one each for SpI, SpII, TMH, CYT.

• Protein sets for training and testing was extracted from SWISS-PROT.• They consisted of lipoproteins, SPaseI-cleaved proteins, cytoplasmic

proteins from the two Gram-negative phyllums Proteobacteria and Spirochetes.

• Transmembrane proteins were taken from phyllums Proteobacteria and Gracilicutes.

Output Example# M19107_final_1488 SpI score=11.1193 margin=11.320213 cleavage=31-32# Cut-off=-3M19107_final_1488 LipoP1.0:Best SpI 1 1 11.1193M19107_final_1488 LipoP1.0:Margin SpI 1 1 11.320213M19107_final_1488 LipoP1.0:Class CYT 1 1 -0.200913M19107_final_1488 LipoP1.0:Class SpII 1 1 -1.80091M19107_final_1488 LipoP1.0:Signal CleavI 31 32 11.119 # PISHA|SDLNQM19107_final_1488 LipoP1.0:Signal CleavI 30 31 -2.18348 # SPISH|ASDLNM19107_final_1488 LipoP1.0:Signal CleavII 19 20 -1.80091 # TALFS|CGLLI Pos+2=G

1. Sequence ID2. Type of prediction. Best means the highest scoring class, Margin gives the difference between the best score

and the second best score, Class gives the score of other classes and Signal lines contain predicted cleavage sites.

3. Feature type.4. Location in the sequence. For lines with a class prediction it is always 1. For cleavage sites it is the last amino

acid of the signal peptide relative to the predicted cleavage site.5. Location same as above except that for cleavage sites it is the first amino acids after the cleavage site.6. Score. For the "Margin" type it is the difference between the best and the second best class score. 7. For the cleavage sites the ±5 context is shown after the #, and for lipoprotein cleavage sites the amino acid in

postition +2 is shown (which may determine whether the lipoprotein is attached to the inner or outer membrane) - An aspartic acid (D) in position +2 after the cleavage site of a lipoprotein means that it is attached to the inner membrane, and most other lipoproteins are attached to the outer membrane (“Testing the '+2 rule' for lipoprotein sorting in the Escherichia coli cell envelope with a new genetic selection”, Seydel et al (1999) Molecular Microbiology 34: 810-821)

Results

Strain SpI SpIIInner

Membrane Lipoproteins

TMH CYT Total

M19107 164 54 2 241 1470 1929

M19501 176 60 3 228 1293 1757

M21127 174 67 3 244 1564 2049

M21621 178 64 2 244 1413 1899

M21639 194 82 4 267 2072 2615

M21709 144 53 2 225 1383 1805

SignalP

Biological background

• Many different types of secretory signals are found. SignalP focused on prediction of classical signal peptides, which are the far most common type of signal peptide cleaved by signal peptidase I (SPase).

• In bacteria signal peptide is targeted directly to the cell membrane.

SignalP

• SignalP 3.0 was the best method among PrediSi, SPEPlip, Signal-CF, Signal-3L and Signal-BLAST. (Choo, K., Tan, T. & Ranganathan, S. BMC Bioinformatics 10, S2 (2009).)

• SignalP4.0 is even better, and hence was included in our method. (SignalP 4.0: discriminating signal peptides from transmembrane regions Thomas Nordahl Petersen, et al. Nature Methods, 8:785-786, 2011)

SignalP

• SignalP 4.0 is a purely neural network–based method.

• Two types of networks in SignalP 4.0:– SignalP-TM networks – SignalP-noTM networks

• The decision to select network: If SignalP-TM predicts four or more positions as being transmembrane positions, SignalP-TM is used for the final prediction, otherwise SignalPnoTM is used.

Results from SignalPOrganism No. Signal Pep Total Genes Percentage

M19107 144 1929 7.47%

M19501 150 1757 8.54%

M21127 152 2049 7.42%

M21621 151 1899 7.95%

M21639 178 2615 6.81%

M21709 122 1805 6.76%

Comparison between LipoP and SignalP

• The results obtained from LipoP and SignalP were compared with the help of a script.

• Both SpI and SpII were taken from LipoP and all the positive outputs were taken from SignalP.

• They were also analyzed for similar cleavage sites.

Comparison table

Organism No.

Genes Predicted to have Signaling Peptides

Negatives Total # of Genes

No. of Cleavage Sites detected

LipoP Unique

SignalP Unique Common Consistent Sites Conflicting Sites

M19107 75 1 143 1710 1929 112 31

M19501 86 0 150 1521 1757 115 35

M21127 89 0 152 1808 2049 114 38

M21621 91 0 151 1657 1899 117 34

M21639 100 2 176 2337 2615 126 50

M21709 75 0 122 1608 1805 93 29

75 143 1 86 150

15289 1519112275

100 2176

M19107 M19501 M21639

M21127M21621 M21709

Signal P LipoP

Comparison between LipoP and SignalP

• Bottom-line: As was clearly visible by the Venn Diagram, the SignalP didn’t provided much of new information as compared to LipoP.

27th Feb 2012 36

TMHMMPrediction of transmembrane helices in proteins

Organism No. Transmembrane Helices Total Genes Percentage

M19107 392 1929 20.32%

M19501 385 1757 21.91%

M21127 417 2049 20.35%

M21621 413 1899 21.75%

M21639 464 2615 17.74%

M21709 361 1805 20.00%

Member Database Focus/FeaturesPFAM divergent domainsPROSITE functional sites

PRINTS hierarchical definitions from superfamily to subfamily levels

TIGRFAMs building HMMs for functionally equivalent proteins

PIRSFproduce HMMs over the full length of a protein and have protein length restrictions together family members

HAMAP profiles

manually created by expert curators they identify proteins that are part of well-conserved bacterial, archaeal and plastid-encoded proteins families or subfamilies

PANTHER build HMMS based on the divergence of function within families

SUPERFAMILY Structure using the SCOP as a basis for building HMMs

GENE3DUse Structure using the CATH superfamilies as a basis for building HMMs

Member signature databasesSimilar coverage in size; Different content

About• A wrapper of sequence analysis applications• Database and output files scanning • Bulk data processing• Efficient(parallel) internal architecture

Querying with InterProScan

InterProScanQuery Sequence

• Input– Nucleotide* or protein sequences – Recognized sequence format: raw, FASTA or

EMBL – Reformat and translate(if necessary)

*Nucleotide sequences will translated and scanned in all 6 frames without any further assumption

• Running InterProScanscreenshot at<60s

• Output– InterProScan makes results available in four

formats: raw, ebixml, xml, txt, html

• Parse InterProScan Output(BioPerl)– Bio::SeqIO::interpro

• Interpretation of Output Data(example)

Key: Intepretation

10683_1_ORF1 the id of the input sequence.

024307F93E501F2C the crc64 (checksum) of the protein sequence (supposed to be unique).

404 the length of the sequence (in AA).

HMMPfam the anaysis method launched.

PF03453 the database members entry for this match.

MoeA_N the database member description for the entry.

1 the start of the domain match.

163 the end of the domain match.

1.49999999999999999E-56 the evalue of the match (reported by member database method).

T the status of the match (T: true, ?: unknown).

26-Feb-12 the date of the run.

IPR005110 the corresponding InterPro entry (if iprlookup requested by the user).

MoeA, N-terminal and linker domain the description of the InterPro entry.

Biological Process: molybdopterin cofactor biosynthetic process (GO:0032324) the GO (gene ontology) description for the InterPro entry.

Preliminary Results

M19107

　　Total Searched Protein 1,769

Match 1,716

Unmatch 378

Total Hits: 12,393

533251,391

Next Up

• Major Challenge: Funneling all the annotation information into a consolidated GenBank/GFF3 entry.

• Level 2!

27th Feb 2012 48

Level 2Operons, Virulence Factors and Metabolic Pathways

27th Feb 2012 49

VIRULENCELikelihood of a pathogen causing disease

27th Feb 2012 50

H.haemolyticus

• As the name of the species implies, is generally hemolytic on blood agar plates

• Beta-hemolytic phenotype routinely used in the clinical setting to distinguish H.h from NTHi

• Nonhemolytic H. haemolyticus strains are being isolated > misidentified as NTHI

Gene(s) encoding hemolysin Unknown(Xin WangMeningitis Laboratory, CDC)

Photograph from From MicrobeLibrary.org

Subject Organisms

fucK : ncoding fuculose-kinase. fucK deletion has been observed in some Hi isolatesHpd: encoding a lipoprotein protein D,

Species Disease State State Isolated Hemolysis Hpd fuculose-

kinase

M19107 H. haemolyticus Asymptomatic Minnesota Y - -

M19501 H. haemolyticus Asymptomatic Minnesota N + -

M21127 H. haemolyticus Pathogenic Georgia Y - -

M21621 H. haemolyticus Pathogenic Texas Y - -

M21639 H. haemolyticus Pathogenic Illinois N - -

M21709 H. influenzae Pathogenic NY N - +

27th Feb 2012 52

Virulence factors • Refer to the traits encoded by `virluence genes` that pathogenic microbes are

equipped to cause infection.

HOW???

– Attach selectively to host tissues – Colonize parts of the host body– Gain access to nutrients by invading or destroying host tissues – Avoid host defenses

• Virulence factors include:– Bacterial toxins– Cell surface proteins that mediate bacterial attachment– Cell surface carbohydrates and proteins that protect a bacterium– Hydrolytic enzymes that may contribute to the pathogenicity of the bacterium

27th Feb 2012 53

VFDB: Virulence factor Database • Set up in 2004• Up-to date information regarding validated VF’s from 24 genera of medically

important bacterial pathogens.• Detailed tabular comparison of virluence composition in terms of V. genes and

their composition• Multiple alignment and statistical analysis of homologous VFs• Graphical comparison of V. genes• VF’s

– Adhesion & invasion – Bacterial secretion systems& effectors– Toxins – Iron-acquisition system

• Pathogenicity island

27th Feb 2012 54

Operon and Pathway Analysis

• As was pointed out by Alejandro Caro, usually a missing gene in an otherwise complete pathway reflects a hole in the annotation process.

• This path serves to fill such holes in the annotation process.

27th Feb 2012 55

DOOR(Database of prOkaryotic OpeRons)

• DOOR (Database of prOkaryotic OpeRons) is an operon database developed by Computational Systems Biology Lab (CSBL) at UGA. The operons in the database are based on prediction.

• DOOR is the biggest operon database available until now(2009).

• This algorithm is consistently best at all aspects including sensitivity and specificity for both true positives and true negatives, and the overall accuracy reach ~90%.

• Currently DOOR has operons for 971 prokaryotic genomes. • Although most of operons in DOOR are not verified by

experiments, they are also trying to provide some limited literature information, which is extracted from ODB.

FOUR STRAINS IN DOOR

Strategy

THE PATHWAY TOOLS

A Glance at the End of Annotation

Enable• Browsing of Annotated Genes• Analysis of pathways

Database

"Do not use a DBMS when the initial investment in hardware, software, and training is too high.”

- Shamkant Navathe,Georgia Institute of Technology

The Pathway Tools

"Pathway Tools is a production-quality software environment for creating a type of model- organism database called a Pathway/Genome Database (PGDB)"

• Prediction– Metabolic pathways– Metabolic pathway hole filler– Operons

• Curating• PGDB web service

– Publish PGDB– Query– Visualization

• Metabolic Network Analysis

The Pathway Tools

• Pros– BioCyc Tier 1 and Tier 2 databases are highly

curated– Enables editing(curation) and querying of PGDB

• Cons– BioCyc have less number ofgenomes than other databases– Some tools are only availablein the local version(eg. PathoLogic)

WHY “The Pathway Tools” ?

• Prediction– Metabolic pathways– Metabolic pathway hole filler– Operons

• Curating• PGDB web service

– Publish PGDB– Query– Visualization

• Metabolic Network Analysis

The Pathway Tools

PathoLogic

The Pathway Tools Local Version(GUI)

PathoLogic

Inputs and outputs of the computational inference modules within PathoLogic

functional annotation

rps blast database

blast vs uniprot

database of profiles

protein sequences viarpsblast

database suitable

build database title

database fileinput file

rps databasefor scoremats

Documents

comprehensive functional annotation of 77 prostate cancer

annotation of functional regulatory elements in livestock

functional sequence annotation in - helsinki

functional annotation of formerly “ unculturable ” sar11...

comparative analysis of chloroplast genomes: functional...

comparative analysis of functional metagenomic annotation...

functional annotation of animal genomes (faang) - alan...

post-genomic computational methods for functional annotation

automated annotation of functional imaging experiments via

functional annotation & comparative genomics

functional annotation and comparative analysis of a

functional annotation of the human brain methylome ... ·...

catia functional tolerancing & annotation · catia...

1 gene function annotation. 2 outline functional annotation...

functional tolerancing and annotation

3d functional tolerancing _ annotation

functional annotation and functional enrichment. annotation...

automatic and manual functional annotation in a distributed...

from protein interactions to functional annotation: graph...

tyler functional annotation thurs 1120