Download - Functional Annotation
Functional Annotation
Episode 2: Preliminary Results
The Group
127th Feb 2012
Lavanya RishishwarArtika NathLu WangHaozheng Tian
Shengyun PengAshwath Kumar
Hamidreza Hassanzadeh
Recap
• What is Functional Annotation• The Importance of Functional Annotation• The Biology of H. haemolyticus• Background for Functional Annotation• Pros/Cons of Available Approaches • Planned Approach
– Breadth– Depth
27th Feb 2012 2
Flowchart
327th Feb 2012
Flowchart
427th Feb 2012
PRELIMINARY RESULTS
27th Feb 2012 5
Subject Organisms
fucK : ncoding fuculose-kinase. fucK deletion has been observed in some Hi isolatesHpd: encoding a lipoprotein protein D,
Species Disease State State Isolated Hemolysis Hpd fuculose-
kinase
M19107 H. haemolyticus Asymptomatic Minnesota Y - -
M19501 H. haemolyticus Asymptomatic Minnesota N + -
M21127 H. haemolyticus Pathogenic Georgia Y - -
M21621 H. haemolyticus Pathogenic Texas Y - -
M21639 H. haemolyticus Pathogenic Illinois N - -
M21709 H. influenzae Pathogenic NY N - +
27th Feb 2012 6
BLAST: Output and Parsing
• Once the results received from gene prediction tools, we should blast them against different databases
• The selected threshold: 0.005• This is automatically done by the ad-hoc
scripts utilizing the BioPerl lib, for all 6 strains• The results are then processed and the
certain metrics elicited for further analysis
27th Feb 2012 7
27th Feb 2012 8
27th Feb 2012 9
BLAST v/s UniProt: Coverage
Organism # of unique organisms in the hits
M19107 2338M19501 2332M21127 2360M21621 2364M21639 2433M21709 2154
27th Feb 2012 10
BLAST v/s UniProt: M19107
27th Feb 2012 11
Pasteurella Ralstonia Lactobacillus Mus Coxiella HomoXylella Legionella Klebsiella Erwinia Arabidopsis RickettsiaBrucella
Rhizobium
BordetellaActinobacillus
Acinetobacter
Francisella
Clostridium
MycobacteriumBuchnera
NeisseriaXanthomonas
Streptococcus
Shigella
Haemophilus
Bacillus
Vibrio
Staphylococcus
Burkholderia
Yersinia
ShewanellaPseudomonas
Salmonella
Escherichia
Others
27th Feb 2012 12
BLAST v/s UniProt: M21709Listeria Homo Coxiella Legionella ErwiniaXylella Klebsiella Arabidopsis Rickettsia Brucella RhizobiumBordetella
ActinobacillusAcinetobacter
FrancisellaClostridium
Mycobacterium
BuchneraNeisseria
Xanthomonas
Streptococcus
Shigella
Vibrio
Burkholderia
Staphylococcus
Haemophilus
Bacillus
Yersinia
ShewanellaPseudomonas
Salmonella
Escherichia
Others
CONSERVED DOMAIN DATABASE (CDD)
Introduction
• CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequence alignment models for ancient domains and full-length proteins.
• These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST.
• The PSSMs are meant to be used for compiling RPS-BLAST search databases only.
RPS-BLAST
• Reversed Position Specific Blast• It searches a query sequence against a
database of profiles (opposite of PSI-BLAST).• Use pre-computed lookup table for the
profiles to allow the search to proceed faster (architecture dependent).
• The CD-Search databases for RPS-BLAST: ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/
Strategy
FORMATRPSDB
• Formatrpsdb is a utility that converts a collection of input sequences into a database suitable for use with RPS-Blast.
• Formatrpsdb is designed to perform the work of formatdb, makemat and copymat simultaneously, without generating the large number of intermediate files these utilities would need to create an RPS Blast database.
Build Database
Title for database file
Input file containing
list of ASN.1 Scoremat filenames
Create index files
for database
Threshold for
extending hits for RPS
database
For scoremats that contain only
residue frequencies, the scaling factor to
apply when creating PSSMs
Base name of output
database
RUN RPS-BLAST
Results for CDD: COGs
27th Feb 2012 22
Organism: M19107
>10
Results for CDD: COGs
27th Feb 2012 23
Organism: M21709
>10
LipoP
27th Feb 2012 24
LiopP• LipoP classifies genes into 4 classes:
– SpI: Signal peptide I– SpII: Lipoprotein signal peptide– TMH: N-terminal transmembrane helix (Not very reliable, It is used to avoid
TMH being falsely predicted as signal peptides)– CYT: Cytoplasmic. (All the rest)
• The classification system in LipoP uses HMM with four branches, one each for SpI, SpII, TMH, CYT.
• Protein sets for training and testing was extracted from SWISS-PROT.• They consisted of lipoproteins, SPaseI-cleaved proteins, cytoplasmic
proteins from the two Gram-negative phyllums Proteobacteria and Spirochetes.
• Transmembrane proteins were taken from phyllums Proteobacteria and Gracilicutes.
Output Example# M19107_final_1488 SpI score=11.1193 margin=11.320213 cleavage=31-32# Cut-off=-3M19107_final_1488 LipoP1.0:Best SpI 1 1 11.1193M19107_final_1488 LipoP1.0:Margin SpI 1 1 11.320213M19107_final_1488 LipoP1.0:Class CYT 1 1 -0.200913M19107_final_1488 LipoP1.0:Class SpII 1 1 -1.80091M19107_final_1488 LipoP1.0:Signal CleavI 31 32 11.119 # PISHA|SDLNQM19107_final_1488 LipoP1.0:Signal CleavI 30 31 -2.18348 # SPISH|ASDLNM19107_final_1488 LipoP1.0:Signal CleavII 19 20 -1.80091 # TALFS|CGLLI Pos+2=G
1. Sequence ID2. Type of prediction. Best means the highest scoring class, Margin gives the difference between the best score
and the second best score, Class gives the score of other classes and Signal lines contain predicted cleavage sites.
3. Feature type.4. Location in the sequence. For lines with a class prediction it is always 1. For cleavage sites it is the last amino
acid of the signal peptide relative to the predicted cleavage site.5. Location same as above except that for cleavage sites it is the first amino acids after the cleavage site.6. Score. For the "Margin" type it is the difference between the best and the second best class score. 7. For the cleavage sites the ±5 context is shown after the #, and for lipoprotein cleavage sites the amino acid in
postition +2 is shown (which may determine whether the lipoprotein is attached to the inner or outer membrane) - An aspartic acid (D) in position +2 after the cleavage site of a lipoprotein means that it is attached to the inner membrane, and most other lipoproteins are attached to the outer membrane (“Testing the '+2 rule' for lipoprotein sorting in the Escherichia coli cell envelope with a new genetic selection”, Seydel et al (1999) Molecular Microbiology 34: 810-821)
Results
Strain SpI SpIIInner
Membrane Lipoproteins
TMH CYT Total
M19107 164 54 2 241 1470 1929
M19501 176 60 3 228 1293 1757
M21127 174 67 3 244 1564 2049
M21621 178 64 2 244 1413 1899
M21639 194 82 4 267 2072 2615
M21709 144 53 2 225 1383 1805
Hh
Hi
SignalP
Biological background
• Many different types of secretory signals are found. SignalP focused on prediction of classical signal peptides, which are the far most common type of signal peptide cleaved by signal peptidase I (SPase).
• In bacteria signal peptide is targeted directly to the cell membrane.
SignalP
• SignalP 3.0 was the best method among PrediSi, SPEPlip, Signal-CF, Signal-3L and Signal-BLAST. (Choo, K., Tan, T. & Ranganathan, S. BMC Bioinformatics 10, S2 (2009).)
• SignalP4.0 is even better, and hence was included in our method. (SignalP 4.0: discriminating signal peptides from transmembrane regions Thomas Nordahl Petersen, et al. Nature Methods, 8:785-786, 2011)
SignalP
• SignalP 4.0 is a purely neural network–based method.
• Two types of networks in SignalP 4.0:– SignalP-TM networks – SignalP-noTM networks
• The decision to select network: If SignalP-TM predicts four or more positions as being transmembrane positions, SignalP-TM is used for the final prediction, otherwise SignalPnoTM is used.
Results from SignalPOrganism No. Signal Pep Total Genes Percentage
M19107 144 1929 7.47%
M19501 150 1757 8.54%
M21127 152 2049 7.42%
M21621 151 1899 7.95%
M21639 178 2615 6.81%
M21709 122 1805 6.76%
Comparison between LipoP and SignalP
• The results obtained from LipoP and SignalP were compared with the help of a script.
• Both SpI and SpII were taken from LipoP and all the positive outputs were taken from SignalP.
• They were also analyzed for similar cleavage sites.
Comparison table
Organism No.
Genes Predicted to have Signaling Peptides
Negatives Total # of Genes
No. of Cleavage Sites detected
LipoP Unique
SignalP Unique Common Consistent Sites Conflicting Sites
M19107 75 1 143 1710 1929 112 31
M19501 86 0 150 1521 1757 115 35
M21127 89 0 152 1808 2049 114 38
M21621 91 0 151 1657 1899 117 34
M21639 100 2 176 2337 2615 126 50
M21709 75 0 122 1608 1805 93 29
75 143 1 86 150
15289 1519112275
100 2176
M19107 M19501 M21639
M21127M21621 M21709
Signal P LipoP
Comparison between LipoP and SignalP
• Bottom-line: As was clearly visible by the Venn Diagram, the SignalP didn’t provided much of new information as compared to LipoP.
27th Feb 2012 36
TMHMMPrediction of transmembrane helices in proteins
TMHMM
Organism No. Transmembrane Helices Total Genes Percentage
M19107 392 1929 20.32%
M19501 385 1757 21.91%
M21127 417 2049 20.35%
M21621 413 1899 21.75%
M21639 464 2615 17.74%
M21709 361 1805 20.00%
Member Database Focus/FeaturesPFAM divergent domainsPROSITE functional sites
PRINTS hierarchical definitions from superfamily to subfamily levels
TIGRFAMs building HMMs for functionally equivalent proteins
PIRSFproduce HMMs over the full length of a protein and have protein length restrictions together family members
HAMAP profiles
manually created by expert curators they identify proteins that are part of well-conserved bacterial, archaeal and plastid-encoded proteins families or subfamilies
PANTHER build HMMS based on the divergence of function within families
SUPERFAMILY Structure using the SCOP as a basis for building HMMs
GENE3DUse Structure using the CATH superfamilies as a basis for building HMMs
Member signature databasesSimilar coverage in size; Different content
About• A wrapper of sequence analysis applications• Database and output files scanning • Bulk data processing• Efficient(parallel) internal architecture
Querying with InterProScan
InterProScanQuery Sequence
• Input– Nucleotide* or protein sequences – Recognized sequence format: raw, FASTA or
EMBL – Reformat and translate(if necessary)
*Nucleotide sequences will translated and scanned in all 6 frames without any further assumption
Querying with InterProScan
• Running InterProScanscreenshot at<60s
Querying with InterProScan
Querying with InterProScan
• Output– InterProScan makes results available in four
formats: raw, ebixml, xml, txt, html
• Parse InterProScan Output(BioPerl)– Bio::SeqIO::interpro
• Interpretation of Output Data(example)
Querying with InterProScan
Querying with InterProScan
Key: Intepretation
10683_1_ORF1 the id of the input sequence.
024307F93E501F2C the crc64 (checksum) of the protein sequence (supposed to be unique).
404 the length of the sequence (in AA).
HMMPfam the anaysis method launched.
PF03453 the database members entry for this match.
MoeA_N the database member description for the entry.
1 the start of the domain match.
163 the end of the domain match.
1.49999999999999999E-56 the evalue of the match (reported by member database method).
T the status of the match (T: true, ?: unknown).
26-Feb-12 the date of the run.
IPR005110 the corresponding InterPro entry (if iprlookup requested by the user).
MoeA, N-terminal and linker domain the description of the InterPro entry.
Biological Process: molybdopterin cofactor biosynthetic process (GO:0032324) the GO (gene ontology) description for the InterPro entry.
Preliminary Results
M19107
Total Searched Protein 1,769
Match 1,716
Unmatch 378
Total Hits: 12,393
533251,391
Next Up
• Major Challenge: Funneling all the annotation information into a consolidated GenBank/GFF3 entry.
• Level 2!
27th Feb 2012 48
Level 2Operons, Virulence Factors and Metabolic Pathways
27th Feb 2012 49
VIRULENCELikelihood of a pathogen causing disease
27th Feb 2012 50
H.haemolyticus
• As the name of the species implies, is generally hemolytic on blood agar plates
• Beta-hemolytic phenotype routinely used in the clinical setting to distinguish H.h from NTHi
• Nonhemolytic H. haemolyticus strains are being isolated > misidentified as NTHI
Gene(s) encoding hemolysin Unknown(Xin WangMeningitis Laboratory, CDC)
Photograph from From MicrobeLibrary.org
Subject Organisms
fucK : ncoding fuculose-kinase. fucK deletion has been observed in some Hi isolatesHpd: encoding a lipoprotein protein D,
Species Disease State State Isolated Hemolysis Hpd fuculose-
kinase
M19107 H. haemolyticus Asymptomatic Minnesota Y - -
M19501 H. haemolyticus Asymptomatic Minnesota N + -
M21127 H. haemolyticus Pathogenic Georgia Y - -
M21621 H. haemolyticus Pathogenic Texas Y - -
M21639 H. haemolyticus Pathogenic Illinois N - -
M21709 H. influenzae Pathogenic NY N - +
27th Feb 2012 52
Virulence factors • Refer to the traits encoded by `virluence genes` that pathogenic microbes are
equipped to cause infection.
HOW???
– Attach selectively to host tissues – Colonize parts of the host body– Gain access to nutrients by invading or destroying host tissues – Avoid host defenses
• Virulence factors include:– Bacterial toxins– Cell surface proteins that mediate bacterial attachment– Cell surface carbohydrates and proteins that protect a bacterium– Hydrolytic enzymes that may contribute to the pathogenicity of the bacterium
27th Feb 2012 53
VFDB: Virulence factor Database • Set up in 2004• Up-to date information regarding validated VF’s from 24 genera of medically
important bacterial pathogens.• Detailed tabular comparison of virluence composition in terms of V. genes and
their composition• Multiple alignment and statistical analysis of homologous VFs• Graphical comparison of V. genes• VF’s
– Adhesion & invasion – Bacterial secretion systems& effectors– Toxins – Iron-acquisition system
• Pathogenicity island
27th Feb 2012 54
Operon and Pathway Analysis
• As was pointed out by Alejandro Caro, usually a missing gene in an otherwise complete pathway reflects a hole in the annotation process.
• This path serves to fill such holes in the annotation process.
27th Feb 2012 55
DOOR(Database of prOkaryotic OpeRons)
• DOOR (Database of prOkaryotic OpeRons) is an operon database developed by Computational Systems Biology Lab (CSBL) at UGA. The operons in the database are based on prediction.
• DOOR is the biggest operon database available until now(2009).
• This algorithm is consistently best at all aspects including sensitivity and specificity for both true positives and true negatives, and the overall accuracy reach ~90%.
• Currently DOOR has operons for 971 prokaryotic genomes. • Although most of operons in DOOR are not verified by
experiments, they are also trying to provide some limited literature information, which is extracted from ODB.
FOUR STRAINS IN DOOR
Strategy
THE PATHWAY TOOLS
A Glance at the End of Annotation
Enable• Browsing of Annotated Genes• Analysis of pathways
Database
"Do not use a DBMS when the initial investment in hardware, software, and training is too high.”
- Shamkant Navathe,Georgia Institute of Technology
The Pathway Tools
"Pathway Tools is a production-quality software environment for creating a type of model- organism database called a Pathway/Genome Database (PGDB)"
• Prediction– Metabolic pathways– Metabolic pathway hole filler– Operons
• Curating• PGDB web service
– Publish PGDB– Query– Visualization
• Metabolic Network Analysis
The Pathway Tools
• Pros– BioCyc Tier 1 and Tier 2 databases are highly
curated– Enables editing(curation) and querying of PGDB
• Cons– BioCyc have less number ofgenomes than other databases– Some tools are only availablein the local version(eg. PathoLogic)
WHY “The Pathway Tools” ?
• Prediction– Metabolic pathways– Metabolic pathway hole filler– Operons
• Curating• PGDB web service
– Publish PGDB– Query– Visualization
• Metabolic Network Analysis
The Pathway Tools
PathoLogic
The Pathway Tools Local Version(GUI)
PathoLogic
Inputs and outputs of the computational inference modules within PathoLogic