functional annotation of proteins via the cafa challenge lee tien duncan renfrow-symon shilpa...
TRANSCRIPT
Functional Annotation of Proteins via the CAFA ChallengeLee TienDuncan Renfrow-SymonShilpa NadimpalliMengfei Cao
COMP150PBT | Fall 2010
What’s the problem?1. Huge bottleneck = finding a protein’s
function when given a protein sequence
1. Incomplete, inaccurate, or inconsistent annotations are difficult to work with and can propagate
1. No good way to measure the accuracy of an annotation predictor
What is the CAFA Challenge?
What are Gene Ontology (GO) terms?•GO = controlled vocabulary of “gene
ontologies”
•Cover three domains:▫Cellular component▫Molecular function▫Biological process
•Hierarchy:▫Broad/general (e.g. “catalytic activity”)▫Specific (e.g. “leukotriene-C4-synthase
activity”)
Outline of Our Approach
CAFA targets(FASTA
sequences)
GO ids for each CAFA
target
SMURF?
Betawrap Pro?
Other Secondary Structure Predictor?
BLAST
PFAM
Pfam: Protein Family Database• Collection of protein
families represented by: ▫Multiple sequence
alignments▫Hidden Markov Models
• Two sections of Pfam:▫A: high-quality,
manually-curated▫B: large, automatically-
generated
Sample Multiple Sequence Alignment
Sample Hidden Markov Model
BLAST: Basic Local Align’t Search Tool•Goal: find homologous (i.e. derived from a
common ancester) sequences from a database
•Various BLAST programs:▫blastp = query: protein, database: protein▫blastn = query: nucleotide, database:
nucleotide▫blastx = query: translated nucleotide,
database: protein▫tblastn = query: protein, database: translated
nucleotide▫tblastx = query: translated nucleotide,
database: translated nucleotide
SMURF: Structural Motifs Using Random Fields
•Determines whether a protein sequence contains one of the following super secondary structures:▫6-bladed propeller▫7-bladed propeller▫8-bladed propeller▫Double blades (i.e. 6-6, 6-7,6-8…)
•Developed at Tufts!•Some propeller functions:
▫Often WD40 repeat –protein-protein interaction
▫Signaling, transcription, cell cycle
Smurf!
7-bladed propeller
Final Database Structure
cafa_targets
cafa_id
uniprot_id
gi_access_idblast_results
cafa_id
pdb_id
refseq_id
e_value_score
pfam_results
cafa_id
pfam_id
smurf_results
cafa_id
template_id
p_value_score
pdb_id
go_id
refseq_id
uniprot_id
uniprot_id
go_id
pfam_id
go_id
template_id
go_idgo_results
cafa_id
go_id
source
confidence
INPUT RESULTS MAPPING OUTPUT
Final Results Statistics
789
69
12
19
4
3,445
1,356
Distribution of sequence hits by method
Of 8,904 unknown sequences… 4,265 had at least one hit in PDB BLAST 4,824 had at least one hit in Pfam 104 had at least one hit in SMURF
In total, 5,694 unique sequences had at least one hit, a 63.9% success
Example ResultT38114MDLDMNGGNKRVFQRLGGGSNRPTTDSNQKVCFHWRAGRCNRYPCPYLHRELPGPGSGPVAASSNKRVADESGFAGPSHR
RGPGFSGTANNWGRFGGNRTVTKTEKLCKFWVDGNCPYGDKCRYLHCWSKGDSFSLLTQLDGHQKVVTGIALPSGSDKLY
TASKDETVRIWDCASGQCTGVLNLGGEVGCIISEGPWLLVGMPNLVKAWNIQNNADLSLNGPVGQVYSLVVGTDLLFAGT
QDGSILVWRYNSTTSCFDPAASLLGHTLAVVSLYVGANRLYSGAMDNSIKVWSLDNLQCIQTLTEHTSVVMSLICWDQFL
LSCSLDNTVKIWAATEGGNLEVTYTHKEEYGVLALCGVHDAEAKPVLLCSCNDNSLHLYDLPSFTERGKILAKQEIRSIQ
IGPGGIFFTGDGSGQVKVWKWSTESTPILS
•BLAST: matches with PDB structures 2OVP, 3MKS, 2CNX, 1P22, 1NEX, 3N0E
▫Transcription, mitosis, methylation, protein binding
•Pfam: match to family PF00642▫Zinc ion binding, nucleic acid binding
•SMURF: match to 7-bladed β-propeller template
▫WD domain (protein binding)
Possible Future Directions• Improving functional annotation for β-
propellers identified by SMURF▫Analyze training set of propeller proteins with
known function to build probabilistic model of protein function based on propeller type
•Addition of other structural prediction tools for motifs with known function▫G-coupled receptors, membrane bound proteins
•Expansion of BLAST search to include full nr database
Questions?