understanding biological function in times of high throughput and low output
TRANSCRIPT
Lush Green
Column B
protein interactions0.631846654696
gene expression0.631846654696
literature0.536536474892
profile-profile alignments0.369482390229
ortholog0.369482390229
sequence properties0.344164582244
phylogeny0.16184302545
sequence alignments-0.0613005258638
other functional information-0.0613005258638
machine learning based method-0.284444077178
sequence-profile alignments-0.591928776926
Column B
profile-profile alignments0.565957694461
literature0.50986822781
ortholog0.342814143146
sequence properties0.317496335162
protein interactions0.317496335162
gene expression0.317496335162
phylogeny0.135174778368
sequence alignments-0.087968772946
other functional information-0.087968772946
machine learning based method-0.087968772946
sequence-profile alignments-0.2131319159
Understanding Biological Function in Times of High Throughput and Low Output
Iddo FriedbergIowa State Universityhttp://iddo-friedberg.net
@iddux
Big Data in my lab
Gene block evolution
Images and Genomes
Host/Microbiome
Database error and bias
Critical Assessment of Protein Function Annotations
Big Data in my lab
Database error and bias
Critical Assessment of Protein Function Annotations
Big Data in my lab
Database error and bias
Critical Assessment of Protein Function Annotations
Understanding methods
Understanding the data
Understanding Methods: The Critical Assessment of protein Function Annotations
Pedja
Wyatt
Sean
Tal
Alex
Large Data Biology has a Bad Rap?
"So we now have a culture which is based on everything must be high-throughput.I like to call it low-input, high-throughput, no-output biology" Sydney Brenner
Motivation: The Knowledge Gap
The gap between data and Information
Information
Data
Temperton & Giovannoni Curr. Opin. Microbiology (2012)
Errors Accumulate in Databases
Schnoes A et al (2009) PLoS Computational Biology, 5 (12)
Assigning Function to Proteins
Low-ish throughputHigh throughputMachine learning
Most Proteins are Annotated Electronically
Compiled from the GOA project, EBI, 6/2011
Problems
Most genes are annotated electronically
Databases have a high error rate which is growing
Homology transfer is less effective
Solutions?
Assess accuracy of annotation softwareWrite better software
Challenges in Picking Targets
Can't use databases: circularity problem
Experimental groups have a small sharing timeframe
Function description too vague for precise GO annotation
There are unknown unknowns
Choose an annotated protein
Prediction method uses said annotation to predict function
Circular logic...
is circular
Choosing Assessment Benchmarks
Function unknown
Function still unknown
Function still unknown
Challenge opens
Submission deadline
Assessment time
Function unknown
Function still unknown
Function known
Benchmark?
Time
BLAST
Naive
Molecular Function precision/ recall
BLAST
Naive
Biological process precision/ recall
Case Study: hPNPase
Gadi Schuster (Technion)
Successful Methods?
log(obs/exp)
BiologicalProcess
Molecular Function
CAFA2 vs. CAFA1
CAFA2 was held in 2014-2015
More targets (100,00 vs. 50,000)More groups (56 vs 29)
CAFA2 vs. CAFA1
CAFA2 was held in 2014-2015
100,000 targets147 participantsMethods have improved
CAFA Conclusions & What's Next
Homology transfer still rules.
Combined methods work best
Molecular Function is easier to predict than Biological Process
Generally, the field can use improvement
Comparison of metrics is very much neededWhy do methods perform differently under different metrics?
Is there a best metric? What is best?
Databases are biased
Understanding Methods: The Critical Assessment of protein Function Annotations
Pedja
Wyatt
Sean
Tal
Alex
protein binding
protein homodimerization
activity
zinc ion binding
transcription activator
activity
chromatin binding
transcription repressor
activity
transcription factor
activity
two-component sensor
activity
specific transcriptional
repressor activity
DNA binding
calcium ion binding
identical protein binding
manganese ion binding
ATP binding
beta-galactoside alpha-
2,3-sialyltransferase
activity
magnesium ion binding
enzyme binding
electron carrier activity
structural constituent of
ribosome
metal ion binding
Leaf terms Molecular Function
David Ream(MU)
Alexander Thorman (MU)
Alexandra Schnoes (UCSF)
Protein BindingActivity
Annotations per article
Schnoes et al PloS Comp Biol (2013)
Information is in an inverse relationship to the number of proteins annotated
1