what is bioinformatics?. what are bioinformaticians up to, actually? manage molecular biological...
TRANSCRIPT
What is bioinformatics?
What are bioinformaticians up to, actually?
• Manage molecular biological data– Store in databases, organise, formalise, describe...
• Compare molecular biological data• Find patterns in molecular biological data
– phylogenies
– correlations (sequence / structure / expression / function / disease)
Goals:• characterise biological patterns & processes• predict biological properties
– low level data high level properties ⇒(eg., sequence function)⇒
Bioinformatics: neighbour disciplines
• Computational biology– Broader concept: includes computational ecology,
physiology, neurology etc...
• -omics:– Genomics– Transcriptomics– Proteomics
• Systems biology– Putting it all together...– Building models, identify control & regulation
Bioinformatics: prerequisites
• Bio- side:– Molecular biology– Cell biology – Genetics– Evolutionary theory
• -informatics side:– Computer science– Statistics– Theoretical physics
Molecular biology data...
>alpha-DATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTGGAGCCGAGGCCCTGGAGAGGTGCGGGCTGAGCTTGGGGAAACCATGGGCAAGGGGGGCGACTGGGTGGGAGCCCTACAGGGCTGCTGGGGGTTGTTCGGCTGGGGGTCAGCACTGACCATCCCGCTCCCGCAGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTTGCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAGAGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACCCTGTCAACTTCAAGGCAGGCGGGGGACGGGGGTCAGGGGCCGGGGAGTTGGGGGCCAGGGACCTGGTTGGGGATCCGGGGCCATGCCGGCGGTACTGAGCCCTGTTTTGCCTTGCAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACACCCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGATAA>alpha-AATGGTGCTGTCTGCCAACGACAAGAGCAACGTGAAGGCCGTCTTCGGCAAAATCGGCGGCCAGGCCGGTGACTTGGGTGGTGAAGCCCTGGAGAGGTATGTGGTCATCCGTCATTACCCCATCTCTTGTCTGTCTGTGACTCCATCCCATCTGCCCCCATACTCTCCCCATCCATAACTGTCCCTGTTCTATGTGGCCCTGGCTCTGTCTCATCTGTCCCCAACTGTCCCTGATTGCCTCTGTCCCCCAGGTTGTTCATCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACCTGTCACATGGCTCCGCTCAGATCAAGGGGCACGGCAAGAAGGTGGCGGAGGCACTGGTTGAGGCTGCCAACCACATCGATGACATCGCTGGTGCCCTCTCCAAGCTGAGCGACCTCCACGCCCAAAAGCTCCGTGTGGACCCCGTCAACTTCAAAGTGAGCATCTGGGAAGGGGTGACCAGTCTGGCTCCCCTCCTGCACACACCTCTGGCTACCCCCTCACCTCACCCCCTTGCTCACCATCTCCTTTTGCCTTTCAGCTGCTGGGTCACTGCTTCCTGGTGGTCGTGGCCGTCCACTTCCCCTCTCTCCTGACCCCGGAGGTCCATGCTTCCCTGGACAAGTTCGTGTGTGCCGTGGGCACCGTCCTTACTGCCAAGTACCGTTAA
• DNA sequences
Molecular biology data...
• Amino acid sequences
• Protein structure:– X-ray crystallography
– NMR
Cell biology & proteomics data...
• Subcellular localization
Cell biology & proteomics data...
protein-protein interactions
DNA microarray technologyTranscriptomics: DNA microarray technology
Proteomics & transcriptomics data
Proteins encoded by periodically expressed genes: a functionally diverse protein category
Cell cycle regulation
Multiprotein complexes in yeast (Saccharomyces cerevisiae).
Coloured components are periodically expressed (dynamic). White components are always expressed (static).
Complexes contain both kinds of components (just-in-time assembly).
Dynamic complex formation during the yeast cell cycle.Science. 2005 Feb 4;307(5710):724-7.
de Lichtenberg U, Jensen LJ, Brunak S, Bork P.
Cell cycle regulation 2
Transcriptional regulation is poorly conserved!
Lars Juhl Jensen, Thomas Skøt Jensen, Ulrik de Lichtenberg, Søren Brunak and Peer Bork, Nature, Oct 5, 2006
Dynamic components overlap:
Phenotype data: human diseases
Prediction methods
• Homology / Alignment
• Simple pattern (“word”) recognition • Statistical methods
– Weight matrices: calculate amino acid probabilities– Other examples: Regression, variance analysis, clustering
• Machine learning– Like statistical methods, but parameters are estimated by
iterative training rather than direct calculation– Examples: Neural Networks (NN), Hidden Markov Models
(HMM), Support Vector Machines (SVM)
• Combinations
Simple pattern (“word”) recognition
Example: PROSITE entry PS00014, ER_TARGET:
Endoplasmic reticulum targeting sequence (”KDEL-signal”).
Pattern: [KRHQSA]-[DENQ]-E-L>
NB: only yes/no answers!
Statistical methods
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
• Estimate probabilities for nucleotides / amino acids• Information content in sequences; logos; Position-
Weight Matrices.• Quantitative answers.
ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I-TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---IIE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD----TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---VASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE----TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI
Sequence profiles
Conserved
Non-conserved
Matching anything but G => large negative score
Anything can match
TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP
TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI
Unlike Position-Weight Matrices, sequence profiles are able to handle gaps
HMM (Hidden Markov Models)
Which parts of the sequence are
– in the cytoplasm?
– embedded in the membrane (TM-helices)?
– outside?
Rule-of-thumb says amino acids in the TM helices are hydrophobic – but there are exceptions
TMHMM predicts topology from sequence using a statistical model
TMHMM: predicting the topology of α-helix transmembrane proteins
IKEEHVI IQAE
HEC
IKEEHVIIQAEFYLNPDQSGEF…..Window
Input Layer
Hidden Layer
Output Layer
Weights
Neural Networks: ”Architecture”
Neural networks
• Neural networks can learn higher order correlations!– What does this
mean?
A A => 0A C => 1C A => 1C C => 0
No linear function can learn this pattern
Neural Networks: advantages
How to estimate parameters for prediction?
Model selection
Linear Regression Quadratic Regression Join-the-dots
The test set method
The test set method
The test set method
The test set method
The test set method
Problem: sequences are related
• If the sequences in the test set are closely related to those in the training set, we can not measure true generalization performanceALAKAAAAM
ALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
Protein sorting in eukaryotes
• Proteins belong in different organelles of the cell – and some even have their function outside the cell
• Günter Blobel was in 1999 awarded The Nobel Prize in Physiology or Medicine for the discovery that "proteins have intrinsic signals that govern their transport and localization in the cell"
Secretory proteins have a signal peptide
Initially, they are transported across the ER membrane
Protein sorting: secretory pathway / ER
Signal peptides
A signal peptide is an N-terminal part of the amino acid chain, containing a hydrophobic region.
Signal peptides differ between proteins, and can be hard to recognize.
SignalP
http://www.cbs.dtu.dk/services/SignalP/
Method for prediction of signal peptides from the amino acid sequence
Based on Neural Networks
The first SignalP paper (Nielsen H, et al., Protein Eng. 10[1]: 1-6, January 1997) was cited 3144 times in a 10 year period