prediction of protein disorder zsuzsanna dosztányi mta-elte momentum bioinformatics group...
TRANSCRIPT
Prediction of protein Prediction of protein disorderdisorder
Zsuzsanna DosztányiMTA-ELTE Momentum Bioinformatics GroupDepartment of Biochemistry Eotvos Lorand University, Budapest, [email protected]
Protein Structure/Function Paradigm
Dominant view: 3D structure is a prerequisite for protein function
But….
Heat stability Protease sensitivity Failed attempts to crystallize Lack of NMR signals Increased molecular volume “Freaky” sequences …
IDPs
Intrinsically disordered proteins/regions (IDPs/IDRs)
Do not adopt a well-defined structure in isolation under native-like conditions
Highly flexible ensembles Functional proteins
p53 tumor suppressortransactivation DNA-binding tetramerizationregulation
Wells et al. PNAS 2008; 105: 5762
TAD DBD TD RD
Disordered region Disordered region
Bioinformatics of protein disorder
Part 1 Prediction of protein disorder Databases Prediction of protein disorder
Part 2 Biology of disordered proteins Prediction of functional regions within IDPs
Datasets Ordered proteins in the PDB
over 100000 structures few 1000s folds
Some structures in the PDB classify as disordered! only adopt a well-defined structure in complex
in crystals, with cofactors, proteins, …
Disorder in the PDB Missing electron density regions from the PDB
NMR structures with large structural variations
Less than 10% of all positions
Usually short (<10 residues), often at the termini
Disprot
www.disprot.org
Current release: 6.02Release date: 05/24/2013Number of proteins: 694Number of disordered regions: 1539
Experimentally verified disordered
proteins collected from literature
(X-ray, NMR, CD, proteolysis, SAXS,
heat stability, gel filtration, …)
Additional databases Combining experiments and predictions
Genome level annotations
MobiDB: http://mobidb.bio.unipd.it D2P2: http://d2p2.pro IDEAL: http://www.ideal.force.cs.is.nagoya-u.ac.jp/IDEAL
Sequence properties of disordered proteins Amino acid compositional bias High proportion of polar and charged amino acids
(Gln, Ser, Pro, Glu, Lys) Low proportion of bulky, hydrophobhic amino acids
(Val, Leu, Ile, Met, Phe, Trp, Tyr) Low sequence complexity Signature sequences identifying disordered proteins
Protein disorder is encoded in the amino acid sequence
Prediction methods for protein disorder
Based on amino acid propensity scales or on simplified
biophysical models GlobPlot, FoldIndex, FoldUnfold, IUPred, UCON
Machine learning approaches PONDR VL-XT, VL3, VSL2; Disopred; POODLE S and L ; DisEMBL;
DisPSSMP; PrDOS, DisPro, OnD-CRF, POODLE-W, RONN
Over 50 methods
1.Amino acid propensity scaleGlobPlotCompare the tendency of amino acids:
to be in coil (irregular) structure. to be in regular secondary structure elements
Linding (2003) NAR 31, 3701
GlobPlot
From position specific predictions
Where are the ordered domains?
Longer disordered segments?
Noise vs. real data
GlobPlot: http://globplot.embl.de/
downhill regions correspond to putative domains (GlobDom)
up-hill regions correspond to predicted protein disorder
2. Physical principles
IUPred
If a residue cannot form enough favorable interactions within its sequential environment, it will not adopt a well defined structure it will be disordered
Dosztanyi (2005) JMB 347, 827
Based on an energy estimation method
Parameters calculated from statistics of globular proteins
No training on disordered proteins
IUPred The algorithm:
…PSVEPPLSQETFSDL WKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPRVA PAPAAPTPAA...
Based only on the composition of environment of D’swe try to predict if it is in a disordered region or not:
Amino acid composition of environ-ment:
A – 10%C – 0%D – 12 %E – 10 %F – 2 % etc…
Estimate the interaction energy between the residue and its environment
Decide the probability of the residue being disordered based on this
DISOPRED2: …..AMDDLMLSPDDIEQWFTED…..
Assign label: D or O D O
F(inp)SVM with linear kernel
Ward (2004) JMB 337, 635
PONDR VSL2
Differences in short and long disorder amino acid composition methods trained on one type of dataset tested on
other dataset resulted in lower efficiencies
PONDR VSL2: separate predictors for short and long disorder combined
length independent predictions
Peng (2006) BMC Bioinformatics 7, 208
4. Metaservers:
Sequence
Disorder prediction methods
PONDR VLXT
PONDR VL3
PONDR VSL2
IUPred
FoldIndex
TopIDP
PredictionANN
Meta-predictor
Xue et al. Biochem Biophys Acta. 2010; 180: 996
Disordered regions and secondary structure Coil is an ordered, irregular structural element Disordered proteins usually do not contain stable secondary
structural elements (e.g. by CD)
They can contain transient secondary structure elements (by NMR)
Pure random coil never occurs Use secondary structure predictions methods for disordered
proteins with extreme caution Long segments without predicted secondary structure may
indicate proteins disorder (NORsnet)
Accuracy
•True positive: Disordered residues are predicted as
disordered
•False positive: Ordered residues predicted as disordered
•True negative: Ordered residues predicted as ordered
•False negative: Disordered residues predicted as ordered
75-90%
Prediction of protein disorder Disordered residues can be predicted from
the amino acid sequence ~ 80% at the residue level
Methods can be specific to certain type of disorder accordingly, accuracies vary depending on
datasets Predictions are based on binary
classification of disorder
Modularity in proteins
Many proteins contains multiple domains
Composed of ordered and disordered segments
Average length of a PDB chain is < 300
Average length of a human proteins ~ 500
Average length of cancer-related proteins > 900
Structural properties of full length proteins …