protein structure prediction: the holy grail of bioinformatics

Download Protein structure prediction: The holy grail of bioinformatics

If you can't read please download the document

Upload: afi

Post on 25-Feb-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Protein structure prediction: The holy grail of bioinformatics. Proteins: Four levels of structural organization: Primary structure Secondary structure Tertiary structure Quaternary structure. Primary structure = the linear amino acid sequence. - PowerPoint PPT Presentation

TRANSCRIPT

  • Protein structure prediction:The holy grail of bioinformatics

  • Proteins: Four levels of structural organization:

    Primary structure

    Secondary structure

    Tertiary structure

    Quaternary structure

  • Primary structure = the linear amino acid sequence

  • Secondary structure = spatial arrangement of amino-acid residues that are adjacent in the primary structure

  • a helix = A helical structure, whose chain coils tightly as a right-handed screw with all the side chains sticking outward in a helical array. The tight structure of the a helix is stabilized by same-strand hydrogen bonds between -NH groups and -CO groups spaced at four amino-acid residue intervals.

  • The b-pleated sheet is made of loosely coiled b strands are stabilized by hydrogen bonds between -NH and -CO groups from adjacent strands.

  • An antiparallel sheet. Adjacent strands run in opposite directions. Hydrogen bonds between NH and CO groups connect each amino acid to a single amino acid on an adjacent strand, stabilizing the structure.

  • A parallel sheet. Adjacent strands run in the same direction. Hydrogen bonds connect each amino acid on one strand with two different amino acids on the adjacent strand.

  • Silk fibroin

  • a helixb sheet (parallel and antiparallel)tight turnsflexible loopsirregular elements (random coil)

  • Tertiary structure = three-dimensional structure of protein

  • The tertiary structure is formed by the folding of secondary structures by covalent and non-covalent forces, such as hydrogen bonds, hydrophobic interactions, salt bridges between positively and negatively charged residues, as well as disulfide bonds between pairs of cysteines.

  • Quaternary structure = spatial arrangement of subunits and their contacts.

  • Prosthetic groupHoloproteinHoloproteins & ApoproteinsApoproteinProsthetic groupHoloprotein

  • Apohemoglobin = 2a + 2b

  • Prosthetic groupHeme

  • Hemoglobin = Apohemoglobin + 4Heme

  • Sela M, White FH, & Anfinsen CB. 1959. The reductive cleavage of disulfide bonds and its application to problems of protein structure. Biochim. Biophys. Acta. 31:417-426.Christian B. Anfinsen1916-1995

  • Not all proteins fold independently.Chaperones.

  • Reducing agents: Ammonium thioglycolate (alkaline) pH 9.0-10Glycerylmonothioglycolate (acid) pH 6.5-8.2

  • Oxidant

  • What do we need to know in order to state that the tertiary structure of a protein has been solved?Ideally: We need to determine the position of all atoms and their connectivity.Less Ideally: We need to determine the position of all Cbackbone structure).

  • Protein structure: Limitations and caveats

    Not all proteins or parts of proteins assume a well-defined 3D structure in solution.Protein structure is not static, there are various degrees of thermal motion for different parts of the structure.There may be a number of slightly different conformations in solution.Some proteins undergo conformational changes when interacting with STUFF.

  • Experimental Protein Structure DeterminationX-ray crystallography most accuratein vitroneeds crystals~$100-200K per structure

    NMR fairly accuratein vivono need for crystalslimited to very small proteins

    Cryo-electron-microscopyimaging technologylow resolution

  • Why predict protein structure?Structural knowledge = some understanding of function and mechanism of action Predicted structures can be used in structure-based drug designIt can help us understand the effects of mutations on structure and functionIt is a very interesting scientific problem (still unsolved in its most general form after more than 50 years of effort)

  • Secondary structure prediction

  • Historically first structure prediction methods predicted secondary structure

    Can be used to improve alignment accuracy

    Can be used to detect domain boundaries within proteins with remote sequence homology

    Often the first step towards 3D structure prediction

    Informative for mutagenesis studiesSecondary structure prediction

  • Protein Secondary Structures (Simplifications)COIL (everything else)-STRAND-HELIX

  • AssumptionsThe entire information for forming secondary structure is contained in the primary sequence side groups of residues will determine structureexamining windows of 13-17 residues is sufficient to predict secondary structure a-helices 540 residues longb-strands 510 residues long

  • Predicting Secondary Structure From Primary Structureaccuracy 64-75%higher accuracy for a-helices than for b-sheetsaccuracy is dependent on protein familypredictions of engineered (artificial) proteins are less accurate

  • A surprising result!

    Chameleonsequences

  • The Chameleon sequenceTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTEKTEAVDAWTVEKAFKTFANDNGVDGAWTVEKAFKTFTVTEKsequence 1 sequence 2Replace both sequences withan engineered peptide (chameleon)Source: Minor and Kim. 1996. Nature 380:730-734a -helix b-strand

  • Measures of prediction accuracy

    Qindex and Q3Correlation coefficient

  • Qindex Qindex: (Qhelix, Qstrand, Qcoil, Q3) percentage of residues correctly predicted as a-helix, b-strand, coil, or for all 3 conformations.

    Drawbacks:- even a random assignment of structure can achieve a high score (Holley & Karpus 1991)

  • Correlation coefficientCa = 1 (=100%)

    True positive

    paFalse positive(overpredicted)oa

    True negative

    naFalse negative(underpredicted)ua

  • Methods of secondary structure prediction

  • Chou & Fasman (1974 & 1978) : Some residues have particular secondary-structure preferences. Based on empirical frequencies of residues in -helices, -sheets, and coils.

    Examples: Glu -helix Val -strandFirst generation methods: single residue statistics

  • Chou-Fasman method

    Sheet1

    AuthorsYear% acurracyMethod

    Chou-Fasman197450%propensities of aa's in 2nd structures

    Garnier197862%interactions between aa's

    Levin199369%multiple seq. alignments (MSA)

    Rost & Sander199472%neural networks + MSA

    Sheet2

    NameP(H)P(E)P(turn)f(i)f(i+1)f(i+2)f(i+3)

    Alanine14283660.060.0760.0350.058

    Arginine9893950.070.1060.0990.085

    Aspartic Acid101541460.1470.110.1790.081

    Asparagine67891560.1610.0830.1910.091

    Cysteine701191190.1490.050.1170.128

    Glutamic Acid15137740.0560.060.0770.064

    Glutamine111110980.0740.0980.0370.098

    Glycine57751560.1020.0850.190.152

    Histidine10087950.140.0470.0930.054

    Isoleucine108160470.0430.0340.0130.056

    Leucine121130590.0610.0250.0360.07

    Lysine114741010.0550.1150.0720.095

    Methionine145105600.0680.0820.0140.055

    Phenylalanine113138600.0590.0410.0650.065

    Proline57551520.1020.3010.0340.068

    Serine77751430.120.1390.1250.106

    Threonine83119960.0860.1080.0650.079

    Tryptophan108137960.0770.0130.0640.167

    Tyrosine691471140.0820.0650.1140.125

    Valine106170500.0620.0480.0280.053

    Sheet3

  • Amino Acid

    P

    P

    Pt

    Glu

    1.51

    0.37

    0.74

    Met

    1.45

    1.05

    0.60

    Ala

    1.42

    0.83

    0.66

    Val

    1.06

    1.70

    0.50

    Ile

    1.08

    1.60

    0.50

    Tyr

    0.69

    1.47

    1.14

    Pro

    0.57

    0.55

    1.52

    Gly

    0.57

    0.75

    1.56

  • Chou-Fasman Method

    Accuracy: Q3 = 50-60%

  • Second generation methods: segment statisticsSimilar to single-residue methods, but incorporating additional information (adjacent residues, segmental statistics).

    Problems:Low accuracy - Q3 below 66% (results).Q3 of -strands (E) : 28% - 48%.Predicted structures were too short.

  • The GOR methoddeveloped by Garnier, Osguthorpe & Robsonbuild on Chou-Fasman Pij valuesevaluate each residue PLUS adjacent 8 N-terminal and 8 carboxyl-terminal residues sliding window of 17 residuesunderpredicts b-strand regionsGOR method accuracy Q3 = ~64%

  • Third generation methodsThird generation methods reached 77% accuracy.They consist of two new ideas: 1. A biological idea Using evolutionary information based on conservation analysis of multiple sequence alignments. 2. A technological idea Using neural networks.

  • Artificial Neural NetworksAn attempt to imitate the human brain (assuming that this is the way it works).

  • Neural network modelsmachine learning approach provide training sets of structures (e.g. a-helices, non a -helices)computers are trained to recognize patterns in known secondary structuresprovide test set (proteins with known structures)

    accuracy ~ 70 75%

  • Reasons for improved accuracyAlign sequence with other related proteins of the same protein familyFind members that has a known structureIf significant matches between structure and sequence assign secondary structures to corresponding residues

  • New and Improved Third-Generation MethodsExploit evolutionary information. Based on conservation analysis of multiple sequence alignments.

    PHD (Q3 ~ 70%)Rost B, Sander, C. (1993) J. Mol. Biol. 232, 584-599.

    PSIPRED (Q3 ~ 77%)Jones, D. T. (1999) J. Mol. Biol. 292, 195-202.Arguably remains the top secondary structure prediction method (won all CASP competitions since 1998).

  • Secondary Structure PredictionSummary1st Generation - 1970s Q3 = 50-55% Chou & Fausman, GOR2nd Generation -1980s Q3 = 60-65% Qian & Sejnowski, GORIII3rd Generation - 1990s Q3 = 70-80% PhD, PSIPREDMany 3rd+ generation methods exist: PSI-PRED - http://bioinf.cs.ucl.ac.uk/psipred/ JPRED - http://www.compbio.dundee.ac.uk/~www-jpred/ PHD - http://www.embl-heidelberg.de/predictprotein/predictprotein.html NNPRED - http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html

  • The sequence-structure gapMore than 13,137,813 known protein sequences, 76,495 experimentally determined structures.

  • 200004000060000800001000001200001400001600000SequencesStructures180000200000The gap is getting bigger.The sequence-structure gap

  • Protein Secondary Structures (Simplifications)COIL (everything else)-STRAND-HELIX

  • Beyond Secondary StructureBefore Tertiary StructureSupersecondary structures (motifs): small, discrete, commonly observed aggregates of secondary structures helix-loop-helix babDomains: independent units of structure b barrel four-helix bundleThe terms domain and motif are sometimes used interchangeably.

  • Helix-loop-helix

  • Beyond Secondary StructureBefore Tertiary StructureFolds: Compact folding arrangements of a polypeptide chain (a protein or part of a protein).

    The terms domain and fold are sometimes used interchangeably.

  • EF FoldFound in Calcium binding proteins such as Calmodulin

  • Leucine Zipper

  • The beta-alpha-beta-alpha-beta subunitOften present in nucleotide-binding proteinsRossman Fold

  • b sandwichb barrel

  • a/b horseshoe

  • Four helix bundle24 amino acid peptide with a hydrophobic surfaceAssembles into 4 helix bundle through hydrophobic regionsMaintains solubility of membrane proteins

  • TIM Barrel

  • PDB New Fold GrowthThe number of unique folds in nature is fairly small (possibly a few thousands) 90% of new structures submitted to PDB in the past three years have similar structural folds in PDBNew foldOld fold

  • Protein data bank

    http://www.rcsb.org/pdb/

  • Protein 3D structure data: The structure of a protein consists of the 3D (X,Y,Z) coordinates of each non-hydrogen atom of the protein. Some protein structure also include coordinates of covalently linked prosthetic groups, non-covalently linked ligand molecules, or metal ions.For some purposes (e.g. structural alignment) only the C coordinates are needed.

    Example of PDB format: X Y Z occupancy / temp. factor

    ATOM 18 N GLY 27 40.315 161.004 11.211 1.00 10.11ATOM 19 CA GLY 27 39.049 160.737 10.462 1.00 14.18ATOM 20 C GLY 27 38.729 159.239 10.784 1.00 20.75ATOM 21 O GLY 27 39.507 158.484 11.404 1.00 21.88

    Note: the PDB format provides no information about connectivity between atoms. The last two numbers (occupancy, temperature factor) relate to disorders of atomic positions in crystals.

  • Protein structure: Some computational tasksBuilding a protein structure model from X-ray dataBuilding a protein structure model from NMR dataComputing the energy for a given protein structure (conformation)Energy minimization: Finding the structure with the minimal energy according to some empirical force fields.Simulating the protein folding process (molecular dynamics)Structure visualizationComputing secondary structure from atomic coordinatesProtein superposition, structural alignmentProtein fold classificationThreading: finding a fold (prototype structure) that fits to a sequenceDocking: fitting ligands onto a protein surface by molecular dynamics or energy minimizationProtein 3D structure prediction from sequence

  • Viewing protein structuresWhen looking at a protein structure, we may ask the following types of questions:Is a particular residue on the inside or outside of a protein?Which amino acids interact with each other?Which amino acids are in contact with a ligand (DNA, peptide hormone, small molecule, etc.)?Is an observed mutation likely to disturb the protein structure?

    Standard capabilities of protein structure software:Display of protein structures in different ways (wireframe, backbone, sticks, spacefill, ribbon.Highlighting of individual atoms, residues or groups of residuesCalculation of interatomic distancesAdvanced feature: Superposition of related structures

  • Example: c-abl oncoprotein SH2 domain, display wireframe

  • Example: c-abl oncoprotein SH2 domain, display sticks

  • Example: c-abl oncoprotein SH2 domain, display backbone

  • Example: c-abl oncoprotein SH2 domain, display spacefill

  • Example: c-abl oncoprotein SH2 domain, display ribbons

  • Predicting protein 3d structureGoal: 3d structure from 1d sequence

    Fold recognitionHomology modelingab-initioAn existing foldA new fold

  • Homology modelingBased on the two major observations (and some simplifications):

    The structure of a protein is uniquely defined by its amino acid sequence.

    Similar sequences adopt similar structures. (Distantly related sequences may still fold into similar structures.)

  • Homology modeling needs three items of input: The sequence of a protein with unknown 3D structure, the "target sequence." A 3D template a structure having the highest sequence identity with the target sequence ( >30% sequence identity) An sequence alignment between the target sequence and the template sequence

  • Homology Modeling: How it worksFind template

    Align target sequence with template

    Generate model:- add loops- add sidechains

    Refine model

  • [Rost, Protein Eng. 1999]Two zones of homology modeling

  • Automated Web-Based Homology Modelling

    SWISS Model : http://www.expasy.org/swissmod/SWISS-MODEL.html

    WHAT IF : http://www.cmbi.kun.nl/swift/servers/

    The CPHModels Server : http://www.cbs.dtu.dk/services/CPHmodels/

    3D Jigsaw : http://www.bmm.icnet.uk/~3djigsaw/

    SDSC1 : http://cl.sdsc.edu/hm.html

    EsyPred3D : http://www.fundp.ac.be/urbm/bioinfo/esypred/

  • Fold recognition = Protein Threading

    Which of the known folds is likely to be similar to the (unknown) fold of a new protein when only its amino-acid sequence is known?

  • Protein ThreadingThe goal: find the correct sequence-structure alignment between a target sequence and its native-like fold in PDB

    Energy function knowledge (or statistics) based rather than physics based Should be able to distinguish correct structural folds from incorrect structural foldsShould be able to distinguish correct sequence-fold alignment from incorrect sequence-fold alignments

  • Protein ThreadingBasic premise

    Statistics from Protein Data Bank (~2,000 structures)

    Chances for a protein to have a structural fold that already exists in PDB are quite good.The number of unique structural (domain) folds in nature is fairly small (possibly a few thousand)90% of new structures submitted to PDB in the past three years have similar structural folds in PDB

  • Protein ThreadingBasic components:Structure databaseEnergy functionSequence-structure alignment algorithmPrediction reliability assessment

  • Protein Threading structure database

    Build a template database

  • ProcessThreading - A protein fold recognition technique that involves incrementally replacing the sequence of a known protein structure with a query sequence of unknown structure. The new model structure is evaluated using a simple heuristic measure of protein fold quality. The process is repeated against all known 3D structures until an optimal fit is found.

  • Fold recognition methods3D-PSSM http://www.sbg.bio.ic.ac.uk/~3dpssm/

    Fugue http://www-cryst.bioc.cam.ac.uk/~fugue/

    HHpred http://protevo.eb.tuebingen.mpg.de/toolkit/index.php?view=hhpred

  • ab-initio foldingGoal: Predict structure from first principlesRequires:A free energy function, sufficiently close to the true potentialA method for searching the conformational spaceAdvantages:Works for novel foldsShows that we understand the processDisadvantages:Applicable to short sequences only

  • Rosetta [Simons et al. 1997]http://www.bioinfo.rpi.edu/~bystrc/hmmstr/server.php

  • Qian et al. (Nature: 2007) used distributed computing* to predict the 3D structure of a protein from its amino-acid sequence. Here, their predicted structure (grey) of a protein is overlaid with the experimentally determined crystal structure (color) of that protein. The agreement between the two is excellent.

    *70,000 home computers for about two years.

  • Protein SequenceDatabase SearchingMultiple SequenceAlignment

    Homologue in PDBHomologyModellingSecondaryStructurePredictionNoYes3-D Protein ModelFold Recognition

    Predicted FoldSequence-StructureAlignmentAb-initio StructurePredictionNoYesOverall Approach

  • ExPASy Proteomics Server:Expert Protein Analysis System links to lots of protein prediction resources

    http://expasy.org/

  • RMSDmin

    The root mean square deviation (RMSD) is the measure of the average distance between the backbones of superimposed proteins. In the study of globular protein conformations, one customarily measures the similarity in three-dimensional structure by the RMSD of the C atomic coordinates after optimal rigid body superposition.

    A widely used way to compare the structures of biomolecules or solid bodies is to translate or rotate one structure with respect to the other to minimize the RMSD. This RMSDmin can be used as a distance measure between two proteins.

    *********************************No long range affects***IgG binding domain of protein G******Pro prefers the 1st residue in an alpha-helixAsp & Glu prefer the amino terminiArg and Lys prefer the carboxyl ends**Based on 15 protein structures*****Simulate the brain. Selection of training sets is extremely important. Different protein families, only one or two representative from each family.

    ****Remember to explain a bit more about SBDD*****************************************