structural bioinformatics

Structural Structural BioinformatiBioinformati

cscs

In this presentation……

Part 1 – Proteins & Proteomics

Part 2 – Protein Structure & Function

Part 3 – Analysis & Visualization

Part 4 – Protein Structure Prediction

Part

1

Proteins & Proteins & ProteomicsProteomics

Proteins

• Proteins are the fundamental building blocks of life• Enzymes are proteins that are molecular machines

responsible for all the chemical transformations cells are capable of

• Those structure that are not made of proteins are produced by enzymes (which are proteins)

• A human contains proteins of the order of 100,000 different proteins

• Proteins are of variable length and shape

Structural types and conceptual models

• Globular proteins are soluble in predominantly aqueous solvents such as the cytosol and extra-cellular fluids, and integral membrane proteins exist within the lipid-dominated environment of biological membranes

• Conceptual models of protein structure are valuable aids to understanding protein bioinformatics

Globular proteins

• The linear amino acid polymer forms a 3D structure by folding into a globular compact shape

• Globular proteins tend to be soluble in aqueous solvents and folding is dominated by the hydrophobic effect, which directs hydrophobic amino acid side-chains to the structural core of the protein, away from the solvent

Secondary structure

• Globular proteins usually contain elements of regular secondary structure, including –helices and –strands

• These are stabilized by hydrogen bonding and contribute most of the amino acids to globular protein cores

• Residues in regular secondary structures are given the symbol H, meaning helix, or E (or B), meaning extended or strand

Folding of polypeptide chain into an helix

0.15 nm (100° rotation per residue)

0.54 nm (3.6 amino acid residues per turn)

Position of polypeptide backbone consisting of C and peptide bond C-N atoms

C atoms of consecutive amino acid residues

Cross-sectional view of an helix showing the positions of the side-chains (R groups) of the amino acids on the outside of the helix

Amino acid side-chains

R

R R

R

R

R

R

R

N C

H

OR1

C

H

N C

H O

R4

C

H

N C

H

OR3

C

H

N C

H O

R2

C

H

N C

H

OR5

C

H

Hydrogen bond

In the helix the CO group of residue n is hydrogen bounded to the NH group on residue (n+4)

Cross-sectional view of an helix showing the positions of the side-chains (R groups) of the amino acids on the outside of the helix

Tertiary structure

• It is the full 3D atomic structure of a single peptide chain

• It can be viewed as the packing together of secondary structure elements, which are connected by irregular loops that lie predominantly on the protein surface

• Loop residues are given the symbol C to distinguish them from residues in helices or strands

Tertiary Structures

Quaternary structures

• Several tertiary structures may pack together to form the biologically functional quaternary structure

Quaternary Structures

Integral membrane proteins• These exist within biological lipid membranes and

obey different structural principles compared with globular proteins

• They contain runs of generally hydrophobic amino acids, associated with membrane-spanning segments (often but not exclusively helices), connected by more hydrophilic loops that lie in aqueous environments outside the membrane

• Membrane proteins are very important components of cellular signaling and transport systems

Domains

• Proteins tend to have modular architecture and many proteins contain a number of domains, often with mixed types, for example mixed integral membrane and globular domains

Evolution

• In globular proteins, surface residues in loops evolve (change) more quickly than residues in the hydrophobic core

• In integral membrane proteins, the most slowly evolving residues are those in the membrane-spanning regions

Protein structure prediction• Identifying all of the proteins in a human is one thing, but to

truly understand a protein’s function scientists must discern its shape and structure

• The structural genomics initiative calls for use of quasi-automated x-ray crystallography to study normal and abnormal proteins

• Conventional structural biology is based on purifying a molecule, coaxing it to grow into crystals and then bombarding the sample with x-rays. X-rays bounce off the molecule’s atoms, leaving a diffraction pattern that can be interpreted to yield molecule’s overall 3D shape

• A structural genomics initiative would depend on scaling up and speeding up the current techniques

• By figuring out which of the unknown proteins associated with previously identified ones, the CuraGen and University of Washington scientists were able to sort them into functional categories, such as energy generation, DNA repair, aging

• Eventhough yeast is an excellent prototype, Drosophila is good when desired to study an organism with multiple cells

Other methods for protein prediction

• Another method for studying proteomes is called “guilt by association”: learning about the function of a protein by assessing whether it interacts with another protein whose role in a cell is known

• A group lead by Stanley Fields of University of Washington reported that they had deduced 957 interactions among 1,004 proteins in baker’s yeast [S. cerevisiae]

• A machine devised by Hochstrasser and his research group goes one step further than the robots. It would automatically extract the protein spots from the gels, use enzymes to chop the proteins into bits, feed the pieces into a laser mass spectrometer and transfer the information to a computer for analysis

• With or without robotic arms, 2-D gels have their problems. Besides being tricky to make, they do not resolve highly charged or low mass proteins very well

• They also do a poor job of resolving proteins with hydrophobic regions, such as those that span the cell membrane. This is a major limitation, because membrane-spanning receptors are important drug targets

• Fields and his colleagues first devised a widely used method for studying protein interactions called the yeast two-hybrid system, which uses known protein “baits” to find “prey” proteins that bind to the “baits”

• Another way to study proteins that has recently become available involves so called protein chips. Ciphergen Biosystems, a biotechnology company in Palo Alto, is selling a range of strips for isolating proteins according to various properties, such as whether they dissolve in water or bind to charged metal atoms. Strips can then be placed in chip reader, which includes a mass spectrometer, for identifying the proteins

What’s new• Knowing the exact structural form of each of the

proteins in the human proteome should, in theory, help drug designers devise chemicals to fit the slots on the proteins that either activate them or prevent them from interacting

• Such efforts, which are generally known as rational drug design, have not shown widespread success so far – but then only roughly one percent of all human proteins have had their structures determined

• After scientists catalogue human proteome, it will be the proteins – not the genes – that will be all the rage

Part

2

Protein Protein Structure & Structure &

FunctionFunction

Structure and function

• Proteins rely upon the shapes and properties of key functional areas of their 3D structures to carry out biological functions

• Knowledge of protein structure is the key to understanding protein function and this is one reason for its importance in bioinformatics

MUTZM WTZM

Structural and functional constraints

• Evolution accepts change to amino acid residues in proteins where they have a neutral or advantageous effect on protein structural stability or protein function

• Residues can be conserved for structural or functional reasons

• Amino acids are conserved where they are uniquely able to fulfill particular structural roles

• This often occurs with cysteine, glycine and proline

TOLC 150RPIP

XRCC4 300FGF1

Evolution of theoverall protein fold

• If two naturally occurring protein sequences can be aligned to show more than 25 percent similarity over an alignment of 80 or more residues, then they will share the same basic structure

• The Sander-Schneider formula gives the higher threshold percentage identifies necessary to guarantee structural similarity from shorter alignments

Conservation of structure

• Protein structures tend to be conserved even when evolution has changed the sequence almost beyond recognition

• Structural knowledge is therefore a key factor in understanding protein evolution

Evolution of function

• While structure tends to be conserved by evolution, function is observed to change

• There are many examples of proteins whose sequence and structure are very similar, but which have different functions

• When function has changed, key functional residues change as well, and this is often clear in multiple sequence alignments

Multiple sequence alignment

• Understanding how structures evolve can help us understand multiple sequence alignments

• Key structural and functional residues are often observed to be conserved

• Insertions and deletions are seen to occur preferentially in hydrophilic surface loops by comparison with regular secondary structure elements

• Loops are also subject to faster mutational change• Conservation of hydrophobic core residues in

secondary structure elements is also common, as are conservation patterns associated with amphipathic helices

Part

3

Analysis & Analysis & VisualizatioVisualizatio

nn

Software, data and WWW sites

• A large variety of software for structure visualization, alignment and analysis is available on the WWW

• All published protein structures are submitted to a public database. Database search and down can be performed at varios WWW sites

• Rasmol, Chime and Cn3d are commonly used programs for viewing structural data

Structural and functional analysis of structures

• There is an enormous amount of software available for structural data analysis, and also several WWW sites holding pre-prepared analyses

• Functional sites in protein structures typically contain a few residues in defined spatial positions

• Software and databases have been developed to locate and search for similarity in such sites

Structural alignment

• It can be very difficult to find correct, biologically-meaningful alignments of very distantly related protein sequences because they contain only a very small proportion of identical monomers

• In such cases, structural information can help because evolution tends to change structure less

• Superimposing the backbones of similar structures implies structurally equivalent residues and this process is known as structural alignment

Structural similarity

• Structural alignment methods often produce measure of structural similarity

• The most common of these is the RMSD, which is reported by most programs

• This the root mean square difference in position between the carbon atoms of aligned residues in optimal structural superposition

Why classify protein structures?…

• Classification groups together proteins with similar structures and common evolutionary origins

• Examples– CATH, available at

http://www.biochem.ucl.ac.uk/bsm/cath– SCOP, available at

http://scop.mrc-lmb.cam.ac.uk/scop

Structural classes

• Proteins can be assigned to broad structural classes based on secondary structure content and other criteria

• CATH has four such broad classes, but SCOP uses more, giving a more detailed description of structural class

Fold or topology

• All classifications gather together proteins with the same overall fold or topology

• Proteins in the same fold or topology class contain more or less the same SSEs, connected in the same way and in similar relative spatial positions

Homologs and analogs

• Homologs (homologous proteins) are related by divergent evolution from a common ancestor, and have the same fold

• Analogs (analogous proteins) have the same fold, but other evidence for common ancestry is weak

Super-folds• Super-folds are proteins folds that seem likely to have

arisen more than once in evolution• They are thought to have advantageous physio-

chemical properties• They appear in SCOP and CATH as fold or topology

levels containing several homologous super-families• Examples are the TIM barrel and immunoglobulin

fold• Characteristics are that they tend to exhibit

approximate symmetries, and are characterized by repeated super-secondary structures

Part

4

Protein Protein Structure Structure PredictionPrediction

Why predict structure?…

• Structure prediction is interesting because experimental structure determination is still much slower than sequence determination

• Structure predictions help us to understand function and mechanism and can be used for rational drug design

• The early work of Levinthal and Anfinsen made structure prediction a fascinating scientific problem

Structure prediction methods

• Comparative modeling

• Secondary structure prediction

• Fold recognition

• Ab initio prediction

• Transmembrane segment prediction

Theoretical basis of comparative modeling

• Sequences with more than 25 percent identity over an alignment of 80 residues or more adopt the same basic structure

• The is the basis of prediction by comparative modeling

Ingredients• All that is needed is an alignment between a

sequence of unknown structure (target) and one or more of known structure (template(s)) with the above property

• Template structures can be found by standard sequence similarity search methods

• Lack of suitable template structures is the main limitation of the method, but structural genomics projects are likely to change this in coming years

• The accuracy of the alignment is crucial if good prediction is to be obtained

The process of prediction• Known structure(s) (templates) are used as the basis

of prediction• The process can then be viewed conceptually as

comprising placement of conserved core residues, modeling of variable loops, side-chain positioning and optimization, and model refinement

• Conserved residues and some side-chain positions can be obtained directly from structural information in the templates

• Modeling of variable loops often makes use of the spare parts algorithm, and there are sophisticated algorithms for side-chain placement to obtain an optimally packed hydrophobic core

Protein prediction – I

Protein prediction – II

Accuracy of comparative modeling

• Accuracy is controlled almost entirely by the quality of the alignment

• Good alignments yield good predictions with most of the main software packages

• Of all prediction methods, comparative modeling produces the most accurate models

Secondary structure prediction

• It predicts the conformational state of each residue in three categories– Helical– Extended or strand– Coil

Methods• Many methods are based on ideas related to

secondary structure propensity, which is a number reflecting the preference of a residue for a particular secondary structure

• Early methods had accuracies of around 60 percent (the percentage of residues predicted in the correct helical/extended/coil state)

• Examples of early methods are the Chou-Fasman rule-based method and the information-theoretical GOR method

Multiple sequence information

• Using multiple alignments of related sequences can improve prediction accuracy enormously by revealing patterns of conservation indicative of certain secondary structures

Accuracy of state-of-the-art methods

• Currently methods claim an average accuracy over trusted test sets of proteins equal to more than 70 percent of residues correctly predicted

• This increase in accuracy can be attributed to the availability of more structural data, and the use of more sophisticated algorithms or methods

Prediction of trans-membrane segments

• Membrane-spanning segments in integral membrane proteins can be predicted with reasonable accuracy

• Most methods make use of a search for contiguous runs of hydrophobic residues that span a lipid membrane

• Some methods also predict the orientation (in-out) or topology of the membrane-spanning segments, but this is usually less accurate

Availability of tools

• Most of the secondary structure and trans-membrane segment prediction tools are available from the ExPASy WWW site, at http://www.expasy.ch

Fold recognition

• It aims to detect very distant structural and evolutionary relationships

• It aims to detect when a protein adopts a known fold even it does not have significant sequence similarity to any protein of known structure

• Methods generally try to find the most compatible fold in a library of known folds using both sequence and structural information

• An alternative term for fold recognition is threading

Ab initio prediction

• These methods rely on first principles calculation and are not yet sufficiently well developed to be of real use in practical structure prediction

Difficulties in modeling in silico• Not all occurrences of a desired part or fragment are

to be found and changed but only particular one• Proteins are globular and not solid objects. They

behave differently for different drug molecules• Penetration through cell wall, then through nucleus to

DNA is not possible as this could effect the entire cell – the only way is by module it over mRNA

Membrane proteinTotal entries in PDB 20173Proteins 18162Membrane proteins only 8

• The membrane proteins are highly suitable for docking drug into the proteins

• Do not dissolve in water – tried enough with NMR• Crystallization of membrane proteins is a difficult task

as they cause damage to other structures of the cell• After failure of crystallography and NMR, it is the

turn of computers (for in silico protein modeling and drug design)

Approach to protein modeling

• Conventional protein modeling technique is to compute all the folding, side-chain arrangement and visualize the final protein structure

• But a new method wherein the side-chain arrangements are computed separately first and then folding computations are done in parallel. Finally, complete information is integrated. This method proved to be million times faster as compared to the conventional method

Template library of fragments• The process of protein modeling becomes extremely

easier if all the common fragments or side-chains are developed and stored as molecule template library

• This technique reduces the time consumed greatly as well as speeds up the visualization

• Until date, about 180 templates of various organic compounds have been identified and developed at IIT Delhi

• All other compounds or molecules can be modeled by suitably assembling them together

• It would also be easier to compare different proteins with this approach

Protein folding

• It is a well known fact that any protein would fold as and when it reaches 30 nm

• Also, it has been found that due to the globular structure, proteins cannot take over 8 sheets and strands

• It has been seen that due to the heavy molecular weight, extra large protein molecules disintegrate

• Phosphates in DNA repel and hence form coil structures, which add to difficulty in folding and modeling them in silico

Folding through computers

• Keeping in view the possible number of sheets or strand attachment to a protein, which can occur at 45˚ interval, in 3D there could be 26 possibilities of folding

• The folds could be easily simulated or modeled through use of a computer as it would take 226 minutes for folding a protein @ one fold per minute, which is about 50 years!!

• With 100 processors running throughout, this could be achieved in about 50 days

structural bioinformatics

Documents