protein structure july 2, 2006 learning objectives-understand the basis of the secondary structure...

Protein Structure

July 2, 2006Learning objectives-Understand the basis of the secondary structure prediction program-Psi-PRED. Introduce the concept of a neural network and how a neural network can help to predict secondary structure. Become familiar with motif finding programs. Become familiar with the Protein Data Bank. Workshop-Analysis of p53 with Psi-PRED and analysis of PTEN protein with BLIMPs.

Some Prediction Methods

ab initio methods Based on physical properties of aa’s and bonding

patterns

Statistics of amino acid distributions in known structures Chou-Fasman

Position of amino acid and distribution Garnier, Osguthorpe-Robeson (GOR)

Neural networks

GOR (Garnier, Osguthorpe-Robeson)

Position-dependent propensities for helix, sheet or turn is calculated for each amino acid. For each position j in the sequence, eight residues on either side of aaj is considered.

A helix propensity table contains info. about propensity for certain residues at 16 other positions when the conformation of residue j is helical. The helix propensity table has 20 x 17 entries.

The predicted state of aaj is calculated as the sum of the position-dependent propensities of all residues around aaj.

aaj

Psi-BLAST Predict Secondary Structure (PSIPRED)

Three stages:1) Generation of sequence profile2) Prediction of initial secondary

structure3) Filtering of predicted structure

PSIPRED

Uses multiple aligned sequences for prediction.Created the sequence weights using a training set of folds with known structure. Used a two-stage neural network to predict structure based on

position specific scoring matrices generated by Psi-BLAST (Jones, 1999)

First network converted a window of 15 aa’s into a raw score of h,e (sheet), c (coil) or terminus

Second network filtered the first output. For example, an output of hhhhehhhh might be converted to hhhhhhhhh.

Obtained a Q3 value of 70-78% (may be the highest achievable)

What is Q3?

Predicted Output x 100Actual OutputQ3 =

Neural networks

• Computer neural networks are based on simulation of adaptivelearning in networks of real neurons.•Neurons connect to each other via synaptic junctions which are either stimulatory or inhibitory. •Adaptive learning involves the formation or suppression of the right combinations of stimulatory and inhibitory synapses so that a setof inputs produce an appropriate output.

Neural Networks (cont. 1)

•The computer version of the neural network involves identification of a set of inputs - amino acids in the sequence, which transmit through a network of connections.•At each layer, inputs are numerically weighted and the combined result passed to the next layer.•Ultimately a final output, a decision, helix, sheet or coil, is produced.

PSIPRED

90% of training set was used (known structures)10% was used to evaluate the performance of the neuralnetwork during the training session.

PSIPRED

•During the training phase, selected sets of proteins of known structure are scanned, and if the decisions are incorrect, the input weightings are adjusted by the software to produce the desired result.

•Training runs are repeated until the success rate is maximized.

•Careful selection of the training set is an important aspect of this technique. The set must contain as wide a range of different fold types as possible without duplications of structural types that may bias the decisions.

PSIPRED

•An additional component of the PSIPRED procedures involves sequence alignment with similar proteins.

•The rationale is that some amino acids positions in a sequence contribute more to the final structure than others. (This has been demonstrated by systematic mutation experiments in which each consecutive position in a sequence is substituted by a spectrum of amino acids. Some positions are remarkably tolerant of substitution, while others have unique requirements.)

•To predict secondary structure accurately, one should place little weight on the tolerant positions, which clearly contribute little to the structure, and strongly emphasize the intolerant positions.

15 groups of 21 units(1 unit for each aa plusone specifying the end)

Row specifies aa position

three outputs are helix, strand or coil

Filtering network

Provides infoon tolerant orintolerant positions

4

4

Example of Output from PSIPRED

PSIPRED PREDICTION RESULTS

Key

Conf: Confidence (0=low, 9=high)

Pred: Predicted secondary structure (H=helix, E=strand, C=coil)

AA: Target sequence

Conf: 923788850068899998538983213555268822788714786424388875156215

Pred: CCEEEEEEEHHHHHHHHHHCCCCCCHHHHHHCCCCCEEEEECCCCCCHHHHHHHCCCCCC

AA: KDIQLLNVSYDPTRELYEQYNKAFSAHWKQETGDNVVIDQSHGSQGKQATSSVINGIEAD

10 20 30 40 50 60

Recognizing motifs in proteins.

PROSITE is a database of protein families and domains.

Most proteins can be grouped, on the basis of similarities in their sequences, into a limited number of families.

Proteins or protein domains belonging to a particular family generally share functional attributes and are derived from a common ancestor.

PROSITE Database

Contains 1087 different proteins and more than 1400 different patterns/motifs or signatures.

A “signature” of a protein allows one to match a protein to a specific function based on structure and/or function.

An example of an entry in PROSITE is:http://ca.expasy.org/cgi-bin/nicedoc.pl?PDOC50020

http://ca.expasy.org/cgi-bin/nicedoc.pl?PDOC50020

How are the profiles constructed in the first place?

ALRDFATHDDVCGK..SMTAEATHDSVACY..ECDQAATHEAVTHR..

Sequences are aligned manually byexpert in field. Then a profile iscreated.

A-T-H-[DE]-X-V-X(4)-{ED}This pattern is translated as: Ala, Thr, His, [Asp or Glu], any,Val, any, any, any, any, any but not Glu or Asp

Example of a PROSITE record

ID ZINC_FINGER_C3HC4; PATTERN.

PA C-X-H-X-[LIVMFY]-C-X(2)-C-[LIVMYA]

PROSITE Database

FindProfile is a program that searches the Prosite database. It uses dynamic programming to determine optimal alignments. If the alignment produces a high score, then the match is given.If a “hit” is obtained the program gives an output that shows the region of the query that contains the pattern and a reference to the 3-D structure database if available.

Example of output from FindProfile

Other algorithms that search for protein patterns.

BLIMPs-A program that uses a query sequence to search the BLOCKs database. (written by Bill Alford)BLOCKs- database of multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. The blocks that comprise the BLOCKs Database are made automatically by searching for the most highly conserved regions in groups of proteins documented in the Prosite Database and other databases.

Example of entry in BLOCKS database

ID p99.1.2414; BLOCKAC BP02414A; distance from previous block=(29,215)DE PROTEIN ZINC-FINGER NUCLEAR FINBL LCC; width=27; seqs=8; 99.5%=1080; strength=1292 RPT1_MOUSE|P15533 ( 101) EKLRLFCRKDMMVICWLCERSQEHRGH 62Y129_HUMAN|Q14142 ( 30) RVAELFCRRCRRCVCALCPVLGAHRGH 100RFP_HUMAN|P14373 ( 101) EPLKLYCEEDQMPICVVCDRSREHRGH 49RFP_MOUSE|Q62158 ( 110) EPLKLYCEQDQMPICVVCDRSREHRDH 51RO52_HUMAN|P19474 ( 97) ERLHLFCEKDGKALCWVCAQSRKHRDH 54RO52_MOUSE|Q62191 ( 101) EKLHLFCEEDGQALCWVCAQSGKHRDH 52TF1B_HUMAN|Q13263 ( 215) EPLVLFCESCDTLTCRDCQLNAHKDHQ 65TF1B_MOUSE|Q62318 ( 216) EPLVLFCESCDTLTCRDCQLNAHKDHQ 65

Median ofstandardized scoresfor true positives

Min and max distto next block

Family description

Sequence weight (higher numberis more distant)

Start position of the sequence segment

How does BLIMPS search the BLOCKS database?

It transforms each block into a position specific scoring matrix (PSSM).Each PSSM column corresponds to a block position and contains values based on frequency of occurrence at that position.A comparison is made between the query sequence and the BLOCK by sliding the PSSM over the query.For every alignment each sequence position receives a score.This sliding window procedure is repeated for all BLOCKS in the database.

Example of a pattern search using BLIMPS

Note that any score less than 1000 may be due to chance. The score above 1000 isa score that is better than 95.5% of the true negatives.

3D structure data

The largest 3D structure database is the Protein Database It contains over 15,000 records Each record contains 3D coordinates for

macromolecules 80% of the records were obtained from X-ray

diffraction studies, 16% from NMR and the rest from other methods and theoretical calculations

ATOM 1 N ARG A 14 22.451 98.825 31.990

ATOM 2 CA ARG A 14 21.713 100.102 31.828

ATOM 3 C ARG A 14 22.583 101.018 30.979

ATOM 4 O ARG A 14 22.105 101.989 30.391

ATOM 5 CB ARG A 14 21.424 100.704 33.208

ATOM 6 CG ARG A 14 20.465 101.880 33.215

ATOM 7 CD ARG A 14 20.008 102.147 34.637

ATOM 8 NE ARG A 14 18.999 103.196 34.718

ATOM 9 CZ ARG A 14 18.344 103.507 35.833

ATOM 10 NH1 ARG A 14 18.580 102.835 36.952

ATOM 11 NH2 ARG A 14 17.441 104.479 35.827

Part of a record from the PDB

protein structure july 2, 2006 learning objectives-understand the basis of the secondary structure...

Documents

aa j slide

protein structure

stage neural network

secondary structure

position j

network of connections

highest achievable slide

predicted output x