conformation of proteins

10

Click here to load reader

Upload: vijay-pradhan

Post on 04-Mar-2015

74 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Conformation of Proteins

ANALYTICAL BIOCHEMISTRY 235, 1–10 (1996)ARTICLE NO. 0084

REVIEW

Methods to Estimate the Conformation of Proteins andPolypeptides from Circular Dichroism Data

Norma J. GreenfieldDepartment of Neuroscience and Cell Biology, UMDNJ–Robert Wood Johnson MedicalSchool, 675 Hoes Lane, Piscataway, New Jersey 08854-5635

Received August 23, 1995

THE ORIGIN OF CIRCULAR DICHROIC ACTIVITYCircular dichroism (CD) is an excellent method for OF PROTEINS

analyzing the conformation of proteins and peptidesCircular dichroism is a phenomenon that resultsin solution. This review compares various methods of

when chromophores in an asymmetrical environmentobtaining structural information from CD data and theinteract with polarized light. In proteins the major opti-advantages and pitfalls of each technique are detailed.cally active groups are the amide bonds of the peptideAmong the topics discussed are how does the wave-backbone and the aromatic side chains. Polypeptideslength range of data acquisition affect the precision ofand proteins have regions where the peptide chromo-the determination of protein conformation, how pre-

cisely must the protein concentration be determined phores are in highly ordered arrays, such as a-helicesfor each method to give reliable answers, and what or b-pleated sheets. Depending on the orientation ofcomputer resources are necessary to use each method. the peptide bonds in the arrays, the optical transitionsq 1996 Academic Press, Inc. of the amide bond can be split into multiple transitions,

the wavelengths of the transitions can be increased ordecreased, and the intensity of the transitions can beenhanced or decreased. As a consequence, many com-

Circular dichroism spectroscopy (CD)1 is a technique mon secondary structure motifs, such as the a-helix,valuable for analyzing the secondary structure of pro- b-pleated sheets, b-turns, and poly-L-proline II (P2),teins in solution. This article is designed to acquaint have very characteristic CD spectra. The left-handednonexperts with some of the modern methods used to helical P2 conformation is found in collagen and hasextract structural information from CD spectra. In the recently been identified in short segments of some glob-body of the review the following are discussed: i, what ular proteins (7, 8). Spectra of representative polypep-contributes to the CD spectrum of a protein or polypep- tides with these conformations are shown in Fig. 1.tide; ii, how must samples be prepared for CD analysis;iii, what are the commonly used methods for extracting

SAMPLE PREPARATION FOR CD MEASUREMENTSsecondary structural information from CD data, andwhat computer resources are needed to use each For meaningful CD analyses, samples must be freemethod; iv, how precisely must the protein concentra- of contaminating proteins, which might contribute totion be determined for each method to give reliable the final spectrum, and other optically active impuri-answers; and v, how does the wavelength range affect ties, such as nucleotides or optically active buffers (e.g.,the precision of the answers obtained using each glutamate). CD measurements should be made on sam-method. For further reading, there are many excellent ples with a maximum absorbance (including buffer) ofreview articles in the literature on the theory and use 1.0 at the wavelength region of interest. Amide bondsof CD (1–6). have optical transitions in the ultraviolet region below

250 nm. One can obtain useful estimates of proteinconformation from data obtained only between 240 and1 Abbreviations used: CD, circular dichroism; P2, poly-L-proline II;

SVD, singular value decomposition; CCA, convex constraint analysis. 200 nm (see below), which means that proteins may

10003-2697/96 $18.00Copyright q 1996 by Academic Press, Inc.All rights of reproduction in any form reserved.

/ 6m0f$$9413 02-06-96 17:31:24 aba AP-Anal Bio

Page 2: Conformation of Proteins

NORMA J. GREENFIELD2

of determining protein concentration include (i) quanti-tative amino acid analysis and (ii) determination ofpeptide backbone concentration by the measurementof biuret (11) (note that reducing agents interfere withthis assay) or total nitrogen (12). It is also possible touse the aromatic spectrum of the protein to measureits concentration, provided the spectrum is obtainedunder denaturing conditions (13, 14).

METHODS TO ANALYZE PROTEIN CONFORMATION

There are many methods to extract protein confor-mation in solution from CD data in the literature. Basi-cally, all of these methods assume that the spectrumof a protein can be represented by a linear combinationof the spectra of the secondary structural elements,plus a noise term which includes the contribution ofFIG. 1. Circular dichroism spectra of polypeptides in the a-helical,

b-pleated sheet, b-turn, and P2 conformations. (s) a-Helix, (l) b- aromatic chromophores given in Eq. [1]:sheet, and (,) b-turn, redrawn from Brahms and Brahms (19), and(.) P2 (poly-L-proline in 0.1 M acetic acid).

ul Å ( FiSl i / noise [1]

where ul is the CD of the protein as a function of wave-be examined in physiological buffers (i.e., phosphate-buffered saline with 1 to 2 mM EDTA and/or 1 to 2 mM length, Fi is the fraction of each secondary structure,

i, and Sli is the ellipticity at each wavelength of eachdithiothreitol) at concentrations of approximately 0.1to 0.4 mg/ml in cuvettes with a pathlength of 1 mm. ith secondary structural element. In constrained fits

the sum of all the fractional weights, ( Fi , must beLow concentrations of organic buffers, e.g., 2 mM

Hepes, are also permissible. The information content equal to 1.The major methods of extracting structural informa-of CD spectra, however, and therefore the precision of

the structural estimates increase when the lower limit tion from CD spectra (more or less in historical order)are i, multilinear regression (15–19); ii, singular valueof the wavelength range is extended to the far ultravio-

let. For the greatest precision it is recommended to decomposition (20, 21); iii, ridge regression (22); iv, con-vex constraint analysis (23–25); v, neural networkcollect data between 260 and 178 nm or even 168 nm

(6, 9). For measurements below 195 nm it is necessary analysis (26–28); and vi, the self-consistent method (8,29, 30). These methods are detailed below. Representa-to use very transparent buffers, such as 10 mM potas-

sium phosphate, and cuvettes with very short path- tive computer programs using these methods, whichwill run on IBM-compatible computers, are availablelengths (0.05 to 0.1 mm). Adler et al. (1) and Johnson

(6) discuss sample preparation and circular dichroism on a diskette (see the Appendix).To compare the accuracy of the estimation of proteininstrumentation in full detail.

Circular dichroism is a quantitative spectroscopic conformation from CD data, all of the above methodswere used to evaluate the a-helical, total b-pleatedtechnique. The various secondary structures have ellip-

ticity bands with both characteristic wavelengths and sheet, and b-turn content of the same set of proteins.This set consisted of 16 proteins plus poly-L-glutamatemagnitudes. Therefore, with the exception of noncon-

strained least-squares analysis (see below) all of the that were suggested as standards by Sreerama andWoody (8, 30), who assigned their secondary structuremethods of CD analysis require a precise knowledge

of protein concentration. For example, the method of from X-ray coordinates using the method of Kabsch andSander (31). The results are shown in Table 1. TheBradford (10) is not acceptable because the results are

dependent on the aromatic content of the protein, and effects of truncating the data between 240 and 200 nmare also shown in Table 1. Sreerama and Woody (8, 30)detergents can interfere with the analyses. The aro-

matic absorption spectrum of a protein depends on its analyzed the conformations of the proteins in Table 1using some of the methods including singular valueconformation, so measurements of the absorbance of

native proteins at 280 nm may only be used to deter- decomposition (SVD), ridge regression (the CONTINprogram), variable selection (the VARSLC program),mine their concentrations when the extinction coeffi-

cients have been determined precisely. In addition, oxi- a self-consistent method (the SELCON program), andneural networks, alone and in combination. Their re-dized dithiothreitol and 2-mercaptoethanol and light

scattering all can increase the apparent absorbance of sults for the individual methods are summarized inTable 1. Table 1 also shows the fits obtained usinga protein solution at 280 nm. Recommended methods

/ 6m0f$$9413 02-06-96 17:31:24 aba AP-Anal Bio

Page 3: Conformation of Proteins

CIRCULAR DICHROISM OF PROTEINS 3

TABLE 1

Comparisons of Methods of Analyzing Protein Conformation from Circular Dichroism Data

a b-Sheet b-TurnRefer-

Program Data base Assignmenti Wavelength P s P s P s ence

Linear regression nonconstrained fitMLR 4 peptidesa KS 240–178 0.91 0.13 0.43 0.21 0.07 0.16MLR 4 peptidesa KS 240–200 0.92 0.14 0.74 0.16 0.23 0.16

Constrained fitG&F Poly-L-lysineb KS 240–208 0.92 0.13 0.61 0.18 ND NDLINCOMB 4 peptidesa KS 240–178 0.93 0.11 0.58 0.15 0.61 0.11LINCOMB 4 peptidesa KS 240–200 0.94 0.11 0.71 0.13 0.53 0.14LINCOMB 4 peptidesa KS 240–208 0.94 0.12 0.52 0.16 0.57 0.13LINCOMB 15 proteinsc KS 240–190 0.95 0.12 0.83 0.26 0.14 0.19LINCOMB 15 proteinsc LG 240–190 0.89 0.17 0.89 0.15 0.05 0.17LINCOMB 17 proteinsd,e KS 240–178 0.94 0.09 0.62 0.14 0.21 0.13LINCOMB 17 proteinsd,e KS 240–200 0.92 0.10 0.09 0.28 0.52 0.12LINCOMB 33 proteins f HJ 240–178 0.96 0.08 0.73 0.13 0.38 0.11LINCOMB 33 proteins f HJ 240–200 0.95 0.08 0.57 0.15 0.11 0.18LINCOMB 33 proteins f KS 240–200 0.95 0.08 0.56 0.16 0.25 0.17LINCOMB 23 proteinsg KS 240–195 0.90 0.17 0.70 0.12 00.27 0.22

Singular value decompositionSVD 17 proteinsk,e KS 240–178 0.98 0.05 0.68 0.12 0.22 0.10SVD 17 proteinsk,e KS 240–200 0.97 0.08 0.00 0.27 00.56 0.27SVD 17 proteinsk,e,j KS 240–200 0.96 0.07 0.43 0.14 0.04 0.13

Convex constraint algorithmCCA 17 proteinsk KS 260–178 0.96 0.10 0.62 0.18 0.39 0.18CCA 17 proteinsk KS 240–200 0.97 0.10 0.42 0.20 0.52 0.22

Ridge regressionCONTIN 17 proteinsk,e KS 260–178 0.93 0.11 0.56 0.15 0.58 0.08 k

CONTIN 17 proteinsk,e KS 240–200 0.95 0.13 0.60 0.15 0.74 0.07Variable selection

VARSLC 17 proteinsk,e KS 260–178 0.97 0.07 0.81 0.10 0.60 0.07 k

Variable selection–self-consistentSELCON 17 proteinsk,e KS 260–178 0.95 0.09 0.84 0.08 0.77 0.05 l

SELCON 17 proteinsk,e KS 260–190 0.94 0.09 0.73 0.09 0.84 0.05 l

SELCON 17 proteinsk,e KS 240–200 0.93 0.10 0.73 0.11 0.71 0.06SELCON 33 proteinsm,e KS 260–178 0.93 0.09 0.91 0.07 0.53 0.09SELCON 33 proteinsm,e KS 240–200 0.88 0.12 0.86 0.09 0.46 0.09SELCON 33 proteinsm,e HJ 260–178 0.93 0.10 0.85 0.09 0.43 0.09SELCON 33 proteinsm,e HJ 240–200 0.88 0.13 0.77 0.11 0.36 0.09

Neural netsNN 17 proteinsk,e KS 260–178 0.93 0.10 0.73 0.11 0.82 0.05 k

K2D 19 proteinsh,e KS 240–200 0.95 0.09 0.77 0.10 NDMean conformation { standard

deviation of the 17 test proteins 0.36 { 0.27 0.20 { 0.16 0.22 { 0.08

a The spectra of the a-helix (sperm whale myoglobin corrected for the contributions of turns and random coil and normalized to 1.0) in0.1 M NaF, pH 7, b-sheet (poly(lys-leu)n, in 0.5 M NaF at pH 7) random coil (poly(pro-lys-leu-lys-leu)n in salt free solution), and b-turn(poly(ala2,gly)n in water multiplied by 0.5) (from Brahms and Brahms (19)).

b The spectra of poly-L-lysine in the a-helical, b-pleated sheet, and random-coil conformations (15).c Standard curves for a-helix, b-structure, random coil, and b-turn extracted from 15 proteins by multilinear regression by Yang et al. (3).d Standard curves for a-helix, total b-sheet, b-turn, and remainder extracted as described by Yang et al. (3) from the 16 proteins plus

poly-L-glutamate utilized as standards by Sreerama and Woody (8, 30).e Each protein analyzed was excluded in turn from the data set.f Standard curves for a-helix, antiparallel b-sheet, parallel b-sheet, b-turn, and random coil extracted as described in footnote d from the

spectra of 33 proteins (data supplied by A. Tourmadje and W. C. Johnson Jr.).g Standard curves for a-helix, b-turn and/or parallel b-pleated sheet, aromatic and disulfides, unordered, and antiparallel b-pleated sheet

extracted from a data base of 23 proteins by convex constraint analysis by Perczel et al. (25).h Data base of Andrade et al. (27).i The assignments of secondary structure are by the methods of Kabsch and Sander (31) [KS], Levitt and Greer (32) [LG], and Hennessey

and Johnson (20) [HJ].j The values for poly-L-glutamate were omitted from the calculations of the correlation coefficient and mean square error because the

sum of its conformations was ú4, a clearly impossible answer.k Sreerama and Woody (30).l Sreerama and Woody (8).m Data supplied by A. Tourmadje and W. C. Johnson Jr.

/ 6m0f$$9413 02-06-96 17:31:24 aba AP-Anal Bio

Page 4: Conformation of Proteins

NORMA J. GREENFIELD4

TABLE 2 and the relative time required for a single analysis aretabulated.Properties of Computer Programs for Analyzing the

Secondary Structure of Proteins and Polypeptides inSolutiona

METHODS FOR ANALYZING CD DATA

Minimum Output Recommended Minimum time Measurements of Helical Content at a SingleProgram computer fitted wavelength of analysis Wavelength

name required curve? range (min)b

Measurements at single wavelengths are useful toMLR Any PC Yes 240–200c 1 follow the kinetics and thermodynamics of the foldingG&F Any PC Yes 240–208 õ1 of polypeptides and proteins. Representative equationsLINCOMB Any PC Yes 240–200c 1

for calculating helical content from the ellipticity atCONTIN Any PC Yes 240–200c 1222 nm (33) and at 208 nm (15) have been described.VARSLC 386d No 260–184 10

SELCON 386d Yes 260–200c 2 The major advantage of using single wavelengths isCCA 286 Noe 240–200c 2 f

that data can be collected rapidly. The disadvantage isK2D 286 Yes 240–200 1 that the information content of measurements at a sin-

gle wavelength is limited and other conformations sucha Programs as supplied run on IBM-compatible computers. Theas b-sheet and turns and aromatic chromophores (1,program which output the fitted curves requires a CGA-compatible

graphics card. 34) may interfere with the estimation of a-helicalb Programs were tested on an IBM-compatible PC with an 80486 content.

microprocessor operating at 33 MHz.c Fits may be improved by collecting data to shorter wavelengths.d Programs may also be compiled and run on any computer with Estimating the Secondary Structure of Proteins and

a FORTRAN 77 compiler. Polypeptides from CD Spectra by Multilineare Theoretical curves can be constructed by summing the basis spec- Regression

tra multiplied by their fractional weights.f Time required for the deconvolution of 17 data sets containing 83 The simplest methods of analysis of protein second-

data points each into five basis curves. ary structure from CD spectra fit the data to be ana-lyzed to the spectra of standards by the method of leastsquares (multiple linear regression). In the earliestwork, the spectra of polypeptides in known conforma-simple multilinear regression (the G&F, LINCOMB,tions were used as standards (15, 19). Later, when theand MLR programs) with a variety of peptide and pro-conformation of a large number of proteins had beentein data bases as references, convex constraint analy-determined from X-ray crystallographic analysis, thesis (the CCA program), and a neural net retrieval pro-CD spectra of these proteins were deconvoluted intogram (K2D). When the data bases used by thesebasis spectra for the a-helix, antiparallel and parallelmethods were constructed using other methods of ana-b-pleated sheets, b-turn, and random conformations bylyzing the secondary structure of proteins [i.e., themultiple linear regression analysis, and these ex-methods of Levitt and Greer (32) or Hennessey andtracted curves were used as standards (3, 16–18, 35).Johnson (20)], the fits are also compared to the X-rayIn constrained least-squares fits (15) the sum of Fi muststructures calculated by those methods.equal 1 (100%). In nonconstrained fits (19), the coeffi-Table 1 lists the correlation coefficients, P, and thecients may be normalized to 100% after the fit is ob-mean square errors, s, between the estimated andtained. Two computer programs for performing con-found contents of each secondary structure. The stan-strained least-squares analysis (the LINCOMB anddard deviations from the average values of each confor-G&F programs) and one for performing a noncon-mation are also shown in Table 1. When the value ofstrained analysis (the MLR program) are available ons for a particular conformation is not lower than thea diskette (see the Appendix).standard deviation from the mean value, it indicates

that the method does a poor job of predicting that con- Nonconstrained least-squares analysis (MLR). Non-constrained least-squares analysis, i.e., multilinear re-formation. All of the methods did a good job of pre-

dicting a-helix, but they varied greatly in their ability gression, is the only method which can be used to esti-mate conformation when protein concentration is notto estimate b-content and turns (see below).

The properties of several computer programs for ana- known precisely. Using the spectra of the polypeptidemodels suggested by Brahms and Brahms (19) as stan-lyzing CD spectra, which utilize each of the methods,

are summarized in Table 2. Computer resources neces- dards, the method gives a reasonable estimate of a-helical content, and there is some correlation betweensary for each analysis (i.e., type of microprocessor re-

quired), whether the program provides graphical com- the calculated and found amount of b-structure, butthe estimate of b-turns is very poor. The method isparison of the raw data and the fitted curve, the

minimum wavelength range required for the analysis, adequate, however, to indicate whether organic sol-

/ 6m0f$$9413 02-06-96 17:31:24 aba AP-Anal Bio

Page 5: Conformation of Proteins

CIRCULAR DICHROISM OF PROTEINS 5

vents, membranes, or ligands increase or decrease the nodes. To ensure that the spectra have sufficient nodesfor successful deconvolution, Hennessey and Johnsonhelicity of a peptide or protein.(20) caution that the method cannot be used with aConstrained least-squares analysis (G&F anddata set truncated above 184 nm. When the data setLINCOMB). Constraining the sum of the fractionalof 16 proteins plus poly-L-glutatmate is analyzed byweights to equal 1 improves the estimate of b-sheetSVD, the fits improve for a-helix, when compared toand b-turns when the method of least squares is em-multilinear regression, are unchanged for total b-struc-ployed. Use of the polypeptide standard curves ofture, and are poorer for the estimate of turns (seeBrahms and Brahms (19) appears to give better esti-Table 1).mates of protein secondary structure than use of stan-

dard curves extracted from the protein data bases (3,Convex Constraint Analysis18). There are several drawbacks to using the simple

Perczel et al. (23–25) developed an algorithm calledmethod of least squares, which analyzes the CD at eachconvex constraint analysis (CCA), which, similar towavelength independently, however, to extract refer-SVD, deconvolutes a data base of spectra into compo-ence curves from a collection of protein spectra. First,nents, but has different criteria for defining the basisthe contributions of aromatic groups and structurescurves. In CCA, the sum of the fractional weights ofother than a-helix, b-sheets, turns, and random coilseach component spectrum is constrained to be equal toare ignored. Second, in several wavelength regions, the1. In addition, a constraint called volume minimizationspectra of the various conformations are not well distin-is defined which allows a finite number of componentguished from one another, and this can lead to largecurves to be extracted from a set of spectra withouterrors in the deconvolution of the spectra into standardrelying on spectral nodes. CCA does not use X-ray crys-basis curves. In addition, at least four types of b-turnstallographic data in the deconvolution procedure. Oncehave been identified, and several of these have spectrathe basis curves are obtained, they must be assignedwhich are very different from one another (19, 36, 37).to specific secondary structures by correlating the frac-Thus, it is simplistic to try to extract the spectra of ational weight of each basis spectra with the fractional‘‘generic’’ turn from a data base using the method ofweight of each conformation of known proteins in theleast squares.data set that has been deconvoluted.The estimates of protein structure obtained using

When the test data set is deconvoluted into 6 curvesthe least-squares programs are not as good as thoseusing the CCA algorithm, two of the basis curves havedetermined by more modern methods (see Table 1). Thespectra similar to the spectra of a-helical peptides andprograms, however, do have their uses. First, they alltheir fractional weights must be added to estimate theoutput the calculated and experimental curves graphi-total helical content. The estimate of total helix is verycally, so that one can directly observe whether the cal-good and is independent of the wavelength range exam-culated fit is a good match to the raw data. In addition,ined. The estimates of b-turns and b-pleated sheets,they may be used when data are obtained over a fairlyhowever, are poor compared to the other methods. Itlimited wavelength range (240–200 nm) with onlyis difficult to relate the basis spectra extracted from aslight loss of accuracy. The use of polypeptide stan-protein data base by CCA to specific conformations,dards has the benefit that the fits obtained are nothowever, because the secondary structures in a proteinbiased by the choice of proteins or the methods used toare not truly independent of one another. For example,translate X-ray coordinates into secondary structures.for the 16 proteins in Table 1, the correlation coefficientof the fraction of antiparallel b-pleated sheets with the

Singular Value Decomposition (SVD) fractions of b-turns is 0.34 and between the unorderedfractions and b-turns is 0.42. Thus, it is almost impossi-Hennessey and Johnson (20) suggested that SVD

would be a better method of extracting information ble to decide whether a given basis spectrum corre-sponds to the b-turns, antiparallel b-sheet, or unor-from a data base of protein CD spectra than the method

of least squares. SVD is an eigenvector method of dered conformations in the protein data set, withoutadditional outside information about the CD spectra ofmulticomponent analysis, which may be used to extract

orthogonal basis curves from a set of spectra. After these conformations.While CCA is difficult to use for the a priori analysisdeconvolution, each basis curve, which has a unique

shape, is related to a known mixture of secondary struc- of the secondary structure of unknown proteins, it isideal for examining the spectra of proteins and polypep-tures. The basis spectra are then used to analyze the

conformation of unknown proteins. In SVD the sum tides as a function of temperature, pH, or ligand bind-ing. CCA easily determines the minimum number ofof the fractional weights of each conformation is not

constrained to equal 1. To have enough information to CD spectra necessary to reconstruct all of the observedspectra and quantifies the fractional weight of eachbe used for conformational analysis, each basis spec-

trum must have unique maxima and minima called component spectrum in the data set.

/ 6m0f$$9413 02-06-96 17:31:24 aba AP-Anal Bio

Page 6: Conformation of Proteins

NORMA J. GREENFIELD6

Selection Methods recommended that the program not be used unless datacan be collected to at least 184 nm (6, 9). In addition,

The basis curves obtained from multilinear regres- the program is relatively slow, since all of the combina-sion, singular value decomposition, or convex constrain tions must be tested individually.analysis of a set of CD spectra may change greatly

The self-consistent method (SELCON). Sreeramadepending on the choice of proteins used as standardsand Woody (8, 29, 30) have made modifications of thein the data base. This occurs because some proteins,variable selection method which improve its speed andwhose conformations are known, may have unusual CDaccuracy, which they call the self-consistent methodspectra, due to aromatic amino acids, disulfide bridges,(SELCON). In the SELCON program, first the proteinsor rare conformations (38). To overcome these difficult-in the data base are arranged in order of increasingies several authors suggested that selection proceduresroot-mean-square difference from the CD spectrum toshould be employed so that the only proteins used asbe analyzed, and the spectra which are least like thestandards have spectral characteristics similar to thosespectrum of interest are systematically deleted as de-of the unknown protein whose conformation is to bescribed by van Stokkum et al. (40). This increases theevaluated. Methods which utilize various selection pro-speed of finding the best solutions. Second, the programcedures include ridge regression, variable selection,utilizes the observation that prediction improves whenand neural networks.the protein analyzed is included in the basis set, sinceRidge regression analysis (CONTIN). Provencherthe solution is biased toward the test protein structure.and Glockner (22) proposed that the CD spectra of un-An initial guess of the structure of the protein to beknown proteins could be fit directly by a linear combi-analyzed is made and this conformation is included innation of the spectra of a large data base of proteinsthe data base which is deconvoluted using SVD. Thewith known conformations. They developed a computersecondary structure of the protein is then determined.program, called CONTIN, which uses a variation of theThe solution replaces the initial guess and the processmethod of least squares that is similar to a mathemati-is repeated until self-consistency is attained.cal technique known as ridge regression. In their

The SELCON program gives very good estimates ofmethod, the contribution of each reference spectrum toa-helix, b-structure, and b-turns of globular proteinsthat of the spectrum to be analyzed is kept small, un-and appears to work fairly well even when data areless it contributes to a good agreement between theavailable only between 240 and 200 nm. Recently,theoretical best-fit curve and the raw data.Sreerama and Woody (29) modified the SELCON pro-The CONTIN program gives a much better estimategram so it can also determine the contribution of theof b-turns than simple multiple linear regression, SVD,P2 conformation to the spectra of globular proteins.or CCA (see Table 1) and truncating the data at 200

There is a caveat in using the SELCON program.nm appears to have little effect on its prediction ofThe program, with its current data base of referenceprotein conformation. The method still suffers, how-spectra, does a relatively poor job of predicting theever, in that the fits depend on the choice of proteinsstructure of polypeptides with very high contents of b-in the data base of standards. Venyaminov et al. (39)pleated sheet, and it overestimates a-helix and under-suggest that estimates of conformational classes canestimates b-sheet considerably. The errors may arisebe improved by including denatured proteins in thebecause the magnitude of the ellipticities of pure ‘‘infi-data base as references for the random conformation.nite’’ b-pleated sheets found in polypeptides and someSingular value decomposition with variable selectionprotein aggregates is much higher than the ellipticities(the VARSLC program). Manavalan and Johnson (21)of the short b-sheets found in the globular proteinsshowed that the technique of variable selection can sig-used as standards in the data base.nificantly improve the estimate of protein conformation

Neural nets (K2D). A neural network is a computerwhen combined with singular value decomposition (seeprogram which can detect patterns and correlations inTable 1). In the variable selection method (the VARSLCdata. Bohm et al. (26) first proposed that neural net-program), an initial data base of proteins with knownworks could be used to analyze CD and that the usespectra and secondary structures is selected. Some ofof such computational techniques could significantlythe protein spectra are eliminated systematically toimprove the correlation between calculated and ob-create new data bases with a smaller number of stan-served secondary structures. The application of neuraldards. Singular value decomposition is used on all ofnetworks to biochemical problems has been reviewedthe reduced data sets to evaluate the conformation ofby Hirst and Sternberg (41). In neural networks therethe unknown protein. The results obtained using eachare three kinds of units: input units which receive sig-set are then examined, and the ones fulfilling selectionnals from external sources and send signals to othercriteria for a good fit are averaged. The VARSLC pro-units; output units which receive signals from othergram gives an excellent evaluation of protein conforma-

tion in solution. Its major disadvantage is that it is units and send signals to the environment; and hidden

/ 6m0f$$9413 02-06-96 17:31:24 aba AP-Anal Bio

Page 7: Conformation of Proteins

CIRCULAR DICHROISM OF PROTEINS 7

units which receive inputs from other units and send estimates of b-sheet when only data with a limitedwavelength range are available. It is rapid to use, andoutput signals to other units, but do not directly receive

data or output final results. A neural network is formed it has the advantage that it outputs the theoreticalcurve, which can be compared with the raw data, butby organizing units into layers. There can be connec-

tions between units in the same layer and connections it does not evaluate b-turns.between units in different layers. The units are con-nected together by ‘‘neurons’’ and the connections are The Estimation of Protein Tertiary Structure Classnumerically weighted so that the data used as input from Circular Dichroism Datawill result in the correct output. While CD is most often used to determine secondaryIn the case of CD, the input patterns are the CD structural characteristics, it has been suggested (42,spectra and the output patterns are the fractional 43) that it can be of some use in determining someweights of the secondary structures. In neural network elements of tertiary structure as well. Proteins haveanalysis there are two phases, the ‘‘learning’’ or ‘‘train- been divided into five structural classes on the basising’’ phase and the ‘‘recall’’ phase. In the learning phase of their secondary structure (42, 44): all-a (mainly a-connections are made between the points of the CD helical), all-b (mainly b-pleated sheet), a / b (separatespectra and the secondary structure of standards and

a-helix and b-sheet regions), a/b (intermixed a-helicesthe weights of the connections are adjusted until the and b-sheet regions), and random (predominantly un-error between the calculated and actual secondary ordered). Manavalan and Johnson (42) suggested thatstructures is minimized. In the recall phase, data not it should be possible to identify the structural class ofused in the learning phase are input and the corre- a protein by visual inspection of its CD spectrum. Theysponding output is calculated using the adjusted found that the all-a, a / b, and a/b proteins showweights. In neural net analysis the learning phase can pronounced negative CD bands at 222 and 208 nm andtake many hours, but the recall phase takes seconds. a positive band between 190 and 195 nm. They sug-Commercial software packages are available for per- gested that the all-a proteins could be distinguishedforming neural net analyses (see Bohm et al. (26) and from those containing some b-structure by the wave-Sreerama and Woody (30) for sources). length at which the CD changed from positive to nega-The neural network of Bohm et al. (26) consisted of tive below 180 nm. In all-a proteins the crossover is83 units in the input layer (corresponding to CD at 83 not until 172 nm, while it occurs at higher wavelengthswavelengths between 260 and 178 nm), a hidden layer in those with some b-structure. In addition, they sug-with 45 neurons, and an output layer with 5 neurons gested that a / b proteins could be distinguished fromrepresenting the a-helix, antiparallel and parallel b-a/b proteins by the relative ratios of their bands at 222sheets, b-turns, and remainder. They found excellent and 208 nm. In the a / b type the 208-nm band isprediction of a-helix and antiparallel b-sheet with cor- larger than the 222-nm band, while the relative inten-relation coefficients of 1.0 and 0.91, respectively. When sities are reversed in a/b proteins. All b-proteins lackthe wavelength region was truncated between 250 and the characteristic a-helical peaks and can be divided200 nm, however, the prediction of a-helix remained into two types: those in which the spectrum resembledexcellent, but the prediction of b-sheet gave a negative model b-polypeptides and those whose spectra resem-correlation coefficient. bled disordered polypeptides. Recently, VenyaminovSreerama and Woody (30) analyzed their test pro- and Vassilenko (43) have proposed that the mathemati-teins in a manner similar to that described by Bohm cal technique of cluster analysis can be used to assignet al. (26). Their best results were obtained using two the class of a protein from its CD spectrum between 236hidden layers and these results are summarized in Ta- and 190 nm. A computer program to perform clusterble 1. They found that prediction could be improved if analysis, DEF CLAS.EXE, is available from Dr. Veny-variable selection was used in constructing the net- aminov (see Appendix). When tested on 53 proteins (43)work. However, such an approach is probably not for the program gave 100% accuracy in identifying all-a,the average user of CD, since the calculations are verya/b, and denatured proteins, 85% accuracy for identi-time consuming and require software which is not yet fying a / b proteins, and 75% accuracy for identifyinggenerally available. all-b proteins. It should be noted, however, that theRecently a somewhat different neural network proce- program performs poorly when tested on polypeptidesdure for analyzing CD data called proteinotopic map- which are 100% a-helical or 100% b-sheet, identifyingping has been described (27, 28). The computer pro- them both as belonging to the a/b class.gram, utilizing this method, is named K2D. It consists

of a data base of weights and a recall program for de-SUMMARY AND RECOMMENDATIONStermining a-helix and b-structure based on these

weights. The program only utilizes data obtained be- All of the various methods of determining proteinconformation from CD spectra give a reasonable esti-tween 240 and 200 nm as input and gives the best

/ 6m0f$$9413 02-06-96 17:31:24 aba AP-Anal Bio

Page 8: Conformation of Proteins

NORMA J. GREENFIELD8

available circular dichroism spectra of a wide variety of proteinsmate of helical content. The SELCON program (8, 29,which have been useful as standards, both for the work reported in30), using a data base of 17 references as standards,this review and for the work of countless other researchers. Gerald

appears to give very good correlations between pre- D. Fasman generously provided the compiled version of the CCA anddicted and found a-helix, b-sheet, and b-turn for globu- LINCOMB programs which have been of invaluable use. He also

provided the C source code and a new software analysis packagelar proteins. It is reasonably fast and is easy to use.with both the LINCOMB and CCA programs, which has a moreIts use is highly recommended for the estimation of thesophisticated user interface currently being developed by Dr. A. Per-structure of globular proteins in solution. However, itczel. Dr. Sergei Yu. Venyaminov generously provided a compiled

does a relatively poor job on estimating the structure version of the CONTIN program which runs on personal computers.of polypeptides, with very high contents of b-pleated Dr. Miguel Andrade unselfishly provided the C source code and com-

piled code and the weight table for the K2D neural network program.sheet (see above).Lawrence E. Greenfield assisted greatly in the compiling of the FOR-The K2D program (27, 28) gives a very good estimateTRAN programs. I also thank Sarah E. Hitchcock-DeGregori andof b-structure with data collected only between 240 and Barbara Brodsky for their support and critical reading of the manu-

200 nm and works well with polypeptides, but it does script. This work was supported by NIH Grants GM36326 andnot estimate b-turns. The CONTIN (22) program, on HL35726 to SEHD and by the CD facility at UMDNJ.the other hand, gives a good estimate of b-turns. It issuggested that both these programs be used in conjunc- APPENDIXtion with the SELCON program for the best overall

Computer Programs to Analyze Protein Conformationestimates of protein and polypeptide conformation.from CD DataIt should be emphasized that when using any of the

methods to analyze CD data, the program which gives The computer programs to analyze CD data, de-scribed in this review, are all available on diskette fromthe calculated spectrum that best matches the experi-

mental spectrum does not necessarily provide the best N.J.G. upon request. Included are directions for usingeach program and programs to convert data files ob-estimate of protein conformation. For example, the

CONTIN program almost always gives excellent tained on AVIV or JASCO spectrophotometers, orASCII files of ellipticity as a function of wavelength, toagreement between the experimental and calculated

CD curves, even when the fits are relatively poor com- the formats used by each of the programs. The follow-ing analysis programs are available:pared to other methods. On the other hand, the calcu-

lated curves obtained using the K2D program are often The LINCOMB program, supplied by Gerald D. Fas-man, Graduate Department of Biochemistry, Brandeisvery poor matches to the experimental data, although

the predictions of structure may be very good. When University (Waltham, MA), analyzes the CD of un-known spectra by fitting the spectra to that of stan-different methods of estimating protein structure give

widely varying results the estimates of conformation dards by the constrained method of least squares (seePerczel et al. (25). Five data sets are currently availableshould be regarded with suspicion.

When protein concentration is not known precisely, as standards. FASMAN.DAT contains the original ba-sis spectra of Perczel et al. (25) extracted by convexonly a nonconstrained least-squares analysis program,

such as the MLR program, may be used to analyze constraint analysis of 23 proteins, which was suppliedby Dr. Fasman with the LINCOMB program. BRAHMS.protein secondary structure (19). This method gives in-

ferior estimates, however, compared to all of the other DAT contains the polypeptide reference spectra ofBrahms and Brahms (19) and YANG.DAT contains themethods.

Although it appears to be less suitable for the routine basis spectra of Yang et al. (3) extracted from 15 pro-teins by multilinear regression. T&J33.DAT and S&determination of the secondary structure of proteins

than SELCON or CONTIN, the CCA (23–25) program R17.DAT contain reference basis spectra extracted bymultilinear regression (3) from data sets of the CDis an excellent method of deconvoluting sets of CD spec-

tra, e.g., to follow the effects of denaturants, ligands, spectra of 33 and 17 proteins, supplied by W. CurtisJohnson (9) and Robert W. Woody (8), respectively. Theor changes in temperature on protein and peptide con-

formation. LINCOMB program runs on 80286 computers andhigher with a color graphics card. When the LINCOMBprogram is used with the YANG.DAT data base, the

ACKNOWLEDGMENTS best fit obtained, where all of the fractional weightsare positive, is identical to the fit obtained using theI am heavily indebted to Robert W. Woody who provided the FOR-ESTIMATE program of Yang et al. (3). The PROSECTRAN source code and compiled copies of the SELCON program and

preprints of his recent articles describing both the SELCON method program, supplied with the AVIV CD spectrophotome-and applications of neural networks to the evaluation of circular ter, is a version of the ESTIMATE program.dichroism spectra. He also provided an early version of the program, MLR is a program that analyzes CD data by noncon-SELCON2, which evaluates the P2 conformation in proteins. W. Cur-

strained multilinear regression. It uses the same filetis Johnson Jr. graciously provided the FORTRAN source code andcompiled versions of his VARSLC program. He has also freely made formats and standards as the LINCOMB program (see

/ 6m0f$$9413 02-06-96 17:31:24 aba AP-Anal Bio

Page 9: Conformation of Proteins

CIRCULAR DICHROISM OF PROTEINS 9

above). The program runs on all IBM-compatible com- K2D is the neural net recall program of Andrade etal. (27) . It was contributed by Miguel Andrade, EMBL,puters, which have a color graphics card.

G&F calculates the percentage a-helix and b-struc- Heidelberg, Germany. The program is also available onthe world wide web at http://www.embl-heidelberg.de/ture by the original method of Greenfield and Fasman

(15) using poly-L-lysine as a reference. This program Çandrade/k2d.html. The program PLOTK2D will dis-play the output file graphically after the program isruns on all IBM PC-compatible computers. A color

graphics card is necessary to view the graphs. run. An 80386 or higher PC with a math coprocessoris recommended for use with the program. A colorTwo programs which perform the self-consistent

method of analyzing protein CD spectra of Sreerama graphics card is necessary to view the output with thePLOTK2D program.and Woody (8, 29, 30) were contributed by Dr. R. W.

Woody, Department of Biochemistry and Molecular Bi- Computer text files with instructions for estimatinghelical content from the ellipticity at 222 (33) and 208ology, Colorado State University (Fort Collins, CO).

SELCON (8) evaluates up to five conformations: a-he- nm (15) and for estimating protein concentration (11–14) are also included on the diskettes.lix, antiparallel and parallel b-sheets, turns, and re-

mainder. SELCON2 (29) evaluates a-helix, total b-structure, turns, P2, and remainder. An 80386 or REFERENCEShigher computer with a math coprocessor is recom-

1. Adler, A. J., Greenfield, N. J., and Fasman, G. D. (1972) Methods.mended for these programs. The FORTRAN codes, Enzymol. 27, 675–735.which are also supplied, can be compiled and run on 2. Woody, R. W. (1985) Peptides 7, 15–114.any computer with an F77 compiler. 3. Yang, J. T., Wu, C-S. C., and Martinez, H. M. (1986) Methods

The VARSLC program is an implementation of the Enzymol. 130, 208–269.variable selection method of Manavalan and Johnson 4. Johnson, W. C., Jr. (1985) Methods Biochem. Anal. 31, 61–163.(21), which was contributed by W. Curtis Johnson, De- 5. Johnson, W. C., Jr. (1988) Annu. Rev. Biophys. Chem. 17, 145–partment of Biochemistry and Biophysics, Oregon 166.State University (Corvallis, OR). A data base of 33 pro- 6. Johnson, W. C., Jr. (1990) Proteins Struct. Funct. Genet. 7, 205–

214.teins with ellipticities between 260 an 178 nm was also7. Woody, R. W. (1992) Adv. Biophys. Chem. 2, 37–79.supplied by Dr. Johnson for use as standards with the8. Sreerama, N., and Woody, R. W. (1993) Anal. Biochem. 209, 32–program. An 80386 or higher computer with a math

44.coprocessor is recommended for this program. The9. Tourmadje, A., Alcorn, S. W., and Johnson, W. C., Jr. (1992)FORTRAN code, which is also available, can be com-

Anal. Biochem. 200, 321–331.piled and run on any computer with an F77 compiler.10. Bradford, M. M. (1976) Anal. Biochem. 72, 248–254.When no proteins are excluded from the data base (i.e.,11. Goa, J. (1953) Scand. J. Clin. Lab. Invest. 5, 218–222.the number of combinations is set to 1), the VARSLC12. Lang, C. A. (1958) Anal. Chem. 30, 1692–1694.program performs the simple SVD procedure described13. Edelhoch, H. (1967) Biochemistry 6, 1948–1954.by Hennessey and Johnson (20).14. Gill, S. C., and von Hipple, P. H. (1989) Anal. Biochem. 182,CCAFAST and CCASLOW are versions of the convex

319–326.constraint analysis program of Perczel et al. (24, 25),15. Greenfield, N., and Fasman, G. D. (1969) Biochemistry 8, 4108–which were also supplied by Dr. Gerald D. Fasman. 4116.

CCAFAST requires a math coprocessor. CCASLOW 16. Saxena, V. P., and Wetlaufer, D. B. (1971) Proc. Natl. Acad. Sci.runs on all PCs but is very slow. An 80286 or higher USA 68, 969–972.computer with a math coprocessor is recommended for 17. Chen, Y-H., and Yang, J. T. (1971) Biochem. Biophys. Res. Com-this program. mun. 44, 1285–1291.

The CONTIN program, which performs the ridge re- 18. Chang, C. T., Wu, C-S. C., and Yang, J. T. (1978) Anal. Biochem.91, 13–31.gression technique of Provencher and Glockner (22, 38,

19. Brahms, S., and Brahms, J. (1980) J. Mol. Biol. 138, 149–178.39), was contributed by Dr. S. Yu. Venyaminov, Depart-20. Hennessey, J. P., and Johnson, W. C., Jr. (1981) Biochemistryment of Biochemistry and Molecular Biology, Mayo

20, 1085–1094.Foundation (Rochester, MN). This program works on21. Manavalan, P., and Johnson, W. C., Jr. (1987). Anal. Biochem.80286 or higher machines with a math coprocessor. A

167, 76–85.CD analysis package called CDSTRUC is available22. Provencher, S. W., and Glockner, J. (1981) Biochemistry 20, 33–from Dr. Venyaminov. In addition to the CONTIN pro-

37.gram it contains the ESTIMATE program of Yang et

23. Perczel, A., Hollosi, M., Tusnady, G., and Fasman, G. D. (1991)al. (3) and a version of the VARSLC program of Mana- Protein Eng. 4, 669–679.valan and Johnson (21) and the cluster analysis pro- 24. Perczel, A., Park, K., and Fasman, G. D. (1992) Proteins Struct.gram called DEF_CLAS.EXE (43). The conversion pro- Funct. Genet. 13, 57–69.grams that convert raw CD data to the CONTIN format 25. Perczel, A., Park, K., and Fasman, G. D. (1992) Anal. Biochem.

203, 83–93.will work with all of the programs in the package.

/ 6m0f$$9413 02-06-96 17:31:24 aba AP-Anal Bio

Page 10: Conformation of Proteins

NORMA J. GREENFIELD10

26. Bohm, G., Muhr, R., and Jaenicke, R. (1992) Protein Eng. 5, 36. Woody, R. W. (1974) in Peptides, Polypeptides and Proteins(Blout, E. R., Bovey, F. A., Goodman, M., and Lotan, N., Eds.),191–195.pp. 338–360, Wiley, New York.27. Andrade, M. A., Chacon, P., Merolo, J. J., and Moran, F. (1993)

Protein Eng. 6, 383–390. 37. Perczel, A., and Fasman, G. D. (1992) Protein Sci. 1, 378–395.28. Merolo, J. J., Andrade, M. A., Prieto, A., and Moran, F. (1994) 38. Venyaminov, S. Y., Baikalov, K. A., Wu, C-S. C., and Yang,

Neurocomputing 6, 443–454. J. T. (1991) Anal. Biochem. 198, 250–255.29. Sreerama, N., and Woody, R. W. (1994) Biochemistry 33, 10022– 39. Venyaminov, S. Y., Baikalov, I. A., Shen, Z. M., Wu, C-S. C., and

10025. Yang, J. T.(1993) Anal. Biochem. 214, 17–24.30. Sreerama, N., and Woody, R. W. (1994) J. Mol. Biol. 242, 497– 40. van Stokkum, I. H. M., Spoelder, H. J. W., Bloemendal, M., van

507. Grondelle, R., and Groen, F. C. A. (1990) Anal. Biochem. 191,110–118.31. Kabsch, W., and Sander, C. (1983) Biopolymers 22, 2577–2637.

41. Hirst, J. D., and Sternberg, M. J. E. (1992) Biochemistry 31,32. Levitt, M., and Greer, J. (1977) J. Mol. Biol. 114, 181–293.7211–7218.33. Scholtz, J. M., Qian, H., York, E. J., Stewart, J. M., and Baldwin,

42. Manavalan, P., and Johnson, W. C., Jr. (1983) Nature 305, 831–R. L. (1991) Biopolymers 31, 1463–1470.832.34. Chakrabartty, A., Kortemme, T., Padmanabhan, S., and Bal-

43. Venyaminov, S. Y., and Vassilenko, K. S. (1994) Anal. Biochem.dwin, R. L. (1993) Biochemistry 32, 5560–5565.222, 176–184.35. Bolotina, I. A., and Lugauskas, V. Y. (1986) Mol. Biol. 19, 1154–

1166. 44. Levitt, M., and Chothia, C. (1976) Nature 261, 552–558.

/ 6m0f$$9413 02-06-96 17:31:24 aba AP-Anal Bio