indiana university school of c571/c696 chemical information tech. 2004, lecture 7. page 1 c571/c696...

56
C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 Indiana University School of C571/C696 Chemical Information Technology David Wild [email protected] http://www.informatics.indiana.edu/djwild Representing 3D Structures

Upload: dominic-miles

Post on 13-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1

Indiana University School of

C571/C696 Chemical Information Technology

David [email protected]

http://www.informatics.indiana.edu/djwild

Representing 3D Structures

Page 2: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 2

Indiana University School of

What we’ll cover today

• Sources of 3D information (X-ray, NMR)• Experimental 3D databases• Rotatable bonds & conformational flexibility• Representing 3D structures using distance matrices• Estimation of 3D structure on computer• Conformational search and minimization• 3D descriptors and fingerprints• Types & sources of protein information• How proteins are represented on computer• PDB file format

Page 3: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 3

Indiana University School of

Sources of 3D information

• X-ray Crystallography• NMR Spectroscopy• Computer-generated 3D structures

• X-ray and NMR methods apply to both small molecules and protiens

Page 4: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 4

Indiana University School of

X-ray crystallography

• Exploits diffraction of x-rays by electron clouds• Allows 3D location of atoms to be inferred• Requires sample to be in crystalline form• More info:

– http://www-structure.llnl.gov/Xray/101index.html

Page 5: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 5

Indiana University School of

X-ray crystallography

Taken from http://www-structure.llnl.gov/Xray/101index.html

Page 6: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 6

Indiana University School of

NMR Spectroscopy

• Exploits magnetic fields created by quantum spin in nucleii

• Atomic spin can switch state when radio waves are applied

• Different atoms and groups resonate at different frequencies

• Information can be pieced together to infer 3D structure• More info:

– http://www.rod.beavon.clara.net/nmr1.htm

Page 7: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 7

Indiana University School of

Experimental 3D Databases – Cambridge Structural Database

• Experimental X-ray structures for 261,000 structures (Jan 2004)

• Various tools for searching the database (some available free)

• More info at:http://www.ccdc.cam.ac.uk/

Page 8: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 8

Indiana University School of

CSD Growth since 1970

Page 9: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 9

Indiana University School of

Factors involved in 3D representation

• Rotatable bonds and Conformational flexibility• Sampling conformations or including flexibility in

algorithms• Measuring energy of conformations• Representation of electronic and other characteristics

Page 10: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 10

Indiana University School of

Rotatable bonds and conformational flexibility

• Most compounds have rotatable bonds. This means that the molecule can take on many 3D conformations.

• Molecules prefer low-energy states, so low-energy conformations are more likely

• How do we work out which bonds are rotatable?• Do we pick one particular conformation (e.g. lowest

energy), or pick several, or allow for flexibility?

Page 11: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 11

Indiana University School of

Working definition of a rotatable bond

Any single bond which is:– Not part of a ring– Not terminal (e.g. methyl)– Not in a conjugated system (e.g. amide)

Page 12: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 12

Indiana University School of

C

Torsion (dihedral) angle

• The torsion angle (τ), also known as the dihedral angle, is the relative position, or angle, between the A-B bonds and the C-D bonds when considering four atoms connected in the order A-B-C-D

A

BC

D A

B

D

ττ

Page 13: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 13

Indiana University School of

Ring flexibility

• Chair & boat conformations• Occur with non-aromatic rings (e.g. cyclohexane)

Chair Boat

Page 14: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 14

Indiana University School of

3D representation on computer

• The Coordinate Table is an extension of the atom table which lists coordinates of atoms in 3D space relative to a defined origin

• The Distance Matrix gives distances (in Ångstrom) between all atoms. It’s main use is in comparison of 3D structures. It can be derived from the coordinate table.

• These are usually stored in addition to a connection table.

Page 15: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 15

Indiana University School of

Atom Label X Y Z

1 C -1.8920 -0.9920 -1.5760

2 C -1.3680 -2.1480 -0.9880

3 C -0.0760 -2.1440 -0.4640

4 C 0.7080 -0.9840 -0.5200

5 C 0.2000 -0.1560 -1.1960

6 C -0.1080 0.1600 -1.6520

7 O 2.0840 -1.0280 0.1040

8 O 2.5320 -2.0320 0.6360

9 C 2.8760 0.0240 0.1120

10 O 0.7520 1.3320 -1.0840

11 O 0.6680 2.0240 0.0320

12 C 1.3000 3.0600 0.1520

13 C -0.2400 1.5760 1.4440

Coordinate Table

1

2

3

4

6

5

78

9

10

11

12

13

Page 16: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 16

Indiana University School of

Distance Matrix

1

2

3

4

6

5

78

9

10

11

12

13

4.8Å

3.5Å

1 2 3 4 5 6 7 8 9 10 11 12 13

1 1.4 2.4 2.8 2.4 3.8 4.8 4.2 1.4 2.4 2.7 2.9 4.3

2 1.4 2.4 2.8 4.3 5.1 5.0 2.4 3.7 3.9 4.2 5.6

3 1.4 2.4 3.8 4.2 4.8 2.8 4.2 4.7 4.9 6.4

4 1.4 2.5 2.8 3.6 2.4 3.7 4.7 4.6 6.1

5 1.5 2.4 2.3 1.4 2.3 3.7 3.5 4.8

6 1.3 1.2 2.5 2.8 4.4 3.9 5.0

7 2.2 3.7 4.1 5.7 5.2 6.3

8 2.8 2.5 4.2 3.5 4.3

9 1.4 2.6 2.3 3.7

10 2.2 1.3 2.5

11 1.2 2.4

12 1.5

13

Page 17: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 17

Indiana University School of

3D Molecule file formats

• All tend to include coordinate/atom lookup table and connection table information

• Examples: MOL file (MDL), Sybyl MOL2 file (Tripos)

Page 18: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 18

Indiana University School of

3D MOL file for Aspirin Chime 12290214053D

21 21 0 0 1 V2000 -1.8920 -0.9920 -1.5760 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.3680 -2.1480 -0.9880 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.0760 -2.1440 -0.4640 C 0 0 0 0 0 0 0 0 0 0 0 0 0.7080 -0.9840 -0.5200 C 0 0 0 0 0 0 0 0 0 0 0 0 0.2000 0.1560 -1.1960 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.1080 0.1600 -1.6520 C 0 0 0 0 0 0 0 0 0 0 0 0 2.0840 -1.0280 0.1040 C 0 0 0 0 0 0 0 0 0 0 0 0 2.5320 -2.0320 0.6360 O 0 0 0 0 0 0 0 0 0 0 0 0 2.8760 0.0240 0.1120 O 0 0 0 0 0 0 0 0 0 0 0 0 0.7520 1.3320 -1.0840 O 0 0 0 0 0 0 0 0 0 0 0 0 0.6680 2.0240 0.0320 C 0 0 0 0 0 0 0 0 0 0 0 0 1.3000 3.0600 0.1520 O 0 0 0 0 0 0 0 0 0 0 0 0 -0.2400 1.5760 1.1440 C 0 0 0 0 0 0 0 0 0 0 0 0 -2.8760 -0.9600 -1.9840 H 0 0 0 0 0 0 0 0 0 0 0 0 -1.9880 -3.0360 -0.9520 H 0 0 0 0 0 0 0 0 0 0 0 0 0.3000 -3.0600 -0.0040 H 0 0 0 0 0 0 0 0 0 0 0 0 -1.4880 1.0840 -2.0560 H 0 0 0 0 0 0 0 0 0 0 0 0 2.5640 0.7800 -0.3240 H 0 0 0 0 0 0 0 0 0 0 0 0 -0.7600 0.6360 0.9320 H 0 0 0 0 0 0 0 0 0 0 0 0 -1.0080 2.3480 1.2880 H 0 0 0 0 0 0 0 0 0 0 0 0 0.3440 1.4320 2.0560 H 0 0 0 0 0 0 0 0 0 0 0 0 13 21 1 0 13 20 1 0 13 19 1 0 11 13 1 0 11 12 1 0 10 11 1 0 9 18 1 0 7 9 1 0 7 8 1 0 6 17 1 0 5 10 1 0 5 6 1 0 4 7 1 0 4 5 1 0 3 16 1 0 3 4 1 0 2 15 1 0 2 3 1 0 1 14 1 0 1 6 1 0 1 2 1 0M END

Page 19: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 19

Indiana University School of

Computer estimation of 3D structure

• Programs take as input 2D structures (e.g. in SMILES) and output 3D structures

• There is no one correct 3D structure, since in three dimensions a molecule is conformationally flexible

• Methods may output one single conformation, or an ensemble of possible conformations

Page 20: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 20

Indiana University School of

Fragment / Rule Based 3D Structure Generation

• Split 2D structure into small fragments matched to a pre-defined empirical database

• Generally use a combination of real fragment coordinates, theory and rules to generate the 3D structure

• Generally produce one or more low-energy conformations

• Examples: Concord, Corina, Omega

Page 21: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 21

Indiana University School of

Distance Geometry Based Structure Generation

• Rapidly samples “conformational space” of molecule, looking for valid conformations based on distance bounds.

• Outputs an ensemble of possible conformations, which can then be scored, e.g. by energy

• For algorithm, see– http://www.daylight.com/meetings/summerschool01/course/

basics/dist.html

Page 22: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 22

Indiana University School of

Concord

• Distributed by Tripos, inc.• One of the earliest structure generators• Fragment / rule-based• Produces low-energy, geometry optimized

conformation• An industry standard• More information:

– http://www.tripos.com/sciTech/inSilicoDisc/chemInfo/concord.html

Page 23: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 23

Indiana University School of

Corina

• Created by Gasteiger lab in Germany• Fragment / Rule-based• Similar to Concord• More information, plus 1,000 free structure

generations on the web, at:– http://www2.chemie.uni-erlangen.de/software/corina/

free_struct.html

Page 24: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 24

Indiana University School of

Omega

• Recently introduced by OpenEye• Rule-based• Systematically tests conformations, not

stochastic• Extremely fast generation of multiple low-energy

conformations• Can handle 100,000 compounds/processor/day• Free academic use license• More information at:

– http://www.eyesopen.com/products/applications/omega.html

Page 25: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 25

Indiana University School of

Rubicon

• Marketed by Daylight• Mixture of Distance Geometry and SMARTS-

based rules• Rules can be user-defined• For more information, see

– http://www.daylight.com/products/rubicon.html

Page 26: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 26

Indiana University School of

Structure Minimization

• Finding the conformer or conformers that have the lowest energy, and are therefore most likely to be found in nature (“conformational search”)

• May start with an existing non-optimized structure• Can use standard optimization methods such as

exhaustive search, simulated annealing, monte carlo, or genentic algorithms

• Can attempt to use ab initio derivation• More info see:

– http://www.chem.swin.edu.au/modules/mod6/

Page 27: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 27

Indiana University School of

3D small molecule databases and searching

• Databases store coordinate tables and often distance matrices

• Searching is a little different from 2D searching:– Needs to take into account conformational flexibility– Requirements different

• Less common and less mature than 2D databases and searching

• See http://www.netsci.org/Science/Cheminform/feature06.html for a review

Page 28: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 28

Indiana University School of

3D substructure (“pharmacophore”) search

• A pharmacophore is a set of features in 3D required for binding to a particular protein

• E.g. “find all of the molecules that have an OH group between 2 and 5 Å away from a Carboxyl Oxygen, both of which are 7-8 Å from a Benzene Ring

Page 29: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 29

Indiana University School of

3D Similarity Searching

• Can use 3D fingerprints based on pharmacophore “fragments”– See, e.g., Comparing 3D Pharmacophore Triplets and 2D Fingerprints

for Selecting Diverse Compound Subsets. H. Matter and T. Pötter, J. Chem. Inf. Comput. Sci.; 1999; 39(6) pp 1211 - 1225

• Can be atom based, involving comparison of distance matrices– E.g. finding pairs of most-similar atoms between molecules, based on

their distances from other atoms in the molecule

• But other forms are also used, e.g. using fields– See, e.g., Calculation of Structural Similarity by the Alignment of

Molecular Electrostatic Potentials, D. Thorner, D. Wild, P. Willett, & M. Wright, Perspectives in Drug Discovery and Design, 9/10/11, 301-320, 1998

• May be used for searching databases or ranking small datasets

Page 30: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 30

Indiana University School of

A debate! – 2D vs 3D similarity

Which is more effective…

… for retrieving molecules with similar biological activity?

… for retrieving molecules with similar 2D structures?

… for retrieving related molecules of interest to chemists?

… for ranking molecules for a particular target?

Page 31: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 31

Indiana University School of

WDI - Mean Actives Retrieved in Top 300

0

10

20

30

40

50

60

2D Finger

3D Atom

3D Fields

Page 32: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 32

Indiana University School of

Agrochemicals Dataset - Correlation between similarity and activity with four activities

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

A1 A2 A3 A4

2D Finger

3D Atom

3D Field

Page 33: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 33

Indiana University School of

Any consensus?

Which is more effective…

… for retrieving molecules with similar biological activity?

Usually 2D

… for retrieving molecules with similar 2D structures?

2D

… for retrieving related molecules of interest to chemists?

Sometimes 2D, sometimes 3D (bioisosteres)

… for ranking molecules for a particular target?

Sometimes 2D, sometimes 3D

Page 34: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 34

Indiana University School of

Other forms of 3D information

• Surface (van de Waal’s, Connolly, volume)• Properties projected onto surface (electrostatics,

hydrophobics)• Fields (energy, force, electrostatic, steric, hydrophobic)• Atom-based properties (charge, hydrophobicity, etc)

Page 35: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 35

Indiana University School of

What is a macromolecule?

• Any very large molecule (>1000 atoms)• Usually made up of repeating building block

molecules (amino acids, nucleic bases, etc) in a chain

• Polypeptides (amino acid building blocks)• Proteins (amino acid building blocks)• Nucleic acids (made up of bases)• Polysaccharides (made up of sugars)• We shall be focusing on polypeptides and proteins

Page 36: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 36

Indiana University School of

Types of protein information

• Atomic (3D atom coordinates and bond information)

• Primary (Amino acid sequence)• Secondary (Alpha helices, beta sheets, etc)• Tertiary (3D folding of protein)• Quaternary (dimers, protein families)

Page 37: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 37

Indiana University School of

Atomic information

• 3D coordinates of all atoms in the protein

• Derived from X-ray crystallography or NMR Spectroscopy

Page 38: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 38

Indiana University School of

Primary structure (Sequence)

• Lists Amino acids in order they appear in chain• Uses three letter or one-letter abbreviations, e.g:

Ser-Tyr-Ser-Met-Glu-His-Phe-Arg-Trp-Gly-Lys

S Y S M E H F R W G K

• Essentially “1-dimensional” representation of the protein

• Can be stored on computer as a text string

Page 39: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 39

Indiana University School of

Secondary structure

• α-helix – C=O and NH groups hydrogen bond to group 4 along in the chain, forming a coil shape:β-sheet, turn

• β-sheet – flat structure due to hydrogen-bonding between two or more chains

Certain groups of amino acids tend to form themselves into regular 3D shapes:

Page 40: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 40

Indiana University School of

Secondary structure (2)

• Secondary structural features can be fairly well predicted from primary structure, or it can be inferred from atom coordinates

• Primary sequence can be ‘tagged’ with secondary structure information

• E.g.

G A F T G E I S P G M I K D C G A T W Vβ β β β β β β α α α α α α α

Page 41: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 41

Indiana University School of

Tertiary structure

• How the protein chain is folded in three dimensions

• Information mostly derived from atomic coordinate information

• Extremely difficult to predict from scratch using computational methods

• May be predicted by finding proteins with similar primary and secondary structures that have known coordinates (homology modeling, threading).

Page 42: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 42

Indiana University School of

Tertiary structure example (HIV)

Page 43: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 43

Indiana University School of

Protein information representation

• Atomic – coordinate/connection table

• Primary – text string

• Secondary – text string

• Tertiary – set of points and vectors

Page 44: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 44

Indiana University School of

File formats

• Tripos Sybyl MOL2– For storage of atomic coordinate information– Same as 3D small molecule file format

• PDB format– Special format for proteins– Complex and somewhat ill-defined– Allows representation of multiple types of information

(primary, secondary, tertiary, atomic)

Page 45: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 45

Indiana University School of

PDB file format

• Official guide: http://www.rcsb.org/pdb/docs/format/pdbguide2.2/guide2.2_frame.html

• Different sections to specify different kinds of information– Title– Primary structure– Heterogen– Secondary Structure– Connectivity Annotation– Miscellaneous– Crystallographic / Co-ordinate– Connectivity– Book-keeping

• Each section made up of keywords, one per line

Page 46: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 46

Indiana University School of

PDB Title section

• HEADER – Type, date, ID code• COMPND – Description of compound• TITLE – Title of experiment used to produce structure• AUTHOR• JRNL – Reference publication• REMARK - Comments

Page 47: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 47

Indiana University School of

Primary structure section

• SEQRES – specifies amino acid sequence• MODRES – specifies modifications to amino acids

Page 48: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 48

Indiana University School of

Secondary structure section

• HELIX – specifies start & end of helical section• SHEET – specifies start & end of turn• TURN – specifies location of turn

Page 49: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 49

Indiana University School of

Coordinates section

• ATOM – specifies coordinates for an atom in a residue• HETATM – specifies coordinates for other atoms (e.g. in

drug)• TER – specifies end of list of coordinates for a chain

Page 50: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 50

Indiana University School of

Connectivity section

• CONECT – specifies connectivity between atoms (usually used for non amino-acids)

Page 51: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 51

Indiana University School of

PDB file example HIV ProteaseHEADER PROTEIN 28-OCT-96 COMPND HIV-1 PROTEASE COMPLEXED WITH THE INHIBITOR A77003 (R,S) AUTHOR GENERATED BY SYBYL, A PRODUCT OF TRIPOS ASSOCIATES, INC. SEQRES 1 A 99 PRO GLN ILE THR LEU TRP GLN ARG PRO LEU VAL THR ILE SEQRES 2 A 99 LYS ILE GLY GLY GLN LEU LYS GLU ALA LEU LEU ASP THR SEQRES 3 A 99 GLY ALA ASP ASP THR VAL LEU GLU GLU MET SER LEU PRO SEQRES 4 A 99 GLY ARG TRP LYS PRO LYS MET ILE GLY GLY ILE GLY GLY SEQRES 5 A 99 PHE ILE LYS VAL ARG GLN TYR ASP GLN ILE LEU ILE GLU SEQRES 6 A 99 ILE CYS GLY HIS LYS ALA ILE GLY THR VAL LEU VAL GLY SEQRES 7 A 99 PRO THR PRO VAL ASN ILE ILE GLY ARG ASN LEU LEU THR SEQRES 8 A 99 GLN ILE GLY CYS THR LEU ASN PHE SEQRES 1 B 99 PRO GLN ILE THR LEU TRP GLN ARG PRO LEU VAL THR ILE SEQRES 2 B 99 LYS ILE GLY GLY GLN LEU LYS GLU ALA LEU LEU ASP THR SEQRES 3 B 99 GLY ALA ASP ASP THR VAL LEU GLU GLU MET SER LEU PRO SEQRES 4 B 99 GLY ARG TRP LYS PRO LYS MET ILE GLY GLY ILE GLY GLY SEQRES 5 B 99 PHE ILE LYS VAL ARG GLN TYR ASP GLN ILE LEU ILE GLU SEQRES 6 B 99 ILE CYS GLY HIS LYS ALA ILE GLY THR VAL LEU VAL GLY SEQRES 7 B 99 PRO THR PRO VAL ASN ILE ILE GLY ARG ASN LEU LEU THR SEQRES 8 B 99 GLN ILE GLY CYS THR LEU ASN PHE ATOM 1 N PRO A 1 8.133 -13.258 12.706 1.00 0.00 ATOM 2 CA PRO A 1 9.325 -12.418 13.001 1.00 0.00 ATOM 3 C PRO A 1 8.939 -10.978 13.283 1.00 0.00 ATOM 4 O PRO A 1 7.813 -10.607 13.030 1.00 0.00 ATOM 5 CB PRO A 1 10.211 -12.484 11.768 1.00 0.00 ATOM 6 CG PRO A 1 9.219 -12.779 10.674 1.00 0.00 ATOM 7 CD PRO A 1 8.271 -13.768 11.335 1.00 0.00 ATOM 8 H1 PRO A 1 7.974 -14.024 13.392 1.00 0.00

Page 52: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 52

Indiana University School of

PDB file example HIV Protease (2)ATOM 1844 CE2 PHE B 99 5.527 -13.746 8.735 1.00 0.00 ATOM 1845 CZ PHE B 99 6.308 -12.665 8.239 1.00 0.00 ATOM 1846 OXT PHE B 99 5.672 -12.903 13.426 1.00 0.00 ATOM 1847 H PHE B 99 5.668 -10.590 12.626 1.00 0.00 TER 1848 PHE B 99 HETATM 1849 C1 A 1 -3.676 0.038 -4.301 1.00 0.00 HETATM 1850 N21 A 1 -2.730 -0.070 -5.222 1.00 0.00 HETATM 1851 H28 A 1 -2.958 0.299 -6.126 1.00 0.00 HETATM 1852 C22 A 1 -1.389 -0.623 -4.962 1.00 0.00 HETATM 1853 H29 A 1 -1.369 -1.096 -3.981 1.00 0.00 HETATM 1854 C25 A 1 -1.031 -1.707 -6.000 1.00 0.00 HETATM 1855 H30 A 1 -1.021 -1.235 -6.985 1.00 0.00 HETATM 1856 C27 A 1 -2.085 -2.821 -6.044 1.00 0.00 HETATM 1857 H36 A 1 -1.845 -3.547 -6.818 1.00 0.00 HETATM 1858 H35 A 1 -3.079 -2.429 -6.267 1.00 0.00 HETATM 1859 H34 A 1 -2.140 -3.350 -5.091 1.00 0.00 HETATM 1860 C26 A 1 0.365 -2.310 -5.758 1.00 0.00 HETATM 1861 H33 A 1 0.450 -2.709 -4.748 1.00 0.00 HETATM 1862 H32 A 1 1.159 -1.573 -5.891 1.00 0.00 HETATM 1863 H31 A 1 0.564 -3.134 -6.440 1.00 0.00 HETATM 1864 C23 A 1 -0.360 0.506 -4.927 1.00 0.00 HETATM 1865 N37 A 1 -0.195 1.091 -3.733 1.00 0.00 HETATM 1866 H59 A 1 -0.715 0.711 -2.967 1.00 0.00 HETATM 1867 C38 A 1 0.602 2.329 -3.511 1.00 0.00 HETATM 1868 H60 A 1 1.052 2.671 -4.449 1.00 0.00 HETATM 1869 C46 A 1 1.713 2.066 -2.491 1.00 0.00 HETATM 1870 H68 A 1 1.221 1.950 -1.522 1.00 0.00

Page 53: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 53

Indiana University School of

PDB file example HIV Protease (3)

CONECT 1943 1934 1941 1944 CONECT 1944 1943 CONECT 1945 1864 CONECT 1946 1849 1947 1960 CONECT 1947 1946 1948 1949 1950 CONECT 1948 1947 CONECT 1949 1947 CONECT 1950 1947 1951 1958 CONECT 1951 1950 1952 CONECT 1952 1951 1953 1954 CONECT 1953 1952 CONECT 1954 1952 1955 1956 CONECT 1955 1954 CONECT 1956 1954 1957 1958 CONECT 1957 1956 CONECT 1958 1950 1956 1959 CONECT 1959 1958 CONECT 1960 1946 1961 1962 1963 CONECT 1961 1960 CONECT 1962 1960 CONECT 1963 1960 CONECT 1964 1849 MASTER 0 0 0 0 0 0 0 0 1965 2 126 16 END

Page 54: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 54

Indiana University School of

Protein Databases

• The PDB (www.pdb.org) is the main worldwide repository for the processing and distribution of 3-D structure data of large molecules of proteins and nucleic acids. It currently holds around 24,000 structures

• Other databases (e.g. SwissProt http://au.expasy.org/sprot/) contain just sequence data for more proteins

• See also EBI: http://www.ebi.ac.uk/Databases/

Page 55: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 55

Indiana University School of

PDB Growth

Page 56: Indiana University School of C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 1 C571/C696 Chemical Information Technology David Wild djwild@indiana.edu

C571/C696 Chemical Information Tech. 2004, Lecture 7. Page 56

Indiana University School of

Follow-up

• Read chapter 2 of Leach & Gillet• Read chapter 3 & 4 of Getting Started in

Chemoinformatics