the crystallographic information framework as a metadata library · 2015. 9. 4. · pdbx/mmcif...

31
Metadata for raw data from X-ray diffraction and other structural techniques A Satellite Workshop to the 29th European Crystallographic Meeting The Crystallographic Information Framework as a metadata library Brian McMahon International Union of Crystallography 5 Abbey Square Chester CH1 2HU UK [email protected]

Upload: others

Post on 02-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

Metadata for raw data from X-ray diffraction and other structural techniques

A Satellite Workshop to the 29th European Crystallographic Meeting

The Crystallographic Information Framework as a metadata library

Brian McMahon

International Union of Crystallography5 Abbey SquareChester CH1 [email protected]

Page 2: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

CIF as a file format

1991 – Introduction of Crystallographic Information File format

Hall, S. R., Allen, F. H. & Brown, I. D. (1991). The Crystallographic Information File (CIF): a New Standard Archive File for Crystallography. Acta Cryst. A47, 655-685

Page 3: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

CIF as a file format

Page 4: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

CIF as a file format

1991 – Introduction of Crystallographic Information File format1996 – Introduction of mmCIF

Bourne, P. E., Berman, H. M., McMahon, B., Watenpaugh, K. D., Westbrook, J. & Fitzgerald, P. M. D. (1997). The Macromolecular Crystallographic Information File (mmCIF). Methods Enzymol. 277, 571-590

Page 5: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

CIF as a file format

1991 – Introduction of Crystallographic Information File format1996 – Introduction of mmCIF2000 – Introduction of imgCIF/CBF

Bernstein, H. J. & Hammersley, A. P. (2006). Specification of the Crystallographic Binary File (CBF/imgCIF). In International Tables For Crystallography Vol. G: Definition and exchange of crystallographic data. Dordrecht: Springer

Page 6: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

CIF as a file format

1991 – Introduction of Crystallographic Information File format1996 – Introduction of mmCIF2000 – Introduction of imgCIF/CBF2016 – Introduction of CIF2.0

Bernstein, H. J., Bollinger, J. C., Brown, I. D., Gražulis, S., Hester, J. R., McMahon, B., Spadaccini, N. & Westrip, S. P. (2015). Specification of the Crystallographic Information File (CIF) format, version 2.0. J. Appl. Cryst. Submitted

Page 7: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

The irrelevance of formatMitchell, L. A. & Holliday, B. J. (2014). 6-Bromo-N-(6-bromopyridin-2-yl)-N-[4-(2,3-dihydro-thieno[3,4-b][1,4]dioxin-5-yl)phenyl]pyridin-2-amine. Acta Cryst. E70, o797.

Rendered using publcif [Westrip, S. P. (2010). J. Appl. Cryst. 43, 920-935]Rendered using Mercury [Macrae, C. F. et al. (2009). J. Appl. Cryst. 41, 466-470]

Page 8: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

The irrelevance of formatMitchell, L. A. & Holliday, B. J. (2014). 6-Bromo-N-(6-bromopyridin-2-yl)-N-[4-(2,3-dihydro-thieno[3,4-b][1,4]dioxin-5-yl)phenyl]pyridin-2-amine. Acta Cryst. E70, o797.

Rendered using publcif [Westrip, S. P. (2010). J. Appl. Cryst. 43, 920-935]Rendered using Mercury [Macrae, C. F. et al. (2009). J. Appl. Cryst. 41, 466-470]

Page 9: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

The irrelevance of formatGao, Y. R., Feng, N., Chen, T., Li, D. F. & Bi, L. J. (2015). Structure of the MarR family protein Rv0880 from Mycobacterium tuberculosis. Acta Cryst. F71, 741-745.

PDB structure 4YIF

Rendered using Jmol [Hanson, R. M. (2010). J. Appl. Cryst. 43, 1250-1260]

Page 10: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

The irrelevance of formatGao, Y. R., Feng, N., Chen, T., Li, D. F. & Bi, L. J. (2015). Structure of the MarR family protein Rv0880 from Mycobacterium tuberculosis. Acta Cryst. F71, 741-745.

PDB structure 4YIF

Rendered using Jmol [Hanson, R. M. (2010). J. Appl. Cryst. 43, 1250-1260]

Page 11: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

The irrelevance of formatExample MAR image file example.mar2300

imgCIF format CBF format

Page 12: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

CIF as a data model

Format innovation Data description evolution

1991: CIF Initial decoupling of syntax/semantics

1995: DDL1: machine-readable semantics

1996: mmCIF Relational data model

2000: imgCIF (Handling of binary data)

2013: DDLm: dynamic (methods) data model

2016: CIF2.0 (Extended character set, data types)

Page 13: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

CIF as a data model

• DDL (dictionary definition language) is the mechanism in the CIF world for describing relationships between defined data objects

• DDL (data description language) is the mechanism in the relational database world for characterising relationships between items in the database

• The two perform very similar functions

Page 14: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

CIF as a relational data model

World Database of Crystallographers (WDC)

WDC initially implemented as a STAR databaseUpdated as an InterBase RDBMS

Page 15: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

CIF as a relational data model

Protein Data Bank

PDB schema based on the mmCIF dictionary

The family of categories (in the mmCIF dictionary) used to describe polymer chemical entities; from Fitzgerald, P. M. D. et al. (2006). Classification and use of macromolecular data. In International Tables For Crystallography Vol. G: Definition and exchange of crystallographic data. Dordrecht: Springer

Global schema map of the entire PDB relational database; from Schierz, A. C., Soldatova, L. N. & King, R. D. (2007). Nature Biotechnology, 25, 437-442

Page 16: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

Structure of a CIF dictionary definition

save__entity.formula_weight_item_description.description

; Formula mass in daltons of the entity.;

_item.name '_entity.formula_weight'_item.category_id entity_item.mandatory_code noloop__item_range.maximum_item_range.minimum . 1.0

1.0 1.0_item_type.code floatsave_

As it appears in the mmCIF dictionary (easy for computers to read) As it appears on the PDB website (easier for people to read?)

Page 17: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

Structure of a CIF dictionary definition

data_cell_volume_name '_cell_volume'_category cell_type numb_type_conditions esd_enumeration_range 0.0:_units A^3^_units_detail 'cubic angstroms'_definition

; Cell volume V in angstroms cubed.

V = a b c [1 - cos^2^(alpha) - cos^2^(beta) - cos^2^(gamma)+ 2 cos(alpha) cos(beta) cos(gamma) ] ^1/2^

a = _cell_length_ab = _cell_length_bc = _cell_length_calpha = _cell_angle_alphabeta = _cell_angle_betagamma = _cell_angle_gamma

;

A more complex example. The relationships with other data items are described – but not in a form that a computer program can do anything with.

Page 18: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

Structure of a CIF dictionary definition

But in the new DDLm, methods are introduced that can be interpreted and executed by computer programs: allows validation of data or retrieval of absent data provided more primitive data values are present.

save_cell.volume_definition.id '_cell.volume'loop__alias.definition_id '_cell_volume'_definition.update 2013-03-07_description.text

;Volume of the crystal unit cell.

;_name.category_id cell_name.object_id volume_type.purpose Measurand_type.source Derived _type.container Single_type.contents Real_enumeration.range 0.0:_units.code angstrom_cubedloop__method.purpose_method.expressionEvaluation

;With v as cell_vector

_cell.volume = v.a * ( v.b ^ v.c );

save_

• Spadaccini, N. & Hall, S. R. (2012). Extensions to the STAR File Syntax. J. Chem. Inf. Model. 52, 1901-1906.

• Spadaccini, N. & Hall, S. R. (2012). DDLm: A New Dictionary Definition Language. J. Chem. Inf. Model. 52, 1907-1916.

• Spadaccini, N., Castleden, I. R., Du Boulay, D. & Hall, S. R. (2012). dREL: A Relational Expression Language for Dictionary Methods. J. Chem. Inf. Model. 52, 1917-1925.

Page 19: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

Structure of a CIF dictionary definition

Another, more complicated example: structure factor

save_refln.F_complex. . .

loop__method.purpose_method.expressionEvaluation

;

With r as refln # reflection packet in scope

fc = Complex (0., 0.)

h = r.hkl

Loop a as atom_site { # Summation over atom sites

x = a.fract_xyz

f = (r.form_factor_table[a.type_symbol]*

a.symmetry_multiplicity*a.occupancy)

B = a.matrix_beta

Loop s as symmetry_equiv { # Summation over symmetry

fc += f * Exp(-h*s.R*B*s.R*h) *

ExpImag(TwoPi*(h*(s.R*x+s.T)))

}

}

refln.F_complex = fc # evaluate defined item;

save_

jkkj rRihsymmetry

k

sBasymmetric

j

j eesfhF

2)(

)()(

Page 20: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

Existing CIF dictionaries

‘Official’: managed by COMCIFS

Dictionary name DDL Purpose

cif_core.dic 1.4.1 crystallographic core

cif_core_restraints.dic 1.4.1 restraints or constraints in least-squares refinement of crystal structures

cif_pd.dic 1.4.1 powder diffraction

cif_ms.dic 1.4.1 incommensurately modulated crystal structures

cif_rho.dic 1.4.1 results of electron density studies

cif_twinning.dic 1.4.1 crystallographic twinning

cif_img.dic 2.1.3 diffraction images

cif_sym.dic 2.1.3 crystallographic symmetry

meta-dictionaries

ddl_core.dic 1.4.1 dictionary definition language

ddl_core_2.1.6.dic 2.1.6 relational dictionary definition language

cifdic.register 1.4 register of dictionaries distributed by IUCr

Page 21: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

Existing CIF dictionaries

‘Official’: managed by wwPDB

Dictionary name DDL Purpose

PDBx/mmCIF 2.1.15 PDB Exchange Dictionary supporting data files in the PDB archive

NMR-STAR.dic 2.1.x NMR structures in the BioMagResBank archive

mmcif_nef.dic 2.1.x NMR exchange format

mmcif_sas.dic 2.1.x small-angle scattering

mmcif_em.dic 2.1.x 3D electron microscopy data

mmcif_img.dic 2.1.x PDB maintained version of diffraction image data description

mmcif_sym.dic 2.1.x PDB maintained version of crystallographic symmetry

mmcif_biosync.dic 2.1.x features of synchrotron facilities and beamlines

mmcif_mdb.dic 2.1.x Homology models and homology modelling methodologies

meta-dictionaries

mmcif_ddl.dic 2.1.15 relational dictionary definition language

Page 22: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

Existing CIF dictionaries

‘Private’: distributed by IUCr

Dictionary name DDL Purpose

cif_iucr.dic 1.4.1 private data items used by the IUCr in journal publishing

cif_ccdc.dic 1.4.1 private data items used by the Cambridge Crystallographic Data Centre

meta-dictionaries

cif_compat.dic 1.4.1 legacy CIF Dictionary of deprecated terms

Page 23: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

Putative CIF dictionaries

Reserved namespaces registered with IUCrPrefix Purpose/owner Prefix Purpose/owner Prefix Purpose/owner

B+S, BplusS, pdb2cif H. J. Bernstein crystmol CrystMol software msd EBI Molecular Structure Database Group

CCP4 CCP4 software suite csd Cambridge Structural Database ndb Nucleic Acids Database

H5, NX support of HDF5 and NeXusintegration

dft density functional theory calculations nottingham University of Nottingham

NIEHS Natl Inst. of Environmental Health Sciences

ebi EBI macromolecular harvest deposition file

oxford U. Oxford CRYSTALS software

SSAD Sulfur SAD Database edchem Edinburgh University Chemistry parvati PARVATI validation server

acihd ACI Heidelberg gsas GSAS powder refinement system pdb, pdbx Protein Data Bank

amcsd American MIneralogistCrystal Structure Database

gsk GlaxoSmithKline phenix PHENIX software suite

anbf Australian National Beamline Facility

iims Integration of 3D Electron Microscopy with X-ray and NMR methods

publcif publCIF editor software

asd Active Site Database itqb Inst. Tecnologia Quimica e Biologica da Univer. Nova de Lisboa, Oeiras, Portugal

rayonix Rayonix (Mar USA) instruments

bruker Bruker AXS iucr IUCr journal use rcsb Research Collaboratory for Structural Bioinformatics

ccdc Cambridge Crystallographic Data Centre

mdb database of model structures of biological molecules

shelx SHELXL solution and refinement programs

cgraph Oxford CryosystemsCrystallographica package

montpellier University of Montpellier tcod Theoretical Crystallography Open Database

cod Crystallography Open Database

mpod, prop Material Properties Open Database vrf checkCIF validation reply form entries

wdc entries in World Directory of Crystallographers

xtal Xtal program system

Page 24: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

Potential CIF dictionaries

Proposed, under development or abandoned

Dictionary name Purpose

magCIF magnetic structures; under development by Commission on

Magnetic Structures

Xasformat, xasCIF efforts to define a data standard for X-ray absorption spectroscopy; input from Commission on XAFS

MPOD Materials Properties Open Database ontology

Lattice topology e.g. following descriptive model of TOPOS software

Page 25: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

Data flow in crystallography

Experiment(synchrotron or laboratory)

Structure solution and refinement(laboratory)

IUCr journals

Other journals

Chemistry databases(CCDC)

Biological structure databases(PDB)

Data reduction

Raw experimental data (e.g. diffraction images)Reduced/processed data (e.g. structure factors)Derived data (e.g. coordinates, a.d.p.s)

retained by scientist

archived at facility (~6 months)

deposited

published/disseminated

validated

Page 26: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

Data flow beyond crystallography

Scientific resource• Technique• Probe (X-ray, neutron, electron)• Equipment• Beamline• Funding• Time• Scale

Page 27: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

Data flow beyond crystallography

Scientific context for the experiment• Theory• Invention• Development• Synthesis• Preparation

Scientific resource• Technique• Probe (X-ray, neutron, electron)• Equipment• Beamline• Funding• Time• Scale

Page 28: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

Data flow beyond crystallography

Scientific context for the experiment• Theory• Invention• Development• Synthesis• Preparation

Scientific resource• Technique• Probe (X-ray, neutron, electron)• Equipment• Beamline• Funding• Time• Scale

Motivation• Curiosity• Mission• Exploitation• Serendipity• Literature• Patents

Management• Funder• Research Framework• Principal Investigator• Accountability

IT infrastructure• Word size, Byte order• Encoding• Lossiness• Volume• Software dependencies• Bit-rot

Reliability• Equipment maintenance• Calibration• Regression testing• Statistical analysis

Archiving• Repositories• Digital Libraries• Curated databases

Credit• Citation• Reputation• Recognition, Prizes

Trust• Verification• Validation• Re-use• Authorisation

Knowledge• Semantic Web• Literature• Text mining• Theory generation

Page 29: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

Data flow beyond crystallography

Scientific context for the experiment• Theory• Invention• Development• Synthesis• Preparation

Scientific resource• Technique• Probe (X-ray, neutron, electron)• Equipment• Beamline• Funding• Time• Scale

Motivation• Curiosity• Mission• Exploitation• Serendipity• Literature• Patents

Management• Funder• Research Framework• Principal Investigator• Accountability

IT infrastructure• Word size, Byte order• Encoding• Lossiness• Volume• Software dependencies• Bit-rot

Reliability• Equipment maintenance• Calibration• Regression testing• Statistical analysis

Archiving• Repositories• Digital Libraries• Curated databases

Credit• Citation• Reputation• Recognition, Prizes

Trust• Verification• Validation• Re-use• Authorisation

Knowledge• Semantic Web• Literature• Text mining• Theory generation

Page 30: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

Acknowledgements

The activities described in the talks in today’s Symposium owe much to many collaborators over the years:

Alan Mighell, Alex Renshaw, Alexei Vagin, Allen Larson, Alun Ashton, Andre Authier, Andy Hammersley, Andy Howard, Arie Van Der Lee, Ashley Buckle, Ben Watts, Bill Clegg, Bob Hanson, Bob Sweet, Brian Matthews, Brian Toby, Charlie Bugg, Chris Nielsen, Colin Groom, Curt Haltiwanger, Dale Tronrud, Dave Duchamp, Dave Stampf, David Brown, David Watkin, David Watson, Doug Du Boulay, Doug Greer, Eldon Ulrich, Eleanor Dodson, Enrique Abola, Eric Gabe, Erica Yang, Ethan Merritt, Frances Bernstein, Frank Allen, George Ferguson, George Sheldrick, Gerard Bricogne, Gerard Kleywegt, Gotzon Madariaga, Greg Shields, Gunter Bergerhoff, Helen Berman, Herbert Bernstein, Howard Einspahr, Howard Flack, I. David Brown, Ian Bruno, James Hester, Jan Zelinka, Jean Richelle, John Huffman, Jim Kaduk, Joe Krahn, Joel Sussman, John Bollinger, John Helliwell, John Westbrook, Keith Watenpaugh, Kim Henrick, Lachlan Cranswick, Liz Lyon, Liz Potterton, Lynn Ten Eyck, Manfred Weiss, Mario Nardelli, Mark Koennecke, Martyn Winn, Matt Towler, Michael Scharf, Mike Dacombe, Mike Hoyland, Mike Hursthouse, Mois Aroyo, Nick Day, Nick England, Nick Spadaccini, Owen Johnson, Paul Edgington, Paul Mallinson, Paula Fitzgerald, Peter Grey, Peter Keller, Peter Murray-Rust, Peter Strickland, Phil Bourne, Phil Coppens, Ralf Grosse-Kunstleve, Richard Ball, Robert Downs, Sameer Velankar, Sandy Blake, Saulius Grazulis, Shoshana Wodak, Sidney Abrahams, Simon Coles, Simon Hodson, Simon Parsons, Simon Westrip, Sine Larsen, Steve Androulakis, Steve Bryant, Syd Hall, Ted Maslen, Tom Koetzle, Tom Terwilliger, Ton Spek, Tony Linden, Vicky Karen, Vivian Stojanoff, Weider Chang, Wolfgang Bluhm, Yvon Le Page

… and many more besides

Page 31: The Crystallographic Information Framework as a metadata library · 2015. 9. 4. · PDBx/mmCIF 2.1.15 PDBExchange Dictionary supporting data files in the archive NMR-STAR.dic 2.1.x

Metadata for raw data from X-ray diffraction and other structural techniques

A Satellite Workshop to the 29th European Crystallographic Meeting

Sponsors