the crystallographic information framework as a metadata library · 2015. 9. 4. · pdbx/mmcif...
TRANSCRIPT
Metadata for raw data from X-ray diffraction and other structural techniques
A Satellite Workshop to the 29th European Crystallographic Meeting
The Crystallographic Information Framework as a metadata library
Brian McMahon
International Union of Crystallography5 Abbey SquareChester CH1 [email protected]
CIF as a file format
1991 – Introduction of Crystallographic Information File format
Hall, S. R., Allen, F. H. & Brown, I. D. (1991). The Crystallographic Information File (CIF): a New Standard Archive File for Crystallography. Acta Cryst. A47, 655-685
CIF as a file format
CIF as a file format
1991 – Introduction of Crystallographic Information File format1996 – Introduction of mmCIF
Bourne, P. E., Berman, H. M., McMahon, B., Watenpaugh, K. D., Westbrook, J. & Fitzgerald, P. M. D. (1997). The Macromolecular Crystallographic Information File (mmCIF). Methods Enzymol. 277, 571-590
CIF as a file format
1991 – Introduction of Crystallographic Information File format1996 – Introduction of mmCIF2000 – Introduction of imgCIF/CBF
Bernstein, H. J. & Hammersley, A. P. (2006). Specification of the Crystallographic Binary File (CBF/imgCIF). In International Tables For Crystallography Vol. G: Definition and exchange of crystallographic data. Dordrecht: Springer
CIF as a file format
1991 – Introduction of Crystallographic Information File format1996 – Introduction of mmCIF2000 – Introduction of imgCIF/CBF2016 – Introduction of CIF2.0
Bernstein, H. J., Bollinger, J. C., Brown, I. D., Gražulis, S., Hester, J. R., McMahon, B., Spadaccini, N. & Westrip, S. P. (2015). Specification of the Crystallographic Information File (CIF) format, version 2.0. J. Appl. Cryst. Submitted
The irrelevance of formatMitchell, L. A. & Holliday, B. J. (2014). 6-Bromo-N-(6-bromopyridin-2-yl)-N-[4-(2,3-dihydro-thieno[3,4-b][1,4]dioxin-5-yl)phenyl]pyridin-2-amine. Acta Cryst. E70, o797.
Rendered using publcif [Westrip, S. P. (2010). J. Appl. Cryst. 43, 920-935]Rendered using Mercury [Macrae, C. F. et al. (2009). J. Appl. Cryst. 41, 466-470]
The irrelevance of formatMitchell, L. A. & Holliday, B. J. (2014). 6-Bromo-N-(6-bromopyridin-2-yl)-N-[4-(2,3-dihydro-thieno[3,4-b][1,4]dioxin-5-yl)phenyl]pyridin-2-amine. Acta Cryst. E70, o797.
Rendered using publcif [Westrip, S. P. (2010). J. Appl. Cryst. 43, 920-935]Rendered using Mercury [Macrae, C. F. et al. (2009). J. Appl. Cryst. 41, 466-470]
The irrelevance of formatGao, Y. R., Feng, N., Chen, T., Li, D. F. & Bi, L. J. (2015). Structure of the MarR family protein Rv0880 from Mycobacterium tuberculosis. Acta Cryst. F71, 741-745.
PDB structure 4YIF
Rendered using Jmol [Hanson, R. M. (2010). J. Appl. Cryst. 43, 1250-1260]
The irrelevance of formatGao, Y. R., Feng, N., Chen, T., Li, D. F. & Bi, L. J. (2015). Structure of the MarR family protein Rv0880 from Mycobacterium tuberculosis. Acta Cryst. F71, 741-745.
PDB structure 4YIF
Rendered using Jmol [Hanson, R. M. (2010). J. Appl. Cryst. 43, 1250-1260]
The irrelevance of formatExample MAR image file example.mar2300
imgCIF format CBF format
CIF as a data model
Format innovation Data description evolution
1991: CIF Initial decoupling of syntax/semantics
1995: DDL1: machine-readable semantics
1996: mmCIF Relational data model
2000: imgCIF (Handling of binary data)
2013: DDLm: dynamic (methods) data model
2016: CIF2.0 (Extended character set, data types)
CIF as a data model
• DDL (dictionary definition language) is the mechanism in the CIF world for describing relationships between defined data objects
• DDL (data description language) is the mechanism in the relational database world for characterising relationships between items in the database
• The two perform very similar functions
CIF as a relational data model
World Database of Crystallographers (WDC)
WDC initially implemented as a STAR databaseUpdated as an InterBase RDBMS
CIF as a relational data model
Protein Data Bank
PDB schema based on the mmCIF dictionary
The family of categories (in the mmCIF dictionary) used to describe polymer chemical entities; from Fitzgerald, P. M. D. et al. (2006). Classification and use of macromolecular data. In International Tables For Crystallography Vol. G: Definition and exchange of crystallographic data. Dordrecht: Springer
Global schema map of the entire PDB relational database; from Schierz, A. C., Soldatova, L. N. & King, R. D. (2007). Nature Biotechnology, 25, 437-442
Structure of a CIF dictionary definition
save__entity.formula_weight_item_description.description
; Formula mass in daltons of the entity.;
_item.name '_entity.formula_weight'_item.category_id entity_item.mandatory_code noloop__item_range.maximum_item_range.minimum . 1.0
1.0 1.0_item_type.code floatsave_
As it appears in the mmCIF dictionary (easy for computers to read) As it appears on the PDB website (easier for people to read?)
Structure of a CIF dictionary definition
data_cell_volume_name '_cell_volume'_category cell_type numb_type_conditions esd_enumeration_range 0.0:_units A^3^_units_detail 'cubic angstroms'_definition
; Cell volume V in angstroms cubed.
V = a b c [1 - cos^2^(alpha) - cos^2^(beta) - cos^2^(gamma)+ 2 cos(alpha) cos(beta) cos(gamma) ] ^1/2^
a = _cell_length_ab = _cell_length_bc = _cell_length_calpha = _cell_angle_alphabeta = _cell_angle_betagamma = _cell_angle_gamma
;
A more complex example. The relationships with other data items are described – but not in a form that a computer program can do anything with.
Structure of a CIF dictionary definition
But in the new DDLm, methods are introduced that can be interpreted and executed by computer programs: allows validation of data or retrieval of absent data provided more primitive data values are present.
save_cell.volume_definition.id '_cell.volume'loop__alias.definition_id '_cell_volume'_definition.update 2013-03-07_description.text
;Volume of the crystal unit cell.
;_name.category_id cell_name.object_id volume_type.purpose Measurand_type.source Derived _type.container Single_type.contents Real_enumeration.range 0.0:_units.code angstrom_cubedloop__method.purpose_method.expressionEvaluation
;With v as cell_vector
_cell.volume = v.a * ( v.b ^ v.c );
save_
• Spadaccini, N. & Hall, S. R. (2012). Extensions to the STAR File Syntax. J. Chem. Inf. Model. 52, 1901-1906.
• Spadaccini, N. & Hall, S. R. (2012). DDLm: A New Dictionary Definition Language. J. Chem. Inf. Model. 52, 1907-1916.
• Spadaccini, N., Castleden, I. R., Du Boulay, D. & Hall, S. R. (2012). dREL: A Relational Expression Language for Dictionary Methods. J. Chem. Inf. Model. 52, 1917-1925.
Structure of a CIF dictionary definition
Another, more complicated example: structure factor
save_refln.F_complex. . .
loop__method.purpose_method.expressionEvaluation
;
With r as refln # reflection packet in scope
fc = Complex (0., 0.)
h = r.hkl
Loop a as atom_site { # Summation over atom sites
x = a.fract_xyz
f = (r.form_factor_table[a.type_symbol]*
a.symmetry_multiplicity*a.occupancy)
B = a.matrix_beta
Loop s as symmetry_equiv { # Summation over symmetry
fc += f * Exp(-h*s.R*B*s.R*h) *
ExpImag(TwoPi*(h*(s.R*x+s.T)))
}
}
refln.F_complex = fc # evaluate defined item;
save_
jkkj rRihsymmetry
k
sBasymmetric
j
j eesfhF
2)(
)()(
Existing CIF dictionaries
‘Official’: managed by COMCIFS
Dictionary name DDL Purpose
cif_core.dic 1.4.1 crystallographic core
cif_core_restraints.dic 1.4.1 restraints or constraints in least-squares refinement of crystal structures
cif_pd.dic 1.4.1 powder diffraction
cif_ms.dic 1.4.1 incommensurately modulated crystal structures
cif_rho.dic 1.4.1 results of electron density studies
cif_twinning.dic 1.4.1 crystallographic twinning
cif_img.dic 2.1.3 diffraction images
cif_sym.dic 2.1.3 crystallographic symmetry
meta-dictionaries
ddl_core.dic 1.4.1 dictionary definition language
ddl_core_2.1.6.dic 2.1.6 relational dictionary definition language
cifdic.register 1.4 register of dictionaries distributed by IUCr
Existing CIF dictionaries
‘Official’: managed by wwPDB
Dictionary name DDL Purpose
PDBx/mmCIF 2.1.15 PDB Exchange Dictionary supporting data files in the PDB archive
NMR-STAR.dic 2.1.x NMR structures in the BioMagResBank archive
mmcif_nef.dic 2.1.x NMR exchange format
mmcif_sas.dic 2.1.x small-angle scattering
mmcif_em.dic 2.1.x 3D electron microscopy data
mmcif_img.dic 2.1.x PDB maintained version of diffraction image data description
mmcif_sym.dic 2.1.x PDB maintained version of crystallographic symmetry
mmcif_biosync.dic 2.1.x features of synchrotron facilities and beamlines
mmcif_mdb.dic 2.1.x Homology models and homology modelling methodologies
meta-dictionaries
mmcif_ddl.dic 2.1.15 relational dictionary definition language
Existing CIF dictionaries
‘Private’: distributed by IUCr
Dictionary name DDL Purpose
cif_iucr.dic 1.4.1 private data items used by the IUCr in journal publishing
cif_ccdc.dic 1.4.1 private data items used by the Cambridge Crystallographic Data Centre
meta-dictionaries
cif_compat.dic 1.4.1 legacy CIF Dictionary of deprecated terms
Putative CIF dictionaries
Reserved namespaces registered with IUCrPrefix Purpose/owner Prefix Purpose/owner Prefix Purpose/owner
B+S, BplusS, pdb2cif H. J. Bernstein crystmol CrystMol software msd EBI Molecular Structure Database Group
CCP4 CCP4 software suite csd Cambridge Structural Database ndb Nucleic Acids Database
H5, NX support of HDF5 and NeXusintegration
dft density functional theory calculations nottingham University of Nottingham
NIEHS Natl Inst. of Environmental Health Sciences
ebi EBI macromolecular harvest deposition file
oxford U. Oxford CRYSTALS software
SSAD Sulfur SAD Database edchem Edinburgh University Chemistry parvati PARVATI validation server
acihd ACI Heidelberg gsas GSAS powder refinement system pdb, pdbx Protein Data Bank
amcsd American MIneralogistCrystal Structure Database
gsk GlaxoSmithKline phenix PHENIX software suite
anbf Australian National Beamline Facility
iims Integration of 3D Electron Microscopy with X-ray and NMR methods
publcif publCIF editor software
asd Active Site Database itqb Inst. Tecnologia Quimica e Biologica da Univer. Nova de Lisboa, Oeiras, Portugal
rayonix Rayonix (Mar USA) instruments
bruker Bruker AXS iucr IUCr journal use rcsb Research Collaboratory for Structural Bioinformatics
ccdc Cambridge Crystallographic Data Centre
mdb database of model structures of biological molecules
shelx SHELXL solution and refinement programs
cgraph Oxford CryosystemsCrystallographica package
montpellier University of Montpellier tcod Theoretical Crystallography Open Database
cod Crystallography Open Database
mpod, prop Material Properties Open Database vrf checkCIF validation reply form entries
wdc entries in World Directory of Crystallographers
xtal Xtal program system
Potential CIF dictionaries
Proposed, under development or abandoned
Dictionary name Purpose
magCIF magnetic structures; under development by Commission on
Magnetic Structures
Xasformat, xasCIF efforts to define a data standard for X-ray absorption spectroscopy; input from Commission on XAFS
MPOD Materials Properties Open Database ontology
Lattice topology e.g. following descriptive model of TOPOS software
Data flow in crystallography
Experiment(synchrotron or laboratory)
Structure solution and refinement(laboratory)
IUCr journals
Other journals
Chemistry databases(CCDC)
Biological structure databases(PDB)
Data reduction
Raw experimental data (e.g. diffraction images)Reduced/processed data (e.g. structure factors)Derived data (e.g. coordinates, a.d.p.s)
retained by scientist
archived at facility (~6 months)
deposited
published/disseminated
validated
Data flow beyond crystallography
Scientific resource• Technique• Probe (X-ray, neutron, electron)• Equipment• Beamline• Funding• Time• Scale
Data flow beyond crystallography
Scientific context for the experiment• Theory• Invention• Development• Synthesis• Preparation
Scientific resource• Technique• Probe (X-ray, neutron, electron)• Equipment• Beamline• Funding• Time• Scale
Data flow beyond crystallography
Scientific context for the experiment• Theory• Invention• Development• Synthesis• Preparation
Scientific resource• Technique• Probe (X-ray, neutron, electron)• Equipment• Beamline• Funding• Time• Scale
Motivation• Curiosity• Mission• Exploitation• Serendipity• Literature• Patents
Management• Funder• Research Framework• Principal Investigator• Accountability
IT infrastructure• Word size, Byte order• Encoding• Lossiness• Volume• Software dependencies• Bit-rot
Reliability• Equipment maintenance• Calibration• Regression testing• Statistical analysis
Archiving• Repositories• Digital Libraries• Curated databases
Credit• Citation• Reputation• Recognition, Prizes
Trust• Verification• Validation• Re-use• Authorisation
Knowledge• Semantic Web• Literature• Text mining• Theory generation
Data flow beyond crystallography
Scientific context for the experiment• Theory• Invention• Development• Synthesis• Preparation
Scientific resource• Technique• Probe (X-ray, neutron, electron)• Equipment• Beamline• Funding• Time• Scale
Motivation• Curiosity• Mission• Exploitation• Serendipity• Literature• Patents
Management• Funder• Research Framework• Principal Investigator• Accountability
IT infrastructure• Word size, Byte order• Encoding• Lossiness• Volume• Software dependencies• Bit-rot
Reliability• Equipment maintenance• Calibration• Regression testing• Statistical analysis
Archiving• Repositories• Digital Libraries• Curated databases
Credit• Citation• Reputation• Recognition, Prizes
Trust• Verification• Validation• Re-use• Authorisation
Knowledge• Semantic Web• Literature• Text mining• Theory generation
Acknowledgements
The activities described in the talks in today’s Symposium owe much to many collaborators over the years:
Alan Mighell, Alex Renshaw, Alexei Vagin, Allen Larson, Alun Ashton, Andre Authier, Andy Hammersley, Andy Howard, Arie Van Der Lee, Ashley Buckle, Ben Watts, Bill Clegg, Bob Hanson, Bob Sweet, Brian Matthews, Brian Toby, Charlie Bugg, Chris Nielsen, Colin Groom, Curt Haltiwanger, Dale Tronrud, Dave Duchamp, Dave Stampf, David Brown, David Watkin, David Watson, Doug Du Boulay, Doug Greer, Eldon Ulrich, Eleanor Dodson, Enrique Abola, Eric Gabe, Erica Yang, Ethan Merritt, Frances Bernstein, Frank Allen, George Ferguson, George Sheldrick, Gerard Bricogne, Gerard Kleywegt, Gotzon Madariaga, Greg Shields, Gunter Bergerhoff, Helen Berman, Herbert Bernstein, Howard Einspahr, Howard Flack, I. David Brown, Ian Bruno, James Hester, Jan Zelinka, Jean Richelle, John Huffman, Jim Kaduk, Joe Krahn, Joel Sussman, John Bollinger, John Helliwell, John Westbrook, Keith Watenpaugh, Kim Henrick, Lachlan Cranswick, Liz Lyon, Liz Potterton, Lynn Ten Eyck, Manfred Weiss, Mario Nardelli, Mark Koennecke, Martyn Winn, Matt Towler, Michael Scharf, Mike Dacombe, Mike Hoyland, Mike Hursthouse, Mois Aroyo, Nick Day, Nick England, Nick Spadaccini, Owen Johnson, Paul Edgington, Paul Mallinson, Paula Fitzgerald, Peter Grey, Peter Keller, Peter Murray-Rust, Peter Strickland, Phil Bourne, Phil Coppens, Ralf Grosse-Kunstleve, Richard Ball, Robert Downs, Sameer Velankar, Sandy Blake, Saulius Grazulis, Shoshana Wodak, Sidney Abrahams, Simon Coles, Simon Hodson, Simon Parsons, Simon Westrip, Sine Larsen, Steve Androulakis, Steve Bryant, Syd Hall, Ted Maslen, Tom Koetzle, Tom Terwilliger, Ton Spek, Tony Linden, Vicky Karen, Vivian Stojanoff, Weider Chang, Wolfgang Bluhm, Yvon Le Page
… and many more besides
Metadata for raw data from X-ray diffraction and other structural techniques
A Satellite Workshop to the 29th European Crystallographic Meeting
Sponsors