crystallography open database (cod) … · x-ray crystallography is an extremely powerful method...

19
Crystallography Open Database (COD) Saulius Gražulis, Andrius Merkys, and Antanas Vaitkus Contents 1 Introduction ............................................................... 2 2 A Short History of COD ..................................................... 5 3 Scope and Contents of the COD ............................................... 7 4 COD Data Semantics and Selection ............................................ 8 5 Accessing the COD ......................................................... 10 5.1 Web Access to the COD ................................................. 10 5.2 Using the RESTful Interfaces ............................................ 11 5.3 Querying SQL Database ................................................. 13 6 COD Applications .......................................................... 16 7 Conclusions ............................................................... 16 References ................................................................... 17 Abstract The Crystallography Open Database (COD, http://crystallography.net/) is as of the time of writing the largest open-access collection of mineral, metal organic, organometallic, and small organic crystal structures, excluding biomolecules that are stored separately in the Protein Data Bank (http://wwpdb.org/). Unlike other existing chemical crystal structure databases, the COD is fully open – all its structures may be downloaded, used and re-disseminated without restriction, along with the results derived from them. Currently, the COD contains >385,000 records and is growing constantly, encompassing most structures published in peer-reviewed academic press and donations by individual researchers. This S. Gražulis () · A. Merkys · A. Vaitkus Department of Protein-DNA Interactions, Vilnius University Institute of Biotechnology, Vilnius, Lithuania e-mail: [email protected] © Springer International Publishing AG, part of Springer Nature 2018 W. Andreoni, S. Yip (eds.), Handbook of Materials Modeling, https://doi.org/10.1007/978-3-319-42913-7_66-1 1

Upload: others

Post on 02-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Crystallography Open Database (COD) … · X-ray crystallography is an extremely powerful method for determining inner structure of the condensed matter. Soon after the discovery

Crystallography Open Database (COD)

Saulius Gražulis, Andrius Merkys, and Antanas Vaitkus

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 A Short History of COD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Scope and Contents of the COD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 COD Data Semantics and Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Accessing the COD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5.1 Web Access to the COD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105.2 Using the RESTful Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115.3 Querying SQL Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

6 COD Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Abstract

The Crystallography Open Database (COD, http://crystallography.net/) is as ofthe time of writing the largest open-access collection of mineral, metal organic,organometallic, and small organic crystal structures, excluding biomolecules thatare stored separately in the Protein Data Bank (http://wwpdb.org/). Unlike otherexisting chemical crystal structure databases, the COD is fully open – all itsstructures may be downloaded, used and re-disseminated without restriction,along with the results derived from them. Currently, the COD contains >385,000records and is growing constantly, encompassing most structures published inpeer-reviewed academic press and donations by individual researchers. This

S. Gražulis (�) · A. Merkys · A. VaitkusDepartment of Protein-DNA Interactions, Vilnius University Institute of Biotechnology, Vilnius,Lithuaniae-mail: [email protected]

© Springer International Publishing AG, part of Springer Nature 2018W. Andreoni, S. Yip (eds.), Handbook of Materials Modeling,https://doi.org/10.1007/978-3-319-42913-7_66-1

1

Page 2: Crystallography Open Database (COD) … · X-ray crystallography is an extremely powerful method for determining inner structure of the condensed matter. Soon after the discovery

2 S. Gražulis et al.

article describes how data are organized in the COD and how the database canbe queried, downloaded, and processed for various purposes.

1 Introduction

X-ray crystallography is an extremely powerful method for determining innerstructure of the condensed matter. Soon after the discovery of X-rays (Röntgen1896) and the first records of their diffraction on crystalline samples (Friedrich et al.1912a,b), the number of structures determined by this technique started to grow. Anexplanation of the X-ray scattering using first principles (Bragg and Bragg 1913;Bragg 1913) allowed determination of structural models for a vast variety of solidmaterials in a uniform way, from simple inorganics to very large biomolecules. Asmore and more crystal structures were appearing, it became evident that the numbers(such as crystal unit cell parameters, atomic coordinates) in their descriptions,made uniform by the availability of the common scattering theory, possess a greatvalue themselves and efforts to collect them systematically were started. The firstcollections were in paper form (Hermann and Ewald 1931; IUCr 2017c; A. I.Kitajgorodskij 1955), and numeric data were accompanying crystallographicpublications in journals dedicated for this field from the very first publications (forinstance, in the Acta Crystallographica journal started by the IUCr in 1948 (Clewsand Cochran 1948).

Growing availability and power of electronic computers allowed crystallogra-phers to use them for structure determination and prompted the idea that crystalstructure data can also be handled automatically (Brown and McMahon 2002). Afirst dedicated crystallographic database, the CSD, was established by the CCDC in1965 (Groom and Allen 2014) to collect structures of small organic molecules andembraced computer-assisted methods for information storage and retrieval (Allenet al. 1979). Data about inorganic crystals (Kaduk 2002), alloys (White et al. 2002)and powder diffraction data (Kabekkodu et al. 2002) were historically kept inseparate archives. Today, we have a whole range of databases, differing by theirscope, size and licensing model, covering various aspects of crystallographic data(Table 1).

As seen in from the Table 1, various licensing models were employed to supportoperations of the databases. About a third of all resources, and some of the oldestand the largest ones, use a subscription-based model, where a user of these databasesmust agree to a license and is restricted with respect of what he or she may dowith the data obtained from the resource. As long as the main vehicle of databasedissemination were paper editions or magnetic tape reels that could be used onlyin computer centers, such situation seemed fairly acceptable. In the epoch ofubiquitous computer access and with the advent of the Internet, however, researchersexpressed concerns that certain licensing clauses are overly restrictive. So, therestriction to disseminate derived results was mentioned as an impediment forscientific work (Baldi et al. 2011; Andronico et al. 2011). As a result, several moderndatabases were created anew, following an open-access dissemination model, and

Page 3: Crystallography Open Database (COD) … · X-ray crystallography is an extremely powerful method for determining inner structure of the condensed matter. Soon after the discovery

Crystallography Open Database (COD) 3

Tab

le1

Ove

rvie

wof

larg

estc

ryst

allo

grap

hic

data

base

s,th

eir

subj

ecta

reas

,siz

esan

dlic

ensi

ngm

odel

s

No.

Dat

abas

eR

ecor

dsL

icen

seC

urre

ntU

RL

Est

.R

efer

ence

1PD

F38

0,00

0Su

bscr

iptio

nba

sed

http

://w

ww

.icdd

.com

/pro

duct

s/pd

f4.

htm

1941

Fabe

ran

dFa

wce

tt(2

002)

2C

SD80

0,00

0Su

bscr

iptio

nba

sed

http

://w

ww

.ccd

c.ca

m.a

c.uk

/sol

utio

ns/

csd-

syst

em/c

ompo

nent

s/cs

d/19

65G

room

etal

.(20

16)

3PD

B12

4,00

0O

pen

acce

ssht

tp://

ww

w.r

csb.

org/

pdb

1971

Prot

ein

Dat

aB

ank

(197

1);

Ber

man

etal

.(20

12)

4IC

SD20

0,00

0Su

bscr

iptio

nba

sed

http

s://i

csd.

fiz-k

arls

ruhe

.de/

1987

Bel

sky

etal

.(20

02)

5N

DB

8600

Ope

nac

cess

http

://nd

bser

ver.r

utge

rs.e

du/

1992

Ber

man

etal

.(1

992)

;N

aray

anan

etal

.(20

14)

6Pa

ulin

gfil

e29

0,00

0Su

bscr

iptio

nba

sed

http

://pa

ulin

gfile

.com

http

://cr

ystd

b.ni

ms.

go.jp

/inde

x_en

.htm

l19

95V

illar

set

al.(

1998

,200

4)

7IZ

AZ

eolit

eda

taba

se17

6O

pen

acce

ssht

tp://

ww

w.iz

a-st

ruct

ure.

org/

data

base

s/19

96B

aerl

oche

ret

al.(

2007

)

8C

RY

STM

ET

170,

000

Subs

crip

tion

base

dht

tp://

ww

w.T

othC

anad

a.co

mht

tps:

//cd

s.dl

.ac.

uk/c

gi-b

in/n

ews/

disp

?cr

ystm

et

1996

Whi

teet

al.(

2002

)

9B

ilbao

serv

erht

tp://

ww

w.c

ryst

.ehu

.es

1997

Aro

yoet

al.(

2011

)

10A

MC

SD20

,000

Ope

nac

cess

http

://rr

uff.

geo.

ariz

ona.

edu/

AM

S/am

csd.

php

2003

Dow

nsan

dH

all-

Wal

lace

(200

3);R

ajan

etal

.(20

06)

(con

tinu

ed)

Page 4: Crystallography Open Database (COD) … · X-ray crystallography is an extremely powerful method for determining inner structure of the condensed matter. Soon after the discovery

4 S. Gražulis et al.

Tab

le1

(con

tinue

d)

No.

Dat

abas

eR

ecor

dsL

icen

seC

urre

ntU

RL

Est

.R

efer

ence

11C

OD

367,

000

Publ

icdo

mai

nht

tp://

ww

w.c

ryst

allo

grap

hy.n

et/c

od20

03G

ražu

liset

al.(

2009

,201

2)

12PC

OD

1,00

0,00

0Pu

blic

dom

ain

http

://w

ww

.cry

stal

logr

aphy

.net

/pco

d20

03L

eB

ail(

2005

)

13M

POD

300

Publ

icdo

mai

nht

tp://

mpo

d.ci

mav

.edu

.mx

2010

Pepp

onie

tal.

(201

2)

14B

-Inc

StrD

B(B

ilbao

Inco

mm

ensu

rate

Stru

ctur

esD

atab

ase)

140

Ope

nac

cess

http

://w

ebbd

cris

ta1.

ehu.

es/in

cstr

db/

2010

Aro

yoet

al.(

2006

)

15T

CO

D2,

600

Publ

icdo

mai

nht

tp://

ww

w.c

ryst

allo

grap

hy.n

et/tc

od20

13M

erky

set

al.

(201

7);

Cha

teig

ner

etal

.(20

15)

16R

RU

FF47

,000

Ope

nac

cess

http

://rr

uff.

info

/20

15L

afue

nte

etal

.(20

15)

17M

AG

ND

ATA

(Bilb

aoM

agne

ticSt

ruct

ure

Dat

abas

e)

428

Ope

nac

cess

http

://w

ebbd

cris

ta1.

ehu.

es/m

agnd

ata/

2015

Pere

z-M

ato

etal

.(20

15)

Page 5: Crystallography Open Database (COD) … · X-ray crystallography is an extremely powerful method for determining inner structure of the condensed matter. Soon after the discovery

Crystallography Open Database (COD) 5

in certain cases can be used in situations where licensing requirements are toorestricting (Sadowski and Baldi 2013). Among them, the Crystallography OpenDatabase (COD) is currently the largest and the oldest open resource of smallmolecule crystal structures, providing access to data in mineralogy and chemicalcrystallography and placing all its collection in public domain.

2 A Short History of COD

The COD project started as a community initiative, when crystallographers on theSDPD (Structure Determination by Powder Diffraction) discussed possible modesof crystallographic data dissemination. It was 2003, computers were becomingcheap, Internet connections widely available and free/libre open source software(F/LOSS) ubiquitous. Armel Le Bail raised a question whether it is possible tobuild an entirely open and free for everyone to use crystallographic database byjoining community efforts. Answering that question, Michael Berndt (1964–2003)listed three conditions that were necessary and sufficient for community resourcecreation and curation: “A small team of engaged scientists with some experiencein database and software design to coordinate the project; the authors (i.e., thescientific community = you) who provide the project with database entries /. . . /; freesoftware (a) for maintaining the database, (b) for data evaluation and calculationof derived data.” With this plan in mind, the COD project started and turned outto be a viable alternative to the top-down, heavy-funded database projects. From2003 to 2007, the COD database master copy was maintained by Armel Le Bailat the Le Mans University in France. In 2007 its collection of 50,000 recordswas ported to the Institute of Biotechnology in Vilnius, Lithuania, the softwaredevelopment for the COD, and database maintenance was continued. When theInstitute of Biotechnology was merged with the Vilnius University in 2011, theCOD development continued by the joint team from the Vilnius University Instituteof Biotechnology and the Faculty of Mathematics and Informatics.

Despite the several transfers of maintainership, the COD is governed by aninternational COD Advisory Board (AB), listed on the COD Web site and operatingvia the mailing list. The COD AB establishes the COD data management policiesand sets inclusion criteria for the COD data. In this way, a continuity of databasequality is maintained.

During the period of 10 years since 2007, the COD was growing constantly andattained >385,000 records in 2017 (Fig. 1). This was possible with the introductionof the new data deposition Web site (Fig. 2) that allowed both manual and automaticuploads of data to the COD and after development of automated data collection anddeposition software that deposits available structures to the COD automatically. Thisautomation in turn is highly facilitated by the introduction of the CrystallographicInterchange Framework (CIF) (Hall et al. 1991; IUCr 2017b). The CIF frameworkwas initially used to facilitate crystallographic paper publication and to reducetyping errors in data by providing automated means of crystallographic dataprocessing (Brown and McMahon 2002). Introduction of electronic data handling

Page 6: Crystallography Open Database (COD) … · X-ray crystallography is an extremely powerful method for determining inner structure of the condensed matter. Soon after the discovery

6 S. Gražulis et al.

0

50000

100000

150000

200000

250000

300000

350000

400000

2008 2009 2010 2011 2012 2013 2014 2015 2016 2017

CO

D r

ecor

d nu

mbe

r

Year

COD records

Fig. 1 Growth of the COD database by year

in the publication process significantly reduced typing errors in data publication,a significant step towards reliable data reuse. Not only that: availability of crystalstructure descriptions in a standardized, machine readable form as supplementarymaterial for scientific publications greatly facilitated reuse of that data. As a resultthe COD data acquisition subsystem can ingest automatically all necessary valuesand formulate structure description records, using information publicly availablewith the IUCr publications and from journals of some other publishers that makethe necessary information publicly available. The same CIF framework makesit possible for the COD to present all its data collection in a widely accepted,standardized form, so that researchers can use the same software to process theCOD CIFs as for the outputs of structure determination programs or from journalWeb pages.

The data collection procedure conducted by the COD is not completely straight-forward, though. Virtually all structures, even though represented in standardCIF format next to the publications, lack essential metadata such as publicationbibliography; sometimes computed items such as cell volumes or space groupnames are missing or presented in a non-standardized form. Such information isautomatically inserted by the COD data processing pipeline (Gražulis et al. 2009).Moreover, a non-negligible part of supplementary files, although it does containnecessary data in a form similar to the CIF, does not strictly follow the CIF syntax.Since the number of such cases was too large to be corrected manually, an error-correcting CIF parser was implemented (Merkys et al. 2016). The same procedureis followed when data is deposited by researchers into the COD using the Webdeposition interface (Gražulis et al. 2012). In this way the COD ensures that allstructure descriptions that enter its collection are syntactically correct, i.e., conformto the syntax defined by the IUCr (2017a).

Page 7: Crystallography Open Database (COD) … · X-ray crystallography is an extremely powerful method for determining inner structure of the condensed matter. Soon after the discovery

Crystallography Open Database (COD) 7

Fig. 2 The Crystallography Open Database Web site

With this setup, the COD is ready to grow further, to provide open access tocrystal structure data for researchers and all interested parties, and to evolve tomeet challenges of the new millennium. Computing landscape changes rapidly,with new techniques, languages, formats and protocols coming and going every day,and computer architectures changing fast enough so that any reasonable scientificarchive must outlive many generations of computer software and hardware. Thebasic principles of the COD design and the successful operation of the COD formore than a decade hint that the methods chosen by the COD founders were soundand that the COD will successfully evolve into the future.

3 Scope and Contents of the COD

The COD collects machine-readable descriptions of crystal structures for inorganiccompounds, minerals, small organic molecules, metal-organic and organometalliccompounds. Proteins, nucleic acids and their complexes, glycoproteins and thelike are as a rule excluded from the COD, since they are systematically collectedin an open-access database, the Protein Data Bank (PDB) (Berman et al. 2012).Most of the “small molecule” structures in the COD are refined using assumptionof independent atom parameters (using full-matrix least squares refinement), anda spherical atom model. This makes the COD suitable, for example, to generaterestraints on molecular geometries and to refine larger molecules or molecularassemblies (Long et al. 2017a,b). We must note, however, that this assumption doesnot necessarily hold for all COD entries. For larger entries, or when disorder ispresent, restraints can be put the by authors on the thermal displacement parameters.

Page 8: Crystallography Open Database (COD) … · X-ray crystallography is an extremely powerful method for determining inner structure of the condensed matter. Soon after the discovery

8 S. Gražulis et al.

For structures solved using powder diffraction techniques, restraints on bond lengthsand angles can be also used. Finally, some structures in the COD are solved byhybrid methods, using powder diffraction to carry out Rietveld refinement andto use DFT to further refine atomic parameters; some structures are reportedentirely based on DFT calculations. Obviously, determining bond length and angleparameters from restrained structures would result in circular reasoning, since thesame restraints were already used during the structure refinement process. Thus,the user is advised to inspect structure determination parameters and to select thosestructures that are suitable for his or her work.

4 COD Data Semantics and Selection

To facilitate structure selection, the COD maintains a set of flags that describeexperimental and refinement techniques used for structure determination. Inthe COD SQL table, the “method” column of the ‘data‘ table describesthe experimental technique which can be “single crystal”, “powderdiffraction” or “theoretical”. If the value of this column is NULL,the method is most probably single crystal diffraction. Unfortunately, in manystructures the most popular method, “single crystal”, is not mentionedexplicitly, so this assumption is a certain guess; but the structures solved by“powder diffraction” or “theoretical” methods are usually markedmore accurately and are less numerous, so the guess should be reasonably safe.Structures marked as “theoretical” are in fact solved by DFT computationswithout using any structure-specific experimental data. These structures are ofcourse more appropriate to a different database, the TCOD, which is dedicated totheoretical structures, and are in fact also most likely deposited there. They endedup in the COD since they were provided as supplementary material to some papersand were not marked as being theoretical and only later data curation revealed theirdetermination method. Several important theoretical structures, e.g., from the DFTmethod error estimate studies (Lejaeghere et al. 2014), were deposited to the CODbefore the TCOD was fully operational but were deemed important enough so thatpermanent storage in a database for these data records is necessary. Since the CODpolicy is not to delete any records, so that once assigned COD IDs remain stable,the policy of the COD is to mark its entries with appropriate flags, but not to removethem.

Further the COD database tables contain several fields describing experimentaltechniques, taken from the IUCr Core CIF dictionary. The “radiation”,“radType” and “radSymbol” columns of the “data” table are deriveddirectly from the CIF data items _diffrn_radiation_probe,_diffrn_radiation_type and _diffrn_radiation_xray_symbol,respectively. These data items allow distinguishing between structures obtainedfrom X-ray, neutron and electron diffraction data (the “radiation” column canhave values “x-ray”, “neutron” or “electron” for the respective radiationtypes). Again, like with the “single crystal” value, the most popular

Page 9: Crystallography Open Database (COD) … · X-ray crystallography is an extremely powerful method for determining inner structure of the condensed matter. Soon after the discovery

Crystallography Open Database (COD) 9

radiation type, “x-ray”, is often not marked and thus represented as a NULLvalue. We can expect that authors are more attentive when they submit a structuremade by a less common method, but certain caution is of course appropriate.

When selecting records from the COD, one must keep in mind certain book-keeping data items. Certain structures are deposited to the COD that are deliberatelyworse than the best possible interpretation; this is usually done in publications todemonstrate that the main interpretation of data offered by authors is correct orindeed the best one. COD policy is to include such structures (so that the paperclaims can be easily verified) but to mark them as “suboptimal.” In COD CIFs,such structures are marked with _cod_suboptimal_structure yes and_cod_related_optimal_struct data items, and in the COD ‘data‘ tableit has a non-NULL value in the “optimal” column pointing the related optimalstructure. Unless explicit comparison of suboptimal and optimal structures is sought,only structures with NULL “optimal” values should be selected.

Another issue is structures that contain known problems. Again, the CODpolicy is not to remove such structures, once they were included in the COD, butto flag them appropriately. This flag is recorded in the COD database ‘data‘table “status” column. Possible values for this column are “warnings”,“errors”, and “retracted”. The “warnings” level indicates that thestructure might be after all correct but there are strange features, unusual description,or wrong metadata in it. The “errors” mark structures that either have beenproven wrong by subsequent published observations, authors’ corrigenda or containserious data consistency problems that prevent correct interpretation of the structure.In all cases, _cod_error_description gives a human readable descriptionof the problem. Finally, the “retracted” in the “status” column indicatesthat the structure was retracted and should not be used under any circumstances.The reasons for retraction may vary, but usually this flag indicates very seriousproblems up to the outright scientific fraud, as was the case discovered in oneIUCr investigation (Harrison et al. 2009); in such cases, the original publicationsare retracted as well.

The last thing to take care about is the presence of duplicated entries in the COD.Unfortunately, due to less stringent admission procedures in the earliest days ofthe COD, or due to programming or data encoding errors, sometimes the samestructure is deposited more than once to the COD. Once again, when such situationis detected, neither entry is removed from the COD; instead, one entry, usually themost complete one, is declared to be the “main” entry describing this structure, andthe others are marked as “duplicates” using the _cod_duplicate_entry dataitem. If the main entry is missing some information that is present in the duplicates,this information is merged into the main entry and committed as a new revision.Duplicate entries are marked by a non-NULL “duplicateof” column in the‘data‘ table. Thus, to select only those entries that are not marked as duplicates,one needs to select entries that have “duplicateof” column set to NULL.

It must be noted that only technical duplicates are flagged as such in the COD,i.e., only structures that are originating from the same original description and fromthe same publication. Two structures of the same compound reported in different

Page 10: Crystallography Open Database (COD) … · X-ray crystallography is an extremely powerful method for determining inner structure of the condensed matter. Soon after the discovery

10 S. Gražulis et al.

publications are not considered duplicates and are stored as different COD records.Even when the same data file is published as supplementary material to two differentpublications, it is deposited under two different COD identifiers. The rationalehere is that a COD record reports an instance of the crystal structure solutionreported somewhere, and all such cases must be represented in the database. Furtherreduction of the multiple records is the responsibility of the COD user, and, indeed,different tasks will require different uniqueness criteria – in some cases these willbe based in chemical identity, in other cases on crystal structure identity, and theCOD must provide sufficient data for all such queries.

Collecting all above considerations into one SQL query, we can select all non-retracted experimental structures that are not marked as duplicates and have atomiccoordinates with a query displayed in Listing 1; the query there reports number ofsuch entries in the current COD SQL database and can be used for further narrowingdown the selection based on crystal parameters.

Listing 1 Number of non-retracted experimental structures with coordinates in the COD that arenot marked as duplicates

#!/bin/bash

mysql -h sql.crystallography.net cod -u cod_reader -t -e \’select count(*), current_timestamp() from datawhere duplicateof is null

and flags like "%has coordinates%"and (status is null or status != "retracted")and (method is null or method != "theoretical")’

+----------+---------------------+| count(*) | current_timestamp() |+----------+---------------------+| 383573 | 2017-12-04 14:07:01 |+----------+---------------------+

5 Accessing the COD

5.1 Web Access to the COD

The COD offers several methods to access its structure collection. The one thatrequires least effort to learn is probably to use query forms (see Fig. 3). Multipleparameters can be specified, most of which should be self-explanatory; their exactmeaning, however, is the same as in the RESTful query fields and can be looked upin the Table 2.

Results from a Web query are displayed in a separate browser page as a HTMLtable (Fig. 4); in addition to that, options are provided to download the list ofresulting structures as a list of COD identifiers, download URLs or as a CSV formattable. For a small number of hits, a ZIP archive of all found CIFs is offered, but for

Page 11: Crystallography Open Database (COD) … · X-ray crystallography is an extremely powerful method for determining inner structure of the condensed matter. Soon after the discovery

Crystallography Open Database (COD) 11

Fig. 3 The Web query form of the Crystallography Open Database

Table 2 RESTful interface search parameters and their descriptions

Parameter Description

format The format in which the results will be returned

formula The empirical chemical formula of the crystal. Chemical elementsymbols in the formula must be ordered according to the Hill notationand separated by a space symbol, i.e., “C8 H10 N4 O2”

el1, el2, . . . , el8 Chemical element symbols that must appear in the chemical formula

nel1, nel2, . . . , nel4 Chemical element symbols that must not appear in the chemicalformula

strictmin, strictmax The minimum/maximum number of distinct chemical elements thatmust appear in the chemical formula

amin, amax The minimum/maximum value of the lattice parameter a

bmin, bmax The minimum/maximum value of the lattice parameter b

cmin, cmax The minimum/maximum value of the lattice parameter c

minZ, maxZ The minimum/maximum Z value of the lattice

year The year of publication of the crystal structure

a larger number of structures (typically more than several thousands), this option isnot available in order to avoid excessive stress on the COD servers, and instead auser is advised to download the COD structures in full and pick the desired CIFsusing the COD identifier list resulting from the search.

5.2 Using the RESTful Interfaces

The COD offers a RESTful interface that allows one to retrieve information aboutCOD entries based on certain criteria as well as the crystal structure files themselves.The REST (REpresentational State Transfer) is an architectural style of network-based programs that was outlined in the doctoral dissertation of Roy Fielding (2000).

Page 12: Crystallography Open Database (COD) … · X-ray crystallography is an extremely powerful method for determining inner structure of the condensed matter. Soon after the discovery

12 S. Gražulis et al.

Fig. 4 An example result page from a Crystallography Open Database Web query

The main ideas of this architecture relevant for the COD are to use a client-server design (the COD server serves multiple clients), to make the COD serverstateless as much as possible (thus the same request to the COD server should yieldidentical results if repeated several times), to use standard connections based onHTTP protocol and stable Web URIs, and to use standard formats (CIF, HTML)to exchange information. An interface based on the ideas of REST, a so-calledRESTful interface, has the benefit of not requiring a specialized client programsince the queries can be executed by any piece of software capable of resolvingURIs including, but not limited to, most Internet browsers.

COD RESTful search query URIs adhere to the HTTP GET query formattaking http://www.crystallography.net/cod/result as the basis URI. For example, aquery that returns a list of COD IDs associated with structures that contain theLi and O atoms and were published in 2017 would take a form of: http://www.crystallography.net/cod/result?el1=Li&el2=O&year=2017&format=lst

As mentioned above, specialized software is not required, but it can, however,ease the construction of the query strings. An example of the same request rewrittento use the cURL program is given in Listing 2.

Listing 2 Querying the RESTful interface using cURL

#!/bin/bash

curl ’http://www.crystallography.net/cod/result’ \-d ’el1=Li’ \-d ’el2=O’ \-d ’year=2017’ \-d ’format=lst’

Page 13: Crystallography Open Database (COD) … · X-ray crystallography is an extremely powerful method for determining inner structure of the condensed matter. Soon after the discovery

Crystallography Open Database (COD) 13

Several more examples of COD RESTful interface queries using cURL are givenin listings Listings 3, 4, 5, and 6. Description of the used query parameters is givenin Table 2. The full list of supported parameters and formats can be acquired athttp://wiki.crystallography.net/RESTful_API/.

Listing 3 Count of structures that contain Fe atoms, but no O atoms

#!/bin/bash

curl ’http://www.crystallography.net/cod/result’ \-d ’el1=Fe’ \-d ’nel1=O’ \-d ’format=count’

Listing 4 Information about entries that contain only Fe and N atoms in JSON format

#!/bin/bash

curl ’http://www.crystallography.net/cod/result’ \-d ’el1=Fe’ \-d ’el2=N’ \-d ’strictmin=2’ \-d ’strictmax=2’ \-d ’format=json’

Listing 5 Text file with URLs of entries that have the “C O2” chemical formula

#!/bin/bash

curl ’http://www.crystallography.net/cod/result’ \-d ’formula’=’C O2’-d ’format=urls’

Listing 6 ZIP archive containing CIF files of entries that have cell length between 30 Å and35 Å and Z number between 3 and 4

curl ’http://www.crystallography.net/cod/result’ \-d ’amin=30&amax=35’ \-d ’bmin=30&bmax=35’ \-d ’cmin=30&cmax=35’ \-d ’minZ=3&maxZ=4’ \-d ’format=zip’

5.3 Querying SQL Database

SQL (Structure Query Language) is arguably the most powerful method of interro-gating relational databases and offers more features than the COD Web page or even

Page 14: Crystallography Open Database (COD) … · X-ray crystallography is an extremely powerful method for determining inner structure of the condensed matter. Soon after the discovery

14 S. Gražulis et al.

than the COD RESTful interface. The Crystallography Open Database offers a read-only access to its data tables so that SQL queries can be carried out by user or bythird-party software. Covering SQL language syntax and its use is beyond the scopeof this chapter, but numerous textbooks and on-line references of SQL exist, as wellas excellent documentation of several F/LOSS implementations of SQL (MySQL isone of them). In this text we provide just a few examples that demonstrate how SQLqueries can be used for querying the COD out of the box.

The COD SQL tables are constructed automatically from the COD CIF collec-tion. Tables are updated by the post-commit hooks of the Subversion repository; thusthe SQL tables should be always in sync with the CIF collection. In the COD, thedataflow is always from CIFs to the SQL database; thus all changes in tables must befirst recorded and versioned in the main repository. Thus, MySQL acts essentiallyas a fast search cache for the COD, making use of index tables and query optimizer.The COD MySQL ‘data‘ table contains also the “svnrevision” column thatrecords Subversion revision from which each row is produced. In addition to that,all COD MySQL tables are dumped nightly in text form and committed to thesame Subversion repository as the CIF collection. These archives provide meansto reproduce queries that were run some time ago, should this necessity arise forscientific computation reproducibility.

The simplest query counts number of records in the current revision of the COD(Listing 7). A more elaborate form of this query which filters structures that areusually unwanted is provided in the Listing 1. Further examples (Listings 8, 9,and 10) demonstrate how various chemical features can be queried. Specifically,the Listing 9 shows how the COD MySQL server can be queried using regularexpressions, an extension of the SQL language. These queries permit selectionsbased on atom chemical types, among other possibilities.

Listing 7 Number of entries in the COD

#!/bin/bash

mysql -h sql.crystallography.net cod -u cod_reader -t -e \’select count(*), current_timestamp() from data’

+----------+---------------------+| count(*) | current_timestamp() |+----------+---------------------+| 387948 | 2017-12-04 14:07:02 |+----------+---------------------+

Listing 8 DOIs and publication years of structures of cucurbituril

#!/bin/bash

mysql -h sql.crystallography.net cod -u cod_reader -t -e \’select file, doi, year from datawhere chemname like "%cucurbituril%"’

Page 15: Crystallography Open Database (COD) … · X-ray crystallography is an extremely powerful method for determining inner structure of the condensed matter. Soon after the discovery

Crystallography Open Database (COD) 15

+---------+---------------------------+------+| file | doi | year |+---------+---------------------------+------+| 2200062 | 10.1107/S1600536800019498 | 2001 || 4320271 | 10.1021/ic015520p | 2001 || 4320272 | 10.1021/ic015520p | 2001 || 4320689 | 10.1021/ic010362n | 2001 || 4320690 | 10.1021/ic010362n | 2001 || 4508668 | 10.1021/cg060062m | 2006 || 4508669 | 10.1021/cg060062m | 2006 |+---------+---------------------------+------+

Listing 9 Number of hydrocarbons

#!/bin/bash

mysql -h sql.crystallography.net cod -u cod_reader -t -e \’select count(*), current_timestamp() from datawhere formula regexp"- C[[:digit:]]* H[[:digit:]]* -"’

+----------+---------------------+| count(*) | current_timestamp() |+----------+---------------------+| 1250 | 2017-12-04 14:07:02 |+----------+---------------------+

Listing 10 Five most voluminous MOFs

#!/bin/bash

mysql -h sql.crystallography.net cod -u cod_reader -t -e \’select file, chemname, vol from datawhere chemname like "%MOF%"order by vol desclimit 5’

+---------+-------------+--------+| file | chemname | vol |+---------+-------------+--------+| 4111295 | mesoMOF-1 | 122163 || 1519417 | Y-ftw-MOF-3 | 111361 || 1519416 | Y-ftw-MOF-2 | 64231 || 7032763 | MOF-205-NO2 | 27851 || 7032762 | MOF-205-NH2 | 27846 |+---------+-------------+--------+

Page 16: Crystallography Open Database (COD) … · X-ray crystallography is an extremely powerful method for determining inner structure of the condensed matter. Soon after the discovery

16 S. Gražulis et al.

6 COD Applications

Even though the COD is not as large as some older crystallographic databases, it hasnumerous applications due to its open nature. One immediate possibility where theCOD excels is teaching. Using the COD one can give students some real-life datasearch and crystallographic applications, illustrate structures of various compounds,and provide insights into modern chemical research areas (Gražulis et al. 2015).Advantages of the COD are its extremely rapid release cycle (the database is updateddaily), permissive license that allows students to download arbitrary parts or eventhe whole database to their computers, and its availability on the Internet where itcan be accessed from or outside the classroom.

Another widely accepted application of the COD is its use for material identi-fication with the help of powder diffraction method and search-match procedure.Largest diffractometer vendors (among them Bruker, PANalytical, Rigaku) haveadapted the COD collection for their software and ship it with their equipment,providing regular updates on the COD Web site or on their own pages. Sincethe COD is an open database, these updates are free of charge for the end users.The COD has currently accumulated enough mineral structures so that it can beused for the SOLSA project (http://solsa-mining.eu), where the database is used,together with other information sources, as a tool for material identification anddata dissemination.

In bioinformatics and drug design, the COD is used as a source of open data forrestraint libraries (Long et al. 2017a,b). It is also used in DataWarrior (Sander et al.2015) as one of the sources of chemical information and in the OpenMoleculesWeb site (http://www.openmolecules.org/). Software testing benefits from largecollection of COD data, where different cases need to be examined and data needsto be stored in regression tests. Finally, the COD is used in fundamental research toanswer different questions about matter (see, e.g., recent works on MOFs (First andFloudas 2013), hydrogen storage (Breternitz and Gregory 2015), or characterizationof 2D materials (Mounet et al. 2018).

7 Conclusions

The more than decade-long history of the COD has demonstrated that it is possibleto build a lasting, high-quality scientific database using an open-access licensingmodel. At its current state, the COD is useful for a range of academic and industrialapplications. Most importantly, this open database provides everyone with theaccess to knowledge in its own field of small molecule crystallography. At the sametime, there are a lot of obvious improvements that can be done. Clearly the CODneeds a more comprehensive data collection. More community organization effortshould be done, to involve more people in data correction, collection, and ensuringquality of the COD records. More links with the rest of the Internet data resourcesshould be made, integrating the COD more closely into the Linked Open Data

Page 17: Crystallography Open Database (COD) … · X-ray crystallography is an extremely powerful method for determining inner structure of the condensed matter. Soon after the discovery

Crystallography Open Database (COD) 17

Cloud. None of these tasks seems to be outside the reach of current possibilities,and so one can expect that in due time, the COD is expanded to include all thesefeatures.

Acknowledgements This project has received funding from the European Union’s Horizon 2020research and innovation program under grant agreement No 689868.

References

Allen FH, Bellard S, Brice MD, Cartwright BA, Doubleday A, Higgs H, Hummelink T,Hummelink-Peters BG, Kennard O, Motherwell WDS, Rodgers JR, Watson DG (1979) TheCambridge crystallographic data centre: computer-based search, retrieval, analysis and displayof information. Acta Crystallogr Sect B Struct Crystallogr Crystal Chem 35(10):2331–2339

Andronico A, Randall A, Benz RW, Baldi P (2011) Data-driven high-throughput prediction of the3-D structure of small molecules: review and progress. J Chem Inf Model 51:760–776

Aroyo MI, Perez-Mato JM, Capillas C, Kroumova E, Ivantchev S, Madariaga G, Kirov A,Wondratschek H (2006) Bilbao crystallographic server: I. Databases and crystallographiccomputing programs. Zeitschrift für Kristallographie – Crystalline Materials 221(1):15–27

Aroyo MI, Perez-Mato JM, Orobengoa D, Tasci E, de la Flor G, Kirov A (2011) Crystallographyonline: Bilbao crystallographic server. Bulg Chem Commun 43(2):183–197

Baerlocher C, McCusker LB, Olson DH (2007) Atlas of zeolite framework types, 6th revised edn.Elsevier, Amsterdam/London/New York/Oxford/Paris/Shannon/Tokyo

Baldi P (2011) Data-driven high-throughput prediction of the 3-D structure of small molecules:review and progress. A response to the letter by the Cambridge crystallographic data centre. JChem Inf Model 51:3029

Belsky A, Hellenbrandt M, Karen VL, Luksch P (2002) New developments in the InorganicCrystal Structure Database (ICSD): accessibility in support of materials research and design.Acta Crystallogr B 58:364–369

Berman HM, Olson WK, Beveridge DL, Westbrook J, Gelbin A, Demeny T, Hsieh SH, SrinivasanAR, Schneider B (1992) The nucleic acid database: a comprehensive relational database ofthree-dimensional structures of nucleic acids. Biophys J 63:751–759

Berman HM, Kleywegt GJ, Nakamura H, Markley JL (2012) The protein data bank at 40: reflectingon the past to prepare for the future. Structure 20:391–396

Bragg WH (1913) The reflection of x-rays by crystals. (II) Proc R Soc A Math Phys Eng Sci89(610):246–248

Bragg WH, Bragg WL (1913) The reflection of x-rays by crystals. Proc R Soc Lond A Math PhysEng Sci 88:428–438

Breternitz J, Gregory D (2015) The search for hydrogen stores on a large scale; a straightforwardand automated open database analysis as a first sweep for candidate materials. Crystals 5:617–633

Brown ID, McMahon B (2002) CIF: the computer language of crystallography. Acta CrystallogrB 58:317–324

Chateigner D, Grazulis S, Pérez O, Pepponi G, Lutterotti L (2015) COD, PCOD, TCOD,MPOD. . . open structure and property databases. http://www.ecole.ensicaen.fr/~chateign/danielc/abstracts/Chateigner_abstract_JNCO2013.pdf accessed 2018-10-03

Clews CJB, Cochran W (1948) The structures of pyrimidines and purines. I. A determination ofthe structures of 2-amino-4-methyl-6-chloropyrimidine and 2-amino-4,6-dichloropyrimidine byx-ray methods. Acta Crystallogr 1(1):4–11

Downs RT, Hall-Wallace M (2003) The American mineralogist crystal structure database. AmMiner 88:247–250

Page 18: Crystallography Open Database (COD) … · X-ray crystallography is an extremely powerful method for determining inner structure of the condensed matter. Soon after the discovery

18 S. Gražulis et al.

Faber J, Fawcett T (2002) The powder diffraction file: present and future. Acta Crystallogr B 58(3Part 1):325–332

Fielding RT (2000) Architectural Styles and the design of network-based software architectures.Ph.D. thesis, University of California, Irvine

First EL, Floudas CA (2013) Mofomics: computational pore characterization of metal-organicframeworks. Microporous Mesoporous Mater 165:32–39

Friedrich W, Knipping P, Laue M (1912) Interferenzerscheinungen bei Röntgenstrahlen. Einequantitative Prüfung der Theorie für die Interferenz-Erscheinungen bei Röntgenstrahlen. Bay-erische Akademie der Wissenschaften, Mathematisch-Physikalische Klasse, Sitzungsberichte,pp 303–322

Friedrich W, Knipping P, Laue M (1912) Interferenzerscheinungen bei Röntgenstrahlen. Einequantitative Prüfung der Theorie für die Interferenz-Erscheinungen bei Röntgenstrahlen,II. Bayerische Akademie der Wissenschaften, Mathematisch-Physikalische Klasse, Sitzungs-berichte, pp 363–373

Gražulis S, Chateigner D, Downs RT, Yokochi AFT, Quirós M, Lutterotti L, Manakova E, Butkus J,Moeck P, Le Bail A (2009) Crystallography open database: an open-access collection of crystalstructures. J Appl Crystallogr 42(4):726–729

Gražulis S, Daškevic A, Merkys A, Chateigner D, Lutterotti L, Quirós M, Serebryanaya NR,Moeck P, Downs RT, Le Bail A (2012) Crystallography open database (COD): an open-accesscollection of crystal structures and platform for world-wide collaboration. Nucleic Acids Res40(D1):D420–D427

Gražulis S, Sarjeant AA, Moeck P, Stone-Sundberg J, Snyder TJ, Kaminsky W, Oliver AG, SternCL, Dawe LN, Rychkov DA, Losev EA, Boldyreva EV, Tanski JM, Bernstein J, Rabeh WM,Kantardjieff KA (2015) Crystallographic education in the 21st century. J Appl Crystallogr48(6):1964–1975

Groom CR, Allen FH (2014) The Cambridge structural database in retrospect and prospect. AngewChem Int Ed 53:662–671

Groom CR, Bruno IJ, Lightfoot MP, Ward SC (2016) The Cambridge structural database. ActaCrystallogr B 72(2):171–179

Hall SR, Allen FH, Brown ID (1991) The crystallographic information file (CIF): a new standardarchive file for crystallography. Acta Crystallogr A 47(6):655–685

Harrison WTA, Simpson J, Weil M (2009) Editorial. Acta Crystallogr E Struct Rep Online66(1):e1–e2

Hermann C, Ewald PP (1931) Strukturbericht 1913-1928: Zeitschrift für Kristallographie, Kristall-geometrie, Kristallphysik, Kristallchemie. Akademische Verlagsgesellschaft, Leipzig

IUCr (2017) A formal grammar for CIF. https://www.iucr.org/resources/cif/spec/version1.1/cifsyntax, accessed 2018-10-03

IUCr (2017) Crystallographic information framework. https://www.iucr.org/resources/cif,accessed 2018-10-03

IUCr (2017) Structure reports. https://www.iucr.org/publications/other/structure-reports, accessed2018-10-03

Kabekkodu SN, Faber J, Fawcett T (2002) New powder diffraction file (pdf-4) in relationaldatabase format: advantages and data-mining capabilities. Acta Crystallogr B 58:333–337

Kaduk JA (2002) Use of the inorganic crystal structure database as a problem solving tool. ActaCrystallogr B 58(Pt 3 Pt 1):370–379

Lafuente B, Downs RT, Yang H, Stone N (2015) The power of databases: the RRUFF project. In:Highlights in mineralogical crystallography. W. De Gruyter, Berlin, pp 1–30

Le Bail A (2005) Inorganic structure prediction with grinsp. J Appl Crystallogr 38:389–395Lejaeghere K, Van Speybroeck V, Van Oost G, Cottenier S (2014) Error estimates for solid-state

density-functional theory predictions: an overview by means of the ground-state elementalcrystals. Crit Rev Solid State Mater Sci 39:1–24

Long F, Nicholls RA, Emsley P, Gražulis S, Merkys A, Vaitkus A, Murshudov GN (2017)ACEDRG: a stereo-chemical description generator for ligands. Acta Crystallogr D 73(2):112–122

Page 19: Crystallography Open Database (COD) … · X-ray crystallography is an extremely powerful method for determining inner structure of the condensed matter. Soon after the discovery

Crystallography Open Database (COD) 19

Long F, Nicholls RA, Emsley P, Gražulis S, Merkys A, Vaitkus A, Murshudov GN (2017)Validation and extraction of stereochemical information from small molecular databases. ActaCrystallogr D 73(2):103–111

Merkys A, Vaitkus A, Butkus J, Okulic-Kazarinas M, Kairys V, Gražulis S (2016)COD::CIF::Parser: an error-correcting CIF parser for the Perl language. J Appl Crystallogr49(1):292–301

Merkys A, Mounet N, Cepellotti A, Marzari N, Gražulis S, Pizzi G (2017) A posteriori metadatafrom automated provenance tracking: integration of AiiDA and TCOD. J Cheminform 9(1):56

Mounet N, Gibertini M, Schwaller P, Campi D, Merkys A, Marrazzo A, Sohier T, Castelli IE,Cepellotti A, Pizzi G, Marzari N (2018) Novel two-dimensional materials from high-throughputcomputational exfoliation of experimentally known compounds. Nature Nanotechnology,13(3):246–252

Narayanan BC, Westbrook J, Ghosh S, Petrov AI, Sweeney B, Zirbel CL, Leontis NB, Berman HM(2014) The nucleic acid database: new features and capabilities. Nucleic Acids Res 42:D114–D122

Pepponi G, Gražulis S, Chateigner D (2012) MPOD: a material property open database linked tostructural information. Nucl Instrum Methods Phys Res Sect B: Beam Interact Mater Atoms284(0):10–14. E-MRS 2011 Spring Meeting, Symposium M: X-ray techniques for materialsresearch-from laboratory sources to free electron lasers

Perez-Mato JM, Gallego SV, Tasci ES, Elcoro L, de la Flor G, Aroyo MI (2015) Symmetry-basedcomputational tools for magnetic crystallography. Annu Rev Mater Res 45(1):217–248

Protein Data Bank (1971) Protein data bank. Nat New Biol 233:22–23Rajan H, Uchida H, Bryan DL, Swaminathan R, Downs RT, Hall-Wallace M (2006) Building the

American mineralogist crystal structure database: a recipe for construction of a small internetdatabase. In: Sinha AK (ed) Geoinformatics: data to knowledge, Geological Society of America,Boulder, vol 397, 73–80

Röntgen WC (1896) On a new kind of rays. Nature 53:274–276Sadowski P, Baldi P (2013) Small-molecule 3d structure prediction using open crystallography

data. J Chem Inf Model 53:3127–3130Sander T, Freyss J, von Korff M, Rufener C (2015) DataWarrior: an open-source program for

chemistry aware data visualization and analysis. J Chem Inf Model 55(2):460–473Villars P, Onodera N, Iwata S (1998) The linus pauling file (LPF) and its application to materials

design. J Alloys Compd 279:1–7Villars P, Cenzual K, Daams J, Chen Y, Iwata S (2004) Data-driven atomic environment prediction

for binaries using the mendeleev number: part 1. Composition {AB}. J Alloys Compd367(1–2):167–175. Proceedings of the {VIII} international conference on crystal chemistry ofintermetallic compounds

White PS, Rodgers JR, Le Page Y (2002) Crystmet: a database of the structures and powderpatterns of metals and intermetallics. Acta Crystallogr B 58(Pt 3 Pt 1):343–348

A. I. Kitajgorodskij. Organiqeskaffl kristallohimiffl, t. 1. Izdatel~stvo Akademii NaukSSSR, sen. 1955