comparing molecules and solids across structural and ... · pdf filestructural and alchemical...

18
Comparing molecules and solids across structural and alchemical space Sandip De, 1,2, a) Albert P. Bart´ ok, 3 abor Cs´ anyi, 3 and Michele Ceriotti 2, b) 1) National Center for Computational Design and Discovery of Novel Materials (MARVEL) 2) Laboratory of Computational Science and Modelling, Institute of Materials, Ecole Polytechnique F´ ed´ erale de Lausanne, Lausanne, Switzerland 3) Engineering Laboratory, University of Cambridge, Trumpington Street, Cambridge CB2 1PZ, United Kingdom (Dated: 19 January 2016) Evaluating the (dis)similarity of crystalline, disordered and molecular compounds is a critical step in the development of algorithms to navigate automatically the configuration space of com- plex materials. For instance, a structural similarity metric is crucial for classifying structures, searching chemical space for better compounds and materials, and to drive the next generation of machine-learning techniques for predicting the stability and properties of molecules and mate- rials. In the last few years several strategies have been designed to compare atomic coordination environments. In particular, the Smooth Overlap of Atomic Positions (SOAP) has emerged as a natural framework to obtain translation, rotation and permutation-invariant descriptors of groups of atoms, driven by the design of various classes of machine-learned inter-atomic potentials. Here we discuss how one can combine such local descriptors using a Regularized Entropy Match (RE- Match) approach to describe the similarity of both whole molecular and bulk periodic structures, introducing powerful metrics that allow the navigation of alchemical and structural complexity within a unified framework. Furthermore, using this kernel and a ridge regression method we can also predict atomization energies for a database of small organic molecules with a mean absolute error below 1kcal/mol, reaching an important milestone in the application of machine-learning techniques to the evaluation of molecular properties. Keywords: Structural fingerprints, machine-learning, materials and molecules, non-linear dimen- sionality reduction I. INTRODUCTION The increase of available computational power, together with the development of more accurate and efficient simulation algorithms, have made it possible to reliably predict the properties of mate- rials and molecules of increasing levels of complex- ity. Furthermore, high-throughput computational screening of existing and hypothetical compounds promises to dramatically accelerate the develop- ment of materials with the better performances or custom-tailored properties 1–6 . These developments have made even more ur- gent the need for automated tools to analyze, classify 7–11 and represent 12–16 large amounts of structural data, as well as techniques to leverage this wealth of information to estimate inexpen- sively the properties of materials using machine- learning techniques, circumventing the need for computationally demanding quantum mechanical calculations 17–28 . At the most fundamental level, the crucial ingre- dient for all these techniques is a mathematical for- mulation of the concept of (dis)similarity between a) Electronic mail: sandip.de@epfl.ch b) Electronic mail: michele.ceriotti@epfl.ch; http://cosmo.epfl.ch/ atomic configurations, that can take the form of a distance - that can be used for dimensionality reduction or clustering - or of a kernel function, that could be used for ridge regression or auto- mated classification. 29–32 The most obvious choice for a metric to compare atomic structures would involve the Euclidean distance between the Carte- sian coordinates of the atoms, commonly known as root mean square displacement (RMSD) distance, that can be easily made invariant to relative trans- lations and rotations. It is however highly non- trivial to extend the RMSD to deal with situations in which atoms in the two structures cannot be mapped unequivocally with each other. The deter- ministic evaluation of a “permutationally invari- ant” RMSD scales combinatorially with the size of the molecules to be compared 33 , and introduces cusps at locations where the mapping of atom iden- tities changes. Furthermore, as we will discuss later on, the RMSD is perhaps the most straightfor- ward, but not necessarily the most flexible or effec- tive strategy to compare molecular and condensed- phase configurations. In the last few years, a large number of “finger- print” functions have been developed to represent the state of structures, or of groups of atoms within a structure. Structural descriptors have been de- veloped based on graph-theoretic procedures (e.g. SPRINTs 34 ), as well as on analogies with elec- arXiv:1601.04077v1 [cond-mat.mtrl-sci] 28 Dec 2015

Upload: doannhi

Post on 05-Mar-2018

215 views

Category:

Documents


3 download

TRANSCRIPT

Comparing molecules and solids acrossstructural and alchemical space

Sandip De,1, 2, a) Albert P. Bartok,3 Gabor Csanyi,3 and Michele Ceriotti2, b)

1)National Center for Computational Design and Discovery of Novel Materials (MARVEL)2)Laboratory of Computational Science and Modelling, Institute of Materials,Ecole Polytechnique Federale de Lausanne, Lausanne, Switzerland3)Engineering Laboratory, University of Cambridge, Trumpington Street, Cambridge CB2 1PZ,United Kingdom

(Dated: 19 January 2016)

Evaluating the (dis)similarity of crystalline, disordered and molecular compounds is a criticalstep in the development of algorithms to navigate automatically the configuration space of com-plex materials. For instance, a structural similarity metric is crucial for classifying structures,searching chemical space for better compounds and materials, and to drive the next generationof machine-learning techniques for predicting the stability and properties of molecules and mate-rials. In the last few years several strategies have been designed to compare atomic coordinationenvironments. In particular, the Smooth Overlap of Atomic Positions (SOAP) has emerged as anatural framework to obtain translation, rotation and permutation-invariant descriptors of groupsof atoms, driven by the design of various classes of machine-learned inter-atomic potentials. Herewe discuss how one can combine such local descriptors using a Regularized Entropy Match (RE-Match) approach to describe the similarity of both whole molecular and bulk periodic structures,introducing powerful metrics that allow the navigation of alchemical and structural complexitywithin a unified framework. Furthermore, using this kernel and a ridge regression method we canalso predict atomization energies for a database of small organic molecules with a mean absoluteerror below 1kcal/mol, reaching an important milestone in the application of machine-learningtechniques to the evaluation of molecular properties.

Keywords: Structural fingerprints, machine-learning, materials and molecules, non-linear dimen-sionality reduction

I. INTRODUCTION

The increase of available computational power,together with the development of more accurateand efficient simulation algorithms, have made itpossible to reliably predict the properties of mate-rials and molecules of increasing levels of complex-ity. Furthermore, high-throughput computationalscreening of existing and hypothetical compoundspromises to dramatically accelerate the develop-ment of materials with the better performances orcustom-tailored properties1–6.

These developments have made even more ur-gent the need for automated tools to analyze,classify7–11 and represent12–16 large amounts ofstructural data, as well as techniques to leveragethis wealth of information to estimate inexpen-sively the properties of materials using machine-learning techniques, circumventing the need forcomputationally demanding quantum mechanicalcalculations17–28.

At the most fundamental level, the crucial ingre-dient for all these techniques is a mathematical for-mulation of the concept of (dis)similarity between

a)Electronic mail: [email protected])Electronic mail: [email protected];http://cosmo.epfl.ch/

atomic configurations, that can take the form ofa distance - that can be used for dimensionalityreduction or clustering - or of a kernel function,that could be used for ridge regression or auto-mated classification.29–32 The most obvious choicefor a metric to compare atomic structures wouldinvolve the Euclidean distance between the Carte-sian coordinates of the atoms, commonly known asroot mean square displacement (RMSD) distance,that can be easily made invariant to relative trans-lations and rotations. It is however highly non-trivial to extend the RMSD to deal with situationsin which atoms in the two structures cannot bemapped unequivocally with each other. The deter-ministic evaluation of a “permutationally invari-ant” RMSD scales combinatorially with the sizeof the molecules to be compared33, and introducescusps at locations where the mapping of atom iden-tities changes. Furthermore, as we will discuss lateron, the RMSD is perhaps the most straightfor-ward, but not necessarily the most flexible or effec-tive strategy to compare molecular and condensed-phase configurations.

In the last few years, a large number of “finger-print” functions have been developed to representthe state of structures, or of groups of atoms withina structure. Structural descriptors have been de-veloped based on graph-theoretic procedures (e.g.SPRINTs34), as well as on analogies with elec-

arX

iv:1

601.

0407

7v1

[co

nd-m

at.m

trl-

sci]

28

Dec

201

5

2

tronic structrure methods (e.g. Hamiltonian ma-trix, Hessian matrix, Overlap matrix of Gaussiantype Orbitals (GTO) or even Kohn-Sham eigen-values fingerprints33). Most of these approacheshave been introduced to provide a fast and reli-able estimate of the dissimilarity between struc-tures. Several other descriptors have been alsoused in machine learning, to predict properties ofmaterials and molecules circumventing the needfor an expensive electronic structure calculation.A non-comprehensive list of such methods includeCoulomb matrices22, bags of bonds28, “symmetryfunctions”35, scattering transformation applied ona linear superposition of atomic densities23.

A particularly promising approach to comparestructures in a way that is invariant to rota-tions, translations, and permutations of equiva-lent atoms, is to start from descriptors designed torepresent local atomic environments and that ful-fill these requirements, and combine them to yielda global measure of similarity between structures.This idea typically relies on finding the best matchbetween pairs of environments in the two config-urations22,33,36, and can also be traced back tomethods developed to compare images based onthe matching of local features37.

In the present work we start from a recently-developed strategy to define a similarity kernel be-tween local environments – the smooth overlap ofatomic positions (SOAP)38 – and discuss the dif-ferent ways one can process the set of all possiblematchings between atomic environments to gen-erate a global kernel to compare two structures.In particular, we introduce a regularized entropymatch (REMatch) strategy that is based on tech-niques in optimal-transport theory39, and that isboth more efficient and tunable than previously-applied methods. We discuss the relative merits ofdifferent approaches, and generalize this strategyto the comparison between structures with differ-ent numbers and kinds of atoms. We demonstratethe behavior of the different global kernels whenapplied to completely different classes of problems,ranging from elemental clusters, to bulk structures,to the conformers of oligopeptides and to a hetero-geneous database of small organic molecules. Wevisualize the behavior of the distance associatedwith these kernels using sketch-map13, a non-lineardimensionality reduction technique, and demon-strate the great promise shown by the straightfor-ward application of the REMatch-SOAP kernel tothe machine-learning of molecular properties. Fi-nally, we present our conclusions.

II. THEORY

Let us start by introducing the notation we willemploy in the rest of the paper. We will labelstructures to be compared by capital letters, use

a lowercase Latin letter to indicate the index of anatom, and when necessary use a Greek lowercaseletter to mark its chemical identity. For instance,the position of the i-th atom within the structureA will be labeled as xAi . The environment of thatatom, i.e. the abstract descriptor of the arrange-ment of atoms in its vicinity will be labelled witha calligraphic upper case letter, e.g. XAi , and thesub-set of such environment that singles out atoms

of species α will be indicated as XA,αi .Among the many descriptors of local environ-

ments that have been developed in the recentyears1–3,5,6,17–22,24–28,33,36, we will refer in partic-ular to the SOAP fingerprints38, that have beenproven to be a very elegant and robust strategy todescribe coordination environments in a way thatis naturally invariant with respect to translations,rotations and permutations of atoms. We will usethe notation k(X ,X ′) to indicate the similarity ker-nel between two environments – which one woulduse in a kernel ridge regression method31,32,40 –and d(X ,X ′)2 = 2 − 2k(X ,X ′) to indicate the(squared) kernel distance between the environ-ments – which one would use in a dimensionalityreduction method13,16. In what follows we will dis-cuss different ways by which environment kernelscan be combined to yield a a global similarity ker-nel between two structures K(A,B), and the asso-ciated squared distance D(A,B)2 = 2− 2K(A,B).

A. SOAP similarity kernels and local environmentdistance

We will first focus on the comparison betweenthe environment of two atoms in a pure compoundmade up of single atomic species α. The crucialingredient in making the comparison is a kernelfunction based on the distribution of atoms in thetwo environments. In the context of SOAP kernelsone represents the local density of atoms withinthe environment X as a sum of Gaussian functionswith variance σ2, centered on each of the neighborsof the central atom, as well as on the central atomitself:

ρX (r) =∑i∈X

exp

(− (xi − r)2

2σ2

). (1)

The SOAP kernel is then defined as the overlapof the two local atomic neighbour densities, inte-grated over all three-dimensional rotations R,

k(X ,X ′) =

∫dR

∣∣∣∣∫ ρX (r)ρX ′(Rr)dr

∣∣∣∣n . (2)

Note that in the n = 1 case the two integrals canbe switched, and therefore the kernel looses all an-gular information, so we focus on the n = 2 caseexclusively. For most applications it is helpful to

3

normalise the kernel so that the self-similarity ofany environment is unity, giving the final kernel

k(X ,X ′) = k(X ,X ′)/√k(X ,X )k(X ′,X ′) (3)

It is a remarkable property of the SOAP kernelthat the integration over all rotations can be car-ried out analytically. First the atomic neighbourdensity is expanded in a basis composed of spheri-cal harmonics and a set of orthogonal radial basisfunctions {gb(r)},

ρX (r) =∑blm

cblmgb(|r|)Ylm(r), (4)

then the rotationally invariant power spectrum isgiven by

p(X )b1b2l =∑m

(cb1l)†cb2l. (5)

Collecting the elements of the power spectrum intoa unit-length vector p(X ), the SOAP kernel isshown38 to be given by

k(X ,X ′) = p(X ) · p(X ′) (6)

eventually leaving a definition of the distance as

d (X ,X ′) =√

2− 2p(X ) · p(X ′) (7)

The SOAP kernel can be written in the form of adot product, therefore it is manifestly positive def-inite, which implies that the distance function (7)is a proper metric.

B. From local descriptors to structure matching

The vectors that enter the definition of the en-vironments are defined in such a way that theirdot product is the overlap of (smoothed) atomicdistributions. Given two structures with the samenumber N of atoms, we can compute an environ-ment covariance matrix that contains all the pos-sible pairing of environments

Cij(A,B) = k(XAi ,XBj

), (8)

This matrix contains the complete information onthe pair-wise similarity of all the environments be-tween the two systems. Based on it, one can in-troduce a global kernel to compare two structuresor molecules. We will discuss and compare fourdifferent approaches. All of them are meant to benormalized, i.e. the given expressions for K(A,B)

are to be divided by√K(A,A)K(B,B) whenever

such normalization is not automatically one.

a. Average structural kernel A first possibilityto compare two structures involves computing anaverage kernel

K(A,B) =1

N2

∑ij

Cij(A,B) =

=

[1

N

∑i

p(XAi )

1

N

∑j

p(XBj )

.(9)

One sees that K can be computed inexpensivelyby just storing the average SOAP fingerprint be-tween all environments of the two structures. Thiskernel is also positive-definite, being based on ascalar product41, and therefore induces a metric

D(A,B) =√

2− 2K(A,B). On the other hand,it is not a very sensitive metric: two very differ-ent structures can appear to be the same if theyare composed of environments that give the samefingerprint upon averaging.b. Best-match structural kernel Another pos-

sibility, that has been used previously with dif-ferent kinds of structural fingerprints22,42–44 is toidentify the best match between the environmentsof the two structures,

K(A,B) =1

Nmaxπ

∑i

Ciπi(A,B). (10)

which can be accomplished with an O(N3) effortusing the Munkres algorithm45. The correspond-ing distance has the properties of a metric, whichmeans it can still be safely used to assess similaritybetween structures and molecules. Unfortunately,this “best-match” kernel is not guaranteed to bepositive-definite, which makes it less than idealfor use in machine-learning applications. Further-more, the distance obtained by a best-match strat-egy is continuous, but has discontinuous derivativeswhenever the matching of environments changes.These problems can be solved or alleviated bymatching the environments based on a differentstrategy, that combines features of the average andthe best-match kernels.c. Regularized entropy match kernel The best

match problem can be also stated in an alternativeform, namely

K(A,B) = maxP∈U(N,N)

∑ij

Cij(A,B)Pij . (11)

where U(N,N) is the set of N ×N (scaled) doublystochastic matrices, whose rows and columns sumto 1/N , i.e.

∑i Pij =

∑j Pij = 1/N . We can then

borrow an idea that was recently introduced in thefield of optimal transport39 to regularize this prob-lem, adding a penalty that instead aims at maxi-mizing the information entropy for the matrix Psubject to the aforementioned constraints on its

4

marginals. Such “regularized-entropy match” (RE-Match) kernel is defined as

Kγ(A,B) = TrPγC(A,B),

Pγ = argminP∈U(N,N)

∑ij

Pij (1− Cij − γ lnPij), (12)

where the regularization is given by an entropyterm E(P) =

∑ij Pij lnPij . Pγ can be com-

puted very efficiently, with O(N2) effort, by theSinkhorn algorithm39 (see Appendix). For γ →0, the entropic penalty becomes negligible, andKγ(A,B) → K(A,B). For γ → ∞, one selectsthe P with the least information content, that isone with constant Pij = 1/N2. Hence, in this limit

Kγ(A,B)→ K(A,B).

d. Permutation structural kernel For the sakeof completeness, we also discuss a fourth option:rather than summing over all possible pairs of envi-ronments, one can consider each pairing of environ-ments separately, and sum over all the N ! possiblepermutations that define the pairings. In order tokill off more rapidly the combinations of environ-ments that contain bad matches, one can multiplythe kernels that appear in each pairing, and definea permutation kernel

K(A,B) =1

N !

∑π

∏i

Ciπi(A,B) = permC(A,B).

(13)This choice corresponds to the evaluation of thepermanent of the environment kernel matrix, andhas some appeal as it is guaranteed to yield apositive-definite kernel46. The evaluation of thepermanent of a matrix, however, has combinato-rial computational complexity47. Its application islimited to small molecules, and we will not discussit further in the present work.

C. Matching structures containing multiple species

When comparing structures that contain differ-ent atomic species, the first problem that has tobe addressed is that of extending the local envi-ronment metric so that the presence of multiplespecies is properly accounted for.

SOAP descriptors provide a very natural way todo this: a separate density can be built for eachatomic species

ραX (r) =∑i∈Xα

exp

(− (xi − r)2

2σ2

), (14)

and a (non-normalized) kernel be defined by

matching separately the different species:

k(X ,X ′) =

∫dR

∣∣∣∣∣∫ ∑

α

ραX (r)ραX ′(Rr)dr

∣∣∣∣∣2

=∑αβ

pαβ(X ) · pαβ(X ′).(15)

Here we have introduced “partial” power spectrathat encode information on the relative arrange-ment of pairs of species,

pαβ(X ) · pαβ(X ′) =

∫dR

∣∣∣∣∫ ραX (r)ρβX ′(Rr)dr

∣∣∣∣2 .(16)

The species-resolved power spectrum pαβ can bewritten as

p(X )αβb1b2l =∑m

(cαb1lm)†cβb2lm, (17)

and combines the expansion coefficients

ραX (r) =∑blm

cαblmgb(|r|)Ylm(r) (18)

of the two densities. The kernel in Eq. (15) canthen be normalized as in Eq. (3).

Note that the SO(3) power spectrum vectors con-tain mixed-species components due to the squaringof the density overlap within the rotational aver-age. These mixed terms guarantee that the kernelis sensitive to the relative correlations of differentspecies, although the overlap between the environ-ments of the different species is considered to bezero. One could however introduce a notion of “al-chemical similarity” between different species. Forinstance, when comparing structures of III-V semi-conductors one could disregard the chemical infor-mation on the identity of an atom as long as itbelongs to the same column of the periodic table.Such a notion can be readily implemented, defin-ing an alchemical similarity kernel καβ which is onefor pairs that should be considered interchangeable,and tend to zero for pairs that one wants to con-sider as completely unrelated. The expression thenbecomes

k(X ,X ′) =

∫dR

∣∣∣∣∣∣∫ ∑

αβ

καβραX (r)ρβX ′(Rr)dr

∣∣∣∣∣∣2

=∑

αβα′β′

pαβ(X )pα′β′(X ′)καα′κββ′ .

(19)

The original expression (15) can be recovered bysetting καβ = δαβ . Global similarity kernels canthen be transparently introduced to compare struc-tures composed of different atomic species, withstructure and alchemical composition treated onthe same footings and the possibility of adaptingthe definition of similarity to the system and ap-plication.

5

D. Matching structures with different numbers ofatoms

The definitions above can be readily extended tocompare structures containing different numbers ofatoms NA and NB . We discuss two possible strate-gies. When comparing crystalline, periodic struc-tures, it may be the case that one of the structurescorresponds to a slight distortion of the other, thatneeds a larger unit cell for a proper representation.Comparing the structures using the average ker-nel (9) does automatically the “right thing”, thatis performing the comparison in a way that is inde-pendent of the number of times the two structureshave to be replicated to match atom counts. Inthe case of the permutation kernel and of the best-match kernel, the most effective way to performthe comparison is to evaluate the least commonmultiple N of NA and NB , and replicate the envi-ronment similarity matrix to form a square matrix.One can then proceed to compute the permanent,or the linear assignment problem, based on suchreplicated matrix. The advantage of this proce-dure is that one does not need to explicitly findthe relation between the shape of the two unit cellsand replicate them to perform the comparison: theenvironment similarities can be evaluated includ-ing periodic replicas, and the minimum numberof comparisons will be naturally performed amongany pairs of structures. However, the least com-mon multiple can become very large, making eventhe best-match kernel (10) impractically demand-ing, although the cost can be reduced by exploit-ing the redundancy in the extended environmentcovariance matrix. As shown in the Appendix, theREMatch kernel (12) can be computed easily alsofor a rectangular matrix, which constitutes an ad-ditional advantage of formulating the environmentmatching problem in terms of a regularized trans-port optimization.

When comparing molecules or molecular frag-ments, it may be advisable to proceed differently.One could consider an idealized pool (“kit”) of iso-lated atoms from which a number of molecules willbe built, and use them to “top up” each moleculeso as to obtain a set of structures, all having thesame number of atomic environments. The “ref-erence kit” could be chosen dynamically for eachpair of molecules, or – when working with a welldefined database – fixed globally as the smallest setof atoms needed to generate all the relevant struc-tures. One can then define the kernel between anactual atomic environment and those within this“virtual atom reservoir”, k(X , ∅) = e−µ, in termsof a “chemical potential” parameter µ. Since theSO(3) fingerprints that underlie the definition ofthe SOAP kernel can also be evaluated for isolatedatoms48, it is also possible to introduce a natu-ral definition of the covariance between an environ-ment and an isolated atom, which has the advan-

tage that the global kernels will then vary smoothlyduring an actual atomization process.

E. Representing (al)chemical landscapes

In this work we will demonstrate the flexibil-ity, transferability and effectiveness of the frame-work we have just introduced to compare molec-ular and condensed-phase structures. To thisaim, we will build two dimensional maps thatrepresent proximity relations between the struc-tures – as assessed by the kernel-induced metric– using sketch-map13, a non-linear dimensional-ity reduction (NLDR) scheme specifically designedto deal with atomistic simulation data. As wewill demonstrate, the combination of SOAP-basedstructural metrics and NLDR representation pro-vides a broadly applicable protocol to generatean insightful representation of the structural andalchemical landscape of complex molecular andcondensed-phase systems. Of course, one coulduse the SOAP-based global kernels, or the corre-sponding distances, as the basis of other non-lineardimensionality reduction techniques, such as multi-dimensional scaling49 or diffusion maps12,16,50.

We refer the reader to the relevant litera-ture for a detailed explanation of the sketch-mapalgorithm13–15. The main idea derives from multi-dimensional scaling, and is based on optimizing anon-linear objective function

S2 =∑ij

[F [D(Xi, Xj)]− f [d(xi, xj)]]2

(20)

where {Xi} and {xi} correspond respectively tohigh-dimensional reference structures and to vec-tors in a low-dimensional space. The metric din low dimension is typically taken to be the Eu-clidean distance, whereas the metric in high dimen-sion could be more complex. In this case, Xi canbe regarded as an abstract descriptor of a struc-ture or molecule, and D is one of the kernel-baseddistance metrics discussed above. F and f are non-linear sigmoid functions of the form

F (r) = 1− (1 + (2a/b − 1)(r/σ)a)−b/a, (21)

which serve to focus the optimization of (20) onthe most significant, intermediate distances, disre-garding local distortion (e.g. induced by thermalfluctuations) and the relation between completelyunrelated portions of configuration landscape. Thechoice of the parameters in the sigmoid functions isdiscussed in Ref.15. Here we will label syntheticallyeach sketch-map representation using the notationσ-A B-a b where A and B denote the exponentsused for the high-dimensional function F , a and bdenote the exponents for the low-dimensional func-tion f , and σ the threshold for the switching func-tion.

6

FIG. 1. The figure compares the value of global structural similarities for different pairs of structures taken fromthe 80 local energy minima of C60 discussed in Ref.42 The structural similarities considered include the absolutedifference in energy per atom, the (permutation invariant33) RMSD per atom, and the best-match combination

of SOAP kernels computed with different cutoff distances (2A, 3.5A, 7A). The correlation between RMSD and Dbased on 2A-cutoff SOAP is enlarged, color-coded based on energy differences and annotated with selected pairsof structures corresponding to different distances.

III. EXAMPLES AND APPLICATIONS

After having described the theoretical and al-gorithmic background of our strategy to define astructural similarity kernel, let us present a seriesof applications. In order to demonstrate that ourapproach can be seamlessly applied to the most di-verse atomistic simulation problems, we have cho-sen examples of increasing complexity, from clus-ters, to crystalline and amorphous solids, to biolog-ical molecules and a database of small organic com-pounds, containing varying number of both atoms

and atomic species.

A. The energy landscape of C60 clusters

Let us start with a relatively simple test case.We consider the same set of 80 local minima forC60 discussed in Ref.42, which were obtained by ex-ploring the Density Functional Theory energy land-scape of C60 using the Minima Hopping 51 globalstructure search algorithm. Figure: 1 contrastsdifferent similarity matrices: the permutation-

7

FIG. 2. The figure compares the value of global structural distances induced by the average, best-match, andREMatch kernels discussed in Section II B, for 80 local-minimum structures of C60. On the diagonal we report thesketch-map projections of the structural landscape based on the three metrics, colored according to the energy ofeach structure, as obtained by Sandip et all 42. Eight representative structures and their positions on the Sketch-maps have been indicated with alphabets on color coded disks. The numeric value on the top of each structurerepresents their energy in eV, relative to the global minimum. SOAP descriptors were computed using a cutoff of3.5A and the Sketch-map parameters are indicated on the map according to the syntax described in the text.

invariant RMSD43, the absolute difference betweenthe potential energy, and the best-match distancesobtained from SOAP descriptors computed withdifferent environment cutoff. RMSD distance doesnot correlate very well with SOAP-based met-rics, particularly for the smaller cutoff value. TheD(2A)-RMSD correlation plot is enlarged, and al-lows us to discuss the source of this discrepancy.

Hollow fullerene-like structures (A, with referenceto the labeling in the figure) and compact struc-tures containing internal connections (A,G) are ex-tremely different from the point of view of theshort-range connectivity, but differ comparativelyless in terms of RMSD, since they are both fairlycompact. On the other hand, flake-like structuresbased on a honeycomb motif (F,E) have the same

8

basic first-neighbor connectivity as the defectivefullerene structures (C,D) but have much differentspatial extent. Then, one sees that the discrep-ancy between RMSD and small-cutoff D indicatesjust the focus on different structural features: theglobal arrangement of atoms in the first case, andthe local connectivity in the latter. In the case ofSOAP-based metrics, however it is easy to extendthe sensitivity of the metric to longer distances justby increasing the cutoff: by going from 2A to 3.5and 7, one sees that D and RMSD become progres-sively more correlated, as the focus shifts from thenearest-neighbor coordination to the overall geom-etry of the cluster.

It is worth stressing that the RMSD, albeit avery natural measure of structural similarity, is notnecessarily the best metric to compare configura-tions. To see why, consider the absolute energy dif-ference as a measure of similarity: even though onecan obviously have configurations with very differ-ent geometry and similar energies, in general onewould expect that on the contrary large energy dif-ferences should be associated with highly dissimi-lar structures in a given system – which is not thecase for RMSD. One sees that the intermediate-cutoff D(3.5A) shows a nice correlation betweenenergetic and structural differences.

These considerations underline a theme that willrecur in other examples: SOAP-based structuralmetrics offer a mathematically sound frameworkthat can be transparently adapted to focus on theaspects that are most relevant to a given appli-cation. For instance, power-spectrum based envi-ronment kernels are invariant to mirror symmetry,and therefore the derived metrics cannot distin-guish enantiomers. If one needed to do so, however,it would be sufficient to use a bispectrum-basedSOAP kernel38 – which corresponds to n = 3 in eq.(2) and is invariant to rotations but not to mirrorsymmetry operations – as the basis for obtaining aglobal comparison that is sensitive to chirality.

Having established a connection between tradi-tional structural similarity metrics and the best-match SOAP kernel, let us use the example of C60

to compare the three main strategies we proposeto build a global kernel: the average kernel K, thebest-match kernel K, and the regularized entropymatch kernel Kγ with an intermediate regulariza-tion parameter γ = 0.1.

The distance-distance correlation plot for eachpair of structures, that compares the distances in-duced by the three kernels, is reported in Fig.2.The D−D plot shows overall linear correlation ex-cept for very small values of D. This is expected asthe average kernel is under-determined, and couldin principle label two structures as identical eventhough they might composed of different environ-ments. The best-match kernel, therefore, providesbetter resolving power. As we will discuss in moredetail later on, the regularized best-match kernel

Dγ can be tuned to interpolate between these twoextremes. As an example, we chose here an inter-mediate value γ = 0.1: as shown in Fig. 2, theresulting distance correlates strongly with both Dand the conventional best-match distance D.

Fig.2 also shows annotated sketch-maps ob-tained based on the three metrics. Once the sketch-map parameters have been adjusted following theguidelines in Ref.15, the three maps are effectivelyequivalent – indicating that the three kernels givesimilar qualitative information on the similarity be-tween different structures. Given the much lowercomputational cost associated with the evaluationof the average kernel, this observation suggests itmight conveniently be used to preliminarily screena dataset before proceeding to a more accuratecomparison of similar structures based on the best-match, or REMatch distance.

B. Natural and hypothetical polymorphs of silicon

As a second example, let us move on to acondensed-phase application. Here we start froma database of 1274 bulk silicon structures contain-ing ideal and distorted structures from the phasediagram (e.g. diamond, simple hexagonal, β-tin,liquid and quenched amorphous structures). SOAPenvironment kernels with a 5 A cutoff distance wereused, and combined with a best-match strategy toobtain the (dis)similarity matrix54 We selected 100landmark configurations out of this data set (usingfarthest point sampling based on kernel distance)and built a sketch-map, on which the rest of con-figurations were projected. The outcome of suchmapping procedure is shown in Fig. 3, where pointsare colored according to the DFT atomic energy,and point sizes are scaled to a size proportional tovolume per atom. As seen in the Fig. 3 the map isextremely well correlated with both atomic energyand density. Furthermore, structures that were ob-tained by distorting and heating up structures com-ing from different portions of the phase diagram areclustered together: rough outlines have been drawnon the map to indicate different phases.

Although the map has been built using only ref-erence configurations from a few of the conven-tional Si phases, we have also projected on it (usingout-of-sample embedding) two sets of hypotheticalconfigurations obtained by minima hopping53 andby ab initio random structure search (AIRSS)52,55.These structures were not included in the land-marks selection phase. Still, the out-of-sampleembedding procedure correctly identifies not onlythat in most cases AIRSS structures differ signifi-cantly from stable phases of silicon, but also clus-ters together hypothetical polymorphs that sharecommon features. For instance, the AIRSS struc-tures outlined in the lower portion of the map areall taken from Ref.55. The structures were pro-

9

FIG. 3. Sketch-map of 1274 crystalline and amorphous silicon structures obtained by sampling different phasesfrom the phase diagram (disks), polymorphs obtained by ab initio random structure search52 (+ signs) and byminima hopping53 (× signs). The color and size of the points varies according to their atomic energy and atomicvolumes respectively. Regions of the plot which represents different phases have been outlined with dotted contours.

posed as possible metastable polymorphs arisingas a result of microexplosion (powerful ultrashortand tightly focused laser pulse) in crystalline cu-bic diamond silicon phase, hence their structuralmotif naturally carries resemblance with silicon di-amond phase. It is interesting to see that they in-deed are projected close to the diamond phase onthe map. All of the minima hopping low-densitySi polymorphs are also clustered together, which isconsistent with the fact that they are all based oncombinations of a few base motifs. Thus, Figure 3shows not only that SOAP-based structural simi-larity distances can be very effective in the studyof bulk crystalline structures, but also testify theextrapolative power of a sketch-map representationbased on such a metric.

C. Arginine Dipeptide

Having shown that SOAP-based structural sim-ilarity kernels are equally effective for clusters andfor bulk configurations of elemental materials, letus consider a case of a multi-species chemical com-pound. We selected a library of 5062 locally sta-ble conformers of arginine dipeptide (845 with and4217 without a Ca2+ counterion) from a publicdatabase of oligopeptides structures developed byMatti Ropo et al56. We used a cut-off of 3.5Ain the definition of environment SOAP kernels,and combined them using a best-match strategy.Since H atoms stay at almost fixed positions rela-tive to their neighboring atoms, we decided to in-clude them in the environment descriptors of otheratoms, but did not include them explicitly as cen-ters of atomic environments. This is another ex-ample of how SOAP-based structural metrics areeffective in a broad variety of contexts, but at thesame time can be easily and transparently refined

10

FIG. 4. Sketch-map representation of locally stable arginine dipeptide conformers, without (top) and with (bottom)a Ca2+ ion. Left-hand panels are colored according to the energy relative to the minimum energy form, whilethe smaller maps on the right are colored according to the values of different dihedral angles, as indicated in thelegend.

based on intuition, prior experience, or a clear un-derstanding of the objectives of the structural com-parison.

In Fig. 4 we show the sketch-map representationfor these two sets of structures, highlighting the

correlation between the location on the map andstructural and energetic properties of the conform-ers. In the absence of a complexing cation, thedipeptide can exist in a very large number of lo-cal minima, spanning a relatively narrow range of

11

FIG. 5. Sketch-map representation of stable configura-tions of Arginine dipeptide complexed with a Ca2+ ion.The structures that have undergone a proton trans-fer reaction relative to the neutral molecule have beenhighlighted, and a few representative snapshots of themolecular structure are also reported.

energies. The map shows very clearly partitioningof configuration space in four disconnected regions.Conventional wisdom57 assumes that the Cα dihe-dral angles φ and ψ are the most important de-scriptors of oligopeptide structure. One quicklyrealizes, however, that the order parameters cor-responding to the four lobes are connected to thecis-trans isomerization of the two peptide bonds.Within each of the lobes, configurations with dif-ferent φ-ψ dihedral angles are clearly clustered to-gether, but in this case they constitute features ofsecondary importance. This observation demon-strates the advantages of using a general-purposedescriptor, that does not rely on pre-conceived as-sumptions on the behavior of the molecule beingstudied, but instead captures automatically the in-trinsic structural hierarchy of minima in the con-figuration landscape.

The presence of a Ca2+ cation has a dramaticimpact on the landscape for the dipeptide. The dis-tribution of configurations becomes considerablymore sparse and span a broader range of energies.The strong electrostatic interaction with the cationmeans that there is not a clear separation any-more between the energy scale for φ-ψ flexibilityof the backbone and the isomerization of the pep-tide bonds.

A remarkable observation in this analysis is therealization that the presence of the cation catalyzedunexpected proton transfer reactions, that changethe chemical structure of the molecule. Configura-tions that underwent a chemical reaction are clus-tered on one side of the map (Fig. 5), with furtherinternal structure reflecting the fact that SOAP-based structural metrics treat on the same foot-

ing information on the chemical bonding and onthe conformational variability of the molecule. Itis again worth noting that by changing the cut-off value for the SOAP descriptors, one can “fo-cus” the structural metric on different molecularfeatures. A short cutoff of 2A makes the chem-ically different structures stand out more as out-liers – which would for instance be useful to detectautomatically this kind of unwanted transitions inan automatically-generated data set – while on thecontrary a longer cutoff would give more impor-tance to the difference between collapsed and ex-tended molecular conformers.

FIG. 6. Correlations between structural similarity dis-tances induced by the average kernel K, the best-matchkernel K, and regularized best-match kernels Kγ withdifferent regularization parameters γ. Distances arecomputed between pairs of 200 structures, randomlyselected from the QM7b database24,58.

D. Mapping (al)chemical space

As a final example of the evaluation of a struc-tural and alchemical similarity metric, and its useto represent complex ensembles of compounds, let

12

FIG. 7. Sketch-map representation of minimum-energy structures from a database of molecules containing up to7 heavy atoms (C O N S Cl), and saturated with hydrogen to a different degree24. Left-hand panels show the mapcolored according to the atomization energy as computed by DFT. In the right-hand images, the points are coloredaccording to the number of constituent C, O, N, S atoms. The top row corresponds to an alchemical kernel thattreats all species as different, the middle row treats all the heavy atoms as the same species, whereas the bottomrow introduces an alchemical kernel that depends on the difference in electronegativity between species.

us consider the QM7b database24. This set ofcompounds contains 7211 minimum-energy struc-tures for small organic compounds containing up toseven heavy atoms (C, N, O, S, Cl), saturated withH to different degrees. This database constitutes asmall fraction of a larger chemical library that con-tains millions of hypothetical structures screenedfor accessible synthetic pathways58.

This is an extremely challenging data setto benchmark a structural similarity metric:

molecules differ by number of atoms, chemical com-position, bonding and conformation. To simplifythe description, we decided to use SOAP descrip-tors with a cutoff of 3A, and to include H atomsin the environments but not as environment cen-ters, to simplify the description – considering alsothat in the case of arginine dipeptide this choice didnot prevent clear identification of isomers that onlydiffered by a proton transfer reaction. We useda best-match strategy to compare configurations,

13

and topped them up with isolated atoms up to themaximum number of each species that is presentin the database. This effectively corresponds tochoosing a “kit” (in other terms, a fully atomizedreference state) starting from which all of the com-pounds can be assembled.

This is a fairly extreme case for the application ofour idea of compounding local structure matchingto obtain a global structural metric, so it is worthreturning on a comparison of the different strate-gies we proposed. Fig. 6 compares average andbest-match distances, together with the REMatchusing different regularization parameters. Despitethe very different context, the outcome is similar towhat we observed in Figure 2 for C60 clusters. Theaverage kernel is reasonably well correlated withthe more demanding best-match kernel, althoughin most cases it has poorer resolution. By vary-ing γ, the regularized match distance Dγ variesbetween these two extremes, and for γ < 1 pro-vides a smooth, inexpensive approximation to thebest-match distance.

For the sake of simplicity (and given we reducedthe size of the environment covariance matrix C byconsidering only heavy atoms as environment cen-ters) we used the conventional best-match distancefor the rest of our analyses. As shown in Fig. 7,the SOAP-based metric nicely separates out “is-lands” with homogeneous composition in terms ofthe number of heavy atoms. Within each group ofatoms, one can recognize some sub-structure, withconfiguration roughly arranged in terms of the at-omization energy – which in turns strongly corre-lates with the degree of H saturation. As it canbe seen from inspection of the database (see theSI) in many cases one can notice that structureswith similar chemical skeleton (presence of cycles,chemical groups, etc.) are clustered close to eachother in the map. However, it is of course verydifficult to quantitatively assess how well the mapcorresponds to chemical intuition, and how muchdepartures from it are to be considered a failure ofthe metric, of the sketch-map procedure or of thenotion of “chemical intuition”.

Our objective here is more to demonstrate howthe fingerprint-based structural metric we intro-duced can cope with widely different classes ofproblems, and how it can treat on the same footingsalchemical and structural variability. As an exam-ple we have also computed the similarity matrixand mapped the QM7b landscape using a modi-fied alchemical similarity metric between the heavyatoms (we always take καH = δαH). First, weset καβ = 1 (which means we are treating speciesα and β as the same species) for all of heavyatoms. The clear separation of the map into is-lands with the same stoichiometry is lost. How-ever, there is now near-perfect correlation betweenposition on the map and atomization energy, andat the same time one can see some residual clus-

tering of molecules with similar composition. Thiscan be explained because information on the al-chemical identity of the atoms is encoded in theiratomic coordination and bond lengths. This isfor instance evident for sulfur, that has consider-ably larger bond lengths, leading to a clustering ofsulfur-containing compounds that is considerablybetter than for instance in the case of oxygen ornitrogen.

Obviously, assuming that all atom kinds are in-terchangeable is an extreme choice, and it is hardto imagine circumstances in which this “elementagnostic” metric would be advantageous over onethat exploited knowledge of the chemical identityof atoms. On the other hand, one could foreseeto encode information on the “alchemical similar-ity” using one of the many quantities chemists haveused historically to rationalize trends in reactivityacross the periodic table. As an example, we usedthe electronegativity Eα to define

καβ = e−(Eα−Eβ)2/2∆2

(22)

where ∆ is a parameter that determines how sen-sitive is the alchemical kernel to differences in elec-tronegativities. We used ∆ = 1 to generate thelast set of maps in Fig. 7. The map now separatesout quite accurately regions with homogeneous sto-ichiometry. Whereas in the καβ = δαβ the differ-ent “islands” were roughly arranged according toa square grid pattern corresponding to nO and nN

along two orthogonal directions, now stripe-shapedislands are arranged in 1D, following numbers of nO

and nC, with the number of nitrogen atoms comingout clustered in adjacent “stripes”, but less clear-cut partitioning than for the other two elements.This is perhaps unsurprising given that nitrogenhas an intermediate electronegativity between thatof oxygen and carbon, and the metric tries to sep-arate most efficiently the elements that differ mostbased on the alchemical similarity kernel.

This last example gives perhaps the most com-pelling demonstration of how a structural similar-ity metric based on a combination of SOAP ker-nels gives an effective, broadly applicable and eas-ily customizable strategy to assess the similarityof materials and molecules, and how a sketch-mapconstruction based on such metric provides an in-sightful representation of structural and alchemicallandscapes.

E. Learning molecular properties

In this paper we focused mainly on the definitionof a compound structural similarity kernel, andon characterizing its behavior by means of sketch-map representations. It is however important tokeep in mind that an effective tool to compareatomic structures can find application to a broad

14

range of problems - one of the most intriguing be-ing the inexpensive prediction of physical-chemicalproperties of materials and molecules. To demon-strate the great promise of REMatch-SOAP ker-nels for machine-learning of molecular properties,we used a standard kernel-ridge regression (KRR)method32 to reproduce the 14 properties that hadbeen reported in Ref.24 for the 7211 molecules wedescribed in the previous paragraph.

We randomly selected 5000 training structures,and used the remainder as an out-of-sample vali-dation set. After having computed the REMatch-SOAP kernel matrix K between all the structures,using a cutoff of 3A and a regularization parame-ter γ = 0.5 – in this case including also H atoms inthe list of environments – we computed the KRRweights vector

w =(Kξ

train + σ1)−1

ytrain. (23)

Here Ktrain and ytrain are the kernel matrix andproperty values restricted to the training set, ξ in-dicates entry-wise exponentiation to tune the spa-tial range of the kernel, and σ is a regularization hy-perparameter. The prediction of the properties for

the test set can then be obtained as ytest = Kξtestw,

where Ktest is the matrix containing the REMatch-SOAP kernels between the test points and thetraining points. The procedure was repeated 10times, and the average mean absolute error (MAE)and root mean square error (RMSE) on the test setwere computed.

We optimized the ξ and σ hyperparameters byminimising the MAE on the atomization energy,and then used the same values to perform a KRRfor all the other molecular properties. Since wedid not further adjust the choice of kernel andthe ξ exponent, all the properties could be esti-mated at the same time, as discussed e.g. in59.The results of this procedure are reported in Ta-ble I, and demonstrate the extraordinary perfor-mance of REMatch-SOAP for machine-learning ap-plications. For the atomization energy we can ob-tain a MAE of less than 1kcal/mol – a four-foldimprovement relative to previous results that werebased on a Coulomb matrix representation of struc-tures and a deep-neural-network learning strategy.What is more, even without separately tuning theKRR hyperparameters, we can improve or matchthe performance of prior methods for almost all ofthe properties, the only exceptions being some ofthe properties computed with semi-empirical meth-ods. The fact we can obtain such a dramatic im-provement using a standard regression technique isa testament to the effectiveness of our kernel. Thecrucial importance of the choice of descriptors isalso apparent by noting that a MAE of about 1.5kcal/mol was recently obtained by regression basedon a “bag of bonds” description of molecules, cou-pled with a Laplacian kernel28.

Property SD MAE RMSE MAE24 RMSE24

E (PBE0) 9.70 0.04 0.07 0.16 0.36α (PBE0) 1.34 0.05 0.07 0.11 0.18α (SCS) 1.47 0.02 0.04 0.08 0.12HOMO (GW) 0.70 0.12 0.17 0.16 0.22HOMO (PBE0) 0.63 0.11 0.15 0.15 0.21HOMO (ZINDO) 0.96 0.13 0.18 0.15 0.22LUMO (GW) 0.48 0.12 0.17 0.13 0.21LUMO (PBE0) 0.68 0.08 0.12 0.12 0.20LUMO (ZINDO) 1.31 0.10 0.15 0.11 0.18IP (ZINDO) 0.96 0.19 0.28 0.17 0.26EA (ZINDO) 1.41 0.13 0.18 0.11 0.18E∗

1st (ZINDO) 1.87 0.18 0.41 0.13 0.31E∗max (ZINDO) 2.82 1.56 2.16 1.06 1.76

Imax (ZINDO) 0.22 0.08 0.12 0.07 0.12

TABLE I. Mean absolute errors (MAEs) and root meansquare errors (RMSE) for the KRR estimation of 14molecular properties, together with previously pub-lished estimation24 for the same data set. The stan-dard deviation of the values of the properties acrossall 7211 molecules in the database is shown in the sec-ond column. Errors in the KRR estimation refer to atest set of 2200 randomly selected configurations, whilethe remaining structures were used for training. Prop-erty labels refer to the level of theory and molecularproperty, i.e. atomization energy (E), averaged molec-ular polarizability (α), HOMO and LUMO eigenvalues,ionization potential (IP), electron affinity (EA), firstexcitation energy (E1st), excitation frequency of maxi-mal absorption (Emax) and the corresponding maximalabsorption intensity (Imax). Energies, polarizabilitiesand intensities are in eV, A3 and arbitrary units, re-spectively.

Reaching chemical accuracy in the automatedprediction of atomization energies is an impor-tant milestone, and the fact that we could achievethat without fully exploring the flexibility of theREMatch-SOAP framework (e.g. by optimizingthe entropy regularization parameter, the envi-ronment cutoff, eliminating the outliers, combin-ing multiple layers of description or using a non-diagonal alchemical similarity matrix) is a testa-ment to the potential of our approach. Futurework will be devoted to analyzing the performance,convergence and limits for the machine-learningof molecular and materials’ properties using ourSOAP-based structural similarity kernel.

IV. CONCLUSIONS

Distances between atomic structures based oncombinations of local similarity kernels provide aflexible framework to define a metric in structuraland alchemical space. Atom-centered environmentinformation can be combined to provide a globalmeasure of (dis)similarity. An average kernel Kprovides an inexpensive strategy to do so, with acost that scales linearly with the size of the struc-

15

tures to be compared, but might under-estimatethe difference between two configurations – sincein principle two different structures might yieldzero D. Alternatively, one can compute the lo-cal kernel between every possible pair of environ-ments (which itself entails a cost scaling with thesquare of the number of environments), and then

build a compound kernel K by finding the best-match permutation of the environments – whichgives a metric with better resolving power, but en-tails solving a cubic-scaling linear assignment prob-lem. Introducing an entropy regularization makesit possible at the same time to reduce the size-scaling to quadratic, and to obtain a better be-haved, smoothly varying metric, that interpolates- depending on the regularization parameter - be-tween the average and best-match limit.

This strategy to compare atomic configurationsbuilds on the very general notion that complexbulk and molecular structures arise from the com-bination of local building blocks, and can be ap-plied seamlessly to systems as diverse as clus-ters, bulk phases of an element, conformation ofa biomolecules and an assembly of small chemicalcompounds with varying atom kinds and number.At the same time, the structure of the underlyingSOAP kernels allows for very effective fine-tuning.For instance, by choosing the cutoff radius overwhich atomic densities are compared between envi-ronments, one can make the metric more sensitiveto the first-neighbor chemical connectivity, or viceversa, include information on the long-range con-formation of flexible molecules. What is more, itis possible to treat structural and alchemical com-plexity on the same footing, by introducing an al-chemical similarity kernel that makes it possible tospecify whether atoms of different species shouldbe considered completely separate, or whether anotion of chemical distance (based e.g. on the dif-ference in electronegativity) should be introducedto give different weights to substitutions betweenelements with similar reactivity.

We also demonstrate that straightforward ap-plication of the REMatch-SOAP kernel to theridge-regression evaluation of molecular propertiesmatches or outperforms all previously-presentedapproaches, reaching chemical accuracy in the pre-diction of the atomization energies of a set of smallorganic molecules. We believe that in this respectwe are only scratching the surface of the potentialapplications of our approach to machine-learning,since these results were obtained without using anyof the more sophisticated techniques (e.g. intro-ducing a hierarchy of models to capture the vari-ance of properties at different structural scales)that have been shown to significantly improve thiskind of procedures when using different kinds ofstructural descriptors.

The similarity metric we introduce could find ap-plication as the workhorse of a number of simu-

lation protocols, machine-learning algorithms anddata mining strategies. For instance, it couldbe used to detect outliers in automated high-throughput screenings of materials, to cluster simi-lar configurations together, to accelerate the explo-ration of chemical and conformational space of ma-terials and molecules. Here, we show in particularhow it can be combined with a non-linear dimen-sionality reduction technique such as sketch-map,to give simple and insightful two-dimensional rep-resentation of a given molecular or structural dataset. As atomistic modelling adventures into larger-scale structures, and unsupervised exploration ofmaterials space, maps such as these can providea valuable tool to convey intuitive information oncomplex structural and alchemical landscapes, torationalize structure-property relations, and to pre-dict physical observables of novel compounds bytraining machine-learning models to libraries ofknown materials.

V. ACKNOWLEDGMENTS

S.D and M.C would like to acknowledge sup-port from the NCCR MARVEL. A.P.B. was sup-ported by a Leverhulme Early Career Fellowshipwith joint funding from the Isaac Newton Trust.We would like to thank M. Cuturi and C. Ortnerfor insightful discussion. We thank C. Pickard forproviding silicon structures found with AIRSS, andS. Goedecker and M. Amsler for sharing with usthe crystal structures of low-density silicon poly-morphs.

Appendix: Sinkhorn distance for structural similarity

Let us discuss briefly how the REMatch can beimplemented in practice. Consider for generalitythe N ×M environment similarity matrix C(A,B)between two structures with N and M atoms re-spectively. The expression (12) given in Section II

for the optimal-transport-inspired definition of Kgeneralizes straightforwardly to non-s quare matri-ces39:

Kγ(A,B) = TrPγTC(A,B),

Pγ = argminP∈U(M,N)

∑ij

Pij (1− Cij − γ lnPij), (A.1)

where P ∈ U(N,M) is a (scaled) doubly-stochasticN × M matrix for which

∑i Pij = 1/M and∑

j Pij = 1/N .The Sinkhorn algorithm finds the optimal Pγ by

the decomposition Pγ = diagu Kdiagv = K ◦uvT , where ◦ indicates the Hadamard product, andK is the entry-wise exponential of (C− 1)/γ, i.e.

P γij = uivi exp [(Cij − 1)/γ] . (A.2)

16

The balancing vectors u and v can be obtained bythe iteration

u←eN/Kv

v←eM/KTu.

(A.3)

where (eN )i = 1/N are scaled stochastic vectors,and the iteration can be initialized by setting v =eM .

One of the advantages of a regularized matchstrategy is that the kernel becomes a smooth func-tion of the environment kernels. Computing itsderivative ∂αK

γ with respect to a parameter α (aCartesian coordinate, for instance), is however notcompletely trivial. Such a derivative is in fact com-posed of two terms

∂αKγ(A,B) = TrPγ∂αC + Tr ∂αP

γC. (A.4)

The first term is easy to compute – provided thatone can obtain ∂αC, the derivative of all environ-ment kernels with respect to α. The second termcan be further broken down based on the Sinkhorndecomposition of Pγ :

∂αPγ = ∂αK ◦ uvT + K ◦ ∂α(uvT ) (A.5)

The critical issue here is that direct evaluation of∂α(uvT ) would involve performing a separate cal-culation for each derivative α, which could makethe approach prohibitively expensive when, for in-stance, one would want to compute derivatives withrespect to the coordinates of each atom.

By straightforward albeit tedious algebra, onecan reformulate the problem in such a way that thederivative can be computed cheaply for any varia-tional parameter, given ∂αC:

∂αKγ(A,B) = TrQT∂αC, (A.6)

with

Qij = uiKijvj

[1 +

1

γ(Cij + aj −Nuibi +Mvjcj)

].

(A.7)The Q matrix can be fully evaluated based on

intermediate terms that do not depend on δαC:

a =−Mv ◦ (K ◦C)Tu

b = (1−W)−T

[(K ◦C)v + K (v ◦ a)]

c =N(b ◦ u2

)K

W = diag(Nu2

)Kdiag

(Mv2

)KT ,

(A.8)

were with u2 = u ◦ u we indicate the entry-wisesquare. The only caveat here is that (1 −W) issingular, and so it cannot be straightforwardly in-verted. Nevertheless, b can be computed by thefixed-point iteration b ← WTb + y with y =[(K ◦C)v + K (v ◦ a)]. Due to the potential in-stability of the procedure, it is crucial however to

check the convergence on the overall value of ∂αKγ ,

and not to push the convergence to higher relativeaccuracy levels than those used for the original so-lution to the Sinkhorn balancing problem.

1L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl,and M. Scheffler, “Big data of materials science: Criti-cal role of the descriptor,” Phys. Rev. Lett. 114, 105503(2015).

2T. D. Huan, A. Mannodi-Kanakkithodi, and R. Ram-prasad, “Accelerated materials property predictions anddesign using motif-based fingerprints,” Phys. Rev. B 92,014106 (2015).

3V. Botu and R. Ramprasad, “Learning scheme to predictatomic forces and accelerate materials simulations,” Phys.Rev. B 92, 094306 (2015).

4A. G. Kusne, T. Gao, A. Mehta, L. Ke, M. C. Nguyen, K.-M. Ho, V. Antropov, C.-Z. Wang, M. J. Kramer, C. Long,and I. Takeuchi, “On-the-fly machine-learning for high-throughput experiments: search for rare-earth-free per-manent magnets,” Scientific Reports 4, 6367 EP – (2014).

5R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. vonLilienfeld, “Quantum chemistry structures and propertiesof 134 kilo molecules,” Scientific Data 1, 140022 EP –(2014).

6L.-F. m. c. Arsenault, A. Lopez-Bezanilla, O. A. vonLilienfeld, and A. J. Millis, “Machine learning for many-body physics: The case of the anderson impurity model,”Phys. Rev. B 90, 155136 (2014).

7A. Rodriguez and A. Laio, “Machine learning. Clusteringby fast search and find of density peaks.” Science 344,1492–6 (2014).

8R. Xu and I. Wunsch, D., “Survey of clustering algo-rithms,” Neural Networks, IEEE Transactions on 16, 645–678 (2005).

9G. Yu, J. Chen, and L. Zhu, “Data mining techniquesfor materials informatics: Datasets preparing and appli-cations,” in Knowledge Acquisition and Modeling, 2009.KAM ’09. Second International Symposium on, Vol. 2(2009) pp. 189–192.

10O. Isayev, D. Fourches, E. N. Muratov, C. Oses, K. Rasch,A. Tropsha, and S. Curtarolo, “Materials cartography:Representing and mining materials space using structuraland electronic fingerprints,” Chemistry of Materials 27,735–743 (2015), http://dx.doi.org/10.1021/cm503507h.

11P. V. Balachandran, J. Theiler, J. M. Rondinelli, andT. Lookman, “Materials prediction via classificationlearning,” Scientific Reports 5, 13285 EP – (2015).

12A. L. Ferguson, A. Z. Panagiotopoulos, P. G. Debenedetti,and I. G. Kevrekidis, “Systematic determination of or-der parameters for chain dynamics using diffusion maps.”Proc. Natl. Acad. Sci. USA 107, 13597–602 (2010).

13M. Ceriotti, G. a. Tribello, and M. Parrinello, “Fromthe Cover: Simplifying the representation of complex free-energy landscapes using sketch-map.” Proc. Natl. Acad.Sci. USA 108, 13023–8 (2011).

14G. A. Tribello, M. Ceriotti, and M. Parrinello, “Usingsketch-map coordinates to analyze and bias molecular dy-namics simulations.” Proc. Natl. Acad. Sci. USA 109,5196–201 (2012).

15M. Ceriotti, G. A. Tribello, and M. Parrinello, “Demon-strating the Transferability and the Descriptive Power ofSketch-Map,” J. Chem. Theory Comput. 9, 1521–1532(2013).

16M. a. Rohrdanz, W. Zheng, and C. Clementi, “Discover-ing mountain passes via torchlight: methods for the def-inition of reaction coordinates and pathways in complexmacromolecular reactions.” Annu. Rev. Phys. Chem. 64,295–316 (2013).

17M. Rupp, A. Tkatchenko, K.-R. Muller, and O. A. vonLilienfeld, “Fast and accurate modeling of molecular at-

17

omization energies with machine learning,” Phys. Rev.Lett. 108, 058301 (2012).

18W. J. Szlachta, A. P. Bartok, and G. Csanyi, “Accuracyand transferability of gaussian approximation potentialmodels for tungsten,” Phys. Rev. B 90, 104108 (2014).

19A. Lopez-Bezanilla and O. A. von Lilienfeld, “Model-ing electronic quantum transport with machine learning,”Phys. Rev. B 89, 235411 (2014).

20G. Pilania, C. Wang, X. Jiang, S. Rajasekaran, andR. Ramprasad, “Accelerating materials property predic-tions using machine learning,” Scientific Reports 3, 2810EP – (2013).

21A. P. Bartok, M. J. Gillan, F. R. Manby, and G. Csanyi,“Machine-learning approach for one- and two-body cor-rections to density functional theory: Applications tomolecular and condensed water,” Phys. Rev. B 88, 054104(2013).

22M. Rupp, E. Proschak, and G. Schneider, “Kernel Ap-proach to Molecular Similarity Based on Iterative GraphSimilarity,” J. Chem. Inf. Model. 47, 2280–2286 (2007).

23M. Hirn, N. Poilvert, and S. Mallat, “Quantum EnergyRegression using Scattering Transforms,” arXiv Prepr.arXiv1502.02077 (2015).

24G. Montavon, M. Rupp, V. Gobre, A. Vazquez-Mayagoitia, K. Hansen, A. Tkatchenko, K.-R. Mller, andO. A. von Lilienfeld, “Machine learning of molecular elec-tronic properties in chemical compound space,” New Jour-nal of Physics 15, 095003 (2013).

25J. C. Snyder, M. Rupp, K. Hansen, K.-R. Muller, andK. Burke, “Finding density functionals with machinelearning,” Phys. Rev. Lett. 108, 253002 (2012).

26S. A. Ghasemi, A. Hofstetter, S. Saha, and S. Goedecker,“Interatomic potentials for ionic systems with densityfunctional accuracy based on charge densities obtainedby a neural network,” Phys. Rev. B 92, 045131 (2015).

27O. A. von Lilienfeld, “First principles view on chemicalcompound space: Gaining rigorous atomistic control ofmolecular properties,” International Journal of QuantumChemistry 113, 1676–1689 (2013).

28K. Hansen, F. Biegler, R. Ramakrishnan, W. Pronobis,O. A. von Lilienfeld, K.-R. Muller, and A. Tkatchenko,“Machine learning predictions of molecular proper-ties: Accurate many-body potentials and nonlocalityin chemical space,” The Journal of Physical Chem-istry Letters 6, 2326–2331 (2015), pMID: 26113956,http://dx.doi.org/10.1021/acs.jpclett.5b00831.

29J. B. Endelman, “Ridge regression and other kernelsfor genomic selection with r package rrblup,” The PlantGenome, 4 (2011), 10.3835/plantgenome2011.08.0024.

30S. An, W. Liu, and S. Venkatesh, “Face recognition us-ing kernel ridge regression,” in Computer Vision and Pat-tern Recognition, 2007. CVPR ’07. IEEE Conference on(2007) pp. 1–7.

31C. E. Rasmussen and C. K. I. Williams, Gaussian Pro-cesses for Machine Learning (Adaptive Computation andMachine Learning) (The MIT Press, 2005).

32T. Hastie, R. Tibshirani, and J. Friedman, The Ele-ments of Statistical Learning, Springer Series in Statistics(Springer New York, New York, NY, 2009).

33A. Sadeghi, S. A. Ghasemi, B. Schaefer, S. Mohr, M. A.Lill, and S. Goedecker, “Metrics for measuring distancesin configuration spaces.” J. Chem. Phys. 139, 184118(2013).

34F. Pietrucci and W. Andreoni, “Graph theory meetsAb Initio molecular dynamics: Atomic structures andtransformations at the nanoscale,” Phys. Rev. Lett. 107,085504 (2011).

35J. Behler, “Atom-centered symmetry functions for con-structing high-dimensional neural network potentials,”The Journal of Chemical Physics 134, 074106 (2011),http://dx.doi.org/10.1063/1.3553717.

36L. Zhu, M. Amsler, T. Fuhrer, B. Schaefer, S. Faraji,

S. Rostami, S. A. Ghasemi, A. Sadeghi, M. Grauzinyte,C. Wolverton, and S. Goedecker, “A fingerprint basedmetric for measuring similarities of crystalline structures,”ArXiv e-prints (2015), arXiv:1507.06730 [physics.comp-ph].

37K. Grauman and T. Darrell, “The pyramid match kernel:Discriminative classification with sets of image features,”Proc. IEEE Int. Conf. Comput. Vis. II, 1458–1465 (2005).

38A. P. Bartok, R. Kondor, and G. Csanyi, “On repre-senting chemical environments,” Phys. Rev. B 87, 184115(2013).

39M. Cuturi, “Sinkhorn Distances: Lightspeed Computa-tion of Optimal Transport,” in Adv. Neural Inf. Process.Syst. 26, edited by C. J. C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K. Q. Weinberger (Curran Asso-ciates, Inc., 2013) pp. 2292–2300.

40B. Scholkopf, A. Smola, and K.-R. Muller, “NonlinearComponent Analysis as a Kernel Eigenvalue Problem,”Neural Comput. 10, 1299–1319 (1998).

41C. Berg, J. P. R. Christensen, and P. Ressel, HarmonicAnalysis on Semigroups, Graduate Texts in Mathematics,Vol. 100 (Springer New York, New York, NY, 1984).

42S. De, A. Willand, M. Amsler, P. Pochet, L. Genovese,and S. Goedecker, “Energy Landscape of Fullerene Mate-rials: A Comparison of Boron to Boron Nitride and Car-bon,” Physical Review Letters 106, 225502 (2011).

43A. Sadeghi, S. A. Ghasemi, B. Schaefer, S. Mohr, M. A.Lill, and S. Goedecker, “Metrics for measuring distancesin configuration spaces,” The Journal of chemical physics139, 184118 (2013).

44S. De, B. Schaefer, A. Sadeghi, M. Sicher, D. G. Kan-here, and S. Goedecker, “Relation between the Dynam-ics of Glassy Clusters and Characteristic Features of theirEnergy Landscape,” Physical Review Letters 112, 083401(2014).

45H. W. Kuhn, “The Hungarian method for the assignmentproblem,” Nav. Res. Logist. Q. 2, 83–97 (1955).

46M. Cuturi, “Permanents, transportation polytopes andpositive definite kernels on histograms,” Int. Jt. Conf. Ar-tif. Intell. IJCAI , 732–737 (2007).

47Although stochastic algorithms do exist to compute it toa desired precision in polynomial time60.

48The density of atoms defined in equation (1) contains thecentral atom.

49T. F. Cox and M. A. A. Cox, Multidimensional scaling(CRC Press, 2010).

50R. R. Coifman, S. Lafon, a. B. Lee, M. Maggioni,B. Nadler, F. Warner, and S. W. Zucker, “Geometricdiffusions as a tool for harmonic analysis and structuredefinition of data: diffusion maps.” Proc. Natl. Acad. Sci.USA 102, 7426–31 (2005).

51S. Goedecker, “Minima hopping: An efficient searchmethod for the global minimum of the potential energysurface of complex molecular systems,” The Journal ofchemical physics 120, 9911–9917 (2004).

52C. J. Pickard and R. J. Needs, “Ab initio random struc-ture searching,” Journal of Physics: Condensed Matter23, 053201 (2011).

53M. Amsler, S. Botti, M. A. L. Marques, T. J. Lenosky,and S. Goedecker, “Low-density silicon allotropes for pho-tovoltaic applications,” Phys. Rev. B 92, 014101 (2015).

54Some of the structures in the overall data set have num-bers of atoms in the unit cell that would lead to a largeleast common multiple when repeating the environmentsimilarity matrix to form a square matrix. One can keepthe cost of computing the similarity matrix low by ex-ploiting the redundancy in the similarity matrix, or byapproximating K using a REMatch kernel with a smallentropy regularization.

55L. Rapp, B. Haberl, C. J. Pickard, J. E. Bradby, E. G.Gamaly, J. S. Williams, and A. V. Rode, “Experimentalevidence of new tetragonal polymorphs of silicon formed

18

through ultrafast laser-induced confined microexplosion,”Nature Communications 6 (2015).

56M. Ropo, C. Baldauf, and V. Blum, “Energy/structuredatabase of all proteinogenic amino acids and dipep-tides without and with divalent cations,” ArXiv e-prints(2015), arXiv:1504.03708 [q-bio.BM].

57G. Ramachandran, C. Ramakrishnan, and V. Sasisekha-ran, “Stereochemistry of polypeptide chain configura-tions,” Journal of Molecular Biology 7, 95 – 99 (1963).

58T. Fink and J.-L. Reymond, “Virtual exploration ofthe chemical universe up to 11 atoms of c, n, o,f: assembly of 26.4 million structures (110.9 million

stereoisomers) and analysis for new ring systems, stereo-chemistry, physicochemical properties, compound classes,and drug discovery,” Journal of Chemical Informationand Modeling 47, 342–353 (2007), pMID: 17260980,http://dx.doi.org/10.1021/ci600423u.

59R. Ramakrishnan and O. A. von Lilienfeld, “Many Molec-ular Properties from One Kernel in Chemical Space,”Chim. Int. J. Chem. 69, 182–186 (2015).

60M. Jerrum, A. Sinclair, and E. Vigoda, “A polynomial-time approximation algorithm for the permanent of amatrix with nonnegative entries,” J. ACM 51, 671–697(2004).