learn-physics.tex

\documentclass[aip,jcp,reprint,amsmath,amssymb,nature]{revtex4-1}

\usepackage{epsfig,color,graphicx}
\usepackage{xr}
\bibliographystyle{naturemag}

\externaldocument[SI-]{SI}

\newcommand{\Ang}{\ensuremath{\mathring{\text{A}}}}
\newcommand{\red}{\color{red}}
\newcommand{\blue}{\color{blue}}
\newcommand{\old}{\color{black}}

\begin{document}

%\title{Machine learns the physics of intermolecular interactions}
\title{Machine learning algorithm finds analytical equations to describe the energetics of intermolecular interactions}

\author{Volodymyr Sakun}
\affiliation{Department of Civil Engineering, McGill University, 817 Sherbrooke St. West, Montreal, QC H3A 0C3, Canada}
\author{Rustam Z. Khaliullin}
\email{rustam.khaliullin@mcgill.ca}
\affiliation{Department of Chemistry, McGill University, 801 Sherbrooke St. West, Montreal, QC H3A 0B8, Canada}

%\date{\today}

\begin{abstract}
Machine learning is now widely applied to reconstruct \emph{ab initio} potential energy surfaces (PES). 
Universal approximators such as artificial neural networks and Gaussian-process regressors excel at this task because they can mimic accurately even most complex PES functions.
However, the high flexibility of these methods comes at a price of poor extrapolation ability and transferability, which stem from the complete neglect of the underlying physics of interaction between atoms. 
Here, we proposed a learning algorithm that performs the evolutionary symbolic regression in the space of physically relevant functions to ``derive'' a simple yet accurate analytical equation for the potential energy of weakly-interacting molecules. 
By accounting for the physics of interaction the model produces accurate results even in the PES regions devoid of training data.
Remarkably, when applied to interactions between water molecules, machine finds that the best two-body representation of quantum mechanical PES is a combination of the Coulomb and Lennard-Jones terms --- in agreement with human-derived force fields equations.  
Furthermore, the automatic incorporation of many-body effects enables machine to suggest corrections to these force fields. 
The approach presented in this work can be generalized to find analytical representation of PES for complex molecules and materials.
\end{abstract}

\maketitle

\section{Introduction}

%Long conference abstract: Machine learning algorithms are now widely applied to reconstruct potential energy surfaces (PES) using accurate quantum mechanical energies as training data. Artificial neural networks (ANN) and Gaussian-process regressors (GPR) excel at this task because of their highly flexible representation of PES and complete neglect of the underlying physics of interaction between atoms. This design makes ANN and GPR universal approximators. Unlike fixed-form force fields they can adapt and accurately mimic many PES regardless of the complexity of physical interactions. However, the high adaptability comes at a price of poor extrapolation ability and transferability. ANN and GPR require large training datasets and fail to describe the regions of PES where training data is sparse. Here, we developed another versatile PES-modeling machine learning method with an improved extrapolation ability by accounting for the physics of interactions while still keeping the PES model flexible. The learning algorithm presented in this work utilizes the evolutionary symbolic regression in the space of physically relevant functions. Unlike many existing methods, it adjusts both the explicit analytical representation of the PES model and its parameters to yield a physically appropriate and yet simple equation for the energy. When applied to interactions between molecules in gas-phase clusters, machine finds that the best two-body representation of many quantum mechanical PES is a combination of the Coulomb and Lennard-Jones-like terms---in agreement with human-derived force fields (e.g. TIP3P for water molecules). Furthermore, the automatic incorporation of many-body effects enables machine to suggest corrections to force fields. For example, the algorithm presented here derived simple analytical expressions that capture cooperativity effects in water clusters.


Machine learning shows enormous promise for modeling of materials and molecules~\cite{ceriotti-science}. 
One of the many rapidly growing area of its application is the development of interatomic potentials---functions that describe the dependence of the interaction energy between atoms on their positions and represent the key component required for all atomistic simulations. 
In the last decade, several approaches based on artificial neural networks (ANNs) and Gaussian process regression (GPR) have been 
developed to approximate interatomic potentials using accurate quantum mechanical energies as training data. 
The main advantage of such machine-learned potentials (MLPs) is that they combine high accuracy of quantum mechanical description of interaction between atoms with the computational efficiency of simple analytical potentials. 
%This rare combination of accuracy and cost has enabled high-quality MLP-based simulations of complex materials on unprecedented time and length scales~\cite{Khaliullin2010,Khaliullin2011,RZK0}. 
This rare combination of accuracy and cost of MLPs has enabled simulations of complex materials on unprecedented time and length scales~\cite{Khaliullin2010,Khaliullin2011,RZK0}. 

The overall quality of interatomic potentials is determined by three key properties: accuracy, transferability, and computational complexity.

%RZK0: Describe the main principles of local descriptors and similarity measure, so that the introduction below is clear.

The high accuracy of MLPs is a result of a very flexible form of the model energy function, parametric or nonparametric. 
This flexibility allows the learning algorithm to adjust (hyper)parameters in the model in order to reproduce the known energies for atomic configurations in the training set and then \textit{interpolate} between them thus predicting the energies for similar atomic structures. 
Unfortunately, machine-learning algorithms are unable to \textit{extrapolate} the energy function to configurations that do not resemble those in the training set. 
%the accuracy of the existing MLPs comes with poor transferability.
This problem of poor transferability arises not only for configuration that lie \emph{outside}---according to some distance measure---the training set but also may appear for configurations \emph{inbetween} training points. 
%RZK: it is better to cite problematic case that to give hypothetical examples. Keep the conclusion that large datasets are required. Emphasize that the growth is factorial with the number of atom types.
A typical example of the former case is an MLP trained to reproduce the energies for a material in a certain pressure interval that fails dramatically for any pressure outside this range. 
The latter situation occurs when the training set is too sparse for accurate interpolation. For example, an MLP trained on configurations collected from liquid phases might not be able to predict energies of its crystal phases even though the local structures in liquid resembles those in solid state. 
In order to cope with the sparsity problem, training sets must include accurate quantum mechanical energies for a large number of representative atomistic configurations.%, making the training process computationally unfeasible.

The transferability issues of MLPs stem from the complete neglect of the physics of interaction between atoms. 
Designed to be flexible universal approximators in mathematical sense, MLPs represent the energy as an elaborate function of a large number of basis functions that do not have physical significance. Typically, sigmoid and Gaussian basis functions are used in ANN and GPR learning, respectively. 
%For example, ANNs rely on sigmoid basis functions, whereas GPR is a nonparametric expansion in a basis set of Gaussian functions.

The severity of the transferability problem has inspired numerous attempts to reintroduce the physical components into MLPs. 
Hybrid approaches use physically appropriate analytical potentials, in which parameters are obtained with machine-learning algorithms~\cite{Ghasemi2015}. 
While such schemes improve the transferability of a potential the fixed analytical form reduces the accuracy of predictions.

%While the accuracy-transferability problem plagues many human-derived energy models it is particularly severe in the case of MLPs. 

In this work, we propose a different approach to learning interatomic potentials with the aim to improve their transferability without compromising the flexibility and accuracy. 
The key idea is to use machine learning not only to adjust parameters within a fixed but also to derive a physically appropriate analytical form of the potential.
%The key idea of the methodology presented here is to let the algorithm select, out of a multitude of physically appropriate functions, only those few that reproduce the training data best. 
The key idea of the methodology described here is to represent the energy model by a multitude of physically appropriate trial functions and then let the learning algorithm select only those few that reproduce the training data best. 
A large number of the trial energy functions ensures the accuracy of the model. 
The physics built into the model and the tight control of overfitting improve its transferability. 
Furthermore, the Occam's razor principle applied to the learning process produces simple analytical energy expressions that resemble those describing fundamental physical laws. 
The obtained physically motivated analytical equations can properly describe interactions between atoms outside the training set. 
To demonstrate a feasibility of the outlined approach, we considered only intermolecular interactions, for which physically relevant energy models can be constructed with linear functions.

Connections to symbolic regression and feature selection algorithms.

Symbolic regression (10) is an established method based on evolutionary computation (11) for searching the space of mathematical expressions while minimizing various error metrics [see section S4 in the supporting online material (SOM)]. Unlike traditional linear and nonlinear regression methods that fit parameters to an equation of a given form, symbolic regression searches both the parameters and the form of equations simultaneously (see SOM section S6). Initial expressions are formed by randomly combining mathematical building blocks such as algebraic operators {+, –, ÷, ×}, analytical functions (for example, sine and cosine), constants, and state variables. New equations are formed by recombining previous equations and probabilistically varying their subexpressions. The algorithm retains equations that model the experimental data better than others and abandons unpromising solutions. After equations reach a desired level of accuracy, the algorithm terminates, returning a set of equations that are most likely to correspond to the intrinsic mechanisms underlying the observed system.


\section{Methodology}

\subsection{Model energy equations}

For any molecular system, the potential energy as a function of distances between molecules satisfies two constraints: it approaches zero when all molecules are infinitely far from each other and it becomes infinite as the distance between any two atoms approaches zero.
%
\begin{eqnarray}
\lim_{\forall R_i \rightarrow \infty} E (\mathbf{R}) = 0 \\
\lim_{\exists R_i \rightarrow 0} E (\mathbf{R}) = \infty
\end{eqnarray}
%
Furthermore, the potential energy is invariant to translations and rotations of the entire system.

Using the multipole expansion as a rough guiding principle, we chose a sum of products of the inverse distances (POIDs) to model the  energy: 
%
\begin{equation}
\begin{split}
E_M (&\mathbf{R}; \mathbf{C}) = \sum_{k=1}^{M_1} \sum_{p=1}^{N} \frac{C_{pk}}{R_p^k} + \sum_{k,m=1}^{M_2} \sum_{p>q}^{N} \frac{C_{pkqm}}{R_p^k R_q^m} + \\
&+ \ldots +\sum_{k,m,n=1}^{M_G} \sum_{p>q>r}^{N} \frac{C_{pkqm\ldots rn}}{R_p^k R_q^m \ldots R_r^n} \equiv \mathbf{F}\cdot \mathbf{C} \label{eq:energy}
\end{split}
\end{equation}
%
where $\mathbf{R}$ is a vector of $N = \frac{1}{2}A(A-1)$ distances between $A$ atoms, $M_{g}$ is the maximum power for a term containing the product of $g$ distances, $G$ is the maximum number of distances in the denominator, $\mathbf{F}$ is a vector of $\sum_{g=1}^{G} (M_g N)^g /g!$ POIDs, which serve as \emph{features}, and $\mathbf{C}$ is the corresponding vector of tunable coefficients. 

The model energy expression in Eq.~(\ref{eq:energy}) is analytically simple, easy to compute, sufficient flexible to reproduce the physics of a variety of interactions beyond the pairwise approximation, and naturally satisfies the important physical constraints mentioned above.

In order to account for the equivalence of atoms of the same element and to make sure that the energy is invariant to permutation of atoms in the input, all distances are classified as intramolecular and intermolecular and the fitting coefficients for the physically equivalent features are restricted to be equal. 
Taking this symmetry into account allows to reduce the training time substantially. 
For example, the energy of two water molecules depends on 15 distances but only five of them are unique and require independent fitting coefficients: oxygen-oxygen, intermolecular oxygen-hydrogen, intramolecular oxygen-hydrogen, intermolecular oxygen-hydrogen, intramolecular hydrogen-hydrogen.

We refer to the features that share the same fitting coefficient as equivalent features. The total number of independent fitting coefficients is equal to the number of non-equivalent groups of features. Features that include only one(two) distance(s) are called single(double)-distance features.

\subsection{Training procedure} 

Ordinary least squares (OLS) method finds the best parameters in the linear regression model~(\ref{eq:energy}) by minimizing the objective function $\Delta_{\text{OLS}}$ with respect to $\mathbf{c}$
%
\begin{eqnarray}
\Delta_{\text{OLS}}(\mathbf{C}) = \frac{1}{2} \sum_{i=1}^{T} \left( E(\mathbf{R}_{i}) -  E_M (\mathbf{R}_{i}; \mathbf{C}) \right)^2 \label{eq:OF-OLS}
\end{eqnarray}
%
where $T$ is the number of datapoints in the training set, $E(\mathbf{R}_{i})$ is the target energy known for each datapoint $i$ upto a specified precision, and $E_M$ is the energy predicted by the model.

Unfortunately this method is not appropriate if one needs to recover a minimalistic energy equation based on numerical data. The undesirable behavior of the OLS regression is demonstrated on a simple example of the Lennard-Jones (LJ) potential
%
\begin{eqnarray}
E (R) = \frac{4}{R^{12}} - \frac{4}{R^{6}} \label{eq:LJenergy}
\end{eqnarray}
%
and an attempted recovery of this equation starting from Eq.~(\ref{eq:energy}) with fifteen suggested features ($M_1=15$ and $M_2=0$). If numerical data is rounded even slightly to $10^{-7}$ reduced LJ units then the optimal OLS parameters differ from the original parameters by orders of magnitude (Figure~\ref{Fig:OLS}). The main reason for this behavior is that OLS parameters are redundant and are essentially tuned to reproduce numerical noise, not the physically relevant data.

%VLAD: plot coefficients powers from 1 to 15. Use bar graph, x-axis label "Power", y-axis "Coefficient, reduced LJ units".
%-0.0240097, 0.612791, -7.043, 48.2289, -219.296, 693.937, -1594.65, 2633.41, -3107.96, 2523.57, -1273.46, 266.353, 95.1984, -72.4919, 13.6115

\begin{figure}
\includegraphics[width=0.45\textwidth]{media/OLS_Bar.eps}
\caption{Optimal fitting coefficients obtained with the OLS regression to reproduce Eq.~\ref{eq:LJenergy}.} \label{Fig:OLS}
\end{figure}

This example illustrates that the task of finding a minimalistic and physically relevant representation of the intermolecular potential requires a feature selection algorithm. %that reduces a set of features while simultaneously tuning their flexible parameters --- linear coefficients in our simple case. 
A training procedure developed in this work includes several stages: initial feature selection, further fine-tuning, and optional backward feature elimination.

\subsubsection{First stage: initial feature selection}

In this work, we explore the applicability of two methods to perform initial feature selection: elastic net regularization (EN) and genetic algorithm (GA). 
While both algorithms produce almost identical results for the linear model in Eq.~(\ref{eq:energy}) the advantage of GA is that it can be easily generalized to features with nonlinear fitting coefficients. 
For example, exponentially decaying functions can be included into the model energy function to increase its flexibility. 
%ZZZ: Comparison of the EN and GA performance 

%ZZZ- Do we really need to have EN in the main text?
\textbf{Elastic net}. The EN method is designed to reduce the fitting coefficients of irrelevant features precisely to zero. This is achieved by minimizing an objective function that is obtained by augmenting the OLS function with two regularization terms that penalize large coefficients: 
%
\begin{eqnarray}
\Delta_{\text{EN}}(\mathbf{C}) = \Delta_{\text{OLS}}(\mathbf{C}) + \lambda_1 \vert\vert \mathbf{C} \vert\vert_1 + \lambda_2 \vert\vert \mathbf{C} \vert\vert_2^2  \label{eq:OF-EN}
\end{eqnarray}
%
where $\lambda_1$, $\lambda_2$ is the regularization strength and $\vert\vert \mathbf{C} \vert\vert_k$ is the $k$-norm of vector $\mathbf{C}$ defined as
%
\begin{eqnarray}
\vert\vert \mathbf{C} \vert\vert_k = \left( \sum_{i} \vert C_i \vert^k \right)^{\frac{1}{k}}  \label{eq:k-nrom}
\end{eqnarray}
%

%ZZZ - to results: As it was mentioned if there is a group of highly correlated variables, then the LASSO selects just one variable from a group and drop the others. This is not convenient because in our case LASSO can drop important features while keeping less important one. EN can overcome this problem since it contains two penalties: linear and quadratic. By adjusting the two regularization factors, it is possible to force EN to keep dependent features in the model. Later the unnecessary features will be removed by another algorithm. Therefore, EN allows to reduce the number of features from thousands to hundreds or even less. 

\textbf{Genetic algorithm.} In a generic genetic algorithm (GA), a population of candidate solutions to the optimization problem is evolved iteratively towards better solutions. Each solution is described by a chromosome --- a collection of genes --- that undergoes mutations and crossovers in the simulated evolution process. In our case, the analytical form of the energy equation is the chromosome whereas each POID term of the equation represents a gene. It is worth noting that parameters $C$ in the energy equation are adjusted with conventional optimization methods, without GA.

We employed a straightforward GA. 
The initial population of equations, all with the same number of features, is seeded randomly. 
The fitness of an equation is measured by the RMSE calculated for the training set. 
All equations in the population are sorted according to their RMSE and the top 50\% selected as the elite fraction.
A half of a new population is created in the crossover (45\%) and mutation (5\%) processes from elite equations while the other half is re-generated randomly. 
In the crossover process, two equations are selected at random from the elite fraction. 60\% of random POIDs are taken from the equation with lower RMSE while the remaining 40\% of POIDs are taken from the other equation. 
In the mutation process, between 5 and 25\% of all POIDs are selected randomly and replaced with other random POIDs. 
The evolution stops after a preset number of generations fails to lower RMSE for the training set. Further details and numerical settings of the algorithm are presented in the Supplementary Information.


\subsubsection{Second stage: fine-tuning with greedy graph traversal algorithm} 

The second stage of the training processes is designed to deal with strong correlation (e.g. multicollinearity) between features and improve the functional form of the energy equation previously derived by either EN or GA. 
It is worth noting that commonly-employed methods for resolving linear dependence issues (e.g. principal component analysis) are not applicable in our case because our main goal is to keep our features pure (i.e. uncontaminated by the admixture of other closely related nearly-dependent features). 
Therefore, we designed a fine-tuning procedure that attempts to lower the RMSE of the model by swapping strongly-correlated features. 
%RZK: Is RMSE defined by this point

The fine-tuning starts with the calculation of the correlation matrix (i.e. matrix of Pearson correlation coefficients) for all features, including those not chosen in the first stage. 
The correlation coefficients are then used to assign a list of strongly-correlated equivalents to each feature. 
Two features are considered strongly-correlated if their correlation coefficient is greater than a pre-selected threshold value. 
Finally, a greedy graph traversing (GGT) algorithm, in which descendant nodes are created by replacing a feature with its highly-correlated equivalents, is used to improve the energy model. 
The total number of nodes in the graph --- $\mathcal{O}(n^k)$, where $k$ is the number of selected features and $n$ is the average number of highly-correlated equivalents per feature --- is typically too large for an exhaustive search. 
A heuristic approach to reduce the number of traversed nodes is described in Algorithm~\ref{SI-fig:ggta} in the Supplementary Information. 
In essence, our GGT method inserts a newly created descendant into the waiting list only if this node has lower RMSE than all other nodes of the same generation (i.e. shortest distance to the root). 
Otherwise, the child node and all its descendants are ignored. 
There is only one waiting list for the graph and it is kept sorted according to RMSE. 
The GGT traverses the graph by picking the best node from the waiting list and then checking if it still has the lowest RMSE within its generation. 
The worst nodes are removed from the waiting list once it reaches the predefined maximum length. 

As any heuristic algorithm, our GTT approach is not guaranteed to find the best possible energy equation for a fixed number of features but nonetheless is proven crucial for improve the GA results in a reasonable amount of time.

%\textit{Selection criteria. These are alternative options that control the generation of child nodes:}
%\begin{itemize}
%\item Fast: each newly created node compares with best node and goes into queue only if it has better fitness. Algorithm works as one branch of depth first search in this case.
%\item Parent: if newly created node is better than its parent it goes to queue.
%\item Level: The best node is picked from the waiting list. If its RMSE is currently the best on its level then all child nodes are created from the this node in ZZZ order. If the fitness of a newly created node child node is the best on its level then it is inserted into the waiting list, which is always kept sorted according to the RMSE of the nodes. 
%\item Slow: all nodes produced go to queue and algorithm becomes the breadth-first search. It scans through all graph and can find the best possible solution. However, its complexity --- $\mathcal{O}(n^k)$ where $k$ is the number of selected features and $n$ is the average number of highly-correlated features per selected feature --- makes it unmanageable in most cases.
%\end{itemize}

%A* is greedy depth first search graph algorithm with heuristic function that speeds up computation by decreasing the number of nodes to be evaluated. %General form of its fitness function if f(n) = g(n) + h(n) where g(n) is path cost function that in our case represent the level of node n in search graph and h(n) - heuristic function which is "shortest path" to the goal. Since our goal is fitness = 0, then we can use MSE as a parameter for h(n). 

%Comments on performance and accuracy: Since the subset contains only 20 or less features in our case the algorithm works relatively fast. Its speed depends on MinCorrelation variable that influence the formation of alternate features lists. The closer this variable to 0 the greater lists and the lower performance. %When MinCorrelation approaches to 0 the number of children nodes increases, therefore increases the probability of improving the fitness.

%Comments on accuracy: focused on eliminating collinear variables in such a way that final fit will contain only one feature for each collinear group. It is not the ideal representation but it allows us having good enough fit minimize number of variables that would give a compact energy equation. 

%%% FAST post-GA feature reduction
\subsubsection{Optional backward feature elimination} 

The final number of selected features can be adjusted from the start in the GA method by changing the size of a chromosome. 
While the final number of retained features in the EN algorithm is not known \emph{a priori} it can also be adjusted by restarting the method with different regularization strength. 
However, in both cases, it is useful to estimate quickly how the quality of the fit depends on the number of predictors without repeating the entire training process. 
Therefore, we implemented a fast feature elimination procedure that starts with an existing feature set, defines the worst feature as the one giving the smallest increase in fitness, then drops the worst feature from the set, and performs the final fine tuning based on the GTTA algorithm. The elimination procedure is repeated until only one feature is left in the set.

\subsection{Molecular systems and training sets} 

%Lennard-Jones model.

Two water molecules TIP3P.

Two water molecules, MP2.

Two-molecule and three-molecule water models, with geometries obtained 
from AIMD.

First of all, the data must contain enough records to cover all possible arrangements that exist in real life. 
In other words, data must contain enough configurations with varying potential from highest in absolute value to zero. 
Another requirement is that data must be sparse and should have equal records for each representative bin, which is in our 
case average distance between centers of masses. 
So, first part of the algorithm takes required amount of records for each interval and that separates training set and test set. 
Only training set is being using for fitting procedure and only test set is being using for predicting and evaluating the accuracy of the fit and visualizing results. 
In order to analyze the efficiency of predicting algorithm, the number of observations can be reduced to required quantity. 
Some algorithms require data standardization, so before doing anything we have to perform feature scaling for training set in order to have zero mean and unit variance for each feature.

[Related to Models] For three water molecules where we have 9 atoms the total number of distances = 9$^{2}$/2 - 9/2 = 36. Using same set of powers (M$_{1}$=15 and M$_{2}$=7) without feature reduction we would have 540+17640=18180 features. However, reducing the number of features using the describe above technique and discarding all intramolecular distances the reduced number of features decreases to only 45+630=675 which is much more convenient and manageable for fitting.

[Related to Models] However, calculation of variance influence factors showed that there 
exists very strong multicollinearity between features which indicates 
that products must be included in the model. Similarly, to single 
distances model, if M$_{1}$=0 and M$_{2}$=1, set contains 22 
features and new model has in overall 22+5=27 features. Another 
assumption that has been made that our function depends only on 
intermolecular distances. After performing several fitting procedures 
this assumption has been confirmed by results since fitting algorithm 
discards them almost in all cases. This fact allowed us to reduce the 
number of features even more. To get reasonable fit, the number of 
powers must be increased which will increase the number of features. For 
the model with M$_{1}$=15 and M$_{2}$=7 the number of features 
is 45+245=290 which is more than acceptable for fitting. Without 
performing feature reduction, we would have in total 225+2940=3165 
features (this number of features is the same order of magnitude as that 
of ANNs-based MLPs). In case of two water molecules reasonable fit could 
be obtained using only single distances. However, if we have three water 
molecules, the full model should be used since the simplified one cannot 
give accurate fit. 


\section{Results and discussion}

The last part summarizes results in report file which include n models 
with corresponding fitting coefficients and creates plots that represent 
prediction of each model using only test set. It also compares our fit 
with gaussian process fit accomplished using the same training set. In 
order to have manageable number of features we have to restrict max power to some reasonable number. Our experiments show that for single distance max power M$_{1}$ from (\ref{eq:tip3p}) value 15 will be sufficient. For features that contain product of distances max power M$_{2}$ from (\ref{eq:tip3p}) 
equal to 7. Increasing those number to greater value does not give any 
significant improvement, but augments number of features and slows down 
the fitting procedure. Table~\ref{Tab:systems}.

\begin{table*}
\caption{Systems and models}\label{Tab:systems}
\begin{tabular*}{\textwidth}{l @{\extracolsep{\fill}} ccccc}
%\begin{tabular}{l*{6}{c}r}
\hline
System & A & B1 & B2 & C & D\\
\hline
Github nickname & set2.x & set6.x & set6.x & set4.x & \\
\hline
$G$ & 1 & 1 & 2 & 2 & \\
\hline
$M_1$ & 15 & 15 & 15 & 15 & \\
\hline
$M_2$ & 0 & 0 & 7 & 7 & \\
\hline
Chemical composition & (H$_2$O)$_2$ & (H$_2$O)$_2$ & (H$_2$O)$_2$ & (H$_2$O)$_3$ & \\
\hline
No. of unique features & & & & &\\
\hline
Ref. energy & TIP3P & MP2 & MP2  & BLYP+D/mo-TZV2P & \\
\hline
Configurations & Grid & $R_{OO}$ bins & $R_{OO}$ bins & AIMD\\
\hline
\end{tabular*}
\end{table*}

Refer to SI for a discussion of LASSO - a special case of EN.


\subsection{System A: TIP3P water dimer}

First of all, we show that our algorithm recovers a known analytical equation correctly using sparse numerical data. We chose the  TIP3P potential (Eq.~\ref{eq:tip3p}) developed to simulate liquid water as the reference equation to be recovered. 
%
\begin{eqnarray} \label{eq:tip3p}
E(R_i) & = & \frac{c_0}{R_{O_1O_2}^{12}}-\frac{c_1}{R_{O_1O_2}^6} + \frac{c_2}{R_{O_1O_2}} - \nonumber \\
& - & c_3(\frac{1}{R_{O_1H_{21}}}+\frac{1}{R_{O_1H_{22}}}+\frac{1}{R_{O_2H_{11}}}+\frac{1}{R_{O_2H_{12}}}) + \nonumber\\
& + & c_4(\frac{1}{R_{H_{11}H_{21}}}+\frac{1}{R_{H_{11}H_{22}}}+\frac{1}{R_{H_{12}H_{21}}}+\frac{1}{R_{H_{12}H_{22}}})
\end{eqnarray}
%
where $c_1=ZZZ$~a.u., ZZZ.

In order to test it and to see how well it would find a simple expression, first of all, we used equation~(\ref{eq:tip3p}) to generate data in the interval where the distance between centers of masses of two water molecules varies from 2 to 8 angstrom. We used range of powers from -1 to -15 for features with single distances and the same range for features that contain product of distances. Since all intramolecular distances are just constants they were removed from the set. Also, for simulation we took only a random numbers of coordinates that equally represent each small bin in chosen interval. The bin size was 0.2 angstrom. Initial number of features was 135 + 1008 = 1143. After performing dimensionality reduction genetic algorithm started with 45 + 245 = 290 features and 69 training points per bin with total number of training records equal to  2139. Initial chromosome size was 20 and number of individuals in population was equal to 100. At the and of code execution the exact linear fit with 5 variables and original coefficients were recovered.

ZZZ - Note relation between charges in our procedure.

\subsection{System A2: TIP3P + noise?}

\subsection{System B: water dimer}

%VLAD: all units must be the same throught the manuscript. For energy, use kJ/mol. 1 Hartree = 2625.5 kJ/mol

In this part, we focus on two water molecules model. First of all, we need to analyze how RMSE of the function obtained from fitting procedure depends on number of predictors. In simplest case (System B1 Table \ref{Tab:systems}) we have only the first term of equation (\ref{eq:tip3p}). Figure \ref{Fig:B1}A shows that if we have a fit with more than 6 variables, RMSE decreases very slightly. If we were to examine what machine learned from dataset first of all it would be wise to compare classic model given by equation (\ref{eq:tip3p}) with newly found fit. For 5 predictors, the results are summarized in Table \ref{Tab:B1 coefficients}. Its RMSE=0.0007068 Ha is even better than the one obtained with classic formula which is 0.0007428 Ha. Since algorithm saves all explored states, we discovered that there are many states with similar goodness of the fit. [ZZZ-their are physically equivalent] The one described above is just the best result obtained for this particular configuration (training points are in the interval 2.8A to 5.6A).

\begin{figure}
\includegraphics[width=0.45\textwidth]{media/B1_Single_GA_PATH_RMSE.eps}
\includegraphics[width=0.45\textwidth]{media/B1_Single_Energy_5_predictors.eps}
\includegraphics[width=0.45\textwidth]{media/B1_Single_Error_5_predictors.eps}
\caption{Results of GA + A* and GP algorithms for system B1.  A. Dependence of RMSE on the number of predictors. B. Dependence of enegry on the average disdance between centers of masses for 5 predictors. C. Dependence of error on the average disdance between centers of masses for 5 predictors.}\label{Fig:B1}
\end{figure}

\begin{figure}
\includegraphics[width=0.45\textwidth]{media/B1_GA_energy_error_histogram_5_predictors.eps}
\caption{Energy error histogram. Fit for 5 predictors obtained by GA + A* search. System B1}\label{Fig:B1_histogram_5_predictors}
\end{figure}

\begin{table}
\caption{Final five-feature fit for system B1 obtained with GA and A* search.}\label{Tab:B1 coefficients}
\begin{tabular*}{0.45\textwidth}{c @{\extracolsep{\fill}} ccc}
%{ccc}
%\begin{tabular}{l*{6}{c}r}
\hline
Distance & Power, $n$ & Coefficient, $Ha\cdot\AA^n$ \\
\hline
O-O & 15 & 10277.7 \\
\hline
O-H & 1 & -0.1348 \\
\hline
O-O & 1 & 0.2726 \\
\hline
H-H & 1 & 0.06622 \\
\hline
H-H & 7 & 0.04201 \\
\hline
\end{tabular*}
\end{table}

In order to analyze deeper system B, we extended our previous model to two terms (System B2 Table  \ref{Tab:systems}) of equation~(\ref{eq:tip3p}). 
%ZZZ - note raltion between charges.
Second term might be useful if there exists dependence between our features generated using first part of equation~(\ref{eq:tip3p}). In this case, the second term will improve the goodness of the fit.

%VLLAD: Label panels A, B, C in the two three-panel figures. Combine figures 4 and 8 into one panel.

\begin{figure}
\includegraphics[width=0.45\textwidth]{media/B2_GA_PATH_RMSE.eps}
\includegraphics[width=0.45\textwidth]{media/B2_Energy_8_predictors.eps}
\includegraphics[width=0.45\textwidth]{media/B2_Error_8_predictors.eps}
\caption{Results of GA + A* and GP algorithms for system B2.  A. Dependence of RMSE on the number of predictors. B. Dependence of enegry on the average disdance between centers of masses for 8 predictors. C. Dependence of error on the average disdance between centers of masses for 8 predictors.}\label{Fig:B2}
\end{figure}

\begin{figure}
\includegraphics[width=0.45\textwidth]{media/B2_GA_energy_error_histogram_8_predictors.eps}
\caption{Energy error histogram for 8 predictors. Fit obtained by GA + A* search. System B2}\label{Fig:B2_histogram_12_predictors}
\end{figure}

% B2 VIP 
\begin{figure}
\includegraphics[width=0.45\textwidth]{media/B2_VIP_GA_PATH_RMSE.eps}
\caption{Dependence of RMSE on the number of predictors. GA + A* search. System B2 with fixed 5 best predictors from system B2}\label{Fig:B2_VIP_RMSE}
\end{figure}


For the same number of predictors, lets say 5 system B2 gives better goodness of the fit than system B1 (RMSE.B1 = 0.0007068 Ha and RMSE.B2 = 0.0005661)   
The algorithm chose 3 out of 5 features that contain product of distances which signifies that they are not completely independent and there exists some relationship between them. Plot of RMSE vs. number of nonzero coefficients for system B2 (Figure~\ref{Fig:B2}A) shows that line becomes gentle approximately near 12. Increasing number of predictors after 12 does not improve significantly fitting results. Coefficients for that particular fit are shown in Table \ref{Tab:B2 coefficients} as well as links to figures that describes features that contain products of distances (Fig. \ref{Fig:B2_Dist}). Plot of average error vs. distance between centers of masses, average energy vs. distance between centers of masses for system B2 and gaussian process are Fig. \ref{Fig:B2}C and \ref{Fig:B2}B respectively and energy error histogram is represented by Fig. \ref{Fig:B2_histogram_12_predictors}.


\begin{table}[h]
\caption{Coefficients for two terms of equation (\ref{eq:tip3p}) with distances description}
\label{Tab:B2 coefficients}
\begin{tabular}{|l|l|l|l|l|l|}
\hline
\textbf{Dist 1} & \textbf{Power 1} & \textbf{Dist 2} & \textbf{
Power 2} & \textbf{Coef $Ha\cdot\AA^n$} & \textbf{Figure} \\
\hline
H-H & -1 & - & - & 0.1517 & - \\
\hline
O-H & -1 & - & - & -0.1617 & - \\
\hline
O-O & -1 & O-H & -1 & 0.9413 & Fig. \ref{Fig:B2_Dist} a) \\
\hline
O-H & -1 & H-H & -1 & -0.4133 & Fig. \ref{Fig:B2_Dist} b) \\
\hline
H-H & -3 & - & - & 0.9059 & - \\
\hline
O-O & -3 & - & - & -5.6737 & - \\
\hline
H-H & -3 & - & - & -0.8635 & - \\
\hline
O-O & -3 & - & - & 6.2834 & - \\
\hline
O-H & -4 & H-H & -4 & 1.0343 & Fig. \ref{Fig:B2_Dist} b) \\
\hline
O-O & -15 & - & - & -1800 & - \\
\hline
H-H & -7 & - & - & 0.09381 & - \\
\hline
O-O & -1 & - & - & 0.007361 & - \\
\hline
\end{tabular}
\end{table}

\begin{figure}
\includegraphics[width=0.45\textwidth]{media/water_distances_1.eps}
\caption{Features description. Model B2. Fig. a) shows the feature that contains the product of O-O and O-H imtermolecular distances. Fig. b) shows the feature that contains the product of O-H and H-H distances.}\label{Fig:B2_Dist}
\end{figure}

\begin{figure}
\includegraphics[width=0.45\textwidth]{media/water_distances_2.eps}
\caption{Features description. Model C. Fig. Four a), b), c) and d) represent four different features that contain products of distances O-O * O-O, O-H * O-H, O-O * O-H  and O-H * H-H respectively.}\label{Fig:C Dist1}
\end{figure}

\subsection{System C: three water molecules}

After four predictors increasing numbers of features does not give significant improvement in fitness. What is remarkable that best fit chosen by algorithm contains the same powers for each distance within one complex feature that contains product of distances. Hopefully it is not coincidence because for different results with number of predictors in range from two to ten all "double" features contain products of distances with the same powers. Coefficients and feature descriptions are in Table \ref{Tab:C coefficients}

\begin{figure}
\includegraphics[width=0.45\textwidth]{media/C_GA_PATH_R2.eps}
\caption{Dependence of R2 on the number of predictors. GA + A* search. System C}\label{Fig:C_R2}
\end{figure}

\begin{figure}
\includegraphics[width=0.45\textwidth]{media/C_Energy_4_predictors.eps}
\caption{Dependence of enegry on the average disdance between centers of masses of molecules to center of mass of the system. Fit for 4 predictors obtained by GA + A* search and gaussian process. System C}\label{Fig:C_Energy_4_predictors}
\end{figure}

\begin{figure}
\includegraphics[width=0.45\textwidth]{media/C_Error_4_predictors.eps}
\caption{Dependence of error on the average disdance between centers of masses of molecules to center of mass of the system. Fit for 4 predictors obtained by GA + A* search and gaussian process. System C}\label{Fig:C_RMSE_4_predictors}
\end{figure}

\begin{figure}
\includegraphics[width=0.45\textwidth]{media/C_GA_energy_error_histogram_4_predictors.eps}
\caption{Energy error histogram. Fit for 4 predictors obtained by GA + A* search. System C}\label{Fig:C_histogram_4_predictors}
\end{figure}

\begin{table}[h]
\caption{Coefficients for system C according to two first terms of equation (\ref{eq:tip3p}) with distances description and corresponding number of distances and powers}
\label{Tab:C coefficients}
\begin{tabular}{|l|l|l|l|l|l|}
\hline
\textbf{Dist 1} & \textbf{Power 1} & \textbf{Dist 2} & \textbf{
Power 2} & \textbf{Coef $Ha\cdot\AA^n$} & \textbf{Figure} \\
\hline
O-H & -3 & O-H & -3 & 0.3495 & Fig. \ref{Fig:C Dist1} b) \\
\hline
O-O & -3 & O-O & -3 & 1.5937 & Fig. \ref{Fig:C Dist1} a) \\
\hline
O-O & -3 & O-H & -3 & -0.9107 & Fig. \ref{Fig:C Dist1} c) \\
\hline
O-H & -4 & H-H & -4 & -0.2013 & Fig. \ref{Fig:C Dist1} d) \\
\hline
\end{tabular}
\end{table}


\subsection{Comparison with Gaussian process}

Another interesting aspect is to compare our results with Gaussian 
process. At the same time, we have to investigate one of our goals: can 
our algorithm predict in regions where there are no training points, in 
other words can we use it to interpolate and extrapolate. To do this we 
removed training points from some regions. In the following example 
simulation will be performed in the region where average distance 
between center of masses of two water molecules spans $[$2.4 .. 15$]$ 
angstrom. However, we removed all data in region $[$5 .. 7$]$ angstrom 
to see how interpolation works and at the edge region $[$9 .. 15$]$ 
angstrom to observe extrapolation. It has to mentioned that we kept test 
points in entire region $[$2.4 .. 15$]$ angstrom. In addition, predicted 
values of energy obtained by Gaussian process are to be compared with 
our method. 

\begin{figure}[h]
\centering
\includegraphics[width=0.45\textwidth]{media/B1_gap_Energy_6_predictors.eps}
\caption{\textit{Dependence of energy on the average disdance between centers of masses. Fit for 6 predictors obtained by GA + A* search and Gaussian process. System B1. Trained region $[$2.4..5$]$, 
$[$7..9$]$ angstrom, test region $[$2.4..15$]$ angstrom}}
\label{Fig:B1_gap_energy_6_predictors}
\end{figure}

\begin{figure}[h]
\centering
\includegraphics[width=0.45\textwidth]{media/B1_gap_Error_6_predictors.eps}
\caption{\textit{Dependence of error on the average disdance between centers of masses. Fit for 6 predictors obtained by GA + A* search and Gaussian process. System B1. Trained region $[$2.4..5$]$, 
$[$7..9$]$ angstrom, test region $[$2.4..15$]$ angstrom}}\label{Fig:B1_gap_RMSE_6_predictors}
\end{figure}

\begin{figure}[h]
\includegraphics[width=0.45\textwidth]{media/B1_gap_GA_energy_error_histogram_6_predictors.eps}
\caption{\textit{Energy error histogram. Fit for 6 predictors obtained by GA + A* search. System B1. Trained region $[$2.4..5$]$, 
$[$7..9$]$ angstrom, test region $[$2.4..15$]$ angstrom}}\label{Fig:B1_gap_histogram_6_predictors}
\end{figure}


Fig. 8 and 9 clearly show that in trained region $[$5..7$]$ both 
algorithms give pretty well prediction where error is relatively small 
and predicted energy almost identical with true energy. In the 
$[$2.4..5$]$ angstrom region error is greater as well as deviation of 
predicted energy since on short distances the behavior of two water 
molecules system is much more complex and true energy function has more 
complex shape. However, if we examine two regions where there are no 
training points we can see a big difference in prediction. Our linear 
model is still good: predicted energy values are close to true energy 
and level of the error is almost the same as in adjacent trained region. 
Gaussian process behaves differently: error goes up and values of 
predicted energy also higher, especially et the edge region $[$9..15$]$ 
angstrom. It can be observed from histogram on Fig. 10 that error 
distribution is reasonable since the most frequently occurred error 
values are close to zero.


Similarly, to those results Fig. 11, 12, 13 represent the model 
according to (4) with 6 predictors and the same other parameters. The 
fit is even better than if we used eq. (3), but general behavior is 
similar.


Another important aspect is to examine the dependence of goodness of fit 
on number of training points. To do that we used model based on eq. (4) 
and trained regions $[$2.4..5$]$, $[$7..9$]$ and test region $[$2.4 .. 
15$]$ angstrom. However, instead of using all training points we used 10 
different fits with number of training points from 10\% of initial 
number up to 100\% with the step 10\%. Both linear model and Gaussian 
process were examined in equal conditions. The results are represented 
by Fig. 14 and 15. It can be clearly seen that Gaussian process requires 
more training points to obtain reasonable prediction. Linear model 
behaves better when we used reduced number of training points and gave 
reasonable results even at 10\% of initial set. All points were chosen 
randomly for each small interval that represent some value measured as 
average distance between the centers of masses of two water molecules. 

\begin{figure}[h]
\centering
\includegraphics[width=0.45\textwidth]{media/B2_12_predictors_RMSE_vs_percentage_of_training_set_used.eps}
\caption{RMSE vs. \% of usage of training points. System B2 with 12 predictors and Gaussian process. Trained and test regions [2.8..5.6] angstrom}
\label{Fig:B2_fractions}
\end{figure}


\subsection{Water dimer dissociation curve: TIP3P, DFT, EvE}

EVolutionary (Analytical) Expression - EV(A)E

\subsection{Trimer cooperativity: plus-minus map}

\section{Conclusions}

In this work, we proposed a new machine learning strategy to reproduce the interaction energy between molecules based on a limited training set of accurate quantum mechanical data. Unlike existing machine learning techniques, the new methodology is designed to incorporate physically relevant information into the model while still retaining its high flexibility. In addition to this, the learning strategy is developed around a strict control of overfitting and thus selects only the most important components of model. 
%This goal is achieved through a combination of modern artificial intelligence algorithms including genetic and elastic net algorithms.
It is demonstrated, using water molecules as an example, that this approach recovers fundamental physical laws that govern intermolecular interactions. As a result, the obtained models do not only reproduce quantum mechanical energies in the region spanned by the training set configurations but are also capable of extrapolating well outside this region -- a remarkable achievement for machine learning in molecular modeling. Moreover, it is shown that incorporating physically motivated features into the model drastically reduces the number of reference data points in the training set without sacrificing the quality of predictions. 
%improves predictive power of machine learning for molecular/materials modeling. 
In the future, the approach presented here will be extended to systems of strongly interacting atoms 
%in gas-phase molecules, condensed phase systems, and materials
that may require incorporating more complex nonlinear physical features. 
% reformulation of the procedure in the Bayesian framework analogous to that of GPR (kernel)


\section{Computational methods}

\textbf{Gaussian process regression. }To better understand the 
accuracy and the performance of our method it would be nice to compare 
its results with the ones obtained by using another machine learning 
algorithms such as artificial neural network or gaussian process. For 
this article, Gaussian process has been chosen since it can "\,mimic" 
the shape of almost any complex function. We used 
GaussianProcessRegressor from sklearn.gaussian\_process which is the 
implementation of Gaussian Processes for Machine Learning by Rasmussen 
and Williams. During fitting the hyperparameters of the kernel are 
optimized by maximizing the log-marginal-likelihood based on the passed 
optimizer. We used two kernels RBF which is basically Radial-basis 
function or squared-exponential kernel and White Kernel the main use of 
which is to add noise level component to the signal. Those two kernels 
have two crucial parameters: length scale and noise level respectively. 
Varying those parameters, it is possible to find the local maxima of the 
function to achieve the best possible fit. Either optimizer provided by 
scipy.optimize.fmin\_l\_bfgs\_b or user defined optimizer that must have 
specific signature can be used to attune kernels and therefore GP 
regressor to give optimal results. Since local maxima is a function of two main variables, to find optimal solution we used a stochastic hill climbing algorithm with random restart (Fig \ref{Fig:B1_Gaussian_path}). 

\begin{figure}
\includegraphics[width=0.45\textwidth]{media/B1_Gaussian_path.eps}
\caption{Gaussian path by stochastic hill climbig with 10 random restarts. Best gaussian result coresponds to blue dot on the graph. System B1}\label{Fig:B1_Gaussian_path}
\end{figure}

\section{Acknowledgments} 

The research was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC) through Discovery
Grants (RGPIN-2016-0505). The authors are grateful to Compute Canada and, in particular, the McGill HPC Centre for computer time.

\bibliography{learn-physics}

\section{LASSO} 

We have explored performance of least absolute 
shrinkage and selection operator (LASSO) method for the feature 
selection stage of the training algorithm. LASSO is a special case of 
the elastic net regularization with ZZZ coefficient being set to zero. 
We found that LASSO, based on both forward stepwise and least angle 
regression, tends to drop important features that are somewhat collinear 
to those already selected. As a result, the final list selected by LASSO 
contains too few features.

In a multiple regression model multicollinearity indicates that the 
collinear independent variables are somehow related. The presence of 
multicollinearity within the set of features causes many problems in the 
understanding the significance of individual independent features in our 
regression model. Our feature set is generated in such a way that it is 
impossible to avoid collinearity and multicollinearity since basically 
we have set of distances raised to different powers. The variance 
inflation factors (VIF) can prof that we have multicollinearity issues 
in our set. It allows a quick estimation of how much a variable is 
contributing to the standard error in our regression model. VIF is very 
large when significant multicollinearity issues exist. Multicollinearity 
can lead to less reliable probability values and larger confidence 
intervals.


\end{document}