General-Purpose Potentials for Organic Molecular Solids

Here you will find the bulk of my work on creating a general-purpose machine learning potential for organic molecular solids.

Note: the file CSD_GAP_model.json is Aditi's preliminary baselined potential.

Guide to the enclosed directories and files

General rule of thumb: Notebooks with capitalised letters at the start of each word are worthy of opening

If there aren't any capitalised letters, the notebook will likely be a messy

There's quite a lot in this directory, but the notebook you should really open and read is Best_Baselined_Potential.ipynb

This notebook applies all of what I learnt in terms of methods/best practice from the other notebooks, and creates the best performing baselined potential

If you want to recreate the best performing direct-fit potential, you can open the notebook Best_Direct_Fit_Potential.ipynb.

This will be commented much less well than the Best_Baselined_Potential.ipynb, but it applies the exact same methodologies, and so if anything is unclear please look at that notebook and the comments.

The old_notebooks directory contain messy notebooks related to developing general purpose potentials. The Best_* notebooks in the root directory contain the curated important information, but in case you would like to see the old ones for reference, they can be found here. In order to properly execute these notebooks, they will have to be moved to the root directory! Within the directory, you will find the general_purpose_potential_part_x.ipynb notebooks, which contain the process of developing/testing the direct-fit general-purpose potential.

The first three notebooks create an initial potential
The notebooks thereafter have various tests/optimisations and create newer models from what I learnt, in order to improve performance
Again, please note that the Best_X_Potential.ipynb contain the methodology that led to the most performant potential

The Learning_Curves.ipynb notebook creates learning curves using the first 20k structures from the PCovFPS-sorted initial training set and varying numbers of sparse points per species. This notebook also shows how the computation time of properties increases with increasing numbers of sparse points per species, as well as when computing representations with and without gradients.

The initial training set contains ~24.6k structures total, see the Best_Baselined_Potential.ipynb on how this was constructed

Other (maybe) useful notebooks are:

create_PCA.ipynb, used to create a Kernel PCA map of training and test set structures, to show that there are many abnormal structures in the training set. Also shows that the binding energies of elements as determined from the training set structures and test set structures differ.
dftb_calcs.ipynb, used to calculate DFTB energies and forces for the CSD-1k test set and the initial training set of ~24/6k structures (11 x 2238 configurations, with the 11 most diverse from each crystal selected via FPS)
deepMD_potential.ipynb, used to prepare files for use with DeepMD. Very scrappy notebook, so it might be better to ask Davide Tisi about how to create a deepMD potential as he has more experience.

Necessary packages to install

Please note that you will need the following packages to execute most of the notebooks:

ase
numpy
pandas
pickle
itertools
tqdm
sklearn
skmatter (previously called skcosmo)
librascal (installation instructions can be found at https://github.com/lab-cosmo/librascal)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

General-Purpose Potentials for Organic Molecular Solids

Guide to the enclosed directories and files

Necessary packages to install

Files

README.md

Latest commit

History

README.md

File metadata and controls

General-Purpose Potentials for Organic Molecular Solids

Guide to the enclosed directories and files

Necessary packages to install