Datasets for open forcefield parameterization and development
ThermoML data compiled and filtered using ThermoPyL tool developed by Chodera Lab @ MSKCC (https://github.com/choderalab/thermopyl)
FILTER PROCEDURE:
-
Pull full ThermoML archive
-
Discard known erroneous data (j.fluid.2013.12.014 the only one I know of now)
-
Define properties of interest to pass filter
-
Allow only C, O and H atoms to pass
-
Generate SMILES formulae from component names (NIH CirPy module)
-
Apply filter for "=" and "#" to SMILES formulae (get rid of double and triple bonding)
-
Generate CAS from component names (CirPy)
-
Apply temperature and pressure filters (250 K - 400 K and 1 atm - 1000 atm)
-
Keep only liquid phase data points
-
Separate final large dataframe into subframes by property of interest a. Remove data with no associated uncertainties from subframes
-
Generate counts by component and journal article for all dataframes
-
Save everything as separate text .csv
Christopher I. Bayly developed a toy dataset of potential molecules of interest which is deposited in the "Model Systems" directory in the "AlkEthOH_distrib" subdirectory. Construction of this set is described in the README.txt there, which should be converted to md.