Skip to content

Latest commit



149 lines (110 loc) · 3.63 KB

File metadata and controls

149 lines (110 loc) · 3.63 KB

Data for NMR GNN

This contains the parsing scripts and data used for our GNN chemical shift predictor model.


pip install nmrgnn-data

Working in Python

Here's an example of how to load and work with data in python. The records are loaded as a tensorflow dataset (read more here), but can be used in a for loop as shown below.

import nmrdata
dataset = nmrdata.load_records('data/metabolite-records.tfrecord')
for record in dataset:
    # get single record


dict_keys(['natoms', 'nneigh', 'features', 'nlist', 'positions', 'peaks', 'mask', 'name', 'class', 'index'])

Access positions as a numpy array



array([[ 0.83740795,  0.09760247,  0.2959486 ],
       [-0.562893  ,  0.00262405, -0.00434441],
       [-1.0725924 , -0.37873718,  0.9061929 ],
       [-0.75536764, -0.72710234, -0.8159687 ],
       [-1.0367495 ,  0.9557108 , -0.27988592],
       [ 1.2855262 , -0.8334997 ,  0.10487328],
       [ 1.3046683 ,  0.8834019 , -0.20681578]], dtype=float32)

Get chemical shifts

array([0.  , 0.  , 2.59, 2.59, 2.59, 0.  , 0.  ], dtype=float32)

Numpy Error

If you see this error:

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

Try re-install numpy

pip uninstall -y numpy && pip install numpy

Parsing Scripts

To install with the parsing functionality, use this

conda install -c omnia openmm
pip install nmrgnn-data[parse]

Working with Data

All commands below can have additional information printed using the --help argument.

Find pairs

Find pairs of atoms with chemical shifts that are neighbors and sort them based on distance.

nmrdata find-pairs structure-test.tfrecords-data.tfrecord ALA-H ALA-N

Count Names

Get class/atom name counts:

nmrdata count-names structure-test.tfrecords-data.tfrecord


Check that records are consistent with embeddings

nmrdata validate-embeddings structure-test.tfrecords-data.tfrecord

Check that neighbor lists are consistent with embeddings

nmrdata validate-nlist structure-test.tfrecords-data.tfrecord

Check that peaks are reasonable (no nans, no extreme values, no bad masks)

nmrdata validate-peaks structure-test.tfrecords-data.tfrecord

Output Lables

To extract labels ordered by PDB and residue:

nmrdata write-peak-labels test-structure-shift-data.tfrecord  test-structure-shift-record-info.txt labels.txt

Making New Data

See commands nmrparse shiftml, nmrparse metabolites, nmrparse shiftx which are parsers for various databases.

From RefDB Files

This requires a pickled python object called data.pb to be in the directory. It is a list of dicts containing pdb_file (path to PDB), pdb (PDB ID), corr (path to .corr file), and chain (which chain). chain can be _ to indicate use first chain.

nmrparse parse-refdb directory name --pdb_filter exclude_ids.txt


Please cite Predicting Chemical Shifts with Graph Neural Networks

  title={Predicting chemical shifts with graph neural networks},
  author={Yang, Ziyue and Chakraborty, Maghesree and White, Andrew D},
  journal={Chemical science},
  publisher={Royal Society of Chemistry}