This provides a basic registration system which can be used either as a python library or via a command-line interface.
Basic operations:
initdb
: resets the database. Note that this destroys any information which is already in the database, so be careful with it.register
: standardize the input molecule, calculates a hash for it, and adds them molecule to the database if it's not already there. returns the molregno (registry ID) of the newly registered molecule.query
: takes a molecule as input and checks whether or not a matching molecule is registered. returns molregnos (registry IDs) of the matching molecule(s), if anyretrieve
: takes one or more IDs and returns the registered structures for them
Please look at the INSTALL.md file.
- rdkit v2023.03.1 or later
- click
- psycopg2 (only if you want to use a postgresql database)
After installing the dependencies (above) and checking out this repo, run this command in this directory:
pip install --editable .
docker build -t lwreg .
# Run Jupyter notebook on the docker container
docker run -i -t -p 8888:8888 rdkit-lwreg /bin/bash -c "\
apt update && apt install libtiff5 -y && \
pip install notebook && \
jupyter notebook \
--notebook-dir=/lw-reg --ip='*' --port=8888 \
--no-browser --allow-root"
% lwreg initdb --confirm=yes
% lwreg register --smiles CCOCC
1
% lwreg register --smiles CCOCCC
2
% lwreg register --smiles CCNCCC
3
% lwreg register --smiles CCOCCC
ERROR:root:Compound already registered
% lwreg query --smiles CCOCCC
2
% lwreg retrieve --id 2
(2, '\n RDKit 2D\n\n 0 0 0 0 0 0 0 0 0 0999 V3000\nM V30 BEGIN CTAB\nM V30 COUNTS 6 5 0 0 0\nM V30 BEGIN ATOM\nM V30 1 C 0.000000 0.000000 0.000000 0\nM V30 2 C 1.299038 0.750000 0.000000 0\nM V30 3 O 2.598076 -0.000000 0.000000 0\nM V30 4 C 3.897114 0.750000 0.000000 0\nM V30 5 C 5.196152 -0.000000 0.000000 0\nM V30 6 C 6.495191 0.750000 0.000000 0\nM V30 END ATOM\nM V30 BEGIN BOND\nM V30 1 1 1 2\nM V30 2 1 2 3\nM V30 3 1 3 4\nM V30 4 1 4 5\nM V30 5 1 5 6\nM V30 END BOND\nM V30 END CTAB\nM END\n', 'mol')
In [1]: import lwreg
In [3]: lwreg.initdb(confirm=True)
Out[3]: True
In [4]: lwreg.register(smiles='CCO')
Out[4]: 1
In [5]: lwreg.register(smiles='CCOC')
Out[5]: 2
In [6]: from rdkit import Chem
In [7]: m = Chem.MolFromSmiles('CCOCC')
In [8]: lwreg.register(mol=m)
Out[8]: 3
In [9]: lwreg.register(mol=m)
---------------------------------------------------------------------------
IntegrityError Traceback (most recent call last)
Input In [9], in <cell line: 1>()
----> 1 lwreg.register(mol=m)
File ~/Code/lightweight-registration/lwreg/utils.py:249, in register(config, mol, molfile, molblock, smiles, escape, fail_on_duplicate, no_verbose)
247 cn = _connect(config)
248 curs = cn.cursor()
--> 249 mrn = _register_mol(tpl, escape, cn, curs, config, fail_on_duplicate)
250 if not no_verbose:
251 print(mrn)
File ~/Code/lightweight-registration/lwreg/utils.py:184, in _register_mol(tpl, escape, cn, curs, config, failOnDuplicate)
181 mhash, layers = hash_mol(sMol, escape=escape, config=config)
183 # will fail if the fullhash is already there
--> 184 curs.execute(
185 _replace_placeholders(
186 'insert into hashes values (?,?,?,?,?,?,?,?,?)'), (
187 mrn,
188 mhash,
189 layers[RegistrationHash.HashLayer.FORMULA],
190 layers[RegistrationHash.HashLayer.CANONICAL_SMILES],
191 layers[RegistrationHash.HashLayer.NO_STEREO_SMILES],
192 layers[RegistrationHash.HashLayer.TAUTOMER_HASH],
193 layers[RegistrationHash.HashLayer.NO_STEREO_TAUTOMER_HASH],
194 layers[RegistrationHash.HashLayer.ESCAPE],
195 layers[RegistrationHash.HashLayer.SGROUP_DATA],
196 ))
198 cn.commit()
199 except _violations:
IntegrityError: UNIQUE constraint failed: hashes.fullhash
In [10]: lwreg.query(smiles='CCOC')
Out[10]: [2]
In [11]: lwreg.query(smiles='CCOCC')
Out[11]: [3]
In [12]: lwreg.query(smiles='CCOCO')
Out[12]: []
In [13]: lwreg.retrieve(id=2)
Out[13]:
((2,
'\n RDKit 2D\n\n 0 0 0 0 0 0 0 0 0 0999 V3000\nM V30 BEGIN CTAB\nM V30 COUNTS 4 3 0 0 0\nM V30 BEGIN ATOM\nM V30 1 C 0.000000 0.000000 0.000000 0\nM V30 2 C 1.299038 0.750000 0.000000 0\nM V30 3 O 2.598076 -0.000000 0.000000 0\nM V30 4 C 3.897114 0.750000 0.000000 0\nM V30 END ATOM\nM V30 BEGIN BOND\nM V30 1 1 1 2\nM V30 2 1 2 3\nM V30 3 1 3 4\nM V30 END BOND\nM V30 END CTAB\nM END\n',
'mol'),)
In [14]: lwreg.retrieve(ids=[2,3])
Out[14]:
((2,
'\n RDKit 2D\n\n 0 0 0 0 0 0 0 0 0 0999 V3000\nM V30 BEGIN CTAB\nM V30 COUNTS 4 3 0 0 0\nM V30 BEGIN ATOM\nM V30 1 C 0.000000 0.000000 0.000000 0\nM V30 2 C 1.299038 0.750000 0.000000 0\nM V30 3 O 2.598076 -0.000000 0.000000 0\nM V30 4 C 3.897114 0.750000 0.000000 0\nM V30 END ATOM\nM V30 BEGIN BOND\nM V30 1 1 1 2\nM V30 2 1 2 3\nM V30 3 1 3 4\nM V30 END BOND\nM V30 END CTAB\nM END\n',
'mol'),
(3,
'\n RDKit 2D\n\n 0 0 0 0 0 0 0 0 0 0999 V3000\nM V30 BEGIN CTAB\nM V30 COUNTS 5 4 0 0 0\nM V30 BEGIN ATOM\nM V30 1 C 0.000000 0.000000 0.000000 0\nM V30 2 C 1.299038 0.750000 0.000000 0\nM V30 3 O 2.598076 -0.000000 0.000000 0\nM V30 4 C 3.897114 0.750000 0.000000 0\nM V30 5 C 5.196152 -0.000000 0.000000 0\nM V30 END ATOM\nM V30 BEGIN BOND\nM V30 1 1 1 2\nM V30 2 1 2 3\nM V30 3 1 3 4\nM V30 4 1 4 5\nM V30 END BOND\nM V30 END CTAB\nM END\n',
'mol'))
When using the Python API you have extensive control over the standardization and validation operations which are performed on the molecule.
Start with a couple of examples showing what the 'fragment' and 'charge' built-in standardizers do:
>>> config['standardization'] = 'fragment'
>>> Chem.MolToSmiles(lwreg.utils.standardize_mol(Chem.MolFromSmiles('CC[O-].[Na+]'),config=config))
'CC[O-]'
>>> config['standardization'] = 'charge'
>>> Chem.MolToSmiles(lwreg.utils.standardize_mol(Chem.MolFromSmiles('CC[O-].[Na+]'),config=config))
'CCO'
Now define a custom filter which rejects (by returning None) molecules which have a net charge and then use that:
>>> def reject_charged_molecules(mol):
... if Chem.GetFormalCharge(mol):
... return None
... return mol
...
>>> config['standardization'] = reject_charged_molecules
>>> Chem.MolToSmiles(lwreg.utils.standardize_mol(Chem.MolFromSmiles('CC[O-].[Na+]'),config=config))
'CC[O-].[Na+]'
Here's an example which fails:
>>> lwreg.utils.standardize_mol(Chem.MolFromSmiles('CC[O-]'),config=config) is None
True
We can chain standardization/filtering operations together by providing a list. The individual operations are run in order. Here's an example where we attempt to neutralise the molecule by finding the charge parent and then apply our reject_charged_molecules filter:
>>> config['standardization'] = ['charge',reject_charged_molecules]
>>> lwreg.utils.standardize_mol(Chem.MolFromSmiles('CC[O-]'),config=config) is None
False
>>> lwreg.utils.standardize_mol(Chem.MolFromSmiles('CC[N+](C)(C)C'),config=config) is None
True
That last one failed because the quarternary nitrogen can't be neutralized.
There are a collection of other standardizers/filters available in the module lwreg.standardization_lib