-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix decoding fragments containing square brackets #25
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @FNTwin, see comments
safe/converter.py
Outdated
|
||
matching_groups = re.findall(r"((?<=%)\d{2})|((?<!%)\d+)", inp) | ||
# Atom mapping case: avoid to capture brackets with :\d | ||
inp = re.sub("\[[^:\]]*?\]", "", inp) # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FNTwin, Could you revisit this regexp. I think it does not capture atom mapping properly:
import re
inp = "c1cc2c(cc1[C@@H]1CC[C:3][NH2+]1)O[13C]CO2"
re.sub("\[[^:\]]*?\]", "", inp) #c1cc2c(cc11CC[C:3]1)OCCO2
I think just removing anything anything between bracket would cover our cases: r"\[[^\]]*\d[^\]]*\]"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just removing the brackets indeed cover our cases (first version in the issue is exactly like that), but I added the atom mapping case just in case to be "more precise" on the substitution.
After testing I think is safe to avoid this case and just remove the brackets.
@@ -327,7 +323,8 @@ def encoder( | |||
) | |||
|
|||
scaffold_str = ".".join(frags_str) | |||
attach_pos = set(re.findall(r"(\[\d+\*\]|\[[^:]*:\d+\])", scaffold_str)) | |||
# don't capture atom mapping in the scaffold | |||
attach_pos = set(re.findall(r"(\[\d+\*\]|!\[[^:]*:\d+\])", scaffold_str)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you double check this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here is that if we don't handle the atom mapping, the attach_pos
set will contain the atom mapping pieces and will give us wrong molecules in some cases.
As an example:
smiles="c1cc2c(cc1[C@@H]1CC[C:3][NH2+]1)O[13C]CO2"
scaffold_str = ".".join(frags_str) # c1cc2c(cc1[1*])O[13C]CO2.[C@@H]1([1*])CC[C:3][NH2+]1
attach_pos = set(re.findall(r"(\[\d+\*\]|\[[^:]*:\d+\])", scaffold_str)) # #{'[13C]CO2.[C@@H]1([1*])CC[C:3]', '[1*]'}
this will give us the safe string 'c1cc2c(cc13)O[13C]CO2.[C@@H]13CC[C:3][NH2+]1'
With my regex:
smiles="c1cc2c(cc1[C@@H]1CC[C:3][NH2+]1)O[13C]CO2"
scaffold_str = ".".join(frags_str) # c1cc2c(cc1[1*])O[13C]CO2.[C@@H]1([1*])CC[C:3][NH2+]1
attach_pos = set(re.findall(r"(\[\d+\*\]|\[[^:]*:\d+\])", scaffold_str)) # {'[1*]'}
that will give us in this case the exact safe string 'c1cc2c(cc13)O[13C]CO2.[C@@H]13CC[C:3][NH2+]1'
For more difficult cases of atom mapping like [C+:1]#[C:2]CC(C[C:20][O:21]CN)C[C:3]1=[C:7]([H:10])[N-:6][O:5][C:4]1([H:8])[H:9]"
We will have the following attach_pos and final safe string:
modified_regex = {'[1*]'}
old_regex = {'[C:3]', '[N-:6]', '[C:4]', '[O:21]', '[H:9]', '[H:8]', '[C:7]', '[C+:1]', '[C:2]', '[1*]', '[O:5]', '[H:10]', '[C:20]'}
# modified regex
'[C+:1]#[C:2]CC(C[C:3]1=[C:7]([H:10])[N-:6][O:5][C:4]1([H:8])[H:9])C[C:20]2.[O:21]2CN'
# old regex
"[C+:1]#[C:2]CC(C[C:3]1=[C:7]([H:10])[N-:6][O:5][C:4]1([H:8])[H:9])C[C:20][1*].[O:21]([1*])CN"
@maclandrol Addressed the issues, hopefully the second issue is fine! |
This PR closes #24 |
Changelogs
Checklist:
feature
,fix
ortest
(or ask a maintainer to do it for you).Original discussion with explanation in: #24
Unrelated but it seems that we are not using isort for the formatting of the package import.