Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix decoding fragments containing square brackets #25

Merged
merged 6 commits into from
Jan 14, 2024

Conversation

FNTwin
Copy link

@FNTwin FNTwin commented Jan 8, 2024

Changelogs


Checklist:

  • Add tests to cover the fixed bug(s) or the new introduced feature(s) (if appropriate).
  • Update the API documentation if a new function is added, or an existing one is deleted. Eventually consider making a new tutorial for new features.
  • Write concise and explanatory changelogs below.
  • If possible, assign one of the following labels to the PR: feature, fix or test (or ask a maintainer to do it for you).

Original discussion with explanation in: #24

Unrelated but it seems that we are not using isort for the formatting of the package import.

@FNTwin FNTwin added the fix Fix a bug label Jan 8, 2024
@FNTwin FNTwin linked an issue Jan 8, 2024 that may be closed by this pull request
Copy link
Member

@maclandrol maclandrol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @FNTwin, see comments


matching_groups = re.findall(r"((?<=%)\d{2})|((?<!%)\d+)", inp)
# Atom mapping case: avoid to capture brackets with :\d
inp = re.sub("\[[^:\]]*?\]", "", inp) # noqa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FNTwin, Could you revisit this regexp. I think it does not capture atom mapping properly:

import re
inp = "c1cc2c(cc1[C@@H]1CC[C:3][NH2+]1)O[13C]CO2"
re.sub("\[[^:\]]*?\]", "", inp) #c1cc2c(cc11CC[C:3]1)OCCO2

I think just removing anything anything between bracket would cover our cases: r"\[[^\]]*\d[^\]]*\]"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just removing the brackets indeed cover our cases (first version in the issue is exactly like that), but I added the atom mapping case just in case to be "more precise" on the substitution.
After testing I think is safe to avoid this case and just remove the brackets.

@@ -327,7 +323,8 @@ def encoder(
)

scaffold_str = ".".join(frags_str)
attach_pos = set(re.findall(r"(\[\d+\*\]|\[[^:]*:\d+\])", scaffold_str))
# don't capture atom mapping in the scaffold
attach_pos = set(re.findall(r"(\[\d+\*\]|!\[[^:]*:\d+\])", scaffold_str))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you double check this ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that if we don't handle the atom mapping, the attach_pos set will contain the atom mapping pieces and will give us wrong molecules in some cases.

As an example:

smiles="c1cc2c(cc1[C@@H]1CC[C:3][NH2+]1)O[13C]CO2"
scaffold_str = ".".join(frags_str) # c1cc2c(cc1[1*])O[13C]CO2.[C@@H]1([1*])CC[C:3][NH2+]1
attach_pos = set(re.findall(r"(\[\d+\*\]|\[[^:]*:\d+\])", scaffold_str)) #  #{'[13C]CO2.[C@@H]1([1*])CC[C:3]', '[1*]'}

this will give us the safe string 'c1cc2c(cc13)O[13C]CO2.[C@@H]13CC[C:3][NH2+]1'

With my regex:

smiles="c1cc2c(cc1[C@@H]1CC[C:3][NH2+]1)O[13C]CO2"
scaffold_str = ".".join(frags_str) # c1cc2c(cc1[1*])O[13C]CO2.[C@@H]1([1*])CC[C:3][NH2+]1
attach_pos = set(re.findall(r"(\[\d+\*\]|\[[^:]*:\d+\])", scaffold_str)) #  {'[1*]'}

that will give us in this case the exact safe string 'c1cc2c(cc13)O[13C]CO2.[C@@H]13CC[C:3][NH2+]1'

For more difficult cases of atom mapping like [C+:1]#[C:2]CC(C[C:20][O:21]CN)C[C:3]1=[C:7]([H:10])[N-:6][O:5][C:4]1([H:8])[H:9]"
image

We will have the following attach_pos and final safe string:

modified_regex = {'[1*]'}
old_regex = {'[C:3]', '[N-:6]', '[C:4]', '[O:21]', '[H:9]', '[H:8]', '[C:7]', '[C+:1]', '[C:2]', '[1*]', '[O:5]', '[H:10]', '[C:20]'}
# modified regex
'[C+:1]#[C:2]CC(C[C:3]1=[C:7]([H:10])[N-:6][O:5][C:4]1([H:8])[H:9])C[C:20]2.[O:21]2CN'
# old regex
"[C+:1]#[C:2]CC(C[C:3]1=[C:7]([H:10])[N-:6][O:5][C:4]1([H:8])[H:9])C[C:20][1*].[O:21]([1*])CN"

That will lead at the following mols:
old regex
image
new regex
image

@FNTwin
Copy link
Author

FNTwin commented Jan 12, 2024

@maclandrol Addressed the issues, hopefully the second issue is fine!

@maclandrol
Copy link
Member

This PR closes #24

@maclandrol maclandrol merged commit ef6b148 into main Jan 14, 2024
3 checks passed
@maclandrol maclandrol deleted the fix_bracket_fragments branch February 15, 2024 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Fix a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Decoding fragments containing square brackets fail
2 participants