Fix decoding fragments containing square brackets #25

FNTwin · 2024-01-08T12:59:25Z

Changelogs

Fix issue Decoding fragments containing square brackets fail #24 by updating the regex used to find the branch attachment point
Add support to encode smiles string with atom mapping (regex changes)

Checklist:

Add tests to cover the fixed bug(s) or the new introduced feature(s) (if appropriate).
Update the API documentation if a new function is added, or an existing one is deleted. Eventually consider making a new tutorial for new features.
Write concise and explanatory changelogs below.
If possible, assign one of the following labels to the PR: feature, fix or test (or ask a maintainer to do it for you).

Original discussion with explanation in: #24

Unrelated but it seems that we are not using isort for the formatting of the package import.

maclandrol

Thanks @FNTwin, see comments

maclandrol · 2024-01-08T17:50:34Z

safe/converter.py

-
-        matching_groups = re.findall(r"((?<=%)\d{2})|((?<!%)\d+)", inp)
+        # Atom mapping case: avoid to capture brackets with :\d
+        inp = re.sub("\[[^:\]]*?\]", "", inp)  # noqa


@FNTwin, Could you revisit this regexp. I think it does not capture atom mapping properly:

import re inp = "c1cc2c(cc1[C@@H]1CC[C:3][NH2+]1)O[13C]CO2" re.sub("\[[^:\]]*?\]", "", inp) #c1cc2c(cc11CC[C:3]1)OCCO2

I think just removing anything anything between bracket would cover our cases: r"\[[^\]]*\d[^\]]*\]"

Just removing the brackets indeed cover our cases (first version in the issue is exactly like that), but I added the atom mapping case just in case to be "more precise" on the substitution.
After testing I think is safe to avoid this case and just remove the brackets.

maclandrol · 2024-01-08T18:15:48Z

safe/converter.py

@@ -327,7 +323,8 @@ def encoder(
            )

        scaffold_str = ".".join(frags_str)
-        attach_pos = set(re.findall(r"(\[\d+\*\]|\[[^:]*:\d+\])", scaffold_str))
+        # don't capture atom mapping in the scaffold
+        attach_pos = set(re.findall(r"(\[\d+\*\]|!\[[^:]*:\d+\])", scaffold_str))


Can you double check this ?

The problem here is that if we don't handle the atom mapping, the attach_pos set will contain the atom mapping pieces and will give us wrong molecules in some cases.

As an example:

smiles="c1cc2c(cc1[C@@H]1CC[C:3][NH2+]1)O[13C]CO2" scaffold_str = ".".join(frags_str) # c1cc2c(cc1[1*])O[13C]CO2.[C@@H]1([1*])CC[C:3][NH2+]1 attach_pos = set(re.findall(r"(\[\d+\*\]|\[[^:]*:\d+\])", scaffold_str)) # #{'[13C]CO2.[C@@H]1([1*])CC[C:3]', '[1*]'}

this will give us the safe string 'c1cc2c(cc13)O[13C]CO2.[C@@H]13CC[C:3][NH2+]1'

With my regex:

smiles="c1cc2c(cc1[C@@H]1CC[C:3][NH2+]1)O[13C]CO2" scaffold_str = ".".join(frags_str) # c1cc2c(cc1[1*])O[13C]CO2.[C@@H]1([1*])CC[C:3][NH2+]1 attach_pos = set(re.findall(r"(\[\d+\*\]|\[[^:]*:\d+\])", scaffold_str)) # {'[1*]'}

that will give us in this case the exact safe string 'c1cc2c(cc13)O[13C]CO2.[C@@H]13CC[C:3][NH2+]1'

For more difficult cases of atom mapping like [C+:1]#[C:2]CC(C[C:20][O:21]CN)C[C:3]1=[C:7]([H:10])[N-:6][O:5][C:4]1([H:8])[H:9]"

We will have the following attach_pos and final safe string:

modified_regex = {'[1*]'} old_regex = {'[C:3]', '[N-:6]', '[C:4]', '[O:21]', '[H:9]', '[H:8]', '[C:7]', '[C+:1]', '[C:2]', '[1*]', '[O:5]', '[H:10]', '[C:20]'} # modified regex '[C+:1]#[C:2]CC(C[C:3]1=[C:7]([H:10])[N-:6][O:5][C:4]1([H:8])[H:9])C[C:20]2.[O:21]2CN' # old regex "[C+:1]#[C:2]CC(C[C:3]1=[C:7]([H:10])[N-:6][O:5][C:4]1([H:8])[H:9])C[C:20][1*].[O:21]([1*])CN"

That will lead at the following mols:
old regex

new regex

FNTwin · 2024-01-12T13:29:11Z

@maclandrol Addressed the issues, hopefully the second issue is fine!

maclandrol · 2024-01-14T00:09:01Z

This PR closes #24

FNTwin added 2 commits January 5, 2024 09:56

Fragment Fix + tests

f0ad87f

Atom mapping regex

d4190e8

FNTwin added the fix Fix a bug label Jan 8, 2024

FNTwin requested a review from maclandrol as a code owner January 8, 2024 12:59

FNTwin linked an issue Jan 8, 2024 that may be closed by this pull request

Decoding fragments containing square brackets fail #24

Closed

Black .

1697d0f

maclandrol reviewed Jan 8, 2024

View reviewed changes

Corrected _find_branch_number regex

07ab637

FNTwin and others added 2 commits January 12, 2024 06:31

black .

b278062

update readme

0532660

maclandrol merged commit ef6b148 into main Jan 14, 2024
3 checks passed

maclandrol deleted the fix_bracket_fragments branch February 15, 2024 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix decoding fragments containing square brackets #25

Fix decoding fragments containing square brackets #25

FNTwin commented Jan 8, 2024

maclandrol left a comment

maclandrol Jan 8, 2024

FNTwin Jan 12, 2024

maclandrol Jan 8, 2024

FNTwin Jan 12, 2024

FNTwin commented Jan 12, 2024

maclandrol commented Jan 14, 2024

Fix decoding fragments containing square brackets #25

Fix decoding fragments containing square brackets #25

Conversation

FNTwin commented Jan 8, 2024

Changelogs

maclandrol left a comment

Choose a reason for hiding this comment

maclandrol Jan 8, 2024

Choose a reason for hiding this comment

FNTwin Jan 12, 2024

Choose a reason for hiding this comment

maclandrol Jan 8, 2024

Choose a reason for hiding this comment

FNTwin Jan 12, 2024

Choose a reason for hiding this comment

FNTwin commented Jan 12, 2024

maclandrol commented Jan 14, 2024