Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of Water Selection token #4839

Open
talagayev opened this issue Dec 16, 2024 · 8 comments · May be fixed by #4854
Open

Implementation of Water Selection token #4839

talagayev opened this issue Dec 16, 2024 · 8 comments · May be fixed by #4854
Labels
Component-Selections decision needed requires input from developers before moving further enhancement

Comments

@talagayev
Copy link
Member

talagayev commented Dec 16, 2024

Is your feature request related to a problem?

No, not directly. It has similarities to #4563 from the sense, that there was a discussion about selection of
GLYCAM and sugars, which makes sense. An additional feature, that could improve life would be a water
selection, thus a user could use a water token in select_atoms() to obtain water as an AtomGroup.

Describe the solution you'd like

Create a class WaterSelection in core/selection.py, which would have the token water and
add all possible water resnames to allow a selection through select_atoms().

which would create something like this:

u = mda.Universe("test.pdb")
ag = u.select_atoms("protein and water")

Describe alternatives you've considered

Leave it as currently is and select the water via it's resname

u = mda.Universe("test.pdb")
ag = u.select_atoms("protein and resname WAT")

Additional context

As for additional context, currently I am helping in implementing waterbridge interaction recognition in
ProLIF and there I noticed, that for the water selection the resname of the water is written out. Yes, in theory the user would
know the water resname and could write it out always, the main reason for this PR would be a quality of life improvement +
in the cases where you have multiple Topologies with different water resnames it could be easier to select them through an
universal water token.

I would be able to do the implementation pretty quickly, but would need an opinion how reasonable it is and how much
everyone agrees that this would be a reasonable quality of life improvement for users.

@orbeckst
Copy link
Member

Which water resnames would you use to detect water?

@orbeckst
Copy link
Member

Also, note that

ag = u.select_atoms("protein and water")

will likely give you an empty atomgroup because I don't know of any molecules that are water and protein — did you mean or?

@talagayev
Copy link
Member Author

talagayev commented Dec 16, 2024

Which water resnames would you use to detect water?

From the ones that I saw mostly it would be
'H2O', 'HOH', 'OH2', 'TIP', 'TIP2', 'TIP3', 'TIP4', 'T3P', 'T4P', 'T5P', 'SOL', 'WAT', 'HHO', 'OHH'

at least those are the ones that come up quite often, although I didn't encounter 'HHO' or 'OHH' that often myself, some
tools have them like in this tool:

https://github.com/KurtzmanLab/SSTMap/blob/master/sstmap/water_analysis.py

Other resname suggestions are also welcome, if I didn't cover any important ones :)

@talagayev
Copy link
Member Author

Also, note that

ag = u.select_atoms("protein and water")

will likely give you an empty atomgroup because I don't know of any molecules that are water and protein — did you mean or?

Yes my bad, would be or, didn't notice that mistake, my bad :(

@IAlibay
Copy link
Member

IAlibay commented Dec 16, 2024

So I will start by saying that I recognise this to be a useful thing for folks that aren't power users of MDAnalysis.

Generally I don't like how these keywords obfuscate what actually happens. I.e. I know plenty of time I've sat down with a student and gone "are you sure your residues actually got selected with the protein keyword", just for them to realise that maybe they had a special residue that wasn't in the standard list.

My personal take is that, like guessing by default, I would love to see keywords go away because they enable bad mistakes. That being said, I'll be honest and say that my opinion really isn't that strong. If it helps folks do what they need then I could live with it.

@talagayev
Copy link
Member Author

So I will start by saying that I recognise this to be a useful thing for folks that aren't power users of MDAnalysis.

Generally I don't like how these keywords obfuscate what actually happens. I.e. I know plenty of time I've sat down with a student and gone "are you sure your residues actually got selected with the protein keyword", just for them to realise that maybe they had a special residue that wasn't in the standard list.

My personal take is that, like guessing by default, I would love to see keywords go away because they enable bad mistakes. That being said, I'll be honest and say that my opinion really isn't that strong. If it helps folks do what they need then I could live with it.

True, I agree that it can lead to errors, since the user would rely more on the token, similar as you mention the cases with
the protein token.

Basically it would be correct vs practical and both have good points, where the correct way would be not having them, so that such errors do not happen and practical is with the protein token currently, which makes it more easier for a non power user, but would lead to potential mistakes at some point.

Although I could imagine moving completly away from keywords now would be tricky, since MDAnalysis users got used to the protein token, so it could be tricky to remove it from them.

I will then let you @IAlibay and @orbeckst and the other coredevs decide what the best option is :)

@orbeckst
Copy link
Member

orbeckst commented Dec 18, 2024

I am in favor of convenience selectors — I'd absolutely hate having to select proteins by index or chain; a selection keyword for water (and perhaps lipid #2082 (PR #4302) and carbohydrate #4563) seem reasonable. I feel that this is one of the few occasions where convenience is quite important.

Generally I don't like how these keywords obfuscate what actually happens.

That's fair, I'd just like to have my cake and eat it, too. 🍰

Perhaps we should be thinking along guesser contexts to set what these selections match. The downside is that MDA will behave differently depending on the context in which a universe was loaded but I think that will already be effectively happening with guessers.

@talagayev
Copy link
Member Author

talagayev commented Dec 18, 2024

I am in favor of convenience selectors — I'd absolutely hate having to select proteins by index or chain; a selection keyword for water (and perhaps lipid #2082 (PR #4302) and carbohydrate #4563) seem reasonable. I feel that this is one of the few occasions where convenience is quite important.

Generally I don't like how these keywords obfuscate what actually happens.

That's fair, I'd just like to have my cake and eat it, too. 🍰

Perhaps we should be thinking along guesser contexts to set what these selections match. The downside is that MDA will behave differently depending on the context in which a universe was loaded but I think that will already be effectively happening with guessers.

True, than I will start working on the PR.

@orbeckst As for #4563 there I mentioned the possible problem in PR #4790 that due to it having a lot of different resnames
it can lead to the case where a ligand would be recognized through the token, this is the reason why it is in the
draft phase, since I am not sure if as @IAlibay mentioned, such a token that covers that many resnames could
potentially lead to an increase of errors, so there I would require also an opinion if it is reasonable to have the resnames,
since currently they cover all of the possible resnames.

Now that I see since I already started doing keywords with the water and the carbohydrates I could also do #2082,
since I am not sure if PR #4302 is still still active 🤔

@orbeckst orbeckst added the decision needed requires input from developers before moving further label Dec 20, 2024
@talagayev talagayev linked a pull request Dec 21, 2024 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component-Selections decision needed requires input from developers before moving further enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants