Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: modernise HBA to use AG as objects internally instead of selection strings - code only PR (Issue #3933) #4533

Open
wants to merge 11 commits into
base: develop
Choose a base branch
from

Conversation

lunamorrow
Copy link

@lunamorrow lunamorrow commented Mar 27, 2024

Fixes #3933

Changes made in this Pull Request:

  • Updated guess_acceptors, guess_hydrogens, guess_donators, group_categories and get_dh_pairs to remove reliance of HydrogenBondAnalysis on selection strings being passed internally

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

Developers certificate of origin


📚 Documentation preview 📚: https://mdanalysis--4533.org.readthedocs.build/en/4533/

@pep8speaks
Copy link

pep8speaks commented Mar 27, 2024

Hello @lunamorrow! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 111:80: E501 line too long (111 > 79 characters)
Line 113:80: E501 line too long (113 > 79 characters)
Line 114:80: E501 line too long (113 > 79 characters)
Line 445:52: W291 trailing whitespace
Line 500:58: W291 trailing whitespace
Line 520:74: W291 trailing whitespace
Line 543:73: W291 trailing whitespace
Line 568:9: E265 block comment should start with '# '
Line 569:70: W291 trailing whitespace
Line 634:68: W291 trailing whitespace
Line 635:80: E501 line too long (80 > 79 characters)
Line 644:68: W291 trailing whitespace
Line 645:80: E501 line too long (80 > 79 characters)
Line 709:49: E128 continuation line under-indented for visual indent
Line 709:80: E501 line too long (80 > 79 characters)

Comment last updated at 2024-04-26 01:03:36 UTC

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello there first time contributor! Welcome to the MDAnalysis community! We ask that all contributors abide by our Code of Conduct and that first time contributors introduce themselves on GitHub Discussions so we can get to know you. You can learn more about participating here. Please also add yourself to package/AUTHORS as part of this PR.

Copy link

github-actions bot commented Mar 27, 2024

Linter Bot Results:

Hi @lunamorrow! Thanks for making this PR. We linted your code and found the following:

Some issues were found with the formatting of your code.

Code Location Outcome
main package ⚠️ Possible failure
testsuite ⚠️ Possible failure

Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/8841528655/job/24278766039


Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!

@lunamorrow
Copy link
Author

Just leaving some comments here for whoever is assigned to check this PR:

  1. I will be submitting a second PR to depreciate the API once I receive any feedback on this one. I want to make sure I have made the appropriate changes here, and any adjustments necessary, before I make that second PR.
  2. A block of code is commented out in _get_dh_pairs as I was unsure whether the check on line 627 had to be repeated (also in guess_donors). In modernize HBA to use AG as objects internally instead of selection strings #3933 I have asked about this double up, but decided to proceed with what looked like the best, most robust option. I am also unsure if there is a reason that self.donors_sel is not set in _prepare the same way that self.hydrogens_sel and self.acceptors_sel are. I have added this functionality, which should make the check in _get_dh_pairs redundant.
  3. I have run the current test cases for test_hydrogenbonds_analysis.py and have 13 failed, 26 passed, 14 warnings and 5 errors. All of the errors are because This Universe does not contain charge information, which I do not think is due to my changes? Of the failures, 6 are from Can't perform '__eq__' between objects: 'AtomGroup' and 'str', so these will require for me to change the test cases and update this PR yes? The remaining failures are due to hydrogens_ag not being a recognised attribute (I am unsure how to fix this?), 2x Assertion Error (Also unsure on best way to address this) or double ups with the charge information errors.

@lunamorrow
Copy link
Author

lunamorrow commented Mar 27, 2024

Update: That big block of code that I commented out was necessary. Turns out that its removal was causing 2 failures. I am adding it back in now, and attempting to tackle those 2 assertion errors.

Question: I have added a commit to my branch, and I suspect I can then push updated code into this PR (i.e., as I fix things that were erroring and adjust/add tests)? Is this correct?

@orbeckst
Copy link
Member

@yuxuanzhuang can you please look after this PR or assign to someone else, please?

Copy link
Contributor

@yuxuanzhuang yuxuanzhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the in-line review that will fix some of your failed tests. Additionally, please fix other failed tests inside the test_hydrogenbonds_analysis.py, mostly related to the switching output of guess_hydrogens

# self.acceptors_ag = self.guess_acceptors() #NOTE: do I need this?
# self.hydrogens_ag = self.guess_hydrogens()
# self.donors_ag = self.guess_donors()

# Set atom selections if they have not been provided
if self.acceptors_sel is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to make sure acceptors_ag, hydrogens_ag, etc. always exist.

else:
    self.acceptors_ag = self.u.select_atoms(self.acceptors_sel)


# Select atom groups
self._acceptors = self.u.select_atoms(self.acceptors_sel,
updating=self.update_selections)
self._acceptors = self.guess_acceptors()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already have self.acceptors_ag above so you don't need to guess it again. Simply parse self._acceptors = self.acceptors_ag

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops I missed this one, thank you for picking it up :)

@yuxuanzhuang
Copy link
Contributor

I also don't have experience handling depreciation before, so I might need help from @hmacdope :)

@yuxuanzhuang
Copy link
Contributor

Question: I have added a commit to my branch, and I suspect I can then push updated code into this PR (i.e., as I fix things that were erroring and adjust/add tests)? Is this correct?

Yes, that's correct! After you've added a commit to your branch, you can push the updated code to the same branch that is associated with this PR. Any new commits pushed to the branch will automatically appear in the PR (as you might already saw).

@lunamorrow
Copy link
Author

lunamorrow commented Mar 29, 2024

Thanks @yuxuanzhuang I am working through your comments now and will aim to push the updated branch shortly!

Quick question about this comment:

You need to make sure acceptors_ag, hydrogens_ag, etc. always exist.

else:
    self.acceptors_ag = self.u.select_atoms(self.acceptors_sel)

I need to avoid using select_atoms (using guess_acceptors instead), but could you please elaborate on what you mean? I have defined them in the initialiser as None, so that they are always there, but I cannot figure out how your code piece suggestion integrates.

Also, do I need to worry about the "This Universe does not contain charge information" errors? I do not think they have anything to do with my changes? I am worried they could be "hiding" other possible errors further down that aren't being reached before this error is thrown though...

@yuxuanzhuang
Copy link
Contributor

Regarding depreciation, current provided code examples will break e.g.

u = MDAnalysis.Universe(PSF, DCD)

hbonds = HBA(universe=u)
hbonds.hydrogens_sel = hbonds.guess_hydrogens("protein") # this will not be selection string.
hbonds.acceptors_sel = hbonds.guess_acceptors("protein")
hbonds.run()

Should we consider accepting AtomGroup for hbonds.hydrogens_sel and other inputs as well?

@yuxuanzhuang
Copy link
Contributor

I need to avoid using select_atoms (using guess_acceptors instead), but could you please elaborate on what you mean? I have defined them in the initialiser as None, so that they are always there, but I cannot figure out how your code piece suggestion integrates.

The logic behind guess_acceptors and the provision of an implicit acceptors_sel IMO is that with the latter, you are certain of the atoms you want to use for calculating hydrogen bonds (so you only need to select_atoms. In guess_acceptors, the Universe needs to have enough information, such as charges and masses, to make guesses; however, it's not always the case.

@lunamorrow
Copy link
Author

I also don't have experience handling depreciation before, so I might need help from @hmacdope :)

Of course, all good! If you can just help me get this PR sorted and ready to go I'll be so grateful! The depreciation can be sorted later :)

I need to avoid using select_atoms (using guess_acceptors instead), but could you please elaborate on what you mean? I have defined them in the initialiser as None, so that they are always there, but I cannot figure out how your code piece suggestion integrates.

The logic behind guess_acceptors and the provision of an implicit acceptors_sel IMO is that with the latter, you are certain of the atoms you want to use for calculating hydrogen bonds (so you only need to select_atoms. In guess_acceptors, the Universe needs to have enough information, such as charges and masses, to make guesses; however, it's not always the case.

Oh ok yes, I understand what you mean now! Yeah I have been a bit unsure on how to tackle this in the "best" way. The guess_acceptors works the same as it used to, but I could add a layer of function that will consider the user input acceptors_sel if it is not None, instead of guessing based on Atom properties? Then change this line in _prepare (see below; add an else block) and pass a user defined self.acceptors_sel (if it exists) to define the Atoms in self.acceptors_ag? I am thinking I can set a selection string as an optional arg for guess_acceptors (and repeat this process for guess_hydrogens and guess_donors too)? Does this could like a good option?

        if self.acceptors_sel is None:
            self.acceptors_ag = self.guess_acceptors()
            self.acceptors_sel = self._group_categories(self.acceptors_ag) #NOTE: do I need this?

@lunamorrow
Copy link
Author

Regarding depreciation, current provided code examples will break e.g.

u = MDAnalysis.Universe(PSF, DCD)

hbonds = HBA(universe=u)
hbonds.hydrogens_sel = hbonds.guess_hydrogens("protein") # this will not be selection string.
hbonds.acceptors_sel = hbonds.guess_acceptors("protein")
hbonds.run()

Should we consider accepting AtomGroup for hbonds.hydrogens_sel and other inputs as well?

Yes I agree. I am thinking a good work-around would be for users to define hbonds.hydrogens_sel, hbonds.donors_sel or hbonds.acceptors_sel when they initialise the object? Otherwise they could manually manipulate it before calling run() and it should all flow well (once I make the adjustments to guess_acceptors, guess_hydrogens and guess_donors as per above). I would like to keep hydrogens_sel for selection strings and hydrogens_ag for the AtomGroups separated? I think this will provide the best functionality/robustness overall?

@lunamorrow
Copy link
Author

@yuxuanzhuang I have made some changes, as I proposed above, which I will push shortly. I have pretty much eliminated the errors/failures bar these two.

MDAnalysis.exceptions.NoDataError: This Universe does not contain charge information

I suspect the charge one is due to your comment? Is there anything I can do to fix this?

In guess_acceptors, the Universe needs to have enough information, such as charges and masses, to make guesses; however, it's not always the case.

AttributeError: AtomGroup has no attribute replace.

I cannot find a call to replace in the tests throwing that error. Where can I find the root cause of this?

@yuxuanzhuang
Copy link
Contributor

I think there may be differences in the results when using guess_donors compared to _get_dh_pairs, because guess_donors includes a check for charge limitations. Note guess_donors is not used at all in the original implementation.

import MDAnalysis
from MDAnalysis.analysis.hydrogenbonds.hbond_analysis import HydrogenBondAnalysis as HBA

u = MDAnalysis.Universe(PSF, DCD)

hbonds = HBA(universe=u)
hbonds.run()
print(hbonds.results.hbonds.shape)
> (10858, 6) # old results
> (4237, 6) # new results
hbonds._donors.n_atoms
> 375 # old results
> 74 # new results

will end up with a different result from the current implementation.

To facilitate testing, you can copy old hbond_analysis.py as hbond_analysis_old.py and from MDAnalysis.analysis.hydrogenbonds.hbond_analysis_old import HydrogenBondAnalysis as HBA_old.

I think this is currently not captured by the tests.

Apologies that I don't really have enough time during the Easter to make thorough suggestions on what should be done about it right now.

However, I would suggest to make sure you

  1. when doing select_atoms, also add updating=update_selections to make sure it will be an UpdatingAtomGroup
  2. make sure the things you parse around are not a string but an AtomGroup. AttributeError: AtomGroup has no attribute replace. this error is linked to putting an AtomGroup into select_atoms.
  3. you can do selections with group syntax e.g. donors_ag = hydrogens_ag.residues.atoms.select_atoms( f"({select}) and around {self.d_h_cutoff} group hydrogens_ag", hydrogens_ag=hydrogens_ag, updating=self.update_selections ).

@lunamorrow
Copy link
Author

Thanks @yuxuanzhuang, and I completely understand. I appreciate you taking time out of your Easter weekend to give me a hand. Unfortunately I am really struggling with this and I think I might need to take it back to the start and re-implement changes again from scratch.

I have spent a good few hours reading over and tweaking code (including adjusting for your suggested changes 1 and 2), and it's still not working nicely with the tests. I have a number of tests that are getting empty AtomGroups, when the selection provided should return something, which is causing a plethora of problems.

I'm getting a bit worried about finalising this fix before GSoC applications close, but I will keep this PR updated with any more questions as I go.

@hmacdope
Copy link
Member

hmacdope commented Apr 1, 2024

I'm getting a bit worried about finalising this fix before GSoC applications close, but I will keep this PR updated with any more questions as I go.

Don't stress about the timeline, there is no requirement that your PR is merged before the application deadline, only that we have had a chance to work with you and go back and forth a bit. :)

@hmacdope
Copy link
Member

hmacdope commented Apr 1, 2024

I also don't have experience handling depreciation before, so I might need help from @hmacdope :)

We can move this to a seperate PR when the time comes. :)

@lunamorrow
Copy link
Author

Ok great, that is such a relief to hear @hmacdope. I have been panicking a bit! I'll focus on my written application and keep chugging away at this fix while I wait to hear if I get accepted :)

We can move this to a seperate PR when the time comes. :)

Perfect, I'll touch base with you before starting that to get some more clear direction!

@hmacdope
Copy link
Member

hmacdope commented Apr 1, 2024

Yes I agree. I am thinking a good work-around would be for users to define hbonds.hydrogens_sel, hbonds.donors_sel or hbonds.acceptors_sel when they initialise the object?

Much preferred to setting them by hand.

@yuxuanzhuang
Copy link
Contributor

Side note while testing: slicing UpdatingAtomGroup will result in a static AtomGroup. -> need a workaround to return an UpdatingAtomGroup in e.g. guess_donors because donors_ag will be sliced by donors_ag[donors_ag.charges < max_charge]

u.select_atoms('protein', updating=True).__class__
> MDAnalysis.core.groups.UpdatingAtomGroup
u.select_atoms('protein', updating=True)[2:].__class__
> MDAnalysis.core.groups.AtomGroup

@orbeckst
Copy link
Member

orbeckst commented Apr 3, 2024

If you write this as as selection instead of slicing, can you maintain the updating group

donors_ag.select_atoms(f"prop charge < {max_charge}", updating=True)

@lunamorrow
Copy link
Author

lunamorrow commented Apr 4, 2024

Side note while testing: slicing UpdatingAtomGroup will result in a static AtomGroup. -> need a workaround to return an UpdatingAtomGroup in e.g. guess_donors because donors_ag will be sliced by donors_ag[donors_ag.charges < max_charge]

If you write this as as selection instead of slicing, can you maintain the updating group

donors_ag.select_atoms(f"prop charge < {max_charge}", updating=True)

This is great to know, thanks @orbeckst and @yuxuanzhuang! I'll make that change to guess_donors, which should help with some of the testing issues. I just checked guess_acceptors and guess_hydrogens, and the same is happening with the AtomGroup selections in these methods, so I will workshop some solutions for them. I'll get these changes done and a few more, then I'll push a new version and share any testing errors/failures for it that I cannot solve.

Just one question: will the use of select_atoms in this solution be appropriate in the method, since it will still be returning/passing an AtomGroup?

@lunamorrow
Copy link
Author

Hi @yuxuanzhuang @orbeckst @hmacdope. I'm jumping back on this PR as I have just finished off a bunch of university assessment. A couple weeks away from it was really beneficial for me to gain a fresh perspective and make fixes/changes relatively quickly. I have just committed updated versions of hbond_analysis and test_hydrogenbonds_analysis and I am pretty confident that I have implemented the changes as we discussed and the tests are running nicely. The code still needs some "cleaning" as I have left comments and some random prints here and there for while I was making changes, and some of the docstrings need a bit of TLC too. If everything looks good I will clean these up and commit a fresh version to be merged with version 3. Thanks! 😊

Copy link

codecov bot commented Apr 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.23%. Comparing base (804b607) to head (39e77c9).
Report is 79 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #4533      +/-   ##
===========================================
- Coverage    93.64%   93.23%   -0.42%     
===========================================
  Files          168       12     -156     
  Lines        21254     1079   -20175     
  Branches      3918        0    -3918     
===========================================
- Hits         19904     1006   -18898     
+ Misses         892       73     -819     
+ Partials       458        0     -458     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@yuxuanzhuang yuxuanzhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need a bit more time to test on the new implementation, but I want to highlight an issue with the code examples in the current documentation. Currently, one can use:

hbonds = HydrogenBondAnalysis(universe=u)
hbonds.hydrogens_sel = hbonds.guess_hydrogens("resname ARG HIS LYS")
hbonds.acceptors_sel = hbonds.guess_acceptors("resname ARG HIS LYS")

However, this will fail with the current implementation because hbonds.acceptors_sel requires a string, whereas guess_acceptors returns an AtomGroup.

While it's acceptable to have a breaking change in the API, it would be beneficial to ensure that most of the existing code remains functional.

@@ -720,6 +741,8 @@ def _single_frame(self):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it somewhat inelegant that _get_dh_pairs, a frame-wise operation, performs atom selection to create new donors and hydrogens objects. Ideally, given that updating=self.update_selections applies to both hydrogens and donors, this should not be necessary.

@lunamorrow
Copy link
Author

lunamorrow commented Apr 25, 2024

Thanks @yuxuanzhuang and @orbeckst.

Currently, one can use:

hbonds = HydrogenBondAnalysis(universe=u)
hbonds.hydrogens_sel = hbonds.guess_hydrogens("resname ARG HIS LYS")
hbonds.acceptors_sel = hbonds.guess_acceptors("resname ARG HIS LYS")

However, this will fail with the current implementation because hbonds.acceptors_sel requires a string, whereas guess_acceptors returns an AtomGroup.

While it's acceptable to have a breaking change in the API, it would be beneficial to ensure that most of the existing code remains functional.

I understand what you mean. Just to make sure we are on the same page, I believe once these changes are implemented that I should be doing a new PR per @hmacdope's guidance that will issue a depreciation warning for some of these features and how the Class is currently used. If I ensure that the API/docstring examples are updated appropriately will that be ok? I cannot see a clean way to implement this change without making all the "guess" methods return AtomGroups instead of selection string.

There are some other examples which will lose functionality. Like being able to guess hydrogens or acceptors from a certain group, like a protein (see below). Since my implementation only uses guess_hydrogens, guess_acceptors and guess_donors if their respective selection string is None. Would it be suitable to make the resulting self._hydrogens, self._acceptors and self._donors mutable through a new method before run is called? Or I could implement a check for if any of self._hydrogens, self._acceptors and self._donors are empty AtomGroups, the respective guess_hydrogens, guess_acceptors or guess_donors could be called with the selection string passed in. Its a bit of a clunky fix though... Any suggestions on this would be really appreciated!

  import MDAnalysis
  from MDAnalysis.analysis.hydrogenbonds.hbond_analysis import HydrogenBondAnalysis as HBA

  u = MDAnalysis.Universe(psf, trajectory)

  hbonds = HBA(universe=u)
  hbonds.hydrogens_sel = hbonds.guess_hydrogens("protein")
  hbonds.acceptors_sel = hbonds.guess_acceptors("protein")
  hbonds.run()

I have gone through the rest of your comments and change requests, and made those changes to the code. I will clean up the comments and docstrings now, then push a new version for your review. Thank you :)

@lunamorrow lunamorrow closed this Apr 25, 2024
@lunamorrow lunamorrow deleted the gsoc-1 branch April 25, 2024 05:21
@lunamorrow lunamorrow reopened this Apr 25, 2024
@yuxuanzhuang
Copy link
Contributor

Just a heads up, it's often necessary to rebase your code and resolve any conflicts with the latest develop branch. Here’s a helpful guide on how to do that: Rebasing Your Code.

Last time, I performed the rebase directly on GitHub, which might have caused some issues for you (apologies!) You could git pull the modification, if there's any on GitHub, before you push.

@lunamorrow
Copy link
Author

Thanks @yuxuanzhuang. I realised I had forgotten to rebase before pushing the previous code, when I saw you had done the rebase for me. I made sure to do so for the most recent one. That's ok, I was a bit crazy on my end for the most recent push but I think I have it all sorted now! I'll just need to push a new branch to this PR somehow for further changes.

@lunamorrow
Copy link
Author

Ok finally rebased and fixed up now. Everything should be sorted :)

Copy link
Contributor

@yuxuanzhuang yuxuanzhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the inline comment and the code below that gets different results after the fix.

import MDAnalysis
from MDAnalysisTests.datafiles import PSF, DCD
from MDAnalysis.analysis.hydrogenbonds.hbond_analysis import HydrogenBondAnalysis as HBA
from MDAnalysis.analysis.hydrogenbonds.hbond_analysis_old import HydrogenBondAnalysis as HBA_old

u = MDAnalysis.Universe(PSF, DCD)

hbonds_old = HBA_old(universe=u,
             hydrogens_sel='around 10 resid 5 and name H*',
             acceptors_sel='resid 5 and name O',
             update_selections=True)
hbonds_old.run()
print(f'old results: {hbonds_old.results.hbonds.shape}')

hbonds = HBA(universe=u,
             hydrogens_sel='around 10 resid 5 and name H*',
             acceptors_sel='resid 5 and name O',
             update_selections=True)
hbonds.run()
print(f'new results: {hbonds.results.hbonds.shape}')
> old results: (21, 6)
> new results: (6, 6)

Given all of your tests pass, it might be worth adding it to your test as well and making sure it also passes.

I am actually not a daily user of HBA (for a reason!) and I don't really know how people normally use it, so feel free to pitch in! @MDAnalysis/coredevs

It's worth noting that it runs so so so much faster!!!

hbonds = HBA(universe=u)
%timeit -r 2 -n 5 hbonds.run()
> old 2.85 s ± 8.15 ms per loop (mean ± std. dev. of 2 runs, 5 loops each)
> new 168 ms ± 6.62 ms per loop (mean ± std. dev. of 2 runs, 5 loops each)

self._donors, self._hydrogens = self._get_dh_pairs()

def _single_frame(self):

box = self._ts.dimensions

# Update donor-hydrogen pairs if necessary
if self.update_selections:
self._donors, self._hydrogens = self._get_dh_pairs()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still needed for _single_frame because self._donors will be updated every frame with the change of self._hydrogens. You need to separate the part that creates self._hydrogens from it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lunamorrow The reason you get different results is that in the old code, self._donors is updated every frame based on self._hydrogens. However, if you don't execute _get_dh_pairs every frame, self._donors will become a non-updating AtomGroup based on the first frame.

@lunamorrow
Copy link
Author

Thanks for your quick feedback @yuxuanzhuang! :)

See the inline comment and the code below that gets different results after the fix.

That is really weird as far as the results differing between the old and updated HBA. I will have a dig around on my end this week and fix it up.

Given all of your tests pass, it might be worth adding it to your test as well and making sure it also passes.

Will do!

It's worth noting that it runs so so so much faster!!!

hbonds = HBA(universe=u)
%timeit -r 2 -n 5 hbonds.run()
> old 2.85 s ± 8.15 ms per loop (mean ± std. dev. of 2 runs, 5 loops each)
> new 168 ms ± 6.62 ms per loop (mean ± std. dev. of 2 runs, 5 loops each)

Wow that is huge, and will have a really big impact for users performing HBA on large systems! That is so amazing to see the impact that this one change has had.

I'll wait to hear back from the core devs as far as the loss of some functionality.

@orbeckst
Copy link
Member

You're asking for other @MDAnalysis/coredevs to jump in and comment "as far as the loss of some functionality.". Can you please summarize the changes so we don't have to read everything in the PR?

  • What was changed?
  • What is behaving differently now (API, results, ...)?

Thanks.

@yuxuanzhuang
Copy link
Contributor

yuxuanzhuang commented Apr 29, 2024

I'm seeking insights on typical uses of HBA prior to and following the changes introduced by this PR. Below are some user scenarios categorized into two main aspects: 1) selections made for analysis and 2) information available in the Universe (omitted for now)

Selection

  1. Conduct HBA on the entire Universe (and perform post-analysis):

    # before and after
    hbonds = HBA(universe=u)
    
  2. Select static parts of the system:

    # before
    hbonds = HBA(universe=u)
    hbonds.hydrogens_sel = hbonds.guess_hydrogens("protein")
    hbonds.acceptors_sel = hbonds.guess_acceptors("protein")
    
    # after
    # the previous code is no longer functional
    ???
    

I think it's an important user case and needs workaround, e.g. allow hydrogens_sel and acceptors_sel accept both string and AtomGroup

  1. Dynamically select parts of the system:
    # before and after
    hbonds = HBA(universe=u,
                 hydrogens_sel='around 10 resid 5 and name H*',
                 acceptors_sel='resid 5 and name O',
                 update_selections=True)
    

Currently, this results in divergent results .

  1. Select between two parts of the system:
    # before and after
    hbonds = HBA(universe=u, between=['protein', 'water'])
    

sidenote: they are non-updating atom groups

Information contained in the Universe (omitted for now)

  1. Charges, masses, and bonds
  2. Charges and masses
  3. Bonds
  4. No details on any of these

Let me know if there are additional scenarios or details that might be relevant.

@IAlibay
Copy link
Member

IAlibay commented Apr 30, 2024

I think it's an important user case and needs workaround, e.g. allow hydrogens_sel and acceptors_sel accept both string and AtomGroup

Just a very brief 2 cent here - the current functionality cannot change or stop working until the 3.0 release. The approach we have usually taken is to temporarily support both input types (and make sure they give the same answer).

@lunamorrow
Copy link
Author

Just a very brief 2 cent here - the current functionality cannot change or stop working until the 3.0 release. The approach we have usually taken is to temporarily support both input types (and make sure they give the same answer).

I agree. The plan I made with @hmacdope is to roll these changes out with 3.0 and create a second PR after this one is done to issue a depreciation warning. Would the support of both types be to integrate now, and then remove this support with 3.0? I can make this happen if this is the route we want to take @orbeckst @yuxuanzhuang @IAlibay.

  1. Select static parts of the system:

    # before
    hbonds = HBA(universe=u)
    hbonds.hydrogens_sel = hbonds.guess_hydrogens("protein")
    hbonds.acceptors_sel = hbonds.guess_acceptors("protein")
    
    # after
    # the previous code is no longer functional
    ???
    

As @yuxuanzhuang said, the main issue is that guess_hydrogens, guess_acceptors and guess_donors are only called if their respective selection string is None, which means a selection like "protein" won't make it to them. This prevents Atoms for HBA being selected based on what structure they are a part of which is a pretty critical functionality. I will start on an elegant work around for this.

@MDAnalysis/coredevs would you prefer this work around to include enabling guess_hydrogens, guess_acceptors and guess_donors to accept strings again? Or am I ok to create another method to enable less specific selection strings like "protein" to be passed to these guesser methods? I am worried that enabling guess_hydrogens, guess_acceptors and guess_donors to accept strings again will make any speed improvements I have made thus far null.

@yuxuanzhuang
Copy link
Contributor

yuxuanzhuang commented May 1, 2024

prefer this work around to include enabling guess_hydrogens, guess_acceptors and guess_donors to accept strings again?

I'm considering to allow hydrogens_sel and acceptors_sel can accept both strings and AtomGroups instead. This would maintain the functionality of commands like hbonds.hydrogens_sel = hbonds.guess_hydrogens("protein"). I don't even think you need to depreciate it after 3.0.0

@lunamorrow
Copy link
Author

I'm considering to allow hydrogens_sel and acceptors_sel can accept both strings and AtomGroups instead. This would maintain the functionality of commands like hbonds.hydrogens_sel = hbonds.guess_hydrogens("protein"). I don't even think you need to depreciate it after 3.0.0

Ok easy, I can get started on that then. I have a crazy few days coming up, but I'll aim to push a revision with this functionality added over the weekend :)

@richardjgowers
Copy link
Member

just chiming in that accepting either AG or string as suggested here: #4533 (comment) is a good idea; strings are a nice way to start, but if you've got a really odd system or want to do quite specific analysis then being able to just hand the analysis an AG to work on directly is heaps easier than finding the magic string that recreates this AG

@orbeckst
Copy link
Member

Hello everyone here. Are you still working on this PR @lunamorrow ?

@orbeckst orbeckst added the close? Evaluate if issue/PR is stale and can be closed. label Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
close? Evaluate if issue/PR is stale and can be closed. Component-Analysis
Projects
None yet
Development

Successfully merging this pull request may close these issues.

modernize HBA to use AG as objects internally instead of selection strings
7 participants