Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality to remove sites and connections from a topology #761

Merged
merged 20 commits into from
Sep 13, 2023

Conversation

chrisjonesBSU
Copy link
Contributor

@chrisjonesBSU chrisjonesBSU commented Sep 1, 2023

This is still a work in progress, but this PR does a couple things:

1. Within the Site class, we can access all the connections a site is directly involved in.

  • The Site.connections property is updated anytime Topology.add_connection is used as well as any time Topology.remove_connection is used.

EDIT: After some conversation; this PR instead adds a method to Toplogy that returns a dict of all the connections that belong to a site.

  • The idea here is to have more direct access to connections from a particle-up approach rather than topology-down approach. This way we can avoid larger nested for-loops.

2. Adds methods for removing a site and a connection from a topology.

This addresses #736 and #161 and should replace PR #370

This is still a WIP, I need to add unit tests and doc strings, and definitely need some feedback in terms of syntax, functionality and to make sure I'm following good pydantic practices. I'm interested what everyone else thinks.

Here is a gist showing an example.

@chrisjonesBSU chrisjonesBSU linked an issue Sep 1, 2023 that may be closed by this pull request
@chrisjonesBSU chrisjonesBSU added the WIP work in progress - do not merge label Sep 1, 2023
gmso/core/topology.py Fixed Show fixed Hide fixed
@CalCraven
Copy link
Contributor

Okay, so my main worry in this is it may render a lot of GMSO operations extremely slow. I'm going to do some time testing in a bit here, but just as an idea of how this may affect some of the GMSO operations. We're essentially storing a lot of redundant information of the topology, so we'll have to keep our eyes peeled if this is a significant impact to other operations.

For instance, in a typical polyethylene structure:
Each internal carbon site is part of: 4 bonds, 12 angles, 21 dihedrals, and 10 impropers. That's a lot of bookkeeping for something that is only necessary in specific instances.

My suggestion is to add a method to the topology that you can call on.

def get_connections_by_site(self, site, connsList=None):
    if connsList is None:
        connsList = ["bonds", "angles", "dihedrals", "impropers"]
    for connStr in connsList:
        countsDict[connectStr] = []
            for conn in getattr(self, connStr):
                if site in conn.connection_members:
                    countsDict[connStr].append(conn]

From here, if you wanted to delete a site, then you just get its connections and do self.remove_connection(connection).

I'm happy to hear more feedback, but I think as a whole it is much preferred to modify the topology on the mBuild side, where we don't have a ton of other overhead attached to the topology. Then once things are in GMSO, modifying the structure should be a rare occurrence. I also want to avoid slowing down building up the topology as much as possible, so we can keep the writers as streamlined and efficient as possible.

@CalCraven
Copy link
Contributor

Time testing:
With this PR

THIS PR TESTS for Simple Ethanol
It took 0.3415291670244187 seconds to convert to GMSO
It took 0.006205082987435162 seconds to identify_connections
It took 0.9126980409491807 seconds to load forcefield
It took 0.12859962496440858 seconds to apply forcefield
It took 0.23195625003427267 seconds to write top, gro, lammps

MAIN GMSO TESTS for Simple Ethanol
It took 0.3757550000445917 seconds to convert to GMSO
It took 0.006402791012078524 seconds to identify_connections
It took 0.9505396250169724 seconds to load forcefield
It took 0.13197216694243252 seconds to apply forcefield
It took 0.2722244169563055 seconds to write top, gro, lammps

################################################

THIS PR TESTS for 10000 Ethanols
It took 7.212720583076589 seconds to convert to GMSO
It took 58.01840241695754 seconds to identify_connections
It took 0.7380243750521913 seconds to load forcefield
It took 43.929414541926235 seconds to apply forcefield
It took 183.31577900005504 seconds to write top, gro, lammps

MAIN GMSO TESTS for 10000 Ethanols
It took 7.272310999920592 seconds to convert to GMSO
It took 56.75998824997805 seconds to identify_connections
It took 1.2427897499874234 seconds to load forcefield
It took 43.663405083003454 seconds to apply forcefield
It took 180.0288589169504 seconds to write top, gro, lammps

################################################

THIS PR TESTS for 300-mer Ethylene Glycol
It took 4.278543125023134 seconds to convert to GMSO
It took 1.785474207950756 seconds to identify_connections
It took 0.732234416063875 seconds to load forcefield
It took 19.215889790910296 seconds to apply forcefield
It took 10.554971750010736 seconds to write top, gro, lammps

MAIN GMSO TESTS for 300-mer Ethylene Glycol
It took 0.31096937495749444 seconds to convert to GMSO
It took 1.7851174169918522 seconds to identify_connections
It took 0.746961124939844 seconds to load forcefield
It took 19.24234045902267 seconds to apply forcefield
It took 10.819363666931167 seconds to write top, gro, lammps

So the big issue timewise seems to be a 10X reduction in speed for the polymer case when going from mBuild to GMSO. Although that may get worse because site.connections is not being updated after identify_connections is run.

@chrisjonesBSU
Copy link
Contributor Author

These are good points @CalCraven! Thanks for doing some performance analysis as well.

While I like the idea of having quick access to connections from a site's perspective, I'm not totally set on doing it the way the PR currently does (automatically doing the book-keeping behind the scenes). I also like your idea of using something like the get_connections_by_site() method above.

Although that may get worse because site.connections is not being updated after identify_connections is run.

It should be getting updated after running identify_connections()

For example in the below.

>>> import mbuild as mb
>>> methane = mb.load("C", smiles=True)
>>> methane_gmso = from_mbuild(methane)

>>> len(methane_gmso.sites[1].connections)
1
>>> for conn in methane_gmso.sites[1].connections:
...     print(type(conn))
<class 'gmso.core.bond.Bond'>

>>> methane_gmso.identify_connections()
>>> len(methane_gmso.sites[1].connections)
7

>>> for conn in methane_gmso.sites[1].connections:
...     print(type(conn))
<class 'gmso.core.bond.Bond'>
<class 'gmso.core.angle.Angle'>
<class 'gmso.core.improper.Improper'>
<class 'gmso.core.angle.Angle'>
<class 'gmso.core.angle.Angle'>
<class 'gmso.core.improper.Improper'>
<class 'gmso.core.improper.Improper'>

@chrisjonesBSU
Copy link
Contributor Author

I'm happy to hear more feedback, but I think as a whole it is much preferred to modify the topology on the mBuild side, where we don't have a ton of other overhead attached to the topology. Then once things are in GMSO, modifying the structure should be a rare occurrence.

I agree about avoiding modifying the topology on the GMSO side, but I think one common use case of doing this is removing hydrogen atoms. If we do that on the mBuild side before converting to a GMSO topology, then I believe SMARTS matching no longer works.

@chrisjonesBSU
Copy link
Contributor Author

Ok, I just removed the automatic book-keeping on a site's connections and instead added the get_connection_by_site method to Toplogy that @CalCraven recommended. I agree that this is better since the use cases of needing to access a site's specific connections won't be that common.

Here is the updated gist showing what the workflow might look like if someone wants to remove sites and their connections.

Let me know what you think. Once we all agree to the general direction of this, I can finish unit tests, doc strings, etc..

@codecov
Copy link

codecov bot commented Sep 5, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.03% 🎉

Comparison is base (a3fa9f4) 91.85% compared to head (9646f3e) 91.89%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #761      +/-   ##
==========================================
+ Coverage   91.85%   91.89%   +0.03%     
==========================================
  Files          66       66              
  Lines        6702     6733      +31     
==========================================
+ Hits         6156     6187      +31     
  Misses        546      546              
Files Changed Coverage Δ
gmso/core/topology.py 95.97% <100.00%> (+0.24%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@CalCraven
Copy link
Contributor

Generally speaking this looks great.

I agree about avoiding modifying the topology on the GMSO side, but I think one common use case of doing this is removing hydrogen atoms. If we do that on the mBuild side before converting to a GMSO topology, then I believe SMARTS matching no longer works.

I see this as a possible workflow, but maybe not a preferred one. A better way would be to use a forcefield with no hydrogens. SMARTS should work fine if there are no hydrogens. For instance, the SMARTS string for methane may go from
C;X4(H)(H)(H)
to
[C;X0]
But I think it should still work, as long as the topology matches. And obviously you have a force field with no hydrogens.

Copy link
Contributor

@CalCraven CalCraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe you could put some tests together now. We definitely need a check that site is in self._sites, and raise an Error if it's already not there. I also think we could make some of these methods private, so users aren't accessing them all the time. I'd like to hear some feedback on that though. I'm generally of the opinion that Topology is too bulky in terms of methods and attributes, and probably could be cleaned up so only the most important user methods are exposed.

gmso/core/topology.py Show resolved Hide resolved
@chrisjonesBSU
Copy link
Contributor Author

chrisjonesBSU commented Sep 5, 2023

Generally speaking this looks great.

I agree about avoiding modifying the topology on the GMSO side, but I think one common use case of doing this is removing hydrogen atoms. If we do that on the mBuild side before converting to a GMSO topology, then I believe SMARTS matching no longer works.

I see this as a possible workflow, but maybe not a preferred one. A better way would be to use a forcefield with no hydrogens. SMARTS should work fine if there are no hydrogens. For instance, the SMARTS string for methane may go from C;X4(H)(H)(H) to [C;X0] But I think it should still work, as long as the topology matches. And obviously you have a force field with no hydrogens.

That's a good point about changing the SMARTS strings to not include hydrogens, but this may be unwieldy for someone using a large built in forcefield like oplsaa, or gaff via GAFF-Foyer. I just don't think the barrier to running united atom simulations with an "off-the-shelf" forcefield should be that high. Whether or not that is a good idea or not is up to the user IMO.

I think the complete workflow as intended works out nicely, at least with Hoomd. In the example of the gist, if I apply a forcefield, remove hydrogens, then write out the hoomd forcefield objects, I get exactly what I would expect. i.e. Only the carbon atom types are in the LJ parameters, and the angle and dihedral potentials that would have been there in the all-atom system are not written out after removing hydrogens.

@CalCraven
Copy link
Contributor

@chrisjonesBSU That's fair, I'm not very familiar with coarse-graining. I don't think I would trust an off the shelf force field developed WITH hydrogens, and expect it to still work when removing the hydrogens and setting all of the charges to 0. But maybe I'm paranoid and this has been well tested somewhere?

@chrisjonesBSU
Copy link
Contributor Author

Well, I'm not trying to make the argument that using AA force fields with UA systems is generally a good practice, and tested. But, I think having a workflow of starting with an AA forcefield and modifying parameters on the fly to fit whatever UA system and/or properties you're studying is a lot more approachable than first having to create a new xml file and SMARTS strings. I've come across a few papers where they essentially do this, and it is a great example of how something like gmso can be used as opposed to editing parameters in text files.

@chrisjonesBSU
Copy link
Contributor Author

@CalCraven After re-reading your comment in the issue, I replaced the get_connections_by_site method with iter_connections_by_site. It seems a bit cleaner than building up and returning a dictionary. Let me know what you think.

I'll start working on some unit tests and doc strings.

@chrisjonesBSU chrisjonesBSU removed the WIP work in progress - do not merge label Sep 8, 2023
@chrisjonesBSU
Copy link
Contributor Author

chrisjonesBSU commented Sep 8, 2023

You can disregard this (see comment below)

This should be good to go. There is still possibly one thing to consider. How do we want to handle it when someone uses remove_site or remove_connection and passes in a site or connection that isn't in the topology? We could not raise any errors, since the end result is the same, or should we check and raise an error? If we don't want to raise errors I think I'll have to use .discard() instead of .remove() in these methods.

Copy link
Contributor

@marjanalbooyeh marjanalbooyeh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you Chris!

I added a minor comment and some questions.

gmso/core/topology.py Show resolved Hide resolved
gmso/core/topology.py Show resolved Hide resolved
@chrisjonesBSU
Copy link
Contributor Author

chrisjonesBSU commented Sep 8, 2023

I think maybe you could put some tests together now. We definitely need a check that site is in self._sites, and raise an Error if it's already not there. I also think we could make some of these methods private, so users aren't accessing them all the time. I'd like to hear some feedback on that though. I'm generally of the opinion that Topology is too bulky in terms of methods and attributes, and probably could be cleaned up so only the most important user methods are exposed.

I must have overlooked this (regarding my last comment about raising errors). I made it so that an error is raise by these remove methods as well as the iter_connections_by_site method.

I agree about not making the Topology class too bulky, but IMO it seems inconsistent to make the add_site and add_connection methods not private while having their remove counterparts private. Maybe a separate issue and PR could be made that revists which methods in Topology are private or not? The majority of the time, the add_ methods are used by the readers, so I could see a case for these being made private as well.

gmso/tests/test_topology.py Fixed Show fixed Hide fixed
@daico007
Copy link
Member

I have a question regarding expected behavior of the remove_connection method. If we remove a bond from the topology, should we also remove the downstream connections (i.e., angle, dihedral, and improper)?

Copy link
Member

@daico007 daico007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Regarding my previous question, I think since the Bond is not a pre-requisite for the creation of Angle/Dihedral/Improper, I think the current behavior should be fine (but it's open for discussion if anyone have any opinion about it).

@daico007
Copy link
Member

I am rerunning the failed bleeding test, need to see if that's something we need to address before merging this PR.

@chrisjonesBSU
Copy link
Contributor Author

I have a question regarding expected behavior of the remove_connection method. If we remove a bond from the topology, should we also remove the downstream connections (i.e., angle, dihedral, and improper)?

LGTM! Regarding my previous question, I think since the Bond is not a pre-requisite for the creation of Angle/Dihedral/Improper, I think the current behavior should be fine (but it's open for discussion if anyone have any opinion about it).

I think it's a good question. Really, the goal of these new methods isn't to encourage people to start modifying the topology in GMSO. My thought behind the remove_connection method was that it wraps the process up nicely to make sure the correct _connection attribute is being used. These if and elif statements could also be moved inside of the remove_site method as well.

Ultimately, I agree with @CalCraven that maybe these methods should be private along with the add methods (and possibly more in Topology). But I still think that should be a separate issue/PR

@daico007 daico007 merged commit 0c4c34c into mosdef-hub:main Sep 13, 2023
14 checks passed
@chrisjonesBSU chrisjonesBSU deleted the remove-site branch December 20, 2023 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Is there a way to remove sites from a topology?
4 participants