Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Trajectories are not properly stored after mols.delete_all_frames() is called. #218

Open
mb2055 opened this issue Jul 25, 2024 · 16 comments
Labels
bug Something isn't working

Comments

@mb2055
Copy link
Contributor

mb2055 commented Jul 25, 2024

Describe the bug
When periodically deleting the frames contained in a system (as is done in SOMD2 when checkpointing), any trajectories contained in a system after calling mols.delete_all_frames() will only contain a single frame. This issue also seems to cause a segmentation fault, but only around 50% of the time.

To Reproduce
Run the following script:

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit()
    print(f"mols currently has {mols.num_frames()} frames")
    mols.delete_all_frames()

the output will be the following:

mols currently has 5 frames
mols currently has 1 frames
mols currently has 1 frames
FATAL: exception not rethrown
Aborted (core dumped)

Expected behavior
The output should be

mols currently has 5 frames
mols currently has 5 frames
mols currently has 5 frames

(please complete the following information):

  • OS: Ubuntu 22.04.4 LTS
  • Version of Python: 3.12.4
  • Version of sire: 2024.3.0.dev
  • I confirm that I have checked this bug still exists in the latest released version of sire: yes

Additional context
Note that everything works as expected if mols.delete_all_frames() is not called:

mols currently has 5 frames
mols currently has 10 frames
mols currently has 15 frames
@mb2055 mb2055 added the bug Something isn't working label Jul 25, 2024
@lohedges
Copy link
Contributor

lohedges commented Jul 25, 2024

Just to add that I have also been seeing...

FATAL: exception not rethrown
Aborted (core dumped)

about 50% of the time on interpreter shutdown on my feature_emle branch having merged in devel following the 2024.2.0 release. In this case I am not calling .delete_all_frames() anywhere. I just run a short block of dynamics and get the energy trajectory.

@chryswoods
Copy link
Contributor

That is interesting - the bug is likely happening because the cache is being improperly shared between the copy of mols in d, and the copy of mols that was output by d.commit(). In theory, it should not be shared, and so running mols.delete_all_frames() should not be deleting all frames for the mols in d, and so the right output from your script should be

mols currently has 5 frames
mols currently has 10 frames
mols currently has 15 frames

Let me think about this and investigate further.

@lohedges
Copy link
Contributor

If you use the return_as_system=True kwarg to d.commit() then you can see that the underlying system (for the dynamics object) does still contain all of the frames, i.e.:

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit(return_as_system=True)
    print(f"mols currently has {mols.num_frames()} frames")
    mols.delete_all_frames()

Gives:

mols currently has 5 frames
mols currently has 10 frames
mols currently has 15 frames

This still has issues with random crashes on interpreter exit, though.

Looking at the logic in the d.commit() it seems that the issue must be during the update of the original system:

    def commit(self, return_as_system: bool = False):
        if self.is_null():
            return

        self._update_from(
            state=self._get_current_state(include_coords=True, include_velocities=True),
            state_has_cv=(True, True),
            nsteps_completed=self._current_step,
        )

        self._sire_mols.set_energy_trajectory(self._energy_trajectory, map=self._map)

        self._sire_mols.set_ensemble(self.ensemble())

        if return_as_system:
            return self._sire_mols.clone()

        elif self._orig_mols is not None:
            from ..system import System

            if System.is_system(self._orig_mols):
                return self._sire_mols
            else:
                r = self._orig_mols.clone()
                r.update(self._sire_mols.molecules())
                return r
        else:
            return self._sire_mols.clone()

I'll take a look to see what's going wrong.

@lohedges
Copy link
Contributor

Okay, this seems to fix it:

diff --git a/src/sire/mol/_dynamics.py b/src/sire/mol/_dynamics.py
index cdb8a56b..965a67f9 100644
--- a/src/sire/mol/_dynamics.py
+++ b/src/sire/mol/_dynamics.py
@@ -1144,7 +1144,7 @@ class DynamicsData:
             from ..system import System

             if System.is_system(self._orig_mols):
-                return self._sire_mols
+                return self._sire_mols.clone()
             else:
                 r = self._orig_mols.clone()
                 r.update(self._sire_mols.molecules())

@chryswoods: Let me know if this is the correct solution.

I'll look at how to actually delete the frames for the purposes of chunking trajectories in somd2.

@lohedges
Copy link
Contributor

lohedges commented Oct 16, 2024

Hmm, but if you delete the frames from the underlying system object, then it doesn't work as expected again:

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit()
    print(f"mols currently has {mols.num_frames()} frames")
    d._d._sire_mols.delete_all_frames()

Gives:

mols currently has 5 frames
mols currently has 0 frames
mols currently has 0 frames

@lohedges
Copy link
Contributor

Yeah, there's weirdness going on. For example, if you delete a single frame (the first frame in this case, but it doesn't matter):

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit()
    print(f"mols currently has {mols.num_frames()} frames")
    d._d._sire_mols.delete_frame(0)

Gives:

mols currently has 5 frames
mols currently has 4 frames
mols currently has 3 frames

Deleting two frames:

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit()
    print(f"mols currently has {mols.num_frames()} frames")
    d._d._sire_mols.delete_frame(0)
    d._d._sire_mols.delete_frame(0)

Gives:

mols currently has 5 frames
mols currently has 3 frames
mols currently has 1 frames

I think deleting all frames must close the trajectory file handle somehow, so you are only ever left with one frame. That said, the coordinates of the molecule do appear to be updated, e.g.:

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit()
    print(f"mols currently has {mols.num_frames()} frames")
    d._d._sire_mols.delete_all_frames()
    print(mols.coordinates()

Gives:

mols currently has 5 frames
( 13.3499 Å, 14.5876 Å, 12.5228 Å )
mols currently has 1 frames
( 13.4061 Å, 14.5797 Å, 12.6688 Å )
mols currently has 1 frames
( 13.2908 Å, 14.5102 Å, 12.5739 Å )

@lohedges
Copy link
Contributor

This is what happens when I print out the call to System::saveFrame:

System::saveFrame
System::saveFrame
System::saveFrame
System::saveFrame
System::saveFrame
mols currently has 5 frames
System::saveFrame
System::saveFrame
System::saveFrame
System::saveFrame
System::saveFrame
mols currently has 0 frames
System::saveFrame
System::saveFrame
System::saveFrame
System::saveFrame
System::saveFrame
mols currently has 0 frames

So the function to save the frames is being called the correct number of times. I think the number of frames must be inconsistent somewhere.

@lohedges
Copy link
Contributor

Ah, I see, there's a SystemFrames object that stores snapshots from the live trajectory. I think this isn't in sync with the trajectory property in the system somehow. In any case, it's easy to adapt the chunking since, I believe, the trajectory is no longer streamed to file. I can just track the current number of frames in the system and only write the new frames to the chunk. I'll just do this and we can figure out the underlying issue here at a later point if it's too fiddly.

@lohedges
Copy link
Contributor

Yes, I've updated somd2 and it works perfectly. Can revisit this at a later date if needed.

@chryswoods
Copy link
Contributor

Yes, this looks like the issue. There's a lot of challenge and complexity in this code because it mixes objects which are copy on write (System, Molecule) with objects that appear to be copy on write, but are actually explicitly shared (Trajectory contents) and hidden objects that are explicitly shared and (almost) globally cached (the SystemFrames that are stored in the cache. To make it more complex, the cache auto-offloads to disk as each cache frame fills up.

The code was tested well for the normal case (generating a trajectory and saving it) but there's a lot of edge case bugs I guess when doing things like deleting individual trajectory frames from on implicitly shared copy, while the other copy needs a trajectory with all the frames in their original order. In these cases the code has to do a lot of juggling and memory movement.

It would be worth at a future point looking at what somd2 really needs in terms of trajectory management and adding some custom code that implements that scheme (e.g. in C++, so that the trajectory doesn't need to be copied or excessively moved around).

@chryswoods
Copy link
Contributor

Also, correct, the trajectory isn't now streamed to a file. The frames are held in a cache that auto-offloads to disk as needed. You only write to a file when you explicitly save the trajectory. The cache keeps about 512MB of trajectory in memory, if I remember correctly, with the rest going to a page cache on disk. The cache is very efficient, memory mapping pages and giving you good random access performance to individual frames (each frame is paged individually). There's also an intermediary cache and a little look ahead so that frames are pulled from disk to memory efficiently, and stay there as long as needed. In general, it should mean that you don't need to do manual trajectory management now during a simulation or have to worry about filling memory. Instead, you can choose what frames to save after the simulation is complete.

@lohedges
Copy link
Contributor

Great. This is what I'm now doing, i.e. writing the last N frames as a chunk at a checkpoint, then assembling the whole trajectory at the end. I guess I could probably keep the file open throughout and append, but this works well for now and avoids a partially written or corrupted file.

@lohedges
Copy link
Contributor

The random crash on interpreter exit is caused by the CacheData::cleanUpOnExit method. If I comment this out, then I see no crashes whatsoever. I'll take a look to see if I can fix the problem. It would be nice to fix this as part of #250.

@lohedges
Copy link
Contributor

Okay, it's entirely caused by the final bit of the block here where it waits for the threads to stop, then kills them if they don't. It looks like this is due to the 50ms timeout. On my Linux box, increasing this to 100ms causes no crashes. @chryswoods: Is this an acceptable solution? (I could make it even longer if needed.)

@chryswoods
Copy link
Contributor

Yes, this is fine. I suspect the threads are being slowed by file removal requests. This is at the interface of threads, system calls and undefined behaviour if killing a thread, so unsurprising it crashes. Waiting longer is the right solution. Maybe we could use some try joins or similar to wait for intervals of 100ms, and if the thread hasn't exited print a message to the user to say what is happening. For example it could be a 10 to 1 countdown with blast off being force termination the thread (and then hoping it doesn't crash)

@lohedges
Copy link
Contributor

That's a good idea, I'll do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants