-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Trajectories are not properly stored after mols.delete_all_frames()
is called.
#218
Comments
Just to add that I have also been seeing...
about 50% of the time on interpreter shutdown on my |
That is interesting - the bug is likely happening because the cache is being improperly shared between the copy of
Let me think about this and investigate further. |
If you use the import sire as sr
mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()
d = mols.dynamics(timestep="4fs", temperature="25oC")
for i in range(3):
d.run("1ps", frame_frequency="0.2ps")
mols = d.commit(return_as_system=True)
print(f"mols currently has {mols.num_frames()} frames")
mols.delete_all_frames() Gives:
This still has issues with random crashes on interpreter exit, though. Looking at the logic in the def commit(self, return_as_system: bool = False):
if self.is_null():
return
self._update_from(
state=self._get_current_state(include_coords=True, include_velocities=True),
state_has_cv=(True, True),
nsteps_completed=self._current_step,
)
self._sire_mols.set_energy_trajectory(self._energy_trajectory, map=self._map)
self._sire_mols.set_ensemble(self.ensemble())
if return_as_system:
return self._sire_mols.clone()
elif self._orig_mols is not None:
from ..system import System
if System.is_system(self._orig_mols):
return self._sire_mols
else:
r = self._orig_mols.clone()
r.update(self._sire_mols.molecules())
return r
else:
return self._sire_mols.clone() I'll take a look to see what's going wrong. |
Okay, this seems to fix it:
@chryswoods: Let me know if this is the correct solution. I'll look at how to actually delete the frames for the purposes of chunking trajectories in |
Hmm, but if you delete the frames from the underlying system object, then it doesn't work as expected again: import sire as sr
mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()
d = mols.dynamics(timestep="4fs", temperature="25oC")
for i in range(3):
d.run("1ps", frame_frequency="0.2ps")
mols = d.commit()
print(f"mols currently has {mols.num_frames()} frames")
d._d._sire_mols.delete_all_frames() Gives:
|
Yeah, there's weirdness going on. For example, if you delete a single frame (the first frame in this case, but it doesn't matter): import sire as sr
mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()
d = mols.dynamics(timestep="4fs", temperature="25oC")
for i in range(3):
d.run("1ps", frame_frequency="0.2ps")
mols = d.commit()
print(f"mols currently has {mols.num_frames()} frames")
d._d._sire_mols.delete_frame(0) Gives:
Deleting two frames: import sire as sr
mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()
d = mols.dynamics(timestep="4fs", temperature="25oC")
for i in range(3):
d.run("1ps", frame_frequency="0.2ps")
mols = d.commit()
print(f"mols currently has {mols.num_frames()} frames")
d._d._sire_mols.delete_frame(0)
d._d._sire_mols.delete_frame(0) Gives:
I think deleting all frames must close the trajectory file handle somehow, so you are only ever left with one frame. That said, the coordinates of the molecule do appear to be updated, e.g.: import sire as sr
mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()
d = mols.dynamics(timestep="4fs", temperature="25oC")
for i in range(3):
d.run("1ps", frame_frequency="0.2ps")
mols = d.commit()
print(f"mols currently has {mols.num_frames()} frames")
d._d._sire_mols.delete_all_frames()
print(mols.coordinates() Gives:
|
This is what happens when I print out the call to
So the function to save the frames is being called the correct number of times. I think the number of frames must be inconsistent somewhere. |
Ah, I see, there's a |
Yes, I've updated |
Yes, this looks like the issue. There's a lot of challenge and complexity in this code because it mixes objects which are copy on write (System, Molecule) with objects that appear to be copy on write, but are actually explicitly shared (Trajectory contents) and hidden objects that are explicitly shared and (almost) globally cached (the SystemFrames that are stored in the cache. To make it more complex, the cache auto-offloads to disk as each cache frame fills up. The code was tested well for the normal case (generating a trajectory and saving it) but there's a lot of edge case bugs I guess when doing things like deleting individual trajectory frames from on implicitly shared copy, while the other copy needs a trajectory with all the frames in their original order. In these cases the code has to do a lot of juggling and memory movement. It would be worth at a future point looking at what somd2 really needs in terms of trajectory management and adding some custom code that implements that scheme (e.g. in C++, so that the trajectory doesn't need to be copied or excessively moved around). |
Also, correct, the trajectory isn't now streamed to a file. The frames are held in a cache that auto-offloads to disk as needed. You only write to a file when you explicitly save the trajectory. The cache keeps about 512MB of trajectory in memory, if I remember correctly, with the rest going to a page cache on disk. The cache is very efficient, memory mapping pages and giving you good random access performance to individual frames (each frame is paged individually). There's also an intermediary cache and a little look ahead so that frames are pulled from disk to memory efficiently, and stay there as long as needed. In general, it should mean that you don't need to do manual trajectory management now during a simulation or have to worry about filling memory. Instead, you can choose what frames to save after the simulation is complete. |
Great. This is what I'm now doing, i.e. writing the last N frames as a chunk at a checkpoint, then assembling the whole trajectory at the end. I guess I could probably keep the file open throughout and append, but this works well for now and avoids a partially written or corrupted file. |
The random crash on interpreter exit is caused by the CacheData::cleanUpOnExit method. If I comment this out, then I see no crashes whatsoever. I'll take a look to see if I can fix the problem. It would be nice to fix this as part of #250. |
Okay, it's entirely caused by the final bit of the block here where it waits for the threads to stop, then kills them if they don't. It looks like this is due to the 50ms timeout. On my Linux box, increasing this to 100ms causes no crashes. @chryswoods: Is this an acceptable solution? (I could make it even longer if needed.) |
Yes, this is fine. I suspect the threads are being slowed by file removal requests. This is at the interface of threads, system calls and undefined behaviour if killing a thread, so unsurprising it crashes. Waiting longer is the right solution. Maybe we could use some try joins or similar to wait for intervals of 100ms, and if the thread hasn't exited print a message to the user to say what is happening. For example it could be a 10 to 1 countdown with blast off being force termination the thread (and then hoping it doesn't crash) |
That's a good idea, I'll do that. |
Describe the bug
When periodically deleting the frames contained in a system (as is done in SOMD2 when checkpointing), any trajectories contained in a system after calling
mols.delete_all_frames()
will only contain a single frame. This issue also seems to cause a segmentation fault, but only around 50% of the time.To Reproduce
Run the following script:
the output will be the following:
Expected behavior
The output should be
(please complete the following information):
Additional context
Note that everything works as expected if
mols.delete_all_frames()
is not called:The text was updated successfully, but these errors were encountered: