-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add checkpointing #997
Comments
Use cases: An HPC provider (a National Lab) limits HPC jobs to 24 hours. Therefore checkpointing would be useful to run longer jobs. The separate motivation is that if a job takes 24.5 hours (longer than expected) the job is killed. |
Work is underway to make critical structures in SST-Core serializable in preparation for checkpoint generation. We've identified that many structures can be serialized without visible API changes. Handlers however (e.g., clock/event handlers) will need a visible change in definition as follows:
For backwards compatibility, the old definition would still work but components using them would not be checkpoint-able. Other implementation notes:
|
A few more details on serialization changes needed for checkpointing: Checkpointing was implemented for "base" objects by implementing a template for the given type. The template was called "serialize". As a convenience function, operator& was also overloaded as a template to allow for simpler syntax in the serialize_order functions. In order to support pointer tracking this structure has been changed somewhat. operator& still calls the serialize template, which does all the pointer tracking, and the original serialize templates have been renamed to serialize_impl. There is a new function call on the serializer to turn pointer tracking on, and once on, it will keep track of all pointers. The data is serialized with the first instance of the pointer and all subsequent instances just put in a tag to the first instance. On deserialization, the object is recreated at the first instance, and all other instances will just be given the pointer to the new object. Added a new template operator| (operator or). This is only used for the very specific instance of treating a non-pointer as a pointer in the case where the data is stored directly in the object (for example in a map or set), but other objects have pointers to the data. This is needed for the ComponentInfo objects of SubComponents, where the ComponentInfo object is stored in the parent in a std::map<ComponentId_t, ComponentInfo) and the SubComponent has a pointer to the data in its parent. A limitation of this function is that the non-pointer data must be serialized first. Made a serialize_impl template instantiation that will handle non-polymorphic classes. This allows a non-polymorphic class to serialize with only a serialize_order function and no need to inherit from serializable. We are considering added an implementation of serialize_impl that will handle classes that return true for std::is_trivially_copyable. In this case, there would be no need for a serialize_order function and it would just use memcpy to serialize the event. We still need to evaluate if is_trivially_copyable will return true for a class with a pointer as one of its data members. If it does, we won't be able to make this work as the pointer would not be pointing to the correct data when deserialized. |
Update on TimeVortex checkpointing: Ultimately, we plan to have events serialized with the Components they are targeting. This is to enable different pre- and post- checkpoint/restart partitioning. For the initial implementation, the TimeVortex will be serialized in-place and the Event::Handlers will be "fixed up" after restarting to point to the correct post-restart handler. This is done by having the Links report their handlers (tag and new pointer) to the Simulation_impl object so that it can exchange the old pointer (used as the tag in the checkpoint) with the new pointer. |
Update on statistics checkpoint: Support for checkpointing statistics was merged in PR #1098 |
Add the ability to checkpoint and restart SST runs.
The text was updated successfully, but these errors were encountered: