-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add celerity blockchain for task divergence checking #217
base: master
Are you sure you want to change the base?
Conversation
Check-perf-impact results: (5a19ced85f862a00d0114dd241122462) ❓ No new benchmark data submitted. ❓ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
ae68e51
to
af692d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
af692d0
to
2dd8a09
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
2dd8a09
to
6c3f128
Compare
ef86cd1
to
9ae8356
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the overall design, I mostly like it, but one thing I noticed is that there is a lot of code that's purely for testing which is in the actual "main logic" part. So far, I believe we mostly try to keep testing-only code in testing helpers, which makes it easier to see at a glance what is actually happening in a production run vs. what is done only in testing. We might want to move the testing-related functionality out of the main code.
d5d2e90
to
b88b60f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, other than the minor documentation and formatting things discussed offline.
I particularly appreciate moving most of the testing logic and functionality outside the main code path.
bb94e68
to
4af1341
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This is becoming a valuable on-boarding debug feature.
I have not looked closely at the algorithm so far, but some more documentaiton / clarification on what the workflow around the state and the member functions of abstract_block_chain
is would help a lot.
4af1341
to
252ad19
Compare
Check-perf-impact results: (3b34e58e3c100f4c3541a1ed59580f72) ❓ No new benchmark data submitted. ❓ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
252ad19
to
8a27177
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I did a first pass and added some comments, mostly about comprehensibility / naming.
388c399
to
b06ad9e
Compare
Check-perf-impact results: (4c65f1399a47e0eb1340f63004745b17) ❓ No new benchmark data submitted. ❓ |
fc46cfc
to
ebf4f9a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things are coming together! GitHub decided to post some of my comments early for some reason... Well, here's the rest :P
By the way, we have a section on diverging execution in ours docs / website, it would be great if you could add a sentence there mentioning this check and how it can be enabled!
test/divergence_check_test_utils.h
Outdated
std::transform(div_test.begin(), div_test.end(), std::back_inserter(sizes), [](auto& div) { return div->m_task_records.size(); }); | ||
auto [min, max] = std::minmax_element(sizes.begin(), sizes.end()); | ||
|
||
std::vector<per_node_task_hashes> extended_lifetime_hashes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what this is for..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extended_lifetime_hashes
are needed, because of how the single node divergence_test_communicator
works. The idea is following:
- each simulated node calls allgather sequentially. The send/receive buffer is saved internally as simple pointers.
- When all simulated nodes called the allgather the data is copied.
To prevent those send/receive buffers for per_node_task_hashes
from going out of scope we need to "extend" their lifetime. For this, I included the extended_lifetime_hashes
vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so I took a look at this again and this does not work: divergence_block_chain::collect_hashes
returns a per_node_task_hashes
by value, which itself contains a std::vector<task_hash>
. So extended_lifetime_hashes[0].data()
is no longer the same pointer as stored in divergence_test_communicator::m_gather_data
, and you're effectively reading already free'd memory. If you run the tests with AddressSanitizier (and without mimalloc), you will get an error for it.
I think the only way the collective operations can be properly mocked (that is, to have them block until all "ranks" have called them), is to put each block chain into a separate thread. The upside is that you won't need any of the pre_check
/ post_check
hackery. I would suggest to abstract all of that into a divergence_check_test_context
(which would subsume the divergence_test_communicator_provider
) that also wraps a number of divergence_block_chain
s, each in their own thread, and possibly even the task_test_context
for each.
ebf4f9a
to
3fe7ae2
Compare
3fe7ae2
to
3adceaa
Compare
3adceaa
to
6eb38e4
Compare
6eb38e4
to
1a6e3ee
Compare
|
||
std::unique_ptr<communicator> m_communicator; | ||
|
||
void divergence_out(const divergence_map& check_map, const int task_num); | ||
void reprot_divergence(const divergence_map& check_map, const int task_num); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void reprot_divergence(const divergence_map& check_map, const int task_num); | |
void report_divergence(const divergence_map& check_map, const int task_num); |
MPI_Comm_dup(MPI_COMM_WORLD, &comm); | ||
m_divergence_check = std::make_unique<divergence_checker>(*m_task_recorder, std::make_unique<mpi_communicator>(comm), m_test_mode); | ||
#endif | ||
// if (m_cfg->is_recording()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment
// // Sychronize all nodes before reseting shuch that we don't get into a deadlock | ||
// // With this barrier we can be shure that every node is fully finished and all mpi operations are done. (Sending ...) | ||
// MPI_Barrier(MPI_COMM_WORLD); | ||
// m_divergence_check.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, other than a few typos and commented out code.
} | ||
|
||
#if CELERITY_DIVERGENCE_CHECK | ||
// Sychronize all nodes before reseting shuch that we don't get into a deadlock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few typos "Sychronize" "reseting" "shuch"
@@ -289,6 +294,7 @@ target_compile_definitions(celerity_runtime PUBLIC | |||
CELERITY_FEATURE_UNNAMED_KERNELS=$<BOOL:${CELERITY_FEATURE_UNNAMED_KERNELS}> | |||
CELERITY_DETAIL_HAS_NAMED_THREADS=$<BOOL:${CELERITY_DETAIL_HAS_NAMED_THREADS}> | |||
CELERITY_ACCESSOR_BOUNDARY_CHECK=$<BOOL:${CELERITY_ACCESSOR_BOUNDARY_CHECK}> | |||
CELERITY_DIVERGENCE_CHECK=$<BOOL:${CELERITY_DIVERGENCE_CHECK}> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also needs to be added to cmake/celerity-config.cmake.in
!
#if CELERITY_DIVERGENCE_CHECK | ||
// divergence checker needs recording | ||
m_recording = true; | ||
#else | ||
m_recording = parsed_and_validated_envs.get_or(env_recording, false); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not a big fan of CELERITY_RECORDING
as it exists for that very reason. The user does not care about DAGs being recorded, they care about divergence checks or graph printing, from which we can decide whether recording needs to be active or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, maybe we should just get rid of it in a small follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments!
Also since the clang-tidy check for this seems broken in CI: Please go over all new function definitions and make sure parameters that can be const are const.
|
||
private: | ||
std::thread m_thread; | ||
bool m_is_running = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m_is_running
must be protected by a mutex!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just use an atomic.
divergence_block_chain& operator=(const divergence_block_chain&) = delete; | ||
divergence_block_chain& operator=(divergence_block_chain&&) = delete; | ||
|
||
bool check_for_divergence(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a comment on what a true
/ false
return value means. It reads like this would return true
when there was divergence, but the function actually throws in that case!
m_thread = std::thread(&divergence_checker::run, this); | ||
m_is_running = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like there is a race between setting m_is_running = true
here and the check for m_is_running
in run()
. I suggest you reverse the order to fix this.
m_thread = std::thread(&divergence_checker::run, this); | |
m_is_running = true; | |
m_is_running = true; | |
m_thread = std::thread(&divergence_checker::run, this); |
#if CELERITY_DIVERGENCE_CHECK | ||
// divergence checker needs recording | ||
m_recording = true; | ||
#else | ||
m_recording = parsed_and_validated_envs.get_or(env_recording, false); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not a big fan of CELERITY_RECORDING
as it exists for that very reason. The user does not care about DAGs being recorded, they care about divergence checks or graph printing, from which we can decide whether recording needs to be active or not.
if(min_hash_count == 0) { | ||
if(max_hash_count != 0 && m_local_nid == 0) { | ||
check_for_deadlock(); | ||
} else if(max_hash_count == 0) { | ||
return true; | ||
} | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment: What is happening here?
m_hashes_added = m_task_records.size(); | ||
} | ||
|
||
void divergence_block_chain::clear(const int min_progress) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming: This does not clear the chain (similar to what vector::clear
would do. Maybe something along the lines of erase_front
, prune_leading
, or similar?
Okay so as discussed offline, we won't include this in 0.5.0 as it needs another revision. The main points:
|
This pull request adds a divergence checking mechanism for tasks.
It does so by periodically gathering hashes of all tasks from task_recording and comparing them. When a divergence is detected an error containing the diverged tasks and their full task record is printed like:
Additionally it also includes a rudimentary deadlock detection for nodes which are stuck by printing a warning after a given amount of time (eg 10 seconds):
All of this is automatically turned on by running the program with task recording enabled.