Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add celerity blockchain for task divergence checking #217

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

GagaLP
Copy link
Contributor

@GagaLP GagaLP commented Oct 2, 2023

This pull request adds a divergence checking mechanism for tasks.

It does so by periodically gathering hashes of all tasks from task_recording and comparing them. When a divergence is detected an error containing the diverged tasks and their full task record is printed like:

[2023-10-02 17:31:07.784] [error] Divergence detected in task graph at index 1:

0x471b0f1db5e4b8e6 on nodes 1 
0xe9fbff654e3748e1 on nodes 0 

[2023-10-02 17:31:07.784] [error] Task record for hash 0x471b0f1db5e4b8e6:

id: 1, debug_name: task_b_4, type: device-compute, cgid: 0
geometry:
         dimensions: 2, global_size: [1,1,1], global_offset: [0,0,0], granularity: [1,1,1]
accesses: 
         bid: 0, buffer_name: , mode: R, req: {[64,0,0] - [128,1,1]}
dependencies: 
         node: 0, kind: true-dep, origin: last-epoch

Additionally it also includes a rudimentary deadlock detection for nodes which are stuck by printing a warning after a given amount of time (eg 10 seconds):

[warning] After 10 seconds of waiting nodes 1, did not move to the next task. The runtime might be stuck.

All of this is automatically turned on by running the program with task recording enabled.

@github-actions
Copy link

github-actions bot commented Oct 2, 2023

Check-perf-impact results: (5a19ced85f862a00d0114dd241122462)

❓ No new benchmark data submitted. ❓
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

include/divergence_block_chain.h Outdated Show resolved Hide resolved
include/divergence_block_chain.h Outdated Show resolved Hide resolved
include/divergence_block_chain.h Outdated Show resolved Hide resolved
include/divergence_block_chain.h Outdated Show resolved Hide resolved
include/recorders.h Outdated Show resolved Hide resolved
include/recorders.h Outdated Show resolved Hide resolved
include/task.h Outdated Show resolved Hide resolved
src/divergence_block_chain.cc Outdated Show resolved Hide resolved
src/runtime.cc Outdated Show resolved Hide resolved
test/divergence_check_tests.cc Outdated Show resolved Hide resolved
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

include/divergence_block_chain.h Outdated Show resolved Hide resolved
include/divergence_block_chain.h Outdated Show resolved Hide resolved
include/divergence_block_chain.h Outdated Show resolved Hide resolved
include/divergence_block_chain.h Outdated Show resolved Hide resolved
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

include/divergence_block_chain.h Outdated Show resolved Hide resolved
@GagaLP GagaLP requested review from psalz and fknorr October 9, 2023 12:00
@GagaLP GagaLP force-pushed the divergence-check branch 4 times, most recently from ef86cd1 to 9ae8356 Compare October 9, 2023 15:40
Copy link
Contributor

@PeterTh PeterTh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the overall design, I mostly like it, but one thing I noticed is that there is a lot of code that's purely for testing which is in the actual "main logic" part. So far, I believe we mostly try to keep testing-only code in testing helpers, which makes it easier to see at a glance what is actually happening in a production run vs. what is done only in testing. We might want to move the testing-related functionality out of the main code.

include/divergence_block_chain.h Outdated Show resolved Hide resolved
include/divergence_block_chain.h Outdated Show resolved Hide resolved
include/grid.h Outdated Show resolved Hide resolved
include/handler.h Outdated Show resolved Hide resolved
include/print_utils.h Outdated Show resolved Hide resolved
src/divergence_block_chain.cc Outdated Show resolved Hide resolved
test/debug_naming_tests.cc Outdated Show resolved Hide resolved
test/divergence_check_tests.cc Outdated Show resolved Hide resolved
test/divergence_check_tests.cc Outdated Show resolved Hide resolved
test/system/distr_tests.cc Outdated Show resolved Hide resolved
@GagaLP GagaLP force-pushed the divergence-check branch 2 times, most recently from d5d2e90 to b88b60f Compare October 12, 2023 11:45
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

include/divergence_block_chain.h Outdated Show resolved Hide resolved
include/divergence_block_chain.h Outdated Show resolved Hide resolved
Copy link
Contributor

@PeterTh PeterTh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, other than the minor documentation and formatting things discussed offline.

I particularly appreciate moving most of the testing logic and functionality outside the main code path.

@GagaLP GagaLP force-pushed the divergence-check branch 2 times, most recently from bb94e68 to 4af1341 Compare October 12, 2023 14:13
Copy link
Contributor

@fknorr fknorr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is becoming a valuable on-boarding debug feature.

I have not looked closely at the algorithm so far, but some more documentaiton / clarification on what the workflow around the state and the member functions of abstract_block_chain is would help a lot.

test/divergence_check_test_utils.h Outdated Show resolved Hide resolved
include/divergence_block_chain.h Outdated Show resolved Hide resolved
include/divergence_block_chain.h Outdated Show resolved Hide resolved
include/divergence_block_chain.h Outdated Show resolved Hide resolved
include/divergence_block_chain.h Outdated Show resolved Hide resolved
include/utils.h Outdated Show resolved Hide resolved
src/divergence_block_chain.cc Outdated Show resolved Hide resolved
src/divergence_block_chain.cc Outdated Show resolved Hide resolved
test/divergence_check_test_utils.h Outdated Show resolved Hide resolved
include/divergence_block_chain.h Outdated Show resolved Hide resolved
@psalz psalz added this to the 0.5.0 milestone Nov 15, 2023
Copy link

Check-perf-impact results: (3b34e58e3c100f4c3541a1ed59580f72)

❓ No new benchmark data submitted. ❓
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

include/communicator.h Show resolved Hide resolved
include/communicator.h Show resolved Hide resolved
include/communicator.h Show resolved Hide resolved
include/communicator.h Outdated Show resolved Hide resolved
test/divergence_check_test_utils.h Outdated Show resolved Hide resolved
test/divergence_check_test_utils.h Outdated Show resolved Hide resolved
Copy link
Member

@psalz psalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I did a first pass and added some comments, mostly about comprehensibility / naming.

include/divergence_block_chain.h Outdated Show resolved Hide resolved
include/communicator.h Outdated Show resolved Hide resolved
include/communicator.h Show resolved Hide resolved
include/divergence_block_chain.h Outdated Show resolved Hide resolved
include/communicator.h Show resolved Hide resolved
include/divergence_block_chain.h Outdated Show resolved Hide resolved
include/divergence_block_chain.h Outdated Show resolved Hide resolved
include/divergence_block_chain.h Outdated Show resolved Hide resolved
include/divergence_block_chain.h Outdated Show resolved Hide resolved
include/divergence_block_chain.h Outdated Show resolved Hide resolved
@GagaLP GagaLP force-pushed the divergence-check branch 2 times, most recently from 388c399 to b06ad9e Compare December 6, 2023 14:59
Copy link

github-actions bot commented Dec 6, 2023

Check-perf-impact results: (4c65f1399a47e0eb1340f63004745b17)

❓ No new benchmark data submitted. ❓
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

@GagaLP GagaLP force-pushed the divergence-check branch 2 times, most recently from fc46cfc to ebf4f9a Compare December 6, 2023 15:32
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

test/divergence_check_test_utils.h Outdated Show resolved Hide resolved
test/divergence_check_test_utils.h Outdated Show resolved Hide resolved
test/divergence_check_tests.cc Outdated Show resolved Hide resolved
test/divergence_check_tests.cc Outdated Show resolved Hide resolved
test/divergence_check_tests.cc Outdated Show resolved Hide resolved
test/divergence_check_tests.cc Outdated Show resolved Hide resolved
test/system/distr_tests.cc Show resolved Hide resolved
test/system/distr_tests.cc Show resolved Hide resolved
test/system/distr_tests.cc Show resolved Hide resolved
test/system/distr_tests.cc Show resolved Hide resolved
include/recorders.h Outdated Show resolved Hide resolved
include/recorders.h Show resolved Hide resolved
include/runtime.h Outdated Show resolved Hide resolved
include/runtime.h Outdated Show resolved Hide resolved
include/mpi_communicator.h Show resolved Hide resolved
src/runtime.cc Outdated Show resolved Hide resolved
src/runtime.cc Outdated Show resolved Hide resolved
include/ranges.h Outdated Show resolved Hide resolved
include/grid.h Outdated Show resolved Hide resolved
include/communicator.h Show resolved Hide resolved
Copy link
Member

@psalz psalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things are coming together! GitHub decided to post some of my comments early for some reason... Well, here's the rest :P

By the way, we have a section on diverging execution in ours docs / website, it would be great if you could add a sentence there mentioning this check and how it can be enabled!

src/runtime.cc Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
src/divergence_block_chain.cc Outdated Show resolved Hide resolved
src/divergence_block_chain.cc Outdated Show resolved Hide resolved
include/divergence_block_chain.h Outdated Show resolved Hide resolved
test/divergence_check_tests.cc Outdated Show resolved Hide resolved
std::transform(div_test.begin(), div_test.end(), std::back_inserter(sizes), [](auto& div) { return div->m_task_records.size(); });
auto [min, max] = std::minmax_element(sizes.begin(), sizes.end());

std::vector<per_node_task_hashes> extended_lifetime_hashes;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this is for..?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extended_lifetime_hashes are needed, because of how the single node divergence_test_communicator works. The idea is following:

  • each simulated node calls allgather sequentially. The send/receive buffer is saved internally as simple pointers.
  • When all simulated nodes called the allgather the data is copied.

To prevent those send/receive buffers for per_node_task_hashes from going out of scope we need to "extend" their lifetime. For this, I included the extended_lifetime_hashes vector.

Copy link
Member

@psalz psalz Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so I took a look at this again and this does not work: divergence_block_chain::collect_hashes returns a per_node_task_hashes by value, which itself contains a std::vector<task_hash>. So extended_lifetime_hashes[0].data() is no longer the same pointer as stored in divergence_test_communicator::m_gather_data, and you're effectively reading already free'd memory. If you run the tests with AddressSanitizier (and without mimalloc), you will get an error for it.

I think the only way the collective operations can be properly mocked (that is, to have them block until all "ranks" have called them), is to put each block chain into a separate thread. The upside is that you won't need any of the pre_check / post_check hackery. I would suggest to abstract all of that into a divergence_check_test_context (which would subsume the divergence_test_communicator_provider) that also wraps a number of divergence_block_chains, each in their own thread, and possibly even the task_test_context for each.

src/divergence_block_chain.cc Outdated Show resolved Hide resolved
src/divergence_block_chain.cc Outdated Show resolved Hide resolved
test/divergence_check_tests.cc Outdated Show resolved Hide resolved
github-actions[bot]

This comment was marked as outdated.

@GagaLP GagaLP changed the title Added celerity blockchain for task divergence checking Add celerity blockchain for task divergence checking Dec 19, 2023
github-actions[bot]

This comment was marked as outdated.


std::unique_ptr<communicator> m_communicator;

void divergence_out(const divergence_map& check_map, const int task_num);
void reprot_divergence(const divergence_map& check_map, const int task_num);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void reprot_divergence(const divergence_map& check_map, const int task_num);
void report_divergence(const divergence_map& check_map, const int task_num);

MPI_Comm_dup(MPI_COMM_WORLD, &comm);
m_divergence_check = std::make_unique<divergence_checker>(*m_task_recorder, std::make_unique<mpi_communicator>(comm), m_test_mode);
#endif
// if (m_cfg->is_recording()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment

// // Sychronize all nodes before reseting shuch that we don't get into a deadlock
// // With this barrier we can be shure that every node is fully finished and all mpi operations are done. (Sending ...)
// MPI_Barrier(MPI_COMM_WORLD);
// m_divergence_check.reset();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment

Copy link
Contributor

@PeterTh PeterTh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, other than a few typos and commented out code.

}

#if CELERITY_DIVERGENCE_CHECK
// Sychronize all nodes before reseting shuch that we don't get into a deadlock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few typos "Sychronize" "reseting" "shuch"

@@ -289,6 +294,7 @@ target_compile_definitions(celerity_runtime PUBLIC
CELERITY_FEATURE_UNNAMED_KERNELS=$<BOOL:${CELERITY_FEATURE_UNNAMED_KERNELS}>
CELERITY_DETAIL_HAS_NAMED_THREADS=$<BOOL:${CELERITY_DETAIL_HAS_NAMED_THREADS}>
CELERITY_ACCESSOR_BOUNDARY_CHECK=$<BOOL:${CELERITY_ACCESSOR_BOUNDARY_CHECK}>
CELERITY_DIVERGENCE_CHECK=$<BOOL:${CELERITY_DIVERGENCE_CHECK}>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also needs to be added to cmake/celerity-config.cmake.in!

Comment on lines +204 to +209
#if CELERITY_DIVERGENCE_CHECK
// divergence checker needs recording
m_recording = true;
#else
m_recording = parsed_and_validated_envs.get_or(env_recording, false);
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we print a warning that recording is being force-enabled here? What about the user explicitly setting CELERITY_RECORDING=0? cc @PeterTh @fknorr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not a big fan of CELERITY_RECORDING as it exists for that very reason. The user does not care about DAGs being recorded, they care about divergence checks or graph printing, from which we can decide whether recording needs to be active or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, maybe we should just get rid of it in a small follow-up PR.

Copy link
Contributor

@fknorr fknorr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments!

Also since the clang-tidy check for this seems broken in CI: Please go over all new function definitions and make sure parameters that can be const are const.

include/communicator.h Show resolved Hide resolved

private:
std::thread m_thread;
bool m_is_running = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m_is_running must be protected by a mutex!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just use an atomic.

divergence_block_chain& operator=(const divergence_block_chain&) = delete;
divergence_block_chain& operator=(divergence_block_chain&&) = delete;

bool check_for_divergence();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a comment on what a true / false return value means. It reads like this would return true when there was divergence, but the function actually throws in that case!

Comment on lines +119 to +120
m_thread = std::thread(&divergence_checker::run, this);
m_is_running = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like there is a race between setting m_is_running = true here and the check for m_is_running in run(). I suggest you reverse the order to fix this.

Suggested change
m_thread = std::thread(&divergence_checker::run, this);
m_is_running = true;
m_is_running = true;
m_thread = std::thread(&divergence_checker::run, this);

Comment on lines +204 to +209
#if CELERITY_DIVERGENCE_CHECK
// divergence checker needs recording
m_recording = true;
#else
m_recording = parsed_and_validated_envs.get_or(env_recording, false);
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not a big fan of CELERITY_RECORDING as it exists for that very reason. The user does not care about DAGs being recorded, they care about divergence checks or graph printing, from which we can decide whether recording needs to be active or not.

Comment on lines +9 to +16
if(min_hash_count == 0) {
if(max_hash_count != 0 && m_local_nid == 0) {
check_for_deadlock();
} else if(max_hash_count == 0) {
return true;
}
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment: What is happening here?

m_hashes_added = m_task_records.size();
}

void divergence_block_chain::clear(const int min_progress) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming: This does not clear the chain (similar to what vector::clear would do. Maybe something along the lines of erase_front, prune_leading, or similar?

@psalz
Copy link
Member

psalz commented Dec 20, 2023

Okay so as discussed offline, we won't include this in 0.5.0 as it needs another revision. The main points:

  • Deadlock detection as-is would produce too many false positive warnings; not sure yet how to proceed on this.
  • Testing infrastructure invokes UB; needs multi-threading to properly mock blocking collective operations
  • We should have a test case (distr_test / integration test?) that exercises the case that one node submits a task while the other does not (the divergence then occurs between that task and the shutdown epoch).

@psalz psalz removed this from the 0.5.0 milestone Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants