-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix build and C++ tests for FreeBSD #10480
Conversation
@trivialfis Many C++ tests are failing on FreeBSD. It appears that all tracker C++ tests have a race condition where the tracker is stopped before workers can send a shutdown signal.
To get this error, I built XGBoost on FreeBSD 14.1 with the following patch: diff --git a/src/collective/comm.cc b/src/collective/comm.cc
index 543ece639..6c257a703 100644
--- a/src/collective/comm.cc
+++ b/src/collective/comm.cc
@@ -373,6 +373,7 @@ RabitComm::~RabitComm() noexcept(false) {
TCPSocket err_client;
return Success() << [&] {
+ LOG(CONSOLE) << "RabitComm::Shutdown(): ConnectTrackerImpl()";
return ConnectTrackerImpl(tracker_, timeout_, retry_, task_id_, &tracker, Rank(), World());
} << [&] {
return this->Block();
diff --git a/src/collective/tracker.cc b/src/collective/tracker.cc
index 9441ab449..822a283ab 100644
--- a/src/collective/tracker.cc
+++ b/src/collective/tracker.cc
@@ -355,6 +355,7 @@ Result RabitTracker::Bootstrap(std::vector<WorkerProxy>* p_workers) {
}
[[nodiscard]] Result RabitTracker::Stop() {
+ LOG(CONSOLE) << "RabitTracker::Stop()";
if (!this->Ready()) {
return Success();
}
@@ -366,6 +367,7 @@ Result RabitTracker::Bootstrap(std::vector<WorkerProxy>* p_workers) {
}
return Success() << [&] {
+ LOG(CONSOLE) << "RabitTracker::Stop(): Shutdown()";
// This should have the effect of stopping the `accept` call.
return this->listener_.Shutdown();
} << [&] {
diff --git a/tests/cpp/collective/test_allgather.cc b/tests/cpp/collective/test_allgather.cc
index 7764a2adc..71bf8203d 100644
--- a/tests/cpp/collective/test_allgather.cc
+++ b/tests/cpp/collective/test_allgather.cc
@@ -141,7 +141,7 @@ class Worker : public WorkerForTest {
} // namespace
TEST_F(AllgatherTest, Basic) {
- std::int32_t n_workers = std::min(7u, std::thread::hardware_concurrency());
+ std::int32_t n_workers = 2;
TestDistributed(n_workers, [=](std::string host, std::int32_t port, std::chrono::seconds timeout,
std::int32_t r) {
Worker worker{host, port, timeout, n_workers, r}; Shall we disable all tests using the tracker until we can figure out how to fix the race condition in the test harness? |
I will try to debug it. Should not be difficult once I can reproduce. |
@trivialfis Let me know if you need help with setting up FreeBSD VM. |
These issues are not new, it's just we started to have c++ tests for networking in this release. Will dive deeper. |
The networking issue should be fixed now. It's an invalid argument error in the tracker. Freebsd requires timeout to be -1 for infinite timeout, while other platforms are OK with negative values. |
I'm still running into a flaky hang for column sampler. Debugging. |
It's probably just my virtual machine (qemu). The OS seems overwhelmed by the number of threads in some tests. There's also a flaky test for column split, probably due to timeout; I haven't gotten to the bottom of it yet. If it fails on the CI, I will disable it for now and get back to it when I sort out some other priorities. |
@trivialfis Can we merge this for now? The CI is green. |
Closes #10468
Closes #10467
Closes #10466
cc @yurivict