Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System hangs during raft node initialization on macOS in situations where initialization should fail #281

Closed
2 of 3 tasks
maurermi opened this issue Jul 12, 2024 · 2 comments
Assignees
Labels
fix/bug Fixes errant behavior

Comments

@maurermi
Copy link
Collaborator

Affected Branch

We have observed that basic_raft_cluster_failure_test hangs on macOS (observed in the macOS CI, as well as on an M3 Mac running macOS Sonoma). This is because of the following block of code in util/raft/node.cpp:46-63

        m_raft_instance = m_launcher.init(m_sm,
                                          m_smgr,
                                          m_raft_logger,
                                          m_port,
                                          m_asio_opt,
                                          params,
                                          m_init_opts);

        if(!m_raft_instance) {
            m_log->error("Failed to initialize raft launcher");
            return false;
        }

        m_log->info("Waiting for raft initialization");
        static constexpr auto wait_time = std::chrono::milliseconds(100);
        while(!m_raft_instance->is_initialized()) {
            std::this_thread::sleep_for(wait_time);
        }

On MacOS, m_launcher.init() returns true in situations where the raft instance cannot successfully be initialized, causing the waiting loop to be infinite. This does not appear to happen on Linux (verified on Ubuntu).

This error occurs in the NuRaft codebase, and so I propose two potential solutions here

  1. We are currently using NuRaft v1.3.0, whereas NuRaft is currently on version 2.1.0. We should investigate whether this problem has been solved to this point and consider upgrading.
  2. We should add a timeout such that we never wait longer than for raft initialization. This is likely a wise addition whether or not we upgrade NuRaft.

Basic Diagnostics

  • I've pulled the latest changes on the affected branch and the issue is still present.

  • The issue is reproducible in docker

Description

In order to reproduce the issue, follow these steps:

  1. Run the basic_raft_cluster_failure_test on MacOS

Code of Conduct

  • I agree to follow this project's Code of Conduct
@maurermi maurermi added the fix/bug Fixes errant behavior label Jul 12, 2024
@maurermi
Copy link
Collaborator Author

Assigning this to @eolesinski

@HalosGhost
Copy link
Collaborator

via #290

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix/bug Fixes errant behavior
Projects
None yet
Development

No branches or pull requests

2 participants