Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when running XGBoost via PySpark (socket.cc - Check failed) #11067

Open
aardvarkk opened this issue Dec 6, 2024 · 4 comments
Open

Crash when running XGBoost via PySpark (socket.cc - Check failed) #11067

aardvarkk opened this issue Dec 6, 2024 · 4 comments

Comments

@aardvarkk
Copy link

I'm running PySpark on macOS on an M3 Pro.

I have a consistent crash when calling fit on SparkXGBRegressor:

xgboost.core.XGBoostError: [15:45:17] /Users/runner/work/xgboost/xgboost/src/collective/socket.cc:133: Check failed: static_cast<std::int32_t>(conn.Domain()) == static_cast<std::int32_t>(addr.Domain()) (2 vs. 30) :
Stack trace:
  [bt] (0) 1   libxgboost.dylib                    0x0000000118c20428 dmlc::LogMessageFatal::~LogMessageFatal() + 124
  [bt] (1) 2   libxgboost.dylib                    0x0000000118ce2814 xgboost::collective::Connect(xgboost::StringView, int, int, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1l>>, xgboost::collective::TCPSocket*) + 360
  [bt] (2) 3   libxgboost.dylib                    0x0000000118ccb90c xgboost::collective::ConnectTrackerImpl(xgboost::collective::proto::PeerInfo, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1l>>, int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, xgboost::collective::TCPSocket*, int, int) + 144
  [bt] (3) 4   libxgboost.dylib                    0x0000000118cd0454 xgboost::collective::RabitComm::Bootstrap(std::__1::chrono::duration<long long, std::__1::ratio<1l, 1l>>, int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 228
  [bt] (4) 5   libxgboost.dylib                    0x0000000118ccfe24 xgboost::collective::RabitComm::RabitComm(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, int, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1l>>, int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, xgboost::StringView) + 924
  [bt] (5) 6   libxgboost.dylib                    0x0000000118cd5ae4 xgboost::collective::CommGroup::Create(xgboost::Json) + 1320
  [bt] (6) 7   libxgboost.dylib                    0x0000000118cd6b30 xgboost::collective::GlobalCommGroupInit(xgboost::Json) + 96
  [bt] (7) 8   libxgboost.dylib                    0x0000000118caf2c8 XGCommunicatorInit + 84
  [bt] (8) 9   libffi.dylib                        0x00000001a31eb050 ffi_call_SYSV + 80

It seems like maybe it's some IPv4 vs IPv6 mismatch? That being said, I'm not at all clear what to do about it!

@trivialfis
Copy link
Member

cc @WeichenXu123 Do you recall why pyspark XGB isn't tested on macos? Also, is pyspark itself tested on apple silicon?

@WeichenXu123
Copy link
Contributor

cc @WeichenXu123 Do you recall why pyspark XGB isn't tested on macos? Also, is pyspark itself tested on apple silicon?

Need to check CI setup. For databricks use-cases, they only run on ubuntu OS, macOS is not the case we care for

@trivialfis
Copy link
Member

Thank you for sharing, I will put a warning that macos and windows is not supported.

@ayoub317
Copy link
Contributor

ayoub317 commented Dec 29, 2024

Out of curiosity, when do you encounter such errors ? Does it work when you spawn a local Spark session ?
What type of cluster are your executors running on ? Is the driver your local machine, and have you tried launching it in a setup where the driver is a machine within the cluster instead of your local machine ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants