Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in h2o.getConnection(): No active connection to an H2O cluster. #57

Open
siavash-babaei opened this issue Sep 15, 2024 · 11 comments

Comments

@siavash-babaei
Copy link

siavash-babaei commented Sep 15, 2024

With tune::tune_grid, I bump into the "Warning: All models failed ...", which is due to "Error in h2o.getConnection(): No active connection to an H2O cluster ...". Although the output of h2o.getConnection() immediately before and after tune_grid suggest everything should be OK. The code is pretty much the standard example on reference site.

library(ggplot2)
library(tidymodels)
library(agua)
#> 
#> Attaching package: 'agua'
#> The following object is masked from 'package:workflowsets':
#> 
#>     rank_results
library(h2o)
#> 
#> ----------------------------------------------------------------------
#> 
#> Your next step is to start H2O:
#>     > h2o.init()
#> 
#> For H2O package documentation, ask for help:
#>     > ??h2o
#> 
#> After starting H2O, you can use the Web UI at http://localhost:54321
#> For more information visit https://docs.h2o.ai
#> 
#> ----------------------------------------------------------------------
#> 
#> Attaching package: 'h2o'
#> The following objects are masked from 'package:stats':
#> 
#>     cor, sd, var
#> The following objects are masked from 'package:base':
#> 
#>     %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames,
#>     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
#>     log10, log1p, log2, round, signif, trunc

h2o::h2o.init()
#>  Connection successful!
#> 
#> R is connected to the H2O cluster: 
#>     H2O cluster uptime:         50 minutes 51 seconds 
#>     H2O cluster timezone:       America/New_York 
#>     H2O data parsing timezone:  UTC 
#>     H2O cluster version:        3.46.0.5 
#>     H2O cluster version age:    16 days 
#>     H2O cluster name:           H2O_started_from_R_siava_dqu656 
#>     H2O cluster total nodes:    1 
#>     H2O cluster total memory:   13.28 GB 
#>     H2O cluster total cores:    32 
#>     H2O cluster allowed cores:  32 
#>     H2O cluster healthy:        TRUE 
#>     H2O Connection ip:          localhost 
#>     H2O Connection port:        54321 
#>     H2O Connection proxy:       NA 
#>     H2O Internal Security:      FALSE 
#>     R Version:                  R version 4.4.1 (2024-06-14 ucrt)
h2o::h2o.connect()
#>  Connection successful!
#> 
#> R is connected to the H2O cluster: 
#>     H2O cluster uptime:         50 minutes 54 seconds 
#>     H2O cluster timezone:       America/New_York 
#>     H2O data parsing timezone:  UTC 
#>     H2O cluster version:        3.46.0.5 
#>     H2O cluster version age:    16 days 
#>     H2O cluster name:           H2O_started_from_R_siava_dqu656 
#>     H2O cluster total nodes:    1 
#>     H2O cluster total memory:   13.28 GB 
#>     H2O cluster total cores:    32 
#>     H2O cluster allowed cores:  32 
#>     H2O cluster healthy:        TRUE 
#>     H2O Connection ip:          localhost 
#>     H2O Connection port:        54321 
#>     H2O Connection proxy:       NA 
#>     H2O Internal Security:      FALSE 
#>     R Version:                  R version 4.4.1 (2024-06-14 ucrt)

data(ames)

set.seed(4595)
data_split <- ames |>
  mutate(Sale_Price = log10(Sale_Price)) |>
  initial_split(strata = Sale_Price)
ames_train <- training(data_split)
ames_test <- testing(data_split)
cv_splits <- vfold_cv(ames_train, v = 10, strata = Sale_Price)

ames_rec <-
  recipe(Sale_Price ~ Gr_Liv_Area + Longitude + Latitude, data = ames_train) |>
  step_log(Gr_Liv_Area, base = 10) |>
  step_ns(Longitude, deg_free = tune("long df")) |>
  step_ns(Latitude, deg_free = tune("lat df"))

lm_mod <- linear_reg(penalty = tune()) |>
  set_engine("h2o")

lm_wflow <- workflow() |>
  add_model(lm_mod) |>
  add_recipe(ames_rec)

grid <- lm_wflow |>
  extract_parameter_set_dials() |>
  grid_regular(levels = 5)

h2o.getConnection()
#> IP Address: localhost 
#> Port      : 54321 
#> Name      : NA 
#> Session ID: _sid_b5d8 
#> Key Count : 0

ames_res <- tune_grid(
  lm_wflow,
  resamples = cv_splits,
  grid = grid,
  control = control_grid(save_pred = TRUE,
                         backend_options = agua_backend_options(parallelism = 5))
)
#> Warning: All models failed. Run `show_notes(.Last.tune.result)` for more
#> information.

h2o.getConnection()
#> IP Address: localhost 
#> Port      : 54321 
#> Name      : NA 
#> Session ID: _sid_b5d8 
#> Key Count : 0

agua::h2o_running(verbose = FALSE)
#> [1] TRUE

show_notes(.Last.tune.result)
#> unique notes:
#> ────────────────────────────────────────────────────────────────────────────────
#> Error in h2o.getConnection(): No active connection to an H2O cluster. Did you run `h2o.init()` ?

Latest RStudio is run with both normal and administrative privileges (same issue - although we should not need admin rights) on Windows 11 Pro x64 machine. JRE (1.8.0_421) and JDK (22) are installed, and all Java paths (JRE_HOME, JDK_HOME, JAVA_HOME = JRE_HOME) are all A-OK! I went so-far as to use rJava to manually initialize a Java VM instance with no effect. The command demo(h2o.kmeans) runs wihout a problem and produces expected results.

The only other thing of note is that running agua::h2o_start() produces weird output that it should not be. I get "permission denied error" on user folder which should be accessible even without admin rights (folder-access issue is the same with h2o package) and I get "no Java error" that shouldn't be there. Could this be the root of the issue: "h2o cluster connection is there as far as h2o package is concerned but agua (which adds supports for tidymodels) does not recognize Java and the existing h2o connection"?!

h2o::h2o.init()
#> 
#> H2O is not running yet, starting it now...
#> 
#> Note:  In case of errors look at the following log files:
#>     C:\Users\siava\AppData\Local\Temp\Rtmp08HkaV\filec5072151a2b/h2o_siava_started_from_r.out
#>     C:\Users\siava\AppData\Local\Temp\Rtmp08HkaV\filec50cc92881/h2o_siava_started_from_r.err
#> 
#> 
#> Starting H2O JVM and connecting:
#> Warning in file(con, "r"): cannot open file
#> 'C:\Users\siava\AppData\Local\Temp\Rtmp08HkaV\filec5072151a2b/h2o_siava_started_from_r.out':
#> Permission denied
#> Error in file(con, "r"): cannot open the connection
h2o::h2o.connect()
#>  Connection successful!
#> 
#> R is connected to the H2O cluster: 
#>     H2O cluster uptime:         3 seconds 213 milliseconds 
#>     H2O cluster timezone:       America/New_York 
#>     H2O data parsing timezone:  UTC 
#>     H2O cluster version:        3.46.0.5 
#>     H2O cluster version age:    16 days 
#>     H2O cluster name:           H2O_started_from_R_siava_ijf963 
#>     H2O cluster total nodes:    1 
#>     H2O cluster total memory:   14.14 GB 
#>     H2O cluster total cores:    32 
#>     H2O cluster allowed cores:  32 
#>     H2O cluster healthy:        TRUE 
#>     H2O Connection ip:          localhost 
#>     H2O Connection port:        54321 
#>     H2O Connection proxy:       NA 
#>     H2O Internal Security:      FALSE 
#>     R Version:                  R version 4.4.1 (2024-06-14 ucrt)

h2o::h2o.getConnection()
#> IP Address: localhost 
#> Port      : 54321 
#> Name      : NA 
#> Session ID: _sid_8c41 
#> Key Count : 0
agua::h2o_running(verbose = TRUE)
#> [1] TRUE

agua::h2o_start()
#> Warning: JAVA not found, H2O may take minutes trying to connect.

agua::h2o_remove_all()
agua::h2o_end()

agua::h2o_start()
#> Warning: JAVA not found, H2O may take minutes trying to connect.
#> Error in .h2o.__checkConnectionHealth(): H2O connection has been severed. Cannot connect to instance at http://localhost:54321/
#> Failed to connect to localhost port 54321 after 2246 ms: Couldn't connect to server
agua::h2o_start()
#> Warning: JAVA not found, H2O may take minutes trying to connect.
#> Warning in file(con, "r"): cannot open file
#> 'C:\Users\siava\AppData\Local\Temp\Rtmp08HkaV\filec503d932872/h2o_siava_started_from_r.out':
#> Permission denied
#> Error in file(con, "r"): cannot open the connection
agua::h2o_start()
#> Warning: JAVA not found, H2O may take minutes trying to connect.

h2o::h2o.getConnection()
#> IP Address: localhost 
#> Port      : 54321 
#> Name      : NA 
#> Session ID: _sid_92dd 
#> Key Count : 0
agua::h2o_running(verbose = TRUE)
#> [1] TRUE

agua::h2o_remove_all()
agua::h2o_end()
@simonpcouch
Copy link
Collaborator

Looks like the "JAVA not found" error from agua::h2o_start() is coming from a call to Sys.which("java"). What output do you see in that case? Do you see different output in a terminal with which java? I might focus on making sure the system knows where to find java using which as my first debugging route.

Also, some diagnostic messages read:

h2o::h2o.init()
#> Note:  In case of errors look at the following log files:
#>     C:\Users\siava\AppData\Local\Temp\Rtmp08HkaV\filec5072151a2b/h2o_siava_started_from_r.out
#>     C:\Users\siava\AppData\Local\Temp\Rtmp08HkaV\filec50cc92881/h2o_siava_started_from_r.err

Could you re-run h2o::h2o.init(), note the log files shown in the output, attempt to run your tuning code, and then paste the log file contents here?

@siavash-babaei
Copy link
Author

Plenty Appreciated ...

Sys.which("java") prints out C:\\PROGRA~2\\COMMON~1\\Oracle\\Java\\JAVA8P~2\\java.exe which is in Program Files (x86). It is different from the actual JAVA_HOME at C:\\Program Files\\Java\\jre1.8.0_421.

system("java -version") outputs the correct Java (I have no other installation):

java version "1.8.0_421"
Java(TM) SE Runtime Environment (build 1.8.0_421-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.421-b09, mixed mode)
[1] 0

h2o_siava_started_from_r.err

2024-09-15 22:56:23.287:INFO::main: Logging initialized @477ms to org.eclipse.jetty.util.log.StdErrLog

h2o_siava_started_from_r.out

22:56:23.069 [main] WARN  hex.tree.xgboost.XGBoostExtension - Cannot initialize XGBoost backend! Xgboost (enabled GPUs) needs: 
  - CUDA 8.0
XGboost (minimal version) needs: 
  - GCC 4.7+
Apple silicon is not supported yet
For more details, run in debug mode: `java -Dlog4j.configuration=file:///tmp/log4j.properties -jar h2o.jar`

09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: ----- H2O started  -----
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: Build git branch: rel-3.46.0
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: Build git hash: 62c1c2579b6eb97e8813659a5e7b919af8ae0c5f
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: Build git describe: jenkins-3.46.0.4-21-g62c1c2579b6
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: Build project version: 3.46.0.5
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: Build age: 17 days
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: Built by: 'jenkins'
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: Built on: '2024-08-29 11:51:54'
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: Found H2O Core extensions: [KrbStandalone, Infogram]
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: Processed H2O arguments: [-name, H2O_started_from_R_siava_igg698, -ip, localhost, -web_ip, localhost, -port, 54321, -ice_root, C:/Users/siava/AppData/Local/Temp/Rtmp6lYapz, -allow_unsupported_java]
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: Java availableProcessors: 32
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: Java heap totalMemory: 15.33 GB
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: Java heap maxMemory: 49.78 GB
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: Java version: Java 1.8.0_421 (from Oracle Corporation)
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: JVM launch parameters: [-Xms16g, -Xmx56g, -ea]
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: JVM process id: 10040@SB-HOME-WIN
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: OS version: Windows 11 10.0 (amd64)
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: Machine physical memory: 63.71 GB
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: Machine locale: en_CA
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: X-h2o-cluster-id: 1726455382903
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: User name: 'siava'
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: IPv6 stack selected: false
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: Possible IP Address: lo (Software Loopback Interface 1), 127.0.0.1
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: Possible IP Address: lo (Software Loopback Interface 1), 0:0:0:0:0:0:0:1
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:net0 (Microsoft 6to4 Adapter)
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:net1 (Microsoft IP-HTTPS Platform Adapter)
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:net2 (WAN Miniport (L2TP))
09-15 22:56:23.577 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:eth0 (Microsoft Kernel Debug Network Adapter)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:net3 (WAN Miniport (SSTP))
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:wlan0 (Microsoft Wi-Fi Direct Virtual Adapter #2)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:ppp0 (WAN Miniport (PPPOE))
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:wlan1 (Microsoft Wi-Fi Direct Virtual Adapter)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:eth1 (Realtek Gaming 2.5GbE Family Controller)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:net4 (Microsoft Teredo Tunneling Adapter)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:eth2 (WAN Miniport (IPv6))
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:eth3 (WAN Miniport (Network Monitor))
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Possible IP Address: wlan2 (Intel(R) Wi-Fi 6E AX211 160MHz), 10.0.0.42
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Possible IP Address: wlan2 (Intel(R) Wi-Fi 6E AX211 160MHz), 2607:fea8:55df:a1d0:0:0:0:2a02
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Possible IP Address: wlan2 (Intel(R) Wi-Fi 6E AX211 160MHz), 2607:fea8:55df:a1d0:b401:575e:a521:46be
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Possible IP Address: wlan2 (Intel(R) Wi-Fi 6E AX211 160MHz), 2607:fea8:55df:a1d0:69e9:33ca:ab6d:897e
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Possible IP Address: wlan2 (Intel(R) Wi-Fi 6E AX211 160MHz), fe80:0:0:0:b61d:f279:5063:6ab7%wlan2
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:net5 (WAN Miniport (IKEv2))
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:net6 (WAN Miniport (PPTP))
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:eth4 (Bluetooth Device (Personal Area Network))
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:eth5 (WAN Miniport (IP))
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:eth6 (Realtek Gaming 2.5GbE Family Controller-WFP Native MAC Layer LightWeight Filter-0000)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:eth7 (Realtek Gaming 2.5GbE Family Controller-QoS Packet Scheduler-0000)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:eth8 (Realtek Gaming 2.5GbE Family Controller-WFP 802.3 MAC Layer LightWeight Filter-0000)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:wlan3 (Intel(R) Wi-Fi 6E AX211 160MHz-WFP Native MAC Layer LightWeight Filter-0000)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:wlan4 (Intel(R) Wi-Fi 6E AX211 160MHz-Virtual WiFi Filter Driver-0000)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:wlan5 (Intel(R) Wi-Fi 6E AX211 160MHz-Native WiFi Filter Driver-0000)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:wlan6 (Intel(R) Wi-Fi 6E AX211 160MHz-QoS Packet Scheduler-0000)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:wlan7 (Intel(R) Wi-Fi 6E AX211 160MHz-WFP 802.3 MAC Layer LightWeight Filter-0000)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:wlan8 (Microsoft Wi-Fi Direct Virtual Adapter-WFP Native MAC Layer LightWeight Filter-0000)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:wlan9 (Microsoft Wi-Fi Direct Virtual Adapter-Native WiFi Filter Driver-0000)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:wlan10 (Microsoft Wi-Fi Direct Virtual Adapter-QoS Packet Scheduler-0000)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:wlan11 (Microsoft Wi-Fi Direct Virtual Adapter-WFP 802.3 MAC Layer LightWeight Filter-0000)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:wlan12 (Microsoft Wi-Fi Direct Virtual Adapter #2-WFP Native MAC Layer LightWeight Filter-0000)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:wlan13 (Microsoft Wi-Fi Direct Virtual Adapter #2-Native WiFi Filter Driver-0000)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:wlan14 (Microsoft Wi-Fi Direct Virtual Adapter #2-QoS Packet Scheduler-0000)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:wlan15 (Microsoft Wi-Fi Direct Virtual Adapter #2-WFP 802.3 MAC Layer LightWeight Filter-0000)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:eth9 (WAN Miniport (IP)-WFP Native MAC Layer LightWeight Filter-0000)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:eth10 (WAN Miniport (IP)-QoS Packet Scheduler-0000)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:eth11 (WAN Miniport (IPv6)-WFP Native MAC Layer LightWeight Filter-0000)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:eth12 (WAN Miniport (IPv6)-QoS Packet Scheduler-0000)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:eth13 (WAN Miniport (Network Monitor)-WFP Native MAC Layer LightWeight Filter-0000)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Network interface is down: name:eth14 (WAN Miniport (Network Monitor)-QoS Packet Scheduler-0000)
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: H2O node running in unencrypted mode.
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Internal communication uses port: 54322
09-15 22:56:23.593 127.0.0.1:54321       10040        main  INFO water.default: Listening for HTTP and REST traffic on http://127.0.0.1:54321/
09-15 22:56:23.671 127.0.0.1:54321       10040        main  INFO water.default: H2O cloud name: 'H2O_started_from_R_siava_igg698' on localhost/127.0.0.1:54321, discovery address /231.206.248.127:59342
09-15 22:56:23.671 127.0.0.1:54321       10040        main  INFO water.default: If you have trouble connecting, try SSH tunneling from your local machine (e.g., via port 55555):
09-15 22:56:23.671 127.0.0.1:54321       10040        main  INFO water.default:   1. Open a terminal and run 'ssh -L 55555:localhost:54321 [email protected]'
09-15 22:56:23.671 127.0.0.1:54321       10040        main  INFO water.default:   2. Point your browser to http://localhost:55555
09-15 22:56:24.255 127.0.0.1:54321       10040        main  WARN org.apache.hadoop.util.Shell: Did not find winutils.exe: {}
java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
	at org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:600) ~[h2o.jar:?]
	at org.apache.hadoop.util.Shell.getHadoopHomeDir(Shell.java:621) ~[h2o.jar:?]
	at org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:644) ~[h2o.jar:?]
	at org.apache.hadoop.util.Shell.<clinit>(Shell.java:741) [h2o.jar:?]
	at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79) [h2o.jar:?]
	at org.apache.hadoop.conf.Configuration.getStringCollection(Configuration.java:2268) [h2o.jar:?]
	at water.persist.PersistHdfs.injectS3ASecurityCredentialProviders(PersistHdfs.java:94) [h2o.jar:?]
	at water.persist.PersistHdfs.<clinit>(PersistHdfs.java:80) [h2o.jar:?]
	at hex.security.KerberosExtension.onLocalNodeStarted(KerberosExtension.java:60) [h2o.jar:?]
	at water.H2O.main(H2O.java:2446) [h2o.jar:?]
	at water.H2OStarter.start(H2OStarter.java:22) [h2o.jar:?]
	at water.H2OStarter.start(H2OStarter.java:51) [h2o.jar:?]
	at water.H2OApp.main(H2OApp.java:12) [h2o.jar:?]
Caused by: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
	at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:520) ~[h2o.jar:?]
	at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:491) ~[h2o.jar:?]
	at org.apache.hadoop.util.Shell.<clinit>(Shell.java:568) ~[h2o.jar:?]
	... 9 more
09-15 22:56:24.255 127.0.0.1:54321       10040        main  INFO water.default: Kerberos not configured
09-15 22:56:24.255 127.0.0.1:54321       10040        main  INFO water.default: Log dir: 'C:\Users\siava\AppData\Local\Temp\Rtmp6lYapz\h2ologs'
09-15 22:56:24.255 127.0.0.1:54321       10040        main  INFO water.default: Cur dir: 'C:\Users\siava\OneDrive\Workspace\Repositories\Projects\Learning\R\Practical Neural Networks'
09-15 22:56:24.270 127.0.0.1:54321       10040        main  INFO water.default: Subsystem for distributed import from HTTP/HTTPS successfully initialized
09-15 22:56:24.270 127.0.0.1:54321       10040        main  INFO water.default: HDFS subsystem successfully initialized
09-15 22:56:24.270 127.0.0.1:54321       10040        main  INFO water.default: S3 subsystem successfully initialized
09-15 22:56:24.283 127.0.0.1:54321       10040        main  INFO water.default: GCS subsystem successfully initialized
09-15 22:56:24.283 127.0.0.1:54321       10040        main  INFO water.default: Drive subsystem not available
09-15 22:56:24.283 127.0.0.1:54321       10040        main  INFO water.default: Flow dir: 'C:/Users/siava/h2oflows'
09-15 22:56:24.343 127.0.0.1:54321       10040        main  INFO water.default: Cloud of size 1 formed [localhost/127.0.0.1:54321]
09-15 22:56:24.343 127.0.0.1:54321       10040        main  INFO water.default: Registered parsers: [GUESS, ARFF, XLS, SVMLight, AVRO, PARQUET, CSV]
09-15 22:56:24.343 127.0.0.1:54321       10040        main  INFO water.default: KrbStandalone extension initialized
09-15 22:56:24.343 127.0.0.1:54321       10040        main  INFO water.default: Infogram extension initialized
09-15 22:56:24.343 127.0.0.1:54321       10040        main  INFO water.default: Registered 2 core extensions in: 125ms
09-15 22:56:24.343 127.0.0.1:54321       10040        main  INFO water.default: Registered H2O core extensions: [KrbStandalone, Infogram]
09-15 22:56:24.490 127.0.0.1:54321       10040        main  INFO water.default: Registered: 277 REST APIs in: 147ms
09-15 22:56:24.490 127.0.0.1:54321       10040        main  INFO water.default: Registered REST API extensions: [Mojo 2 pipeline extensions, Amazon S3, Algos, Infogram, AutoML, Core V3, TargetEncoder, Core V4]
09-15 22:56:24.568 127.0.0.1:54321       10040        main  INFO water.default: Registered: 332 schemas in 78ms
09-15 22:56:24.568 127.0.0.1:54321       10040        main  INFO water.default: H2O started in 1662ms
09-15 22:56:24.568 127.0.0.1:54321       10040        main  INFO water.default: 
09-15 22:56:24.568 127.0.0.1:54321       10040        main  INFO water.default: Open H2O Flow in your web browser: http://localhost:54321
09-15 22:56:24.568 127.0.0.1:54321       10040        main  INFO water.default: 
09-15 22:56:24.615 127.0.0.1:54321       10040  8857875-42  INFO water.default: GET /, parms: {}
09-15 22:56:25.085 127.0.0.1:54321       10040  8857875-27  INFO water.default: GET /, parms: {}
09-15 22:56:55.223 127.0.0.1:54321       10040  8857875-29  INFO water.default: GET /, parms: {}
09-15 22:56:55.658 127.0.0.1:54321       10040  8857875-31  INFO water.default: GET /, parms: {}
09-15 22:56:56.281 127.0.0.1:54321       10040  8857875-34  INFO water.default: GET /3/Capabilities/API, parms: {}
09-15 22:56:56.281 127.0.0.1:54321       10040  8857875-34  INFO water.default: Locking cloud to new members, because water.api.schemas3.CapabilitiesV3
09-15 22:56:56.923 127.0.0.1:54321       10040  8857875-25  INFO water.default: GET /3/InitID, parms: {}
09-15 22:56:57.352 127.0.0.1:54321       10040  8857875-42  INFO water.default: POST /3/SessionProperties, parms: {key=job.fetch_timeout_ms, value=1000, session_key=_sid_b3b1}

@tomasfryda
Copy link
Collaborator

This might be caused by an h2o-3 bug (h2oai/h2o-3#16360). Unfortunately, we don't have automated test for Windows and the bug is caused by trying to read a file from R that is opened by the java process. (This works on macOS and linux but not on Windows.)

If that is the case, you could downgrade h2o-3 to 3.44.0.x or wait for next fix release (I already have a PR fixing that: h2oai/h2o-3#16369).

@siavash-babaei
Copy link
Author

siavash-babaei commented Sep 18, 2024

I switched to a linux machine and tested my own code. I still get the exact same problem. I decided to test leaving out the bits of code one-by-one. Anyway, in short. Removing the following line from .Rprofile fixed it:

cluster <- parallel::makeCluster(...)

The fact that I have to remove parallelization capabilities as whole is not good (say, parallel::stopCluster for just a section and then restarting won't do the trick), this seems to fix it for the moment.

There is nothing odd in the log file, though on the Linux system, I noticed a warning:

parallel::makeCluster(spec = 32)
Warning: program compiled against libxml 210 using older 209
Warning: program compiled against libxml 210 using older 209
...

Then again, it would be odd if both Ubuntu and Windows used libxml 209!!! So I am guessing whatever it is, it has to do with parallel and future, or rather with for.each! very puzzling ...

NOTE 1: future::plan(...) will have the same issue.
NOTE 2: At least on my machine, the latest package in Ubuntu 24 repo is 209. I could not get any of 210s to install.

@simonpcouch
Copy link
Collaborator

Ah, that makes sense that removing cluster <- parallel::makeCluster(...) addressed the issue.

agua_backend_options(parallelism = 5) tells agua to use H2O's within-model parallelism, distributing computations for a single model across 5 cores. cluster <- parallel::makeCluster(...) tells tidymodels to use its across-model parallelism, fitting n_models/n_cores models per core across n_cores cores. Setting both of these is an example of "nested parallelism," and is unsupported by agua, as connections to H2O can't be serialized and shipped off to cluster workers. So, either agua_backend_options(parallelism = 5) or cluster <- parallel::makeCluster(...) will work, but not both at the same time.

@simonpcouch
Copy link
Collaborator

In any case, the error message here should be more informative!

@siavash-babaei
Copy link
Author

siavash-babaei commented Sep 19, 2024

Ah well, the problem persists in Windows. I guess I have to resolve to using linux for this. In Windows, you got to remove doParallel::registerDoParallel() as well, which you do not need to do in linux. I think that will remove parallel processing from H2O as well (kindly correct me if I am wrong).

@tomasfryda
Copy link
Collaborator

@siavash-babaei H2O-3 is written for big data and is using MapReduce under the hood so it should use parallel processing if you don't disable it. IIRC agua_backend_options(parallelism = 1) makes training of the models sequential, i.e., one model after another, but each model is trained using MapReduce so it should use all resources available.

H2O runs within JVM with the exception of XGBoost so if you train XGBoost models you should not set H2O to use all memory (that's the memory that will be allocated for JVM heap) so we have a recommendation to set only 2/3 of RAM for H2O.

Model training is usually more memory demanding than inference so if you train on bigger data it's wise to do it sequentially since it will make it much less likely to run out of memory. If you train on smaller data, it's possible to use parallelism > 2 and benefit from it.

If the problem persists on Windows, you can use older H2O version (e.g. the one on CRAN) and that might help with the access denied bug.

@siavash-babaei
Copy link
Author

siavash-babaei commented Sep 19, 2024

Solved. It doesn't have anything to do with R/H2O version, Java, or Linux/Windows. It was right there but strangely I didn't see the obvious: You need to have a cluster object and every node needs to be initialized. The future package doesn't have a function to return the cluster object nor the ability to initialize nodes properly with complex code, but the parallel package does.

To use the model parallelism for tidymodels and data parallelism for h2o and agua, you need to mix-up parallel, future, and doFuture packages. The parallel packages is used to detect cores, make a cluster, and then load-up and initialize every node. The future and doFuture packages are used to plan a future and register the for.each backend, which is required to enable model-parallelism in tidymodels. Tidymodels specifically uses the future package while agua can work with future, parallel, or mc backends (please correct if wrong).

# H2O for machine learning
library(h2o)

# Initialize local H2O server and start
local_h2o <- h2o::h2o.init(startH2O = TRUE)
local_h2o |> print()

# The agua package provides tidymodels interface to the H2O platform and the
# H2O R package
library(agua)

# Adaptive Parallelism - Decided by H2O
h2o_thread_spec <- agua::agua_backend_options(parallelism = 0)

# To be used when using grid search, racing, or any of the iterative search
# methods in tidymodels.
grid_ctrl <- tune::control_grid(
  allow_par = TRUE,
  save_pred = TRUE,
  save_workflow = TRUE,
  event_level = "second",
  # chooses between "resamples" and "everything" automatically
  parallel_over = NULL,
  backend_options = h2o_thread_spec
)

# Parallel computation libraries
library(future)
library(doFuture)

# Speed up computation with parallel processing (optional)
n_cores <- parallel::detectCores(logical = TRUE) 
cluster <- parallel::makeCluster(spec = n_cores)
doFuture::registerDoFuture()
strategy <- future::plan(strategy = future::cluster, workers = cluster)

# load, initialize, and check status of h2o on a cluster node
node_h2o_init <- function() {
  library(h2o)
  library(agua)

  # doesn't start a new server if you've already started one
  cluster_init <- h2o.init(startH2O = FALSE)

  c(cluster_init = cluster_init, cluster_status = h2o.clusterIsUp())
}

# initialize and check each cluster node
cluster |>
  parallel::clusterCall(cl = _, fun = node_h2o_init) |>
  purrr::list_c()

@siavash-babaei
Copy link
Author

For me, Adding tidymodel parallelism like above (similar to ) does in fact reduce a tune_grid op from around 100 mins to about 20 mins. A pretty significant gain. Thank you

@qiushiyan
Copy link
Collaborator

Thanks @siavash-babaei . We should consider adding a vignette for this, probably combined with #41

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants