You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
mpi depends entirely on each proc executing the same sequence of AllGather etc calls at the same times. If any node doesn't, everything just waits and then probably timeouts with an error.. When running a fixed -nogui run, there is no problem here.
But when running interactively, each node needs to get the user's commands to start, stop, step, Init, etc, so they can all stay sync'd. Thus, we need an additional outer-loop of communication where the proc > 0 nodes wait for commands and then run them, all the while checking to see if a stop command has come in.
Probably this should be done using something other than mpi, because it needs to be non-blocking and more dynamic. Someone with appropriate network communication knowledge should probably take this on..
The text was updated successfully, but these errors were encountered:
I wouldn't use a different protocol, mostly because if we just MPI for everything we only have to do the MPI_World setup once. With a different protocol it'll get complicated once we have cross-machine MPI with ssh setups etc.
Can't we just put a MPI.BCast from the root node (where the GUI runs) to all other procs into the GUI loop, that tells the other procs about current user input (start, stop etc)? It should really be blocking, else you'll run into the same issues with timed-out AllReduces. Using blocking will add a ~10μs of latency, which will be fast enough to not be noticeable.
mpi depends entirely on each
proc
executing the same sequence ofAllGather
etc calls at the same times. If any node doesn't, everything just waits and then probably timeouts with an error.. When running a fixed-nogui
run, there is no problem here.But when running interactively, each node needs to get the user's commands to start, stop, step, Init, etc, so they can all stay sync'd. Thus, we need an additional outer-loop of communication where the proc > 0 nodes wait for commands and then run them, all the while checking to see if a stop command has come in.
Probably this should be done using something other than mpi, because it needs to be non-blocking and more dynamic. Someone with appropriate network communication knowledge should probably take this on..
The text was updated successfully, but these errors were encountered: