Description
The context of this proposal is this synchronisation issue.
The main problem with logging in parallelized operations is simply this: requests are
posted directly to an MLflow service without full information about the state the service
at the time the request is ultimately acted on. I propose we resolve this as follows:
-
Instead of a client posting requests directly to an MLflow service, they are posted
(put!
) to a first-in-first-out queue (JuliaChannel
). Requesting calls will return
immediately, unless the queue is full. In this way, the performance of the parallel
workload is not impacted. -
A single Julia
Task
dispatches requests (take!
s) from the end of the queue. Whenever
a request has the possibility of altering the service state (e.g., creating an
experiment), then the dispatcher waits for confirmation that the state change is
complete before dispatching the next request.
I imagine that we can insert the queue (buffer) without breaking the user-facing
interface of MLFlowClient.jl.
I have implemented a POC for this proposal and shared it with two maintainers, and can share with anyone else interested.