Task monitor daemon process may limit scalabilty #194
Labels
priority: medium
Medium priority
type: maintenance
Related to general repository maintenance
workload: weeks
Likely takes weeks to resolve
Is your feature request related to a problem? Please describe.
Currently updating the run status in the database involves sending a Celery signal that is picked up by a single task monitor daemon process that is spawned by the main application. As status updates may be numerous if many workflow runs are managed in parallel and status updates may furthermore contain long log messages, this architecture may impose a serious bottleneck for scaling up run throughput.
Describe the solution you'd like
To improve scalability, status updates could be handled by worker processes instead. A status update could be posted to the broker queue and picked up by a worker rather than the task monitor in order to update the database. To ensure that ongoing workflow runs do not block status updates (effectively causing the service to be stuck indefinitely), a dedicated worker pool of at least size would need to be set aside for this purpose.
Describe alternatives you've considered
As an alternative to setting aside a dedicated worker pool for status updates, status updates could also be handled directly by the worker processes that are already handling the workflow runs.
Additional context
It is important that the chosen solution will be conceptually compatible with a future callback mechanism for status updates (see #57, ga4gh/task-execution-schemas#121, ga4gh/workflow-execution-service-schemas#133 & ga4gh/cloud-interop-testing#98 (comment)).
The text was updated successfully, but these errors were encountered: