You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the ServiceBackend's implementation of collect distributed array submits a job group full of worker jobs (1 per partition) and waits for the job group to complete before reading the results of the worker jobs. For small analyses this is fine, but when a query has tens of thousands of partitions it can take time to schedule and complete all of the worker jobs and reading back those results on the driver can become a bottleneck. Below is one possible solution to this problem:
Expose log for job completions in a job group
The Query Driver should attempt to read worker job results while the stage is running, but to do this it needs the Batch API to provide an append-only log of completed jobs in a job group that the Query Driver can consume instead of issuing O(jobs) job status requests during each stage. It may be that this is already possible with the current database schema, but can at worst be achieved by creating an indexed column on jobs that contain the spot they completed in in the job group.
Completion of this feature would require:
Carefully evaluating the Batch data model to determine if there are any database changes necessary to construct an append-only log of job completions in a job group from the state of the database
If changes are needed, design and implement a batch front end API endpoint to query the log
(Separately) Add support for streaming the log in the Scala BatchClient and use it to read partition results before the job group completes.
Version
0.2.132
Relevant log output
No response
The text was updated successfully, but these errors were encountered:
What happened?
Currently, the
ServiceBackend
's implementation of collect distributed array submits a job group full of worker jobs (1 per partition) and waits for the job group to complete before reading the results of the worker jobs. For small analyses this is fine, but when a query has tens of thousands of partitions it can take time to schedule and complete all of the worker jobs and reading back those results on the driver can become a bottleneck. Below is one possible solution to this problem:Expose log for job completions in a job group
The Query Driver should attempt to read worker job results while the stage is running, but to do this it needs the Batch API to provide an append-only log of completed jobs in a job group that the Query Driver can consume instead of issuing O(jobs) job status requests during each stage. It may be that this is already possible with the current database schema, but can at worst be achieved by creating an indexed column on jobs that contain the spot they completed in in the job group.
Completion of this feature would require:
Version
0.2.132
Relevant log output
No response
The text was updated successfully, but these errors were encountered: