-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve parallel read #74
base: main
Are you sure you want to change the base?
Conversation
cf686d3
to
3efb63f
Compare
I have this now running in production without any issue. |
7bde9b5
to
fe367ed
Compare
cf4880a
to
57bf417
Compare
…d a Worker node failed add basic unit testing for FeedWorkerProcess logic add unit test for when command queue is full
57bf417
to
0c5c5ad
Compare
Whats the actual problem here? That the reads run as python code in threads and therefore run into the GIL? I always thought due to the "run everything as subprocess" we never run into that problem? This feels like a lot of complexity and I don't really see the gain here. Any chance to make that gain clearer to me? |
@jankatins the problem what I was trying to solve is that when running a parallel task, the commands for the internal sub pipelines need to be evaluated before the pipeline starts working. I had a file bucket with over millions of files which I had to process. In my case, the pipeline became so big that it was unable to start; probably because of memory consumption or the job was still reading the complete file list of the bucket after more than 1 hour. This PR changes the parallel task behavior by putting the sub pipeline generation into a separate feed worker task. This PR is complex and I am not 100% sure if it should be part of mara. It is a first try to implement file based micro batch streaming via mara. I realized that it might not have been the best idea💡 I had in the last years 😉 |
See #75