Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PSA: Improved graph construction time for calls including from_delayed with dask-expr>=1.1.13 #52

Open
hendrikmakait opened this issue Sep 5, 2024 · 8 comments

Comments

@hendrikmakait
Copy link

During today's Dask monthly meeting, people mentioned exceedingly long graph construction times related to delayed objects. I see that you are using from_delayed here. Please note that dask-expr>=1.1.13 includes a fix that improves graph construction time for large collections of delayed objects in dd.from_delayed by orders of magnitude (dask/dask-expr#1132).

I don't know your specific problem, so I'm not sure if this will fix it, but I wanted to point it out nonetheless in case it helps.

@hombit
Copy link
Collaborator

hombit commented Sep 5, 2024

@hendrikmakait. Thank you, I do confirm, that scheduling time (time between I hit .compute() and tasks start to stream) is significantly smaller after updating dask-expr from 1.1.11 to 1.1.13. So for a simple pipeline with 5k partitions it went from 156s to 12s, and for 40k partitions it went from >3 hours to 12 minutes.

@hendrikmakait
Copy link
Author

That's great to hear!

Please let us know if you notice any other performance bottlenecks. 12 minutes may still be somewhat long, depending on the number of total tasks.

@hombit
Copy link
Collaborator

hombit commented Sep 9, 2024

@hendrikmakait 12 minutes is significant amount of time, the "run time" was ~90 minutes on 60 3-thread workers, you can find more details in this notebook
https://github.com/lincc-frameworks/notebooks_lf/blob/main/ztf_periodogram/SIMPLIFIED-ztf_periodogram_PSC.ipynb

It is also worth to mention that the manager node used ~90GB of memory. The task graph had ~0.5M tasks.

When I run a more complex notebook, "lazy cells" (where I plan dask computations before .compute()) took a dozen minutes. It is quite long and we do expect our users to implement an order of magnitude more complicated analysis.

@hendrikmakait
Copy link
Author

hendrikmakait commented Sep 9, 2024

Is there a public dataset that one could access to reproduce this? If not, I recommend profiling your code with py-spy. We could do a sanity check if you can provide a profile in the speedscope format. Also, 90 GiB of memory on the scheduler seems excessive, in particular with "only" 500k tasks.

@hombit
Copy link
Collaborator

hombit commented Sep 10, 2024

@hendrikmakait The data is public, I made a notebook which uses the HTTPS data storage. The dataset volume is pretty large, but graph building part doesn't really touch any data files, so the manager node issues are reproducible without fetching any actual data files. The NB:
https://github.com/lincc-frameworks/notebooks_lf/blob/main/ztf_periodogram/ztf-periodogram-data.lsdb.io.ipynb

It took 90GB of RAM and 7 minutes to schedule the graph. (Do I use the right terminology here? What I mean is the time between I hit .compute() and tasks start to stream).

I haven't tried py-spy yet, thanks for the recommendation!

@hombit
Copy link
Collaborator

hombit commented Sep 11, 2024

@hendrikmakait I ran the code (up to the len(x.dask) point) and profiled it with py-spy and memray. I don't know much about Dask internals, so it's hard to tell what exactly is happening there, but I believe memray indicates that the graph size is actually ~80GB.

Here you can find the code and outputs for both profilers:
https://github.com/lincc-frameworks/notebooks_lf/tree/main/ztf_periodogram/profile-dask-graph

I'd really appreciate it if you could take a look. I believe you would have much better insight into this!

Update 2024.09.12 I tried to turn off optimizations with dask.config.set(array_optimize=None, dataframe_optimize=None, delayed_optimize=None), but it didn't change anything.

@hombit
Copy link
Collaborator

hombit commented Sep 17, 2024

@hendrikmakait just a gentle reminder, could you please give a look to these profiler logs? Should I convert it to a Dask/dask-expr issue?

@hendrikmakait
Copy link
Author

Thanks for creating the profiles! I've had a brief look at them and there's nothing that would point toward an obvious issue at first glance. I'm not sure if I'll have any time soon to dig deeper into this. It looks like I'd have to understand the graph structures you create in much more detail to point out issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants