-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance comparisons: nc mesh vs conf. mesh #18
base: master
Are you sure you want to change the base?
Conversation
Turning on the profiler for the following run: shows that most of the run-time for the non-conforming mesh is dominated by sorting algorithms? While there are no sorting algorithms in the conforming mesh case: I have the profiler data if you guys want to look at it with me. |
Do you have the stack trace for |
I see that HYPRE creates and destroys a cuSparse matrix every time the matvec function is called -- maybe that's where these sorts come from? |
Or maybe when we do |
That would explain why I see:
being called over and over. |
I'm not sure the cuSparse constructor (with external data) or action need to do sorting. The transpose construction on GPU, on the other hand, probably needs some sorting. |
Do you have |
Yes, we call |
Yes. you can avoid that by |
If the transpose will be constructed either way, isn't it better to create and store the |
If it's worth paying the cost of constructing |
What's the difference in storage between that and the |
OK. I take part of it back. Storage wise, local and global |
@liruipeng, let us know when you have this ready for testing |
CUDA_MAKE_OPTS=( | ||
"--with-cuda" | ||
"--with-gpu-arch=70" | ||
"--enable-device-memory-pool" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know if this is as performant as using Umpire? Also, does this disable the use of unified memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. We suggest using Umpire if you can. We probably will remove our device memory pool implementation in near future since even CUDA now has the "official" memory pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default for hypre is no unified memory (ref. docs).
--enable-unified-memory Use unified memory for allocating the memory
(default is NO).
If I recall correctly I think Umpire does improve the performance a bit, not sure by how what factor compared to the internal memory pool framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liruipeng do you know how much faster it is to use umpire?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liruipeng do you know how much faster it is to use umpire?
Hard to give a number. It highly depends on the application.
@tzanio @v-dobrev |
Yes, overall we evaluate the action of
The discussion here is for the last step |
@artv3 @tzanio I created a new branch https://github.com/hypre-space/hypre/tree/parcsr_local_trans The way I imagined to work is to call Give it a try and let me know. Thanks! |
Thanks @liruipeng ! |
Improved performance! Looking much better |
Folks, we seem to loose quite a bit of performance when going the nc-mesh machinery.
This PR will generate the data for the plot below:
when using a conforming mesh, we see the following performance:
With hypre branch: parcsr_local_trans