-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: added economic QR #903
Conversation
/build |
Thanks @mfzmullen ! This is really great work. Do you know if the instability was caused by something in particular (maybe the size or type)? |
I was doing m, n = 512x256 at the time. To be honest, I did not dig into it that much. My guess (truly guess, have not checked) is in the IFELSE of |
I'm struggling to find where this function is useful vs the other QR. Is it only if you want the H vectors and don't care about solving for QR? Can you add some notes to the docs on why a user would choose to use this? |
It looks like the tests are failing here:
I don't think you touched SVD directly and I don't see any usage of QR inside SVD. Any ideas what's causing this? |
@cliffburdick updated docs, let me know if the use case is not clear. I am replicating that error but am not sure why it exists yet. I was using the gtest_filter to only run tests on this code, my apologies I didn't run on the full suite (I didn't expect it to cause issues elsewhere). |
When I do a clean build of main and run
|
Are there nightly tests or anything that would help track down when this started failing? |
May not have time to look at this much more today, but using compute-sanitizer there seems to be an issue in the call to |
/build |
We have tests run on every commit, so it's strange this is showing up now. I will take a look at SVD since maybe it's been there all along and is just showing up with the right combination of parameters. |
Pushed a commit that fixes SVD on my end. It appears(?) that not enough memory was being allocated in d_space for the batched solver (based on the NVIDIA sample here) |
/build |
1 similar comment
/build |
Thanks for fixing an unrelated bug, and sorry that happened to show up here!. Merging |
/build |
@mfzmullen I'm seeing QR tests fail. Going to re-run this to see if this is the cause. |
Thanks. I have been trying to replicate locally but no luck so far, I have one more thing to try to replicate it. Any error logs you can share? |
It might have something to do with the GPU (L40), but it's very sporadic and isn't happening on every run. Here are the logs when it happens:
|
Oh, that is not a file I touched in my recent PR. That is a little surprising. Unfortunately I do not have an L40 to test on! I can still see if I can do anything to replicate it though on what I do have |
Don't worry about it. I'll keep an eye on it and try to reproduce here. |
Added an economic qr calling the cusolver orgqr methods based on appropriate type of the input. Did not replace
qr_solver
so as to keep existing functionality, but expanded on it. Modifying the existingqr
transform to perform an economic QR seems to lead to numeric instability, andQH * Q
is not near identity in that case, andQ*R
is not near the input operator. Thestatic_cast
s toint
in theorgqr_dispatch
functions prevent narrowing errors at compilation.To catch an edge case error, I modified the hashing function slightly to use the rank of the input as part of the hash to prevent
bad any_cast
when using a cached plan. Previously an error was thrown if one firsts performs a batched economic QR on a tensor of shape{b, m, n}
withb = 1
, then run a plan on a tensor with shape{m, n}
with no specified batch size. Not sure when anyone would do that except me when testing, but it is a simple error to prevent.