Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: added economic QR #903

Merged
merged 6 commits into from
Mar 10, 2025
Merged

feat: added economic QR #903

merged 6 commits into from
Mar 10, 2025

Conversation

mfzmullen
Copy link
Contributor

Added an economic qr calling the cusolver orgqr methods based on appropriate type of the input. Did not replace qr_solver so as to keep existing functionality, but expanded on it. Modifying the existing qr transform to perform an economic QR seems to lead to numeric instability, and QH * Q is not near identity in that case, and Q*R is not near the input operator. The static_casts to int in the orgqr_dispatch functions prevent narrowing errors at compilation.

To catch an edge case error, I modified the hashing function slightly to use the rank of the input as part of the hash to prevent bad any_cast when using a cached plan. Previously an error was thrown if one firsts performs a batched economic QR on a tensor of shape {b, m, n} with b = 1, then run a plan on a tensor with shape {m, n} with no specified batch size. Not sure when anyone would do that except me when testing, but it is a simple error to prevent.

@cliffburdick
Copy link
Collaborator

/build

@cliffburdick
Copy link
Collaborator

Thanks @mfzmullen ! This is really great work. Do you know if the instability was caused by something in particular (maybe the size or type)?

@mfzmullen
Copy link
Contributor Author

mfzmullen commented Mar 7, 2025

I was doing m, n = 512x256 at the time. To be honest, I did not dig into it that much. My guess (truly guess, have not checked) is in the IFELSE of qr, where the check is only against zero elements and not elements within a few eps of 0 could be causing some issues. Despite that, using the cusolver method was an order of magnitude faster on my setup than my modified version of the existing qr, so I abandoned that.

@cliffburdick
Copy link
Collaborator

I was doing m, n = 512x256 at the time. To be honest, I did not dig into it that much. My guess (truly guess, have not checked) is in the IFELSE of qr, where the check is only against zero elements and not elements within a few eps of 0 could be causing some issues. Despite that, using the cusolver method was an order of magnitude faster on my setup than my modified version of the existing qr, so I abandoned that.

I'm struggling to find where this function is useful vs the other QR. Is it only if you want the H vectors and don't care about solving for QR? Can you add some notes to the docs on why a user would choose to use this?

@cliffburdick
Copy link
Collaborator

It looks like the tests are failing here:

[----------] 7 tests from SVDSolverTestNonHalfTypes/0, where TypeParam = cuda::std::__4::tuple<float, matx::cudaExecutor>
[ RUN      ] SVDSolverTestNonHalfTypes/0.SVDBasic
[       OK ] SVDSolverTestNonHalfTypes/0.SVDBasic (132 ms)
[ RUN      ] SVDSolverTestNonHalfTypes/0.SVDMLeqN
[       OK ] SVDSolverTestNonHalfTypes/0.SVDMLeqN (9 ms)
[ RUN      ] SVDSolverTestNonHalfTypes/0.SVDReducedMode
[       OK ] SVDSolverTestNonHalfTypes/0.SVDReducedMode (75 ms)
[ RUN      ] SVDSolverTestNonHalfTypes/0.SVDHostAlgoQR
[       OK ] SVDSolverTestNonHalfTypes/0.SVDHostAlgoQR (8 ms)
[ RUN      ] SVDSolverTestNonHalfTypes/0.SVDBasicBatched
[       OK ] SVDSolverTestNonHalfTypes/0.SVDBasicBatched (137 ms)
[ RUN      ] SVDSolverTestNonHalfTypes/0.SVDBasicBatchedSmallMGTN
terminate called without an active exception
Aborted (core dumped)

I don't think you touched SVD directly and I don't see any usage of QR inside SVD. Any ideas what's causing this?

@mfzmullen
Copy link
Contributor Author

@cliffburdick updated docs, let me know if the use case is not clear.

I am replicating that error but am not sure why it exists yet. I was using the gtest_filter to only run tests on this code, my apologies I didn't run on the full suite (I didn't expect it to cause issues elsewhere).

@mfzmullen
Copy link
Contributor Author

When I do a clean build of main and run ./matx_test --gtest_filter="*SVD*", I still get the error. Here it is in more detail

[----------] 7 tests from SVDSolverTestNonHalfTypes/4, where TypeParam = cuda::std::__4::tuple<double, matx::cudaExecutor>
[ RUN ] SVDSolverTestNonHalfTypes/4.SVDBasic
[ OK ] SVDSolverTestNonHalfTypes/4.SVDBasic (26 ms)
[ RUN ] SVDSolverTestNonHalfTypes/4.SVDMLeqN
[ OK ] SVDSolverTestNonHalfTypes/4.SVDMLeqN (6 ms)
[ RUN ] SVDSolverTestNonHalfTypes/4.SVDReducedMode
[ OK ] SVDSolverTestNonHalfTypes/4.SVDReducedMode (12 ms)
[ RUN ] SVDSolverTestNonHalfTypes/4.SVDHostAlgoQR
[ OK ] SVDSolverTestNonHalfTypes/4.SVDHostAlgoQR (5 ms)
[ RUN ] SVDSolverTestNonHalfTypes/4.SVDBasicBatched
[ OK ] SVDSolverTestNonHalfTypes/4.SVDBasicBatched (51 ms)
[ RUN ] SVDSolverTestNonHalfTypes/4.SVDBasicBatchedSmallMGTN
err: Failed to allocate memory. May be an asynchronous error from another CUDA call(700 != 0)
terminate called after throwing an instance of 'matx::detail::matxException'
what(): matxException (matxOutOfMemory: ) - /home/michael/adl/forks/MatX/include/matx/core/allocator.h:197

@mfzmullen
Copy link
Contributor Author

Are there nightly tests or anything that would help track down when this started failing?

@mfzmullen
Copy link
Contributor Author

May not have time to look at this much more today, but using compute-sanitizer there seems to be an issue in the call to cusolverDnSgesvdjBatched in svd_cuda with float test type.

@cliffburdick
Copy link
Collaborator

/build

@cliffburdick
Copy link
Collaborator

Are there nightly tests or anything that would help track down when this started failing?

We have tests run on every commit, so it's strange this is showing up now. I will take a look at SVD since maybe it's been there all along and is just showing up with the right combination of parameters.

@mfzmullen
Copy link
Contributor Author

Pushed a commit that fixes SVD on my end. It appears(?) that not enough memory was being allocated in d_space for the batched solver (based on the NVIDIA sample here)

@cliffburdick
Copy link
Collaborator

/build

1 similar comment
@cliffburdick
Copy link
Collaborator

/build

@cliffburdick
Copy link
Collaborator

Thanks for fixing an unrelated bug, and sorry that happened to show up here!. Merging

@cliffburdick cliffburdick merged commit 1e5c64a into NVIDIA:main Mar 10, 2025
1 check passed
@cliffburdick
Copy link
Collaborator

/build

@cliffburdick
Copy link
Collaborator

@mfzmullen I'm seeing QR tests fail. Going to re-run this to see if this is the cause.

@mfzmullen
Copy link
Contributor Author

Thanks. I have been trying to replicate locally but no luck so far, I have one more thing to try to replicate it. Any error logs you can share?

@cliffburdick
Copy link
Collaborator

Thanks. I have been trying to replicate locally but no luck so far, I have one more thing to try to replicate it. Any error logs you can share?

It might have something to do with the GPU (L40), but it's very sporadic and isn't happening on every run. Here are the logs when it happens:


[----------] 1 test from QR2SolverTestNonHalfTypes/0, where TypeParam = cuda::std::__4::tuple<float, matx::cudaExecutor>
[ RUN      ] QR2SolverTestNonHalfTypes/0.QR2
/home/jenkins/workspace/unit-tests/test/00_solver/QR2.cu:107: Failure
The difference between mdiffQTQ() and SType(0) is 1, which exceeds .00001, where
mdiffQTQ() evaluates to 1,
SType(0) evaluates to 0, and
.00001 evaluates to 1.0000000000000001e-05.
/home/jenkins/workspace/unit-tests/test/00_solver/QR2.cu:107: Failure
The difference between mdiffQTQ() and SType(0) is 1, which exceeds .00001, where
mdiffQTQ() evaluates to 1,
SType(0) evaluates to 0, and
.00001 evaluates to 1.0000000000000001e-05.
/home/jenkins/workspace/unit-tests/test/00_solver/QR2.cu:107: Failure
The difference between mdiffQTQ() and SType(0) is 1, which exceeds .00001, where
mdiffQTQ() evaluates to 1,
SType(0) evaluates to 0, and
.00001 evaluates to 1.0000000000000001e-05.
[  FAILED  ] QR2SolverTestNonHalfTypes/0.QR2, where TypeParam = cuda::std::__4::tuple<float, matx::cudaExecutor> (60 ms)
[----------] 1 test from QR2SolverTestNonHalfTypes/0 (60 ms total)

[----------] 1 test from QR2SolverTestNonHalfTypes/1, where TypeParam = cuda::std::__4::tuple<double, matx::cudaExecutor>
[ RUN      ] QR2SolverTestNonHalfTypes/1.QR2
/home/jenkins/workspace/unit-tests/test/00_solver/QR2.cu:107: Failure
The difference between mdiffQTQ() and SType(0) is 1, which exceeds .00001, where
mdiffQTQ() evaluates to 1,
SType(0) evaluates to 0, and
.00001 evaluates to 1.0000000000000001e-05.
/home/jenkins/workspace/unit-tests/test/00_solver/QR2.cu:107: Failure
The difference between mdiffQTQ() and SType(0) is 1, which exceeds .00001, where
mdiffQTQ() evaluates to 1,
SType(0) evaluates to 0, and
.00001 evaluates to 1.0000000000000001e-05.
/home/jenkins/workspace/unit-tests/test/00_solver/QR2.cu:107: Failure
The difference between mdiffQTQ() and SType(0) is 1, which exceeds .00001, where
mdiffQTQ() evaluates to 1,
SType(0) evaluates to 0, and
.00001 evaluates to 1.0000000000000001e-05.
[  FAILED  ] QR2SolverTestNonHalfTypes/1.QR2, where TypeParam = cuda::std::__4::tuple<double, matx::cudaExecutor> (16 ms)
[----------] 1 test from QR2SolverTestNonHalfTypes/1 (16 ms total)

[----------] 1 test from QR2SolverTestNonHalfTypes/2, where TypeParam = cuda::std::__4::tuple<cuda::std::__4::complex<float>, matx::cudaExecutor>
[ RUN      ] QR2SolverTestNonHalfTypes/2.QR2
/home/jenkins/workspace/unit-tests/test/00_solver/QR2.cu:107: Failure
The difference between mdiffQTQ() and SType(0) is 1, which exceeds .00001, where
mdiffQTQ() evaluates to 1,
SType(0) evaluates to 0, and
.00001 evaluates to 1.0000000000000001e-05.
/home/jenkins/workspace/unit-tests/test/00_solver/QR2.cu:107: Failure
The difference between mdiffQTQ() and SType(0) is 1, which exceeds .00001, where
mdiffQTQ() evaluates to 1,
SType(0) evaluates to 0, and
.00001 evaluates to 1.0000000000000001e-05.
/home/jenkins/workspace/unit-tests/test/00_solver/QR2.cu:107: Failure
The difference between mdiffQTQ() and SType(0) is 1, which exceeds .00001, where
mdiffQTQ() evaluates to 1,
SType(0) evaluates to 0, and
.00001 evaluates to 1.0000000000000001e-05.
[  FAILED  ] QR2SolverTestNonHalfTypes/2.QR2, where TypeParam = cuda::std::__4::tuple<cuda::std::__4::complex<float>, matx::cudaExecutor> (14 ms)
[----------] 1 test from QR2SolverTestNonHalfTypes/2 (14 ms total)

[----------] 1 test from QR2SolverTestNonHalfTypes/3, where TypeParam = cuda::std::__4::tuple<cuda::std::__4::complex<double>, matx::cudaExecutor>
[ RUN      ] QR2SolverTestNonHalfTypes/3.QR2
/home/jenkins/workspace/unit-tests/test/00_solver/QR2.cu:107: Failure
The difference between mdiffQTQ() and SType(0) is 1, which exceeds .00001, where
mdiffQTQ() evaluates to 1,
SType(0) evaluates to 0, and
.00001 evaluates to 1.0000000000000001e-05.
/home/jenkins/workspace/unit-tests/test/00_solver/QR2.cu:107: Failure
The difference between mdiffQTQ() and SType(0) is 1, which exceeds .00001, where
mdiffQTQ() evaluates to 1,
SType(0) evaluates to 0, and
.00001 evaluates to 1.0000000000000001e-05.
/home/jenkins/workspace/unit-tests/test/00_solver/QR2.cu:107: Failure
The difference between mdiffQTQ() and SType(0) is 1, which exceeds .00001, where
mdiffQTQ() evaluates to 1,
SType(0) evaluates to 0, and
.00001 evaluates to 1.0000000000000001e-05.
[  FAILED  ] QR2SolverTestNonHalfTypes/3.QR2, where TypeParam = cuda::std::__4::tuple<cuda::std::__4::complex<double>, matx::cudaExecutor> (16 ms)
[----------] 1 test from QR2SolverTestNonHalfTypes/3 (16 ms total)

@mfzmullen
Copy link
Contributor Author

Oh, that is not a file I touched in my recent PR. That is a little surprising. Unfortunately I do not have an L40 to test on! I can still see if I can do anything to replicate it though on what I do have

@cliffburdick
Copy link
Collaborator

Don't worry about it. I'll keep an eye on it and try to reproduce here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants