feat: added economic QR #903

mfzmullen · 2025-03-06T22:01:45Z

Added an economic qr calling the cusolver orgqr methods based on appropriate type of the input. Did not replace qr_solver so as to keep existing functionality, but expanded on it. Modifying the existing qr transform to perform an economic QR seems to lead to numeric instability, and QH * Q is not near identity in that case, and Q*R is not near the input operator. The static_casts to int in the orgqr_dispatch functions prevent narrowing errors at compilation.

To catch an edge case error, I modified the hashing function slightly to use the rank of the input as part of the hash to prevent bad any_cast when using a cached plan. Previously an error was thrown if one firsts performs a batched economic QR on a tensor of shape {b, m, n} with b = 1, then run a plan on a tensor with shape {m, n} with no specified batch size. Not sure when anyone would do that except me when testing, but it is a simple error to prevent.

include/matx/operators/qr.h

cliffburdick · 2025-03-07T04:31:21Z

/build

cliffburdick · 2025-03-07T04:38:14Z

Thanks @mfzmullen ! This is really great work. Do you know if the instability was caused by something in particular (maybe the size or type)?

mfzmullen · 2025-03-07T15:20:17Z

I was doing m, n = 512x256 at the time. To be honest, I did not dig into it that much. My guess (truly guess, have not checked) is in the IFELSE of qr, where the check is only against zero elements and not elements within a few eps of 0 could be causing some issues. Despite that, using the cusolver method was an order of magnitude faster on my setup than my modified version of the existing qr, so I abandoned that.

cliffburdick · 2025-03-07T16:05:45Z

I was doing m, n = 512x256 at the time. To be honest, I did not dig into it that much. My guess (truly guess, have not checked) is in the IFELSE of qr, where the check is only against zero elements and not elements within a few eps of 0 could be causing some issues. Despite that, using the cusolver method was an order of magnitude faster on my setup than my modified version of the existing qr, so I abandoned that.

I'm struggling to find where this function is useful vs the other QR. Is it only if you want the H vectors and don't care about solving for QR? Can you add some notes to the docs on why a user would choose to use this?

cliffburdick · 2025-03-07T16:07:34Z

It looks like the tests are failing here:

[----------] 7 tests from SVDSolverTestNonHalfTypes/0, where TypeParam = cuda::std::__4::tuple<float, matx::cudaExecutor>
[ RUN      ] SVDSolverTestNonHalfTypes/0.SVDBasic
[       OK ] SVDSolverTestNonHalfTypes/0.SVDBasic (132 ms)
[ RUN      ] SVDSolverTestNonHalfTypes/0.SVDMLeqN
[       OK ] SVDSolverTestNonHalfTypes/0.SVDMLeqN (9 ms)
[ RUN      ] SVDSolverTestNonHalfTypes/0.SVDReducedMode
[       OK ] SVDSolverTestNonHalfTypes/0.SVDReducedMode (75 ms)
[ RUN      ] SVDSolverTestNonHalfTypes/0.SVDHostAlgoQR
[       OK ] SVDSolverTestNonHalfTypes/0.SVDHostAlgoQR (8 ms)
[ RUN      ] SVDSolverTestNonHalfTypes/0.SVDBasicBatched
[       OK ] SVDSolverTestNonHalfTypes/0.SVDBasicBatched (137 ms)
[ RUN      ] SVDSolverTestNonHalfTypes/0.SVDBasicBatchedSmallMGTN
terminate called without an active exception
Aborted (core dumped)

I don't think you touched SVD directly and I don't see any usage of QR inside SVD. Any ideas what's causing this?

mfzmullen · 2025-03-07T16:39:47Z

@cliffburdick updated docs, let me know if the use case is not clear.

I am replicating that error but am not sure why it exists yet. I was using the gtest_filter to only run tests on this code, my apologies I didn't run on the full suite (I didn't expect it to cause issues elsewhere).

mfzmullen · 2025-03-07T18:02:28Z

When I do a clean build of main and run ./matx_test --gtest_filter="*SVD*", I still get the error. Here it is in more detail

[----------] 7 tests from SVDSolverTestNonHalfTypes/4, where TypeParam = cuda::std::__4::tuple<double, matx::cudaExecutor>
[ RUN ] SVDSolverTestNonHalfTypes/4.SVDBasic
[ OK ] SVDSolverTestNonHalfTypes/4.SVDBasic (26 ms)
[ RUN ] SVDSolverTestNonHalfTypes/4.SVDMLeqN
[ OK ] SVDSolverTestNonHalfTypes/4.SVDMLeqN (6 ms)
[ RUN ] SVDSolverTestNonHalfTypes/4.SVDReducedMode
[ OK ] SVDSolverTestNonHalfTypes/4.SVDReducedMode (12 ms)
[ RUN ] SVDSolverTestNonHalfTypes/4.SVDHostAlgoQR
[ OK ] SVDSolverTestNonHalfTypes/4.SVDHostAlgoQR (5 ms)
[ RUN ] SVDSolverTestNonHalfTypes/4.SVDBasicBatched
[ OK ] SVDSolverTestNonHalfTypes/4.SVDBasicBatched (51 ms)
[ RUN ] SVDSolverTestNonHalfTypes/4.SVDBasicBatchedSmallMGTN
err: Failed to allocate memory. May be an asynchronous error from another CUDA call(700 != 0)
terminate called after throwing an instance of 'matx::detail::matxException'
what(): matxException (matxOutOfMemory: ) - /home/michael/adl/forks/MatX/include/matx/core/allocator.h:197

mfzmullen · 2025-03-07T18:03:10Z

Are there nightly tests or anything that would help track down when this started failing?

mfzmullen · 2025-03-07T20:23:55Z

May not have time to look at this much more today, but using compute-sanitizer there seems to be an issue in the call to cusolverDnSgesvdjBatched in svd_cuda with float test type.

cliffburdick · 2025-03-07T21:17:34Z

/build

cliffburdick · 2025-03-07T21:18:09Z

Are there nightly tests or anything that would help track down when this started failing?

We have tests run on every commit, so it's strange this is showing up now. I will take a look at SVD since maybe it's been there all along and is just showing up with the right combination of parameters.

mfzmullen · 2025-03-10T14:06:52Z

Pushed a commit that fixes SVD on my end. It appears(?) that not enough memory was being allocated in d_space for the batched solver (based on the NVIDIA sample here)

cliffburdick · 2025-03-10T14:09:05Z

/build

cliffburdick · 2025-03-10T15:19:39Z

/build

cliffburdick · 2025-03-10T18:33:20Z

Thanks for fixing an unrelated bug, and sorry that happened to show up here!. Merging

cliffburdick · 2025-03-11T21:50:36Z

/build

cliffburdick · 2025-03-11T21:50:57Z

@mfzmullen I'm seeing QR tests fail. Going to re-run this to see if this is the cause.

mfzmullen · 2025-03-12T15:43:08Z

Thanks. I have been trying to replicate locally but no luck so far, I have one more thing to try to replicate it. Any error logs you can share?

cliffburdick · 2025-03-12T15:45:05Z

Thanks. I have been trying to replicate locally but no luck so far, I have one more thing to try to replicate it. Any error logs you can share?

It might have something to do with the GPU (L40), but it's very sporadic and isn't happening on every run. Here are the logs when it happens:


[----------] 1 test from QR2SolverTestNonHalfTypes/0, where TypeParam = cuda::std::__4::tuple<float, matx::cudaExecutor>
[ RUN      ] QR2SolverTestNonHalfTypes/0.QR2
/home/jenkins/workspace/unit-tests/test/00_solver/QR2.cu:107: Failure
The difference between mdiffQTQ() and SType(0) is 1, which exceeds .00001, where
mdiffQTQ() evaluates to 1,
SType(0) evaluates to 0, and
.00001 evaluates to 1.0000000000000001e-05.
/home/jenkins/workspace/unit-tests/test/00_solver/QR2.cu:107: Failure
The difference between mdiffQTQ() and SType(0) is 1, which exceeds .00001, where
mdiffQTQ() evaluates to 1,
SType(0) evaluates to 0, and
.00001 evaluates to 1.0000000000000001e-05.
/home/jenkins/workspace/unit-tests/test/00_solver/QR2.cu:107: Failure
The difference between mdiffQTQ() and SType(0) is 1, which exceeds .00001, where
mdiffQTQ() evaluates to 1,
SType(0) evaluates to 0, and
.00001 evaluates to 1.0000000000000001e-05.
[  FAILED  ] QR2SolverTestNonHalfTypes/0.QR2, where TypeParam = cuda::std::__4::tuple<float, matx::cudaExecutor> (60 ms)
[----------] 1 test from QR2SolverTestNonHalfTypes/0 (60 ms total)

[----------] 1 test from QR2SolverTestNonHalfTypes/1, where TypeParam = cuda::std::__4::tuple<double, matx::cudaExecutor>
[ RUN      ] QR2SolverTestNonHalfTypes/1.QR2
/home/jenkins/workspace/unit-tests/test/00_solver/QR2.cu:107: Failure
The difference between mdiffQTQ() and SType(0) is 1, which exceeds .00001, where
mdiffQTQ() evaluates to 1,
SType(0) evaluates to 0, and
.00001 evaluates to 1.0000000000000001e-05.
/home/jenkins/workspace/unit-tests/test/00_solver/QR2.cu:107: Failure
The difference between mdiffQTQ() and SType(0) is 1, which exceeds .00001, where
mdiffQTQ() evaluates to 1,
SType(0) evaluates to 0, and
.00001 evaluates to 1.0000000000000001e-05.
/home/jenkins/workspace/unit-tests/test/00_solver/QR2.cu:107: Failure
The difference between mdiffQTQ() and SType(0) is 1, which exceeds .00001, where
mdiffQTQ() evaluates to 1,
SType(0) evaluates to 0, and
.00001 evaluates to 1.0000000000000001e-05.
[  FAILED  ] QR2SolverTestNonHalfTypes/1.QR2, where TypeParam = cuda::std::__4::tuple<double, matx::cudaExecutor> (16 ms)
[----------] 1 test from QR2SolverTestNonHalfTypes/1 (16 ms total)

[----------] 1 test from QR2SolverTestNonHalfTypes/2, where TypeParam = cuda::std::__4::tuple<cuda::std::__4::complex<float>, matx::cudaExecutor>
[ RUN      ] QR2SolverTestNonHalfTypes/2.QR2
/home/jenkins/workspace/unit-tests/test/00_solver/QR2.cu:107: Failure
The difference between mdiffQTQ() and SType(0) is 1, which exceeds .00001, where
mdiffQTQ() evaluates to 1,
SType(0) evaluates to 0, and
.00001 evaluates to 1.0000000000000001e-05.
/home/jenkins/workspace/unit-tests/test/00_solver/QR2.cu:107: Failure
The difference between mdiffQTQ() and SType(0) is 1, which exceeds .00001, where
mdiffQTQ() evaluates to 1,
SType(0) evaluates to 0, and
.00001 evaluates to 1.0000000000000001e-05.
/home/jenkins/workspace/unit-tests/test/00_solver/QR2.cu:107: Failure
The difference between mdiffQTQ() and SType(0) is 1, which exceeds .00001, where
mdiffQTQ() evaluates to 1,
SType(0) evaluates to 0, and
.00001 evaluates to 1.0000000000000001e-05.
[  FAILED  ] QR2SolverTestNonHalfTypes/2.QR2, where TypeParam = cuda::std::__4::tuple<cuda::std::__4::complex<float>, matx::cudaExecutor> (14 ms)
[----------] 1 test from QR2SolverTestNonHalfTypes/2 (14 ms total)

[----------] 1 test from QR2SolverTestNonHalfTypes/3, where TypeParam = cuda::std::__4::tuple<cuda::std::__4::complex<double>, matx::cudaExecutor>
[ RUN      ] QR2SolverTestNonHalfTypes/3.QR2
/home/jenkins/workspace/unit-tests/test/00_solver/QR2.cu:107: Failure
The difference between mdiffQTQ() and SType(0) is 1, which exceeds .00001, where
mdiffQTQ() evaluates to 1,
SType(0) evaluates to 0, and
.00001 evaluates to 1.0000000000000001e-05.
/home/jenkins/workspace/unit-tests/test/00_solver/QR2.cu:107: Failure
The difference between mdiffQTQ() and SType(0) is 1, which exceeds .00001, where
mdiffQTQ() evaluates to 1,
SType(0) evaluates to 0, and
.00001 evaluates to 1.0000000000000001e-05.
/home/jenkins/workspace/unit-tests/test/00_solver/QR2.cu:107: Failure
The difference between mdiffQTQ() and SType(0) is 1, which exceeds .00001, where
mdiffQTQ() evaluates to 1,
SType(0) evaluates to 0, and
.00001 evaluates to 1.0000000000000001e-05.
[  FAILED  ] QR2SolverTestNonHalfTypes/3.QR2, where TypeParam = cuda::std::__4::tuple<cuda::std::__4::complex<double>, matx::cudaExecutor> (16 ms)
[----------] 1 test from QR2SolverTestNonHalfTypes/3 (16 ms total)

mfzmullen · 2025-03-12T15:48:20Z

Oh, that is not a file I touched in my recent PR. That is a little surprising. Unfortunately I do not have an L40 to test on! I can still see if I can do anything to replicate it though on what I do have

cliffburdick · 2025-03-12T15:52:18Z

Don't worry about it. I'll keep an eye on it and try to reproduce here.

feat: added economic QR

bd16023

cliffburdick reviewed Mar 6, 2025

View reviewed changes

include/matx/operators/qr.h Outdated Show resolved Hide resolved

update docs

b57818c

mfzmullen added 2 commits March 7, 2025 09:28

fix size checks and delete commented code

4bcd15a

correct assertion

4872ca1

update docs to clarify use case

48364cb

fix memory alloc in gesvdjBatched

48ab4e3

cliffburdick approved these changes Mar 10, 2025

View reviewed changes

cliffburdick merged commit 1e5c64a into NVIDIA:main Mar 10, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: added economic QR #903

feat: added economic QR #903

mfzmullen commented Mar 6, 2025

cliffburdick commented Mar 7, 2025

cliffburdick commented Mar 7, 2025

mfzmullen commented Mar 7, 2025 •

edited

Loading

cliffburdick commented Mar 7, 2025

cliffburdick commented Mar 7, 2025

mfzmullen commented Mar 7, 2025

mfzmullen commented Mar 7, 2025

mfzmullen commented Mar 7, 2025

mfzmullen commented Mar 7, 2025

cliffburdick commented Mar 7, 2025

cliffburdick commented Mar 7, 2025

mfzmullen commented Mar 10, 2025

cliffburdick commented Mar 10, 2025

cliffburdick commented Mar 10, 2025

cliffburdick commented Mar 10, 2025

cliffburdick commented Mar 11, 2025

cliffburdick commented Mar 11, 2025

mfzmullen commented Mar 12, 2025

cliffburdick commented Mar 12, 2025

mfzmullen commented Mar 12, 2025

cliffburdick commented Mar 12, 2025

feat: added economic QR #903

feat: added economic QR #903

Conversation

mfzmullen commented Mar 6, 2025

cliffburdick commented Mar 7, 2025

cliffburdick commented Mar 7, 2025

mfzmullen commented Mar 7, 2025 • edited Loading

cliffburdick commented Mar 7, 2025

cliffburdick commented Mar 7, 2025

mfzmullen commented Mar 7, 2025

mfzmullen commented Mar 7, 2025

mfzmullen commented Mar 7, 2025

mfzmullen commented Mar 7, 2025

cliffburdick commented Mar 7, 2025

cliffburdick commented Mar 7, 2025

mfzmullen commented Mar 10, 2025

cliffburdick commented Mar 10, 2025

cliffburdick commented Mar 10, 2025

cliffburdick commented Mar 10, 2025

cliffburdick commented Mar 11, 2025

cliffburdick commented Mar 11, 2025

mfzmullen commented Mar 12, 2025

cliffburdick commented Mar 12, 2025

mfzmullen commented Mar 12, 2025

cliffburdick commented Mar 12, 2025

mfzmullen commented Mar 7, 2025 •

edited

Loading