Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Better error reporting #858

Closed
jwmelto opened this issue Feb 4, 2025 · 11 comments
Closed

[FEA] Better error reporting #858

jwmelto opened this issue Feb 4, 2025 · 11 comments

Comments

@jwmelto
Copy link

jwmelto commented Feb 4, 2025

Is your feature request related to a problem? Please describe.
I ran an FFT and got the obtuse exception thrown:

matxException (matxCufftError: error == CUFFT_SUCCESS)

This occurs at fft/fft_cuda.h:437 in v0.9.0

The exception is raised from the MATX_ASSERT macro, which lacks fidelity to describe the failure condition.

Describe the solution you'd like
The actual error code is the minimum requirement for problem resolution. The error string would be helpful. Something like the ubiquitous cudaCheck macro.

Describe alternatives you've considered
I'm currently at a loss how to get the error code out, without modifying MatX code directly.

Additional context
I can guess at the issue; I created a tensor over host memory and passed that into the fft. I surmise that it's a host/device memory issue.

using HostComplex = std::complex<float>;
using Complex = cuda::std::complex<float>;

auto exec = matx::cudaExecutor{};

auto wipeoff = matx::make_tensor<Complex>( { M, N } );
(wipeoff = /* details out of scope */).run(exec);

std::vector<HostComplex> data = /* get source data */
auto dataP = reinterpret_cast<Complex*>( data.data() );
auto dataT = matx::make_tensor<Complex>( dataP, { N } );

// Here is where it fails
auto vals = matx::make_tensor<Complex>( wipeoff.Shape() );
(vals = matx::fft( (wipeoff * dataT) )).run(exec);

I took a look at v01.1.1 but my project had numerous compilation errors with this version. Further investigation is on-going.

@cliffburdick
Copy link
Collaborator

Hi @jwmelto we need to update that location to provide the error number/string. We will add a fix for that soon.

In the meantime it looks like your code is using a CUDA executor, but it's trying to pass host memory. std::vector will allocate memory on the heap on the host that is not managed memory host-pinned memory and cannot be accessed on the device. If you want to access that memory you will either need to copy that data to the device or copy the vector to host-pinned memory before using it. Something like this should work:

std::vector<HostComplex> data = /* get source data */
auto dataP = reinterpret_cast<Complex*>( data.data() );

auto data_device = matx::make_tensor<Complex>(  { N },  matx::MATX_DEVICE_MEMORY);
cudaMemcpy(data_device.Data(), dataP, data_device.Bytes(), cudaMemcpyDefault);

// Here is where it fails
auto vals = matx::make_tensor<Complex>( wipeoff.Shape() );
(vals = matx::fft( wipeoff * data_device)).run(exec);

I will try to reproduce the error on our side

@jwmelto
Copy link
Author

jwmelto commented Feb 4, 2025

Thanks. It isn't quite clear what "magic" is handled by the MatX tensors and what I need to handle. In particular, the "locality" of memory is a complete mystery.

What I had before seemed grossly less efficient (but may work) (updated for clarity):

auto dataT = matx::make_tensor<Complex>( { N } );
for (auto idx = 0; idx < data.size(); ++idx) dataT(idx) = data[idx];

does this construct mystically handle the data movement from host to device?

@cliffburdick
Copy link
Collaborator

I agree, and this is something we need to improve. Ideally the user would never need to worry about where the memory is located and we would handle that for them. That's exactly what managed memory does for you (the default), but that doesn't help if you have existing code using std::vector and you have no choice where it was allocating it. Technically you could pass an allocator to std::vector, but that's not available in every case.

With this line: dataT(idx) = data[idx];
dataT is a tensor pointing to the exact same memory backing the vector since you initialized the tensor with its data() pointer. So that loop shouldn't be doing anything since it's assigning a value to and from the same location in the array.

With your dataT allocation since it doesn't specify the memory space (like matx::MATX_DEVICE_MEMORY) it will default to managed memory. MatX will incorrectly think your pointer is a managed memory pointer rather than a host memory pointer, and it will potentially assign the memory incorrectly. We could try to find the pointer type based on the CUDA API. I think this is not perfect and we may get the wrong information from the pointer, so it's always better if the user is explicit.

When you do a straight assignment of A = B matx could likely detect this properly and do a cudaMemcpy behind the scenes, including working for vectors. This would make it so the user doesn't have to do it themselves. However, this will not work if you make a larger expression like A = B * 5 where B is not device-accessible, and you will get a crash.

Does this all make sense?

By the way I added better cuFFT error reporting here: #860

For your example I now get:

I have no name!@353d37a83437:/repro/tmp/MatX/build$ examples/fft_conv
error: (5 != 0): cuFFT: Driver or internal library error
terminate called after throwing an instance of 'matx::detail::matxException'
  what():  matxException (matxCufftError: ) - /repro/tmp/MatX/include/matx/transforms/fft/fft_cuda.h:436

In this case cuFFT only knows that its kernel crashed. For more information you can run compute-sanitizer:

========= COMPUTE-SANITIZER
========= Invalid __global__ read of size 8 bytes
=========     at void matx::detail::matxOpT2Kernel<matx::detail::set<matx::tensor_t<cuda::std::__4::complex<float>, (int)2, matx::basic_storage<matx::raw_pointer_buffer<cuda::std::__4::complex<float>, matx::matx_allocator<cuda::std::__4::complex<float>>>>, matx::tensor_desc_t<cuda::std::__4::array<long long, (unsigned long)2>, cuda::std::__4::array<long long, (unsigned long)2>, (int)2>>, matx::detail::matxBinaryOp<matx::detail::tensor
_impl_t<cuda::std::__4::complex<float>, (int)2, matx::tensor_desc_t<cuda::std::__4::array<long long, (unsigned long)2>, cuda::std::__4::array<long long, (unsigned long)2>, (int)2>, matx::detail::DenseTensorData<cuda::std::__4::complex<float>>>, matx::detail::tensor_impl_t<cuda::std::__4::complex<float>, (int)1, matx::tensor_desc_t<cuda::std::__4::array<long long, (unsigned long)1>, cuda::std::__4::array<long long, (unsigned long)1>, (
int)1>, matx::detail::DenseTensorData<cuda::std::__4::complex<float>>>, matx::detail::BinOp<cuda::std::__4::complex<float>, cuda::std::__4::complex<float>, matx::detail::MulF<cuda::std::__4::complex<float>, cuda::std::__4::complex<float>>>>>>(T1, long long, long long)+0x4c0
=========     by thread (0,6,0) in block (0,0,0)
=========     Address 0x562b62993ce0 is out of bounds
=========     and is 45185528939296 bytes before the nearest allocation at 0x7f43f6000000 of size 2048 bytes

@cliffburdick
Copy link
Collaborator

Hi @jwmelto after talking internally about this, doing memory type checks at runtime can be expensive and it's likely better to leave this to the user. This is what python does when you have to explicitly call asnumpy, for example. I will add a section to the documentation about this.

@jwmelto
Copy link
Author

jwmelto commented Feb 4, 2025

With this line: dataT(idx) = data[idx];
dataT is a tensor pointing to the exact same memory backing the vector since you initialized the tensor with its data() pointer. So that loop shouldn't be doing anything since it's assigning a value to and from the same location in the array.

My loop was using a "normal" tensor, not the pointer-initialized version. That does work, and I'm moving forward. As I contemplated reasons for the FFT to fail (when elsewhere in my code it did not), the memory mismatch did come to mind, but I was looking for a little more confirmation, like "invalid parameter" (that's what I get from cudaMemcpyAsync if passed a non-page-locked host pointer).

For a performance-sensitive operation, do you recommend explicitly doing the cudaMemcpy or is the tensor loop fast enough?

@cliffburdick
Copy link
Collaborator

With this line: dataT(idx) = data[idx];
dataT is a tensor pointing to the exact same memory backing the vector since you initialized the tensor with its data() pointer. So that loop shouldn't be doing anything since it's assigning a value to and from the same location in the array.

My loop was using a "normal" tensor, not the pointer-initialized version. That does work, and I'm moving forward. As I contemplated reasons for the FFT to fail (when elsewhere in my code it did not), the memory mismatch did come to mind, but I was looking for a little more confirmation, like "invalid parameter" (that's what I get from cudaMemcpyAsync if passed a non-page-locked host pointer).

For a performance-sensitive operation, do you recommend explicitly doing the cudaMemcpy or is the tensor loop fast enough?

You definitely don't want to copy each item individually if you can avoid it. A cudaMemcpy will use the copy engines if available, and bulk transfers will be orders of magnitude faster than individual updates.

@jwmelto
Copy link
Author

jwmelto commented Feb 4, 2025

Thanks. It's what I would have expected, but I've been wrong before.

@cliffburdick
Copy link
Collaborator

Thanks. It's what I would have expected, but I've been wrong before.

One last thing on your example: using a std::vector backing a tensor is technically undefined anyways. The reason is that if your vector grows the standard library may reallocate that data in a completely different memory location, copy it, and free the old pointer. At that point the MatX tensor would point to invalid memory and could crash.

@jwmelto
Copy link
Author

jwmelto commented Feb 5, 2025

if you see my last comment in the "Discussion", I was looking for a more "natural" container assignment idiom. I found the pointer-initialized tensor constructor and thought that tensor-to-tensor might invoke the movement to device. I'm OK with doing the explicit declaration and copies, but learning the MatX idioms is slow going.

@cliffburdick
Copy link
Collaborator

I added a small section to the documentation here:
https://github.com/NVIDIA/MatX/pull/862/files

Hopefully this will clarify things since I think you have a very common use case. If you've read through the quickstart guide and the "external libraries" section and things still aren't making sense, please let us know. We're always looking to add more docs.

@jwmelto
Copy link
Author

jwmelto commented Feb 5, 2025

That looks good. I'll try to create discrete issues for each suggestion. Thanks for your help.

@jwmelto jwmelto closed this as completed Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants