Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for 2D operations and C2R/R2C FFT #305

Merged
merged 23 commits into from
Nov 7, 2024
Merged

Conversation

wvbbreu
Copy link
Collaborator

@wvbbreu wvbbreu commented Oct 31, 2024

Description

This pull request includes support for various 2D operations, namely memset2D, memcpyHtoD2DAsync, memcpyDtoH2DAsync, and introduces FFT1DRealToComplex and FFT1DComplexToReal classes for FFT transformations. Finally, the cu::Device::getOrdinal() method is introduced to retrieve the current device ID.
Related tests are included in this pull request.

Related issues:

None

Instructions to review the pull request

  • Check that CHANGELOG.md has been updated if necessary

@wvbbreu wvbbreu requested a review from csbnw October 31, 2024 13:21
@wvbbreu wvbbreu self-assigned this Oct 31, 2024
@wvbbreu wvbbreu added the enhancement New feature or request label Oct 31, 2024
include/cudawrappers/cufft.hpp Outdated Show resolved Hide resolved
include/cudawrappers/cufft.hpp Outdated Show resolved Hide resolved
include/cudawrappers/cufft.hpp Show resolved Hide resolved
include/cudawrappers/nvtx.hpp Show resolved Hide resolved
tests/test_cu.cpp Outdated Show resolved Hide resolved
tests/test_cu.cpp Outdated Show resolved Hide resolved
tests/test_cufft.cpp Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
include/cudawrappers/cu.hpp Outdated Show resolved Hide resolved
@wvbbreu wvbbreu requested a review from csbnw November 5, 2024 08:03
void zero(size_t size) { memset(static_cast<unsigned char>(0), size); }

const void *parameter()
const // used to construct parameter list for launchKernel();
{
return &_obj;
}
void *parameter_copy() { return reinterpret_cast<void *>(_obj); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't seem to be used, remove it?

Suggested change
void *parameter_copy() { return reinterpret_cast<void *>(_obj); }

Copy link
Collaborator Author

@wvbbreu wvbbreu Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function was intentionally added. I will demonstrate its use through a small example. In some libraries, such as dedisp, a significant amount of manual pointer arithmetic is (unfortunately) performed. Using the dereference operator (e.g., cu::Device::operation*) would be quite beneficial in such cases. However, it's not allowed, as memory in these situations isn’t allocated with CU_POINTER_ATTRIBUTE_IS_MANAGED, resulting in a runtime exception.

Option 1 (Dereference Operator)

cu::DeviceMemory idata(...); // idata points to a device address

unsigned int i = 1;
unsigned int nsamp_padded = 2345;
unsigned int nchan_fft_batch = 5678;
cu::DeviceMemory idata(static_cast<cufftReal*>(d_data_x_nu) + i * nsamp_padded * nchan_fft_batch);
// --> THROWS: "Cannot return memory of type CU_MEMORYTYPE_DEVICE as pointer."

Option 2 (cu::DeviceMemory Constructor with Offset)

Alternatively, the cu::DeviceMemory(DeviceMemory&, size_t offset, size_t size) constructor allows us to wrap the arithmetic within the DeviceMemory object itself. Here, offset is given in bytes, so we need to multiply by the word size (e.g., sizeof(cufftReal)). Even so, I’d prefer to do pointer arithmetic with a cufftReal pointer rather than manually determining the byte offset.

cu::DeviceMemory idata(d_data_x_nu, (i * nsamp_padded * nchan_fft_batch) * sizeof(cufftReal), d_data_x_nu.size());

Option 3 (cu::DeviceMemory::parameter)

As an alternative, we could use cu::DeviceMemory::parameter(). However, it has its limitations. This function returns a const pointer to the _obj pointer stored in cu::DeviceMemory, not the actual underlying device pointer, so it can’t be used directly. Personally, I would rather avoid using this type of extensive casting in production.

const void* const* param = reinterpret_cast<const void* const*>(d_data_x_nu.parameter());

cu::DeviceMemory idata(reinterpret_cast<CUdeviceptr>(
    static_cast<const cufftReal*>(*param) +
    i * nsamp_padded * nchan_fft_batch));

Option 4 (cu::DeviceMemory::parameter_copy)

A final approach is to return the actual device pointer by introducing a parameter_copy() method.

cu::DeviceMemory idata(static_cast<const cufftReal*>(d_data_x_nu.parameter_copy()) + i * nsamp_padded * nchan_fft_batch);

What is your view on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option 4 looks much cleaner than option 3, but I really think that the name parameter_copy should be changed. It shouldn't contain an underscore (see also this page) but more importantly, copy is somewhat misleading. I think object or pointer would be better names.

Another thing that comes to mind is that a feature like this may be useful for the other classes in cu::, for instance when combining cudawrappers code with 'plain' CUDA code. So this function could even be added to cu::Wrapper?

@loostrum, @john-romein, what is your view here?

Copy link
Contributor

@john-romein john-romein Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please stay as close as possible to the CUDA driver API. If you need to do arithmetic on a device memory pointer, simply cast the cu::DeviceMemory object to a CUdeviceptr (a cast operator that is inherited from cu::Wrapper<CUdeviceptr>) and do the arithmetic on CUdeviceptr (which is some int type that allows arithmetic to be performed on).

Copy link
Collaborator Author

@wvbbreu wvbbreu Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had not considered that approach, but it does work. The only downside is that for non-char sized types, it requires two casts. For example:

float* x = reinterpret_cast<float*>(static_cast<CUdeviceptr>(some_dev_object));
// Move x pointer
float* y = x + 3;

@john-romein @csbnw Would we rather opt for this approach than introducing a new method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this is likely to work in practice, this is not strictly conforming. It will not work in the (unlikely) case that the size of a host memory pointer is smaller than the size of a device memory pointer. The proper solution would be static_cast<CUdeviceptr>(some_dev_object) + 3 * sizeof(float)
Apart from that, performing two casts is something that I do not consider as a problem, as it is very clear what happens this way.

Copy link
Collaborator Author

@wvbbreu wvbbreu Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with performing arithmetic on a CUdeviceptr is that in HIP the underlying type is hipDeviceptr_t, which is internally declared as void*. Since performing arithmetic on a void pointer is not allowed, this results in the following compiler warning.

error: arithmetic on a pointer to void
  549 |             static_cast<CUdeviceptr>(some_dev_object) + (3 * sizeof(float))

In contrast, CUDA allows this because the underlying type is unsigned long long. This issue will likely also exist in other applications that use pointer arithmetic and are not yet compiled with HIP.

Currently, it is only allowed use the deference operator in case of managed memory. This commit
relaxes this requirement a bit by also allowing access to non-managed memory. This enables casts like this:

cu::DeviceMemory(1024) mem;
float* ptr = static_cast<float*>(mem);

therefore avoiding an intermediate cast to CUdeviceptr.
@wvbbreu
Copy link
Collaborator Author

wvbbreu commented Nov 6, 2024

I came up with a solution that relaxes the requirements of cu::DeviceMemory::operator* a bit. Instead of only allowing de-referencing on managed memory, access to non-managed/device memory is also allowed (see example). This change has no impact on the existing cudawrappers API. Access to unallocated memory will still be handled through a sanity check (checkPointerAccess).

Before

cu::DeviceMemory(1024) dev_mem;
cu::DeviceMemory(reinterpret_cast<CUdeviceptr>(static_cast<float*>(static_cast<CUdeviceptr>(dev_mem) + 2));

After

cu::DeviceMemory(1024) dev_mem;
cu::DeviceMemory(reinterpret_cast<CUdeviceptr>(static_cast<float*>(dev_mem) + 2));

@wvbbreu wvbbreu requested a review from csbnw November 6, 2024 14:33
Copy link
Contributor

@csbnw csbnw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor suggestions, other than that I think it's ready to be merged.

include/cudawrappers/cu.hpp Outdated Show resolved Hide resolved
include/cudawrappers/cu.hpp Outdated Show resolved Hide resolved
@wvbbreu wvbbreu requested a review from csbnw November 7, 2024 10:23
@wvbbreu wvbbreu merged commit fd7f102 into main Nov 7, 2024
6 checks passed
@wvbbreu wvbbreu deleted the add-2dsupport branch November 7, 2024 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants