Version 0.6.2 RC1: Stream callback semantics change, bug fixes #476

eyalroz · 2023-02-28T17:26:19Z

eyalroz
Feb 28, 2023
Maintainer

The most significant change in this version regards the way callbacks/host functions are supported. This change is motivated mostly as preparation for the upcoming introduction of CUDA graph support (not in this version), which will impose some stricter constraints on callbacks - precluding the hack we have been using so far.

So far, a callback was any object invokable with an std::stream_t parameter. From now on, we support two kinds of callback:

A plain function - not a closure, which may be invoked with a pointer to an arbitrary type: cuda::stream_t::enqueue_t::host_function_call(Argument * user_data)
An object invokable with no parameters - a closure, to which one cannot provide any additional information: cuda::stream_t::enqueue_t::host_invokable(Invokable& invokable)

This lets us avoid the combination of heap allocation at enqueue and deallocation at launch - which works well enough for now, but will not be possible when the same callback needs to be invoked multiple times. Also, it was in contradiction of our presumption not to add layers of abstraction over what CUDA itself provides.

Of course, the release also has s the "usual" long list of minor fixes.

Changes to existing API

Avoid the fancy-shmancy heap allocation for stream callbacks #473 Redesign of host function / callback enqueue and launch mechanism, see above
cuda::kernel::get() should not take an arbitrary context #459 cuda::kernel::get() now takes a device, not a kernel - since it can't really do anything useful for non-primary kernels (which is where apriori-compiled kernels are available)

API additions

Have memory::type_of treat non-CUDA-allocated memory as a non-error #468 Added a non-CUDA memory type enum value, and - can now check the memory type of any pointer without throwing an error.
stream_t::enqueue::memset() should support regions #472 Can now pass cuda::memory::region_t's when enqueueing copy operations on streams (and thus also cuda::span<T>'s)
Make copy_parameters_t user-facing and beef it up #466 Can now perform copies using cuda::memory::copy_parameters_t<N> (for N=2 or 3), a wrapper of the CUDA driver's richest parameters structure with multiple convenience functions, for maximum configurability of a copy operation. But - this structure is not currently "fool-proof", so use with care and initialize all relevant fields.
Allow obtaining a pointer's device and context without wrapping it #463 Can now obtain a raw pointer's context and device without first wrapping it in a cuda::pointer_t
Support the memory barrier "stream memory operation" #452 Support an enqueuing a memory barrier on a stream (one of the "batch stream memory operations)

Bug fixes

device::get() marked noexcept #475 device::get() no longer incorrectly marked as noexcept
Array-to-raw-mem copy function must either get a context or determine it #467 Array-to-raw-memory copy function now determines context for the target area, and a new variant of the function takes the content as a parameter.
Missing definition for allocate_managed() in src/cuda/api/context.hpp #455 Add missing definition of allocate_managed()~ in context.hpp`
Set flags for the flush-remote-writes stream operation #453 Now actually setting the flags when enqueueing a flush_remote_writes() operation on a stream (this is one of the "batch stream memory operations)
Memory leak: Allocation without release in cuda::memory::virtual::set_access_mode #450 Fixed an allocation-without-release in cuda::memory::virtual::set_access_mode
apriori_compiled_kernel_t::get_attribute should be marked inline #449 apriori_compiled_kernel_t::get_attribute() was missing an inline decoration
cuda::profiling::mark::range_start and range_end call create_attributions the wrong way #448 cuda::profiling::mark::range_start() and range_end() were calling create_attributions() the wrong way

Cleanup and warning avoidance

Member initialisation order error. #443 Aligned member initialization order(s) in array_t with their declaration order.

Compatibility

Support obtaining a pointer's device in CUDA 9.1 and earlier #462 Can now obtain a pointer's device in CUDA 9.x (not just 10.0 and later)
Restore CUDA 9.x compatibility #304 Some CUDA 9.x incompatibilities have been fixed

Other changes

Make more comparison operators constexpr and noexcept #471 Made a few more comparison operators constexpr

This discussion was created from the release Version 0.6.2 RC1: Stream callback semantics change, bug fixes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version 0.6.2 RC1: Stream callback semantics change, bug fixes #476

{{title}}

Replies: 0 comments

Select a reply

Version 0.6.2 RC1: Stream callback semantics change, bug fixes #476

eyalroz Feb 28, 2023 Maintainer

Changes to existing API

API additions

Bug fixes

Cleanup and warning avoidance

Compatibility

Other changes

Replies: 0 comments

eyalroz
Feb 28, 2023
Maintainer