Host-Device Unnecessary Copying #5483
-
Hi guys, I'm currently realizing a pipeline P1 on a static buffer B1 on the device over and over again until a certain condition is met. The condition is checked though another pipeline P2 which has B1 as an input and B2 as a smaller output buffer. When running P1, the only data movement of B1 is the initial host->device copy/allocation before the first realize, and a final device->host copy after a copy_to_host() call, as is expected. For P2, I require a copy_to_host() of B2 after each of its realization for checking the end condition. As expected, after realizing P2, the device_dirty flag of B2 is set, allowing the copy_to_host() call. However, before the execution of P2 a host->device copy is always performed when it shouldn't. Interestingly, if I set B2.set_device_dirty() before the copy call, a device->host copy is performed before the host->device copy operation! Since the data on B2 is only dependent on B1, ideally it should only perform a single allocation/copy on device the first time (which I'm even doing outside the main loop), and a device->host copy at every copy_to_host() calls. I also tried setting B2.set_device_dirty(false) to no difference. It is worth noting that both P1 and P2 use update definitions. For P1 the first definition is a Halide::undef() for in-place realization. For P2 the first definition is an initialization (i.e., P2(...) = 0) and the second actually computes the results based on B1. Just in case, bellow I've written a pseudo-code to better visualize what my algorithm does:
I imagine that Halide might see the update definition and assume that it should copy the host data, even with the host_dirty flag not set. I'm curious since this does not happen for P1 (maybe something related to Halide:undef() and in-place realization?). What can I do to impede Halide from performing the host->device copy of B2 before every realization? Thanks for the help. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
I don't understand why you have B2.copy_to_device just before inspecting B2 on CPU. Is that supposed to be B2.copy_to_host? I'm not seeing any unusual copies when I attempt to reproduce this. Can you post a full repro? Here's what I have:
|
Beta Was this translation helpful? Give feedback.
-
On another issue, I found that the execution of functions with update definitions were performed separately. With debugging ON two executions happened for P2 (note, I named Func g("g")): CUDA: halide_cuda_run (user_context: 0x7ffd11f431d8, entry: kernel_g_s0_x_v0___block_id_x, blocks: 13x1x1, threads: 10x1x1, shmem: 0 CUDA: halide_cuda_run (user_context: 0x7ffd11f431d8, entry: kernel_g_s1_x_v0___block_id_x, blocks: 13x1x1, threads: 10x1x1, shmem: 0 Highlighted are what I assume are the names of the Func's, "g" and the definition or stage "s". Is this the default behavior of Halide, to perform a single |
Beta Was this translation helpful? Give feedback.
-
That's a function of the schedule. The simplest way to run them together would be to use Func::in to make a wrapper Func, and replace:
with
|
Beta Was this translation helpful? Give feedback.
I don't understand why you have B2.copy_to_device just before inspecting B2 on CPU. Is that supposed to be B2.copy_to_host?
I'm not seeing any unusual copies when I attempt to reproduce this. Can you post a full repro? Here's what I have: