Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect device placement for multi device transfers #19891

Closed
rsuderman opened this issue Feb 3, 2025 · 7 comments
Closed

Incorrect device placement for multi device transfers #19891

rsuderman opened this issue Feb 3, 2025 · 7 comments

Comments

@rsuderman
Copy link
Contributor

A tensor parallelism 4 model is failing to place transfers onto the correct devices. See that the IR below generates copies to the same device execution is on.

https://gist.github.com/rsuderman/8b08489da40673087d00e2e716b02fae

This is based on compiling this mlir file:
https://gist.github.com/rsuderman/1af7b0c558ec1dfb59352369045c1cc4

@benvanik
Copy link
Collaborator

benvanik commented Feb 3, 2025

and that's with #19872? what's the IR before iree-stream-schedule-allocation?

@benvanik
Copy link
Collaborator

benvanik commented Feb 3, 2025

(also, 4gb transients are not only big but if you're really doing 4gb copies across devices you're going to have a bad time - are you sure your allocs are 4gb?)

@rsuderman
Copy link
Contributor Author

and that's with #19872? what's the IR before iree-stream-schedule-allocation?

Looks like the transfers should be happening to each device. I think the resource analysis is failing in this case:

https://gist.github.com/rsuderman/52573861fae608832a7f37974eedde0e

Here is the after for the rough region

https://gist.github.com/rsuderman/5dc1d5ed55821999406f62ff8fb2f0dd

@benvanik
Copy link
Collaborator

benvanik commented Feb 3, 2025

ah, the transfer target analysis is likely getting confused by the concurrent region
I'll fix the PR

@rsuderman
Copy link
Contributor Author

(also, 4gb transients are not only big but if you're really doing 4gb copies across devices you're going to have a bad time - are you sure your allocs are 4gb?)

Yeah we need to reduce those. This was just a demo program pushed with large execution times so Alex is going to bring it down by a factor.

@benvanik
Copy link
Collaborator

benvanik commented Feb 3, 2025

#19872 updated to handle concurrent regions - with the input IR attached I now get the expected IR:

    %result_28, %result_timepoint_29 = stream.resource.alloca uninitialized on(#hal.device.affinity<@__device_0>) await(%1) => !stream.resource<transient>{%c4294967296} => !stream.timepoint
    %result_30, %result_timepoint_31 = stream.resource.alloca uninitialized on(#hal.device.promise<@__device_1>) await(%1) => !stream.resource<transient>{%c4294967296} => !stream.timepoint
    %result_32, %result_timepoint_33 = stream.resource.alloca uninitialized on(#hal.device.promise<@__device_2>) await(%1) => !stream.resource<transient>{%c4294967296} => !stream.timepoint
    %result_34, %result_timepoint_35 = stream.resource.alloca uninitialized on(#hal.device.promise<@__device_3>) await(%1) => !stream.resource<transient>{%c4294967296} => !stream.timepoint
    %result_36, %result_timepoint_37 = stream.resource.alloca uninitialized on(#hal.device.promise<@__device_3>) await(%1) => !stream.resource<transient>{%c12884901888} => !stream.timepoint
    %11 = stream.timepoint.join max(%result_timepoint_29, %result_timepoint_31, %result_timepoint_33, %result_timepoint_35, %result_timepoint_37) => !stream.timepoint
    %12 = stream.cmd.execute on(#hal.device.promise<@__device_3>) await(%11) => with(%0 as %arg3: !stream.resource<external>{%c4294967296}, %__hoisted_tensor_512x4096x4x1xf16_2 as %arg4: !stream.resource<constant>{%c50331648}, %result_28 as %arg5: !stream.resource<transient>{%c4294967296}, %result_30 as %arg6: !stream.resource<transient>{%c4294967296}, %result_32 as %arg7: !stream.resource<transient>{%c4294967296}, %result_34 as %arg8: !stream.resource<transient>{%c4294967296}, %result_36 as %arg9: !stream.resource<transient>{%c12884901888}) {
      stream.cmd.copy %arg3[%c0], %arg9[%c0], %c4294967296 : !stream.resource<external>{%c4294967296} -> !stream.resource<transient>{%c12884901888}
      stream.cmd.dispatch @main$async_dispatch_0::@main$async_dispatch_0_pack_f16 {
        ro %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888},
        wo %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888}
      }
      stream.cmd.concurrent {
        stream.cmd.dispatch @main$async_dispatch_1::@main$async_dispatch_1_mmt4d_65536x512x4096x8x4x1_f16xf16xf32(%c1_i32, %c0_i32, %c0_i32 : i32, i32, i32) {
          ro %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888},
          ro %arg4[%c0 for %c50331648] : !stream.resource<constant>{%c50331648},
          wo %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888}
        }
        stream.cmd.dispatch @main$async_dispatch_1::@main$async_dispatch_1_mmt4d_65536x512x4096x8x4x1_f16xf16xf32(%c1_i32, %c16777216_i32, %c2_i32 : i32, i32, i32) {
          ro %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888},
          ro %arg4[%c0 for %c50331648] : !stream.resource<constant>{%c50331648},
          wo %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888}
        }
      }
      stream.cmd.dispatch @main$async_dispatch_9::@main$async_dispatch_9_unpack_f32(%c1_i32 : i32) {
        ro %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888},
        wo %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888}
      }
      stream.cmd.dispatch @main$async_dispatch_16::@main$async_dispatch_16_unpack_elementwise_524288x2048_f32xf32xf16(%c0_i32, %c1_i32 : i32, i32) {
        ro %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888},
        wo %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888}
      }
      stream.cmd.dispatch @main$async_dispatch_20::@main$async_dispatch_20_pack_f16 {
        ro %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888},
        wo %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888}
      }
      stream.cmd.dispatch @main$async_dispatch_21::@main$async_dispatch_21_mmt4d_65536x1024x2048x8x4x1_f16xf16xf32 {
        ro %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888},
        ro %arg4[%c0 for %c50331648] : !stream.resource<constant>{%c50331648},
        wo %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888}
      }
      stream.cmd.dispatch @main$async_dispatch_22::@main$async_dispatch_22_unpack_elementwise_524288x4096_f32xf16(%c0_i32, %c0_i32 : i32, i32) {
        ro %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888},
        wo %arg8[%c0 for %c4294967296] : !stream.resource<transient>{%c4294967296}
      }
      stream.cmd.concurrent {
        stream.cmd.copy %arg8[%c0], %arg5[%c0], %c4294967296 : !stream.resource<transient>{%c4294967296} -> !stream.resource<transient>{%c4294967296}
        stream.cmd.flush to(#hal.device.affinity<@__device_0>) %arg5[%c0 for %c4294967296] : !stream.resource<transient>{%c4294967296}
        stream.cmd.copy %arg8[%c0], %arg6[%c0], %c4294967296 : !stream.resource<transient>{%c4294967296} -> !stream.resource<transient>{%c4294967296}
        stream.cmd.flush to(#hal.device.promise<@__device_1>) %arg6[%c0 for %c4294967296] : !stream.resource<transient>{%c4294967296}
        stream.cmd.copy %arg8[%c0], %arg7[%c0], %c4294967296 : !stream.resource<transient>{%c4294967296} -> !stream.resource<transient>{%c4294967296}
        stream.cmd.flush to(#hal.device.promise<@__device_2>) %arg7[%c0 for %c4294967296] : !stream.resource<transient>{%c4294967296}
      }
    } => !stream.timepoint

@benvanik
Copy link
Collaborator

benvanik commented Feb 3, 2025

Closing out as it seems to be fixed in the PR now! Let me know if you still see issues.

@benvanik benvanik closed this as completed Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants