-
Notifications
You must be signed in to change notification settings - Fork 656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect device placement for multi device transfers #19891
Comments
and that's with #19872? what's the IR before iree-stream-schedule-allocation? |
(also, 4gb transients are not only big but if you're really doing 4gb copies across devices you're going to have a bad time - are you sure your allocs are 4gb?) |
Looks like the transfers should be happening to each device. I think the resource analysis is failing in this case: https://gist.github.com/rsuderman/52573861fae608832a7f37974eedde0e Here is the after for the rough region https://gist.github.com/rsuderman/5dc1d5ed55821999406f62ff8fb2f0dd |
ah, the transfer target analysis is likely getting confused by the concurrent region |
Yeah we need to reduce those. This was just a demo program pushed with large execution times so Alex is going to bring it down by a factor. |
#19872 updated to handle concurrent regions - with the input IR attached I now get the expected IR: %result_28, %result_timepoint_29 = stream.resource.alloca uninitialized on(#hal.device.affinity<@__device_0>) await(%1) => !stream.resource<transient>{%c4294967296} => !stream.timepoint
%result_30, %result_timepoint_31 = stream.resource.alloca uninitialized on(#hal.device.promise<@__device_1>) await(%1) => !stream.resource<transient>{%c4294967296} => !stream.timepoint
%result_32, %result_timepoint_33 = stream.resource.alloca uninitialized on(#hal.device.promise<@__device_2>) await(%1) => !stream.resource<transient>{%c4294967296} => !stream.timepoint
%result_34, %result_timepoint_35 = stream.resource.alloca uninitialized on(#hal.device.promise<@__device_3>) await(%1) => !stream.resource<transient>{%c4294967296} => !stream.timepoint
%result_36, %result_timepoint_37 = stream.resource.alloca uninitialized on(#hal.device.promise<@__device_3>) await(%1) => !stream.resource<transient>{%c12884901888} => !stream.timepoint
%11 = stream.timepoint.join max(%result_timepoint_29, %result_timepoint_31, %result_timepoint_33, %result_timepoint_35, %result_timepoint_37) => !stream.timepoint
%12 = stream.cmd.execute on(#hal.device.promise<@__device_3>) await(%11) => with(%0 as %arg3: !stream.resource<external>{%c4294967296}, %__hoisted_tensor_512x4096x4x1xf16_2 as %arg4: !stream.resource<constant>{%c50331648}, %result_28 as %arg5: !stream.resource<transient>{%c4294967296}, %result_30 as %arg6: !stream.resource<transient>{%c4294967296}, %result_32 as %arg7: !stream.resource<transient>{%c4294967296}, %result_34 as %arg8: !stream.resource<transient>{%c4294967296}, %result_36 as %arg9: !stream.resource<transient>{%c12884901888}) {
stream.cmd.copy %arg3[%c0], %arg9[%c0], %c4294967296 : !stream.resource<external>{%c4294967296} -> !stream.resource<transient>{%c12884901888}
stream.cmd.dispatch @main$async_dispatch_0::@main$async_dispatch_0_pack_f16 {
ro %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888},
wo %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888}
}
stream.cmd.concurrent {
stream.cmd.dispatch @main$async_dispatch_1::@main$async_dispatch_1_mmt4d_65536x512x4096x8x4x1_f16xf16xf32(%c1_i32, %c0_i32, %c0_i32 : i32, i32, i32) {
ro %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888},
ro %arg4[%c0 for %c50331648] : !stream.resource<constant>{%c50331648},
wo %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888}
}
stream.cmd.dispatch @main$async_dispatch_1::@main$async_dispatch_1_mmt4d_65536x512x4096x8x4x1_f16xf16xf32(%c1_i32, %c16777216_i32, %c2_i32 : i32, i32, i32) {
ro %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888},
ro %arg4[%c0 for %c50331648] : !stream.resource<constant>{%c50331648},
wo %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888}
}
}
stream.cmd.dispatch @main$async_dispatch_9::@main$async_dispatch_9_unpack_f32(%c1_i32 : i32) {
ro %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888},
wo %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888}
}
stream.cmd.dispatch @main$async_dispatch_16::@main$async_dispatch_16_unpack_elementwise_524288x2048_f32xf32xf16(%c0_i32, %c1_i32 : i32, i32) {
ro %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888},
wo %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888}
}
stream.cmd.dispatch @main$async_dispatch_20::@main$async_dispatch_20_pack_f16 {
ro %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888},
wo %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888}
}
stream.cmd.dispatch @main$async_dispatch_21::@main$async_dispatch_21_mmt4d_65536x1024x2048x8x4x1_f16xf16xf32 {
ro %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888},
ro %arg4[%c0 for %c50331648] : !stream.resource<constant>{%c50331648},
wo %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888}
}
stream.cmd.dispatch @main$async_dispatch_22::@main$async_dispatch_22_unpack_elementwise_524288x4096_f32xf16(%c0_i32, %c0_i32 : i32, i32) {
ro %arg9[%c0 for %c12884901888] : !stream.resource<transient>{%c12884901888},
wo %arg8[%c0 for %c4294967296] : !stream.resource<transient>{%c4294967296}
}
stream.cmd.concurrent {
stream.cmd.copy %arg8[%c0], %arg5[%c0], %c4294967296 : !stream.resource<transient>{%c4294967296} -> !stream.resource<transient>{%c4294967296}
stream.cmd.flush to(#hal.device.affinity<@__device_0>) %arg5[%c0 for %c4294967296] : !stream.resource<transient>{%c4294967296}
stream.cmd.copy %arg8[%c0], %arg6[%c0], %c4294967296 : !stream.resource<transient>{%c4294967296} -> !stream.resource<transient>{%c4294967296}
stream.cmd.flush to(#hal.device.promise<@__device_1>) %arg6[%c0 for %c4294967296] : !stream.resource<transient>{%c4294967296}
stream.cmd.copy %arg8[%c0], %arg7[%c0], %c4294967296 : !stream.resource<transient>{%c4294967296} -> !stream.resource<transient>{%c4294967296}
stream.cmd.flush to(#hal.device.promise<@__device_2>) %arg7[%c0 for %c4294967296] : !stream.resource<transient>{%c4294967296}
}
} => !stream.timepoint |
Closing out as it seems to be fixed in the PR now! Let me know if you still see issues. |
A tensor parallelism 4 model is failing to place transfers onto the correct devices. See that the IR below generates copies to the same device execution is on.
https://gist.github.com/rsuderman/8b08489da40673087d00e2e716b02fae
This is based on compiling this mlir file:
https://gist.github.com/rsuderman/1af7b0c558ec1dfb59352369045c1cc4
The text was updated successfully, but these errors were encountered: