Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Randomized failure of PTX JIT Compilation #528

Open
PolyRocketMatt opened this issue Aug 11, 2024 · 3 comments
Open

Randomized failure of PTX JIT Compilation #528

PolyRocketMatt opened this issue Aug 11, 2024 · 3 comments
Assignees
Labels
bug Something isn't working compiler PTX

Comments

@PolyRocketMatt
Copy link

PolyRocketMatt commented Aug 11, 2024

Describe the bug

Whenever I am trying to run my application on the latest TornadoVM build, it only at some occasions throws an error indicating that PTX JIT compilation failed:

Unable to compile task 300502d1-daec-4e34-b335-8fde2503eb00.mxm - addFloat
The internal error is: [Error During the Task Compilation]

How To Reproduce

My Main.java just simply runs the following:

public class Main {

    public static void main(String[] args) {
        run();
    }

    public void run() {
        FloatArray nativeBuffer = getFrom...();
        FloatArray nativeResultBuffer = getFrom...();
        size = 1024;

        // Create a task-graph with multiple tasks. Each task points to an exising Java method
        // that can be accelerated on a GPU/FPGA
        TaskGraph taskGraph = new TaskGraph(UUID.randomUUID().toString())
                .transferToDevice(DataTransferMode.FIRST_EXECUTION, nativeBuffer, nativeResultBuffer) // Transfer data from host to device only in the first execution
                .task("mxm", AdditionTask::addFloat, nativeBuffer, nativeResultBuffer, 1.0f, size)             // Each task points to an existing Java method
                .transferToHost(DataTransferMode.EVERY_EXECUTION, nativeResultBuffer);     // Transfer data from device to host

        // Create an immutable task-graph
        ImmutableTaskGraph immutableTaskGraph = taskGraph.snapshot();

        // Create an execution plan from an immutable task-graph
        try (TornadoExecutionPlan executionPlan = new TornadoExecutionPlan(immutableTaskGraph)) {

            // Run the execution plan on the default device
            TornadoExecutionResult executionResult = executionPlan.execute();

            if (executionResult.isReady()) {
               ...
            }
        } catch (TornadoExecutionPlanException ex) {
            ex.printStackTrace()
        }
    }
}

The AdditionTask class looks like this:

public class AdditionTask implements BufferTask {

    public static void addFloat(@NotNull FloatArray input, @NotNull FloatArray output,
                                float value, int size) {
        for (@Parallel int i = 0; i < size; i++)
            output.set(i, 1.0f);
    }
}

In this case, BufferTask is just an empty interface.

Expected behavior

I'm expecting the code to run, without throwing any compilation issues. This happens in some cases but not all.

Computing system setup (please complete the following information):

  • OS: Windows 10
  • CUDA: cuda_11.7.r11.7/compiler.31294372_0
  • PTX: How do I obtain this?
  • TornadoVM commit id: 8db121e

Additional context

The attached log is generated using the --debug flag and is one from my original program. The minimal reproducible example should still be a valid proxy.

debug.txt


@jjfumero
Copy link
Member

Hi @PolyRocketMatt , thanks for the report.

To see the generated PTX code, you can use the option --printKernel for the tornado command.

Just having a quick look at the issue, and I am not sure the annotation @NotNull is supported. It might happen that IR is not clean (ready to be consumed by the PTX code generator) because the annotation introduced more nodes. Can you check without this?

@jjfumero jjfumero self-assigned this Aug 11, 2024
@jjfumero jjfumero added the PTX label Aug 11, 2024
@PolyRocketMatt
Copy link
Author

PolyRocketMatt commented Aug 11, 2024

Hi, thanks for the fast response.

I have tried compiling and running the code without the @NotNull annotation, but this still seems to work only occasionally still. This is the generated PTX code, which seems to be running in version 7.6:

.version 7.6
.target sm_86
.address_size 64

.visible .entry 90375540_e9cf_461e_b9a2_d7dc8d46e67e_mxm_addfloat_arrays_floatarray_arrays_floatarray_1_0_1024(.param .u64 .ptr .global .align 8 kernel_context, .param .u64 .ptr .global .align 8 input, .param .u64 .ptr .global .align 8 output, .param .align 8 .u64 value, .param .align 8 .u64 size) {
        .reg .s64 rsd<3>;
        .reg .u64 rud<5>;
        .reg .s32 rsi<6>;
        .reg .pred rpb<2>;
        .reg .u32 rui<5>;

BLOCK_0:
        ld.param.u64    rud0, [kernel_context];
        ld.param.u64    rud1, [output];
        mov.u32 rui0, %nctaid.x;
        mov.u32 rui1, %ntid.x;
        mul.wide.u32    rud2, rui0, rui1;
        cvt.s32.u64     rsi0, rud2;
        mov.u32 rui2, %tid.x;
        mov.u32 rui3, %ctaid.x;
        mad.lo.s32      rsi1, rui3, rui1, rui2;

BLOCK_1:
        mov.s32 rsi2, rsi1;
LOOP_COND_1:
        setp.lt.s32     rpb0, rsi2, 1024;
        @!rpb0 bra      BLOCK_3;

BLOCK_2:
        add.s32 rsi3, rsi2, 6;
        cvt.s64.s32     rsd0, rsi3;
        shl.b64 rsd1, rsd0, 2;
        add.u64 rud3, rud1, rsd1;
        st.global.f32   [rud3], 0F3F800000;
        add.s32 rsi4, rsi0, rsi2;
        mov.s32 rsi2, rsi4;
        bra.uni LOOP_COND_1;

BLOCK_3:
        ret;
}

@PolyRocketMatt
Copy link
Author

I'm not sure if this is the solution, but removing the UUID from the name of the task graph (and subsequently the PTX code), seems to behave stable. I'll be doing some further experimentation myself to see if this still keeps fixing the errors I was getting.

@jjfumero jjfumero added bug Something isn't working compiler labels Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working compiler PTX
Projects
None yet
Development

No branches or pull requests

2 participants