Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FPGA result is not correct in JDK 22 branch, due to casting of the address type #517

Open
stratika opened this issue Jul 31, 2024 · 0 comments
Labels
bug Something isn't working jdk22

Comments

@stratika
Copy link
Collaborator

stratika commented Jul 31, 2024

Describe the bug

I tried the experimental branch of TornadoVM working with JDK 22, and noticed that the result of the generated OpenCL kernels for FPGAs are not result.

How To Reproduce

I have built TornadoVM as follows:

git checkout feat/jdk22_2nd_iteration
./bin/tornadovm-installer --jdk sapmachine-jdk-22 --backend opencl && source setvars.sh
tornado --version

returns

version=1.0.7-dev
branch=feat/jdk22_2nd_iteration
commit=dec5a87

Backends installed: 
	 - opencl

and I have used the oneAPI base toolkit to see also the Intel FPGA in emulation mode:

tornado --devices
WARNING: Using incubator modules: jdk.incubator.vector

Number of Tornado drivers: 1
Driver: OpenCL
  Total number of OpenCL devices  : 4
  Tornado device=0:0  (DEFAULT)
	OPENCL --  [NVIDIA CUDA] -- NVIDIA RTX A2000 8GB Laptop GPU
		Global Memory Size: 7.8 GB
		Local Memory Size: 48.0 KB
		Workgroup Dimensions: 3
		Total Number of Block Threads: [1024]
		Max WorkGroup Configuration: [1024, 1024, 64]
		Device OpenCL C version: OpenCL C 1.2

  Tornado device=0:1
	OPENCL --  [Intel(R) OpenCL HD Graphics] -- Intel(R) Graphics [0x46a6]
		Global Memory Size: 24.8 GB
		Local Memory Size: 64.0 KB
		Workgroup Dimensions: 3
		Total Number of Block Threads: [512]
		Max WorkGroup Configuration: [512, 512, 512]
		Device OpenCL C version: OpenCL C 1.2

  Tornado device=0:2
	OPENCL --  [Intel(R) OpenCL] -- 12th Gen Intel(R) Core(TM) i9-12900H
		Global Memory Size: 31.0 GB
		Local Memory Size: 32.0 KB
		Workgroup Dimensions: 3
		Total Number of Block Threads: [8192]
		Max WorkGroup Configuration: [8192, 8192, 8192]
		Device OpenCL C version: OpenCL C 3.0

  Tornado device=0:3
	OPENCL --  [Intel(R) FPGA Emulation Platform for OpenCL(TM)] -- Intel(R) FPGA Emulation Device
		Global Memory Size: 31.0 GB
		Local Memory Size: 256.0 KB
		Workgroup Dimensions: 3
		Total Number of Block Threads: [67108864]
		Max WorkGroup Configuration: [67108864, 67108864, 67108864]
		Device OpenCL C version: OpenCL C 1.2

Then, I ran the DFTMT.java class from the dynamic examples (uk.ac.manchester.tornado.examples.dynamic) on the FPGA emulation platform, and got the output as follows:

tornado --threadInfo --printKernel --jvm="-Ds0.t0.device=0:3" -m tornado.examples/uk.ac.manchester.tornado.examples.dynamic.DFTMT --params="1024 default 1"
WARNING: Using incubator modules: jdk.incubator.vector
Initialization time:  2021238628 ns

Version running: default ! 
#pragma OPENCL EXTENSION cl_khr_fp64 : enable  
#pragma OPENCL EXTENSION cl_khr_fp16 : enable  
__attribute__((reqd_work_group_size(64, 1, 1)))
__kernel void computeDFT(__global long *_kernel_context, __constant uchar *_constant_region, __local uchar *_local_region, __global int *_atomics, __global uchar *inreal, __global uchar *inimag, __global uchar *outreal, __global uchar *outimag, __global uchar *inputSize)
{
  int i_4, i_36, i_5, i_6, i_10, i_11, i_12, i_31, i_32, i_33; 
  ulong ul_13, ul_0, ul_15, ul_2, ul_34, ul_1, ul_3, ul_35; 
  float f_21, f_22, f_23, f_24, f_17, f_18, f_19, f_20, f_14, f_16, f_9, f_7, f_8, f_29, f_30, f_25, f_26, f_27, f_28; 

  // BLOCK 0
  ul_0  =  (ulong) inreal;
  ul_1  =  (ulong) inimag;
  ul_2  =  (ulong) outreal;
  ul_3  =  (ulong) outimag;
  i_4  =  get_global_size(0);
  i_5  =  get_global_id(0);
  // BLOCK 1 MERGES [0 5 ]
  i_6  =  i_5;
  // BLOCK 2
  f_7  =  (float) i_6;
  // BLOCK 3 MERGES [2 4 ]
  f_8  =  0.0F;
  f_9  =  0.0F;
  i_10  =  0;
  #pragma unroll 4
  for(;i_10 < 1024;)
  {
    // BLOCK 4
    i_11  =  i_10 + 6;
    i_12  =  i_11 << 2;
    ul_13  =  ul_0 + i_12;
    f_14  =  *((__global float *) ul_13);
    ul_15  =  ul_1 + i_12;
    f_16  =  *((__global float *) ul_15);
    f_17  =  *((__global float *) ul_13);
    f_18  =  *((__global float *) ul_15);
    f_19  =  (float) i_10;
    f_20  =  f_19 * 6.2831855F;
    f_21  =  f_20 * f_7;
    f_22  =  f_21 / 1024.0F;
    f_23  =  native_sin(f_22);
    f_24  =  native_cos(f_22);
    f_25  =  f_24 * f_18;
    f_26  =  fma(f_23, f_17, f_25);
    f_27  =  f_9 - f_26;
    f_28  =  f_16 * f_23;
    f_29  =  fma(f_24, f_14, f_28);
    f_30  =  f_8 + f_29;
    i_31  =  i_10 + 1;
    f_8  =  f_30;
    f_9  =  f_27;
    i_10  =  i_31;
  }  // B4
  // BLOCK 5
  i_32  =  i_6 + 6;
  i_33  =  i_32 << 2;
  ul_34  =  ul_2 + i_33;
  *((__global float *) ul_34)  =  f_8;
  ul_35  =  ul_3 + i_33;
  *((__global float *) ul_35)  =  f_9;
  i_36  =  i_4 + i_6;
  i_6  =  i_36;
  // BLOCK 6
  return;
}  //  kernel

Task info: s0.t0
	Backend           : OPENCL
	Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
	Dims              : 1
	Global work offset: [0]
	Global work size  : [1024]
	Local  work size  : [64, 1, 1]
	Number of workgroups  : [16]

Total time:  441707758 ns 

-6.5101576 vs -1.5097469

Validation:  FAIL 

Expected behavior

I did a diff with the kernel for the same test as generated in the develop branch. The first observation is that there is a casting on the values used to calculate the address (ul_14) that is used for loading the data. See variable l_12, in the following screenshot.

image

I guess, the expected behavior would be to avoid this casting.

Computing system setup (please complete the following information):

  • OS: Ubuntu 23.10
  • OpenCL version: OpenCL 1.2
  • Driver: 2023.16.10.0.17_160000
  • TornadoVM commit id: dec5a87 from branch feat/jdk22_2nd_iteration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working jdk22
Projects
None yet
Development

No branches or pull requests

2 participants