Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] How does caching work in CUDA? #262

Closed
BA8F0D39 opened this issue Nov 14, 2020 · 14 comments
Closed

[Question] How does caching work in CUDA? #262

BA8F0D39 opened this issue Nov 14, 2020 · 14 comments
Assignees

Comments

@BA8F0D39
Copy link
Contributor

How does caching work for a simple kernel such as adding two vectors?
On arrayfire-rust 3.7.2 CUDA backend.

let dims = arrayfire::Dim4::new(&[4,4,1,1]);
let a = arrayfire::randu::<f32>(dims);
let mut b = arrayfire::randu::<f32>(dims);

let mut c = a.clone();
while (1==1)
{
	b = b + (0.02f32);
	c = arrayfire::add(&b, &a, false);
}

Running the code generates 100 cubins in ~/.arrayfire/.

How come arrayfire generates many different kernels just for adding two vectors?
Why in the first 100 iterations, the code runs much slower?

@9prady9
Copy link
Member

9prady9 commented Nov 14, 2020

Can you please tell how many kernels you are seeing ? You should try export AF_TRACE=jit and run the rust example, that should print out the kernels being generated.

Apart from JIT kernels, most functions kernels are compiled and cached in the first run. It shouldn't need 100 iterations to run fine, single iteration in the beginning is all it needs to compile/cache the functions/JIT code. If you are noticing such slow down, then it must be something else.

I will run the code you shared and check, that should ideally generate a single kernel since both the statements in while loop are JITed and eval should be triggered based on some memory pressue heuristics or when explicitly called by user.

@BA8F0D39
Copy link
Contributor Author

BA8F0D39 commented Nov 14, 2020

I figured it out the print function (arrayfire::print_gen) generates 100 different kernels. After generating 100 kernels, it doesn't generate anymore and runs faster.

	let dims = arrayfire::Dim4::new(&[4,4,1,1]);
	let a = arrayfire::randu::<f32>(dims);
	let mut b = arrayfire::randu::<f32>(dims);

	let mut c = a.clone();
	while (1==1)
	{
		b = b + (0.02f32);
		c = arrayfire::add(&b, &a, false);
		arrayfire::print_gen("c".to_string(), &c,Some(6));
	}


[unified][1605388051][006181] [ ../src/api/unified/symbol_manager.cpp:141 ] Attempting: Default System Paths
[unified][1605388051][006181] [ ../src/api/unified/symbol_manager.cpp:144 ] Found: libafcpu.so.3
[unified][1605388051][006181] [ ../src/api/unified/symbol_manager.cpp:151 ] Device Count: 1.
[unified][1605388051][006181] [ ../src/api/unified/symbol_manager.cpp:141 ] Attempting: Default System Paths
[unified][1605388051][006181] [ ../src/api/unified/symbol_manager.cpp:144 ] Found: libafopencl.so.3
[platform][1605388052][006181] [ ../src/backend/common/DependencyModule.cpp:99 ] Attempting to load: libforge.so
[platform][1605388052][006181] [ ../src/backend/common/DependencyModule.cpp:102 ] Found: libforge.so
[platform][1605388052][006181] [ ../src/backend/opencl/device_manager.cpp:218 ] Found 2 OpenCL platforms
[platform][1605388052][006181] [ ../src/backend/opencl/device_manager.cpp:230 ] Found 1 devices on platform NVIDIA CUDA
[platform][1605388052][006181] [ ../src/backend/opencl/device_manager.cpp:235 ] Found device GeForce GTX 1060 3GB on platform NVIDIA CUDA
[platform][1605388052][006181] [ ../src/backend/opencl/device_manager.cpp:230 ] Found 1 devices on platform Intel(R) CPU Runtime for OpenCL(TM) Applications
[platform][1605388052][006181] [ ../src/backend/opencl/device_manager.cpp:235 ] Found device Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz on platform Intel(R) CPU Runtime for OpenCL(TM) Applications
[platform][1605388052][006181] [ ../src/backend/opencl/device_manager.cpp:240 ] Found 2 OpenCL devices
[platform][1605388052][006181] [ ../src/backend/opencl/device_manager.cpp:335 ] Default device: 0
[unified][1605388052][006181] [ ../src/api/unified/symbol_manager.cpp:151 ] Device Count: 2.
[unified][1605388052][006181] [ ../src/api/unified/symbol_manager.cpp:141 ] Attempting: Default System Paths
[unified][1605388052][006181] [ ../src/api/unified/symbol_manager.cpp:144 ] Found: libafcuda.so.3
[unified][1605388052][006181] [ ../src/api/unified/symbol_manager.cpp:151 ] Device Count: 1.
[unified][1605388052][006181] [ ../src/api/unified/symbol_manager.cpp:206 ] AF_DEFAULT_BACKEND: cuda
[platform][1605388052][006181] [ ../src/backend/common/DependencyModule.cpp:99 ] Attempting to load: libforge.so
[platform][1605388052][006181] [ ../src/backend/common/DependencyModule.cpp:102 ] Found: libforge.so
[platform][1605388052][006181] [ ../src/backend/cuda/device_manager.cpp:428 ] CUDA Driver supports up to CUDA 11.1 ArrayFire CUDA Runtime 10.0
[platform][1605388052][006181] [ ../src/backend/cuda/device_manager.cpp:495 ] Found 1 CUDA devices
[platform][1605388052][006181] [ ../src/backend/cuda/device_manager.cpp:521 ] Found device: GeForce GTX 1060 3GB (2.95 GB | ~3754.03 GFLOPs | 9 SMs)
[platform][1605388052][006181] [ ../src/backend/cuda/device_manager.cpp:556 ] AF_CUDA_DEFAULT_DEVICE: 
[platform][1605388052][006181] [ ../src/backend/cuda/device_manager.cpp:575 ] Default device: 0(GeForce GTX 1060 3GB)
ArrayFire v3.7.2 (CUDA, 64-bit Linux, build 218dd2c)
Platform: CUDA Runtime 10.0, Driver: 455.32.00
[0] GeForce GTX 1060 3GB, 3019 MB, CUDA Compute 6.1
Info String:
ArrayFire v3.7.2 (CUDA, 64-bit Linux, build 218dd2c)
Platform: CUDA Runtime 10.0, Driver: 455.32.00
[0] GeForce GTX 1060 3GB, 3019 MB, CUDA Compute 6.1
Arrayfire version: (3, 7, 2)
Name: GeForce_GTX_1060_3GB
Platform: CUDA
Toolkit: v10.0
Compute: 6.1
Revision: 218dd2c
[mem][1605388052][006181] [ ../src/backend/cuda/memory.cpp:158 ] nativeAlloc:    1 KB 0x7f3bfc800000
[mem][1605388052][006181] [ ../src/backend/cuda/memory.cpp:158 ] nativeAlloc:    1 KB 0x7f3bfc800400
c
[4 4 1 1]
[mem][1605388052][006181] [ ../src/backend/cuda/memory.cpp:158 ] nativeAlloc:    1 KB 0x7f3bfc800800
[mem][1605388052][006181] [ ../src/backend/cuda/memory.cpp:158 ] nativeAlloc:    1 KB 0x7f3bfc800c00
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {8849204984755236620            : loaded from /home/usertest/.arrayfire/KER8849204984755236620_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {11230367656483596278           : loaded from /home/usertest/.arrayfire/KER11230367656483596278_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.037432     0.760924     1.560485     0.665100 
    0.629204     1.180188     1.327124     1.659123 
    1.896739     0.829636     0.231963     1.501766 
    0.603847     0.917672     0.968704     0.750666 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {13980104363090480971           : loaded from /home/usertest/.arrayfire/KER13980104363090480971_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.057432     0.780924     1.580485     0.685100 
    0.649204     1.200188     1.347124     1.679123 
    1.916739     0.849636     0.251963     1.521766 
    0.623847     0.937672     0.988704     0.770666 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {4009459142369369917            : loaded from /home/usertest/.arrayfire/KER4009459142369369917_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.077432     0.800924     1.600485     0.705100 
    0.669204     1.220188     1.367124     1.699123 
    1.936739     0.869636     0.271963     1.541766 
    0.643847     0.957672     1.008704     0.790666 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {9733618669386616452            : loaded from /home/usertest/.arrayfire/KER9733618669386616452_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.097432     0.820924     1.620485     0.725100 
    0.689204     1.240188     1.387124     1.719123 
    1.956739     0.889636     0.291963     1.561766 
    0.663847     0.977672     1.028704     0.810666 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {2382943202206333633            : loaded from /home/usertest/.arrayfire/KER2382943202206333633_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.117432     0.840924     1.640485     0.745100 
    0.709203     1.260188     1.407124     1.739123 
    1.976739     0.909636     0.311963     1.581766 
    0.683847     0.997672     1.048704     0.830666 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {6644372178722575851            : loaded from /home/usertest/.arrayfire/KER6644372178722575851_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.137432     0.860924     1.660485     0.765100 
    0.729203     1.280188     1.427124     1.759123 
    1.996739     0.929636     0.331963     1.601766 
    0.703847     1.017672     1.068704     0.850666 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {13196190995612841930           : loaded from /home/usertest/.arrayfire/KER13196190995612841930_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.157432     0.880924     1.680485     0.785100 
    0.749203     1.300188     1.447124     1.779123 
    2.016739     0.949636     0.351963     1.621766 
    0.723847     1.037672     1.088704     0.870666 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {13042035139927288876           : loaded from /home/usertest/.arrayfire/KER13042035139927288876_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.177432     0.900924     1.700485     0.805100 
    0.769203     1.320188     1.467124     1.799123 
    2.036739     0.969636     0.371963     1.641766 
    0.743847     1.057672     1.108704     0.890666 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {10433931349394287412           : loaded from /home/usertest/.arrayfire/KER10433931349394287412_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.197432     0.920924     1.720484     0.825100 
    0.789203     1.340188     1.487124     1.819123 
    2.056739     0.989636     0.391963     1.661766 
    0.763847     1.077672     1.128704     0.910666 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {17406924581070665299           : loaded from /home/usertest/.arrayfire/KER17406924581070665299_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.217432     0.940924     1.740484     0.845100 
    0.809203     1.360188     1.507124     1.839123 
    2.076739     1.009636     0.411963     1.681766 
    0.783847     1.097672     1.148704     0.930666 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {8443460603347440819            : loaded from /home/usertest/.arrayfire/KER8443460603347440819_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.237432     0.960923     1.760485     0.865100 
    0.829203     1.380188     1.527124     1.859123 
    2.096739     1.029636     0.431963     1.701766 
    0.803847     1.117672     1.168704     0.950666 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {17398454312112522272           : loaded from /home/usertest/.arrayfire/KER17398454312112522272_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.257432     0.980923     1.780485     0.885100 
    0.849203     1.400187     1.547124     1.879123 
    2.116739     1.049636     0.451963     1.721766 
    0.823847     1.137672     1.188704     0.970666 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {566849599014583701             : loaded from /home/usertest/.arrayfire/KER566849599014583701_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.277431     1.000923     1.800485     0.905100 
    0.869203     1.420187     1.567124     1.899123 
    2.136739     1.069636     0.471963     1.741766 
    0.843847     1.157672     1.208704     0.990666 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {9895145218070650125            : loaded from /home/usertest/.arrayfire/KER9895145218070650125_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.297431     1.020923     1.820485     0.925100 
    0.889203     1.440187     1.587124     1.919123 
    2.156739     1.089636     0.491963     1.761766 
    0.863847     1.177672     1.228704     1.010666 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {18416015551140480695           : loaded from /home/usertest/.arrayfire/KER18416015551140480695_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.317431     1.040923     1.840484     0.945100 
    0.909203     1.460187     1.607124     1.939123 
    2.176739     1.109636     0.511963     1.781766 
    0.883847     1.197672     1.248704     1.030666 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {1395665809357684044            : loaded from /home/usertest/.arrayfire/KER1395665809357684044_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.337431     1.060923     1.860484     0.965100 
    0.929203     1.480187     1.627124     1.959123 
    2.196739     1.129636     0.531963     1.801766 
    0.903847     1.217672     1.268704     1.050666 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {18078439866516878613           : loaded from /home/usertest/.arrayfire/KER18078439866516878613_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.357431     1.080923     1.880484     0.985100 
    0.949203     1.500187     1.647124     1.979123 
    2.216739     1.149636     0.551963     1.821766 
    0.923847     1.237672     1.288704     1.070666 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {13661482405227820945           : loaded from /home/usertest/.arrayfire/KER13661482405227820945_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.377431     1.100923     1.900484     1.005100 
    0.969203     1.520187     1.667124     1.999123 
    2.236739     1.169636     0.571963     1.841766 
    0.943847     1.257672     1.308704     1.090666 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {12769791623256484212           : loaded from /home/usertest/.arrayfire/KER12769791623256484212_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.397431     1.120923     1.920484     1.025100 
    0.989203     1.540187     1.687124     2.019123 
    2.256739     1.189636     0.591963     1.861766 
    0.963847     1.277672     1.328704     1.110666 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {13057020803926944950           : loaded from /home/usertest/.arrayfire/KER13057020803926944950_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.417431     1.140923     1.940484     1.045100 
    1.009203     1.560187     1.707124     2.039123 
    2.276739     1.209636     0.611963     1.881766 
    0.983847     1.297672     1.348704     1.130666 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {880907572762940498             : loaded from /home/usertest/.arrayfire/KER880907572762940498_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.437431     1.160923     1.960484     1.065100 
    1.029203     1.580187     1.727124     2.059123 
    2.296739     1.229636     0.631963     1.901766 
    1.003847     1.317672     1.368704     1.150666 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {13248654035093065597           : loaded from /home/usertest/.arrayfire/KER13248654035093065597_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.457431     1.180923     1.980484     1.085100 
    1.049203     1.600187     1.747124     2.079123 
    2.316739     1.249636     0.651963     1.921766 
    1.023847     1.337672     1.388704     1.170666 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {12017414924011417044           : loaded from /home/usertest/.arrayfire/KER12017414924011417044_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.477431     1.200923     2.000484     1.105100 
    1.069203     1.620187     1.767124     2.099123 
    2.336739     1.269636     0.671963     1.941766 
    1.043847     1.357672     1.408704     1.190665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {17208724059193338788           : loaded from /home/usertest/.arrayfire/KER17208724059193338788_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.497431     1.220923     2.020484     1.125100 
    1.089203     1.640187     1.787124     2.119123 
    2.356739     1.289636     0.691963     1.961766 
    1.063847     1.377672     1.428704     1.210665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {9747043305674791366            : loaded from /home/usertest/.arrayfire/KER9747043305674791366_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.517431     1.240923     2.040484     1.145100 
    1.109203     1.660187     1.807124     2.139123 
    2.376739     1.309636     0.711963     1.981766 
    1.083847     1.397672     1.448704     1.230665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {1582973965808737410            : loaded from /home/usertest/.arrayfire/KER1582973965808737410_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.537431     1.260923     2.060484     1.165100 
    1.129203     1.680187     1.827124     2.159123 
    2.396739     1.329636     0.731963     2.001765 
    1.103847     1.417672     1.468704     1.250665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {17972095324306002025           : loaded from /home/usertest/.arrayfire/KER17972095324306002025_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.557431     1.280923     2.080484     1.185100 
    1.149203     1.700187     1.847124     2.179123 
    2.416739     1.349636     0.751963     2.021765 
    1.123847     1.437672     1.488704     1.270665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {908883122856686465             : loaded from /home/usertest/.arrayfire/KER908883122856686465_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.577431     1.300923     2.100484     1.205100 
    1.169203     1.720187     1.867124     2.199123 
    2.436739     1.369636     0.771963     2.041765 
    1.143847     1.457672     1.508704     1.290665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {15719518527074708699           : loaded from /home/usertest/.arrayfire/KER15719518527074708699_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.597431     1.320923     2.120484     1.225100 
    1.189203     1.740187     1.887124     2.219123 
    2.456739     1.389636     0.791963     2.061765 
    1.163846     1.477672     1.528704     1.310665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {10905669632665384869           : loaded from /home/usertest/.arrayfire/KER10905669632665384869_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.617431     1.340923     2.140484     1.245100 
    1.209203     1.760187     1.907124     2.239123 
    2.476739     1.409636     0.811963     2.081765 
    1.183846     1.497672     1.548704     1.330665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {14975519542225761557           : loaded from /home/usertest/.arrayfire/KER14975519542225761557_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.637431     1.360923     2.160484     1.265100 
    1.229203     1.780187     1.927124     2.259123 
    2.496739     1.429636     0.831963     2.101765 
    1.203846     1.517672     1.568704     1.350665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {9769535880516143146            : loaded from /home/usertest/.arrayfire/KER9769535880516143146_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.657431     1.380923     2.180484     1.285100 
    1.249203     1.800187     1.947124     2.279123 
    2.516739     1.449636     0.851963     2.121765 
    1.223846     1.537672     1.588704     1.370665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {14651520230164704038           : loaded from /home/usertest/.arrayfire/KER14651520230164704038_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.677431     1.400923     2.200484     1.305099 
    1.269203     1.820187     1.967124     2.299123 
    2.536739     1.469636     0.871963     2.141765 
    1.243846     1.557672     1.608704     1.390665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {597729058584633328             : loaded from /home/usertest/.arrayfire/KER597729058584633328_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.697431     1.420923     2.220484     1.325099 
    1.289203     1.840187     1.987123     2.319123 
    2.556739     1.489636     0.891963     2.161765 
    1.263846     1.577672     1.628704     1.410665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {12869152181671698913           : loaded from /home/usertest/.arrayfire/KER12869152181671698913_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.717431     1.440923     2.240484     1.345099 
    1.309203     1.860187     2.007123     2.339123 
    2.576739     1.509636     0.911963     2.181765 
    1.283846     1.597672     1.648704     1.430665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {10026106053191297983           : loaded from /home/usertest/.arrayfire/KER10026106053191297983_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.737431     1.460923     2.260484     1.365099 
    1.329203     1.880187     2.027123     2.359123 
    2.596739     1.529636     0.931963     2.201765 
    1.303846     1.617671     1.668704     1.450665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {7621449264499331398            : loaded from /home/usertest/.arrayfire/KER7621449264499331398_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.757431     1.480923     2.280484     1.385099 
    1.349203     1.900187     2.047123     2.379122 
    2.616739     1.549636     0.951963     2.221765 
    1.323846     1.637671     1.688704     1.470665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {4878043452201074553            : loaded from /home/usertest/.arrayfire/KER4878043452201074553_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.777431     1.500923     2.300484     1.405099 
    1.369203     1.920187     2.067123     2.399122 
    2.636739     1.569636     0.971963     2.241765 
    1.343846     1.657671     1.708704     1.490665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {2950959384520563954            : loaded from /home/usertest/.arrayfire/KER2950959384520563954_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.797431     1.520923     2.320484     1.425099 
    1.389203     1.940187     2.087123     2.419122 
    2.656739     1.589636     0.991963     2.261765 
    1.363846     1.677671     1.728703     1.510665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {13420267831978312157           : loaded from /home/usertest/.arrayfire/KER13420267831978312157_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.817431     1.540923     2.340484     1.445099 
    1.409203     1.960187     2.107123     2.439122 
    2.676739     1.609636     1.011963     2.281765 
    1.383846     1.697671     1.748703     1.530665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {9976405948399289198            : loaded from /home/usertest/.arrayfire/KER9976405948399289198_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.837431     1.560923     2.360484     1.465099 
    1.429203     1.980187     2.127123     2.459122 
    2.696738     1.629636     1.031963     2.301765 
    1.403846     1.717671     1.768703     1.550665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {14523153467266522587           : loaded from /home/usertest/.arrayfire/KER14523153467266522587_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.857431     1.580923     2.380484     1.485099 
    1.449203     2.000187     2.147123     2.479122 
    2.716738     1.649636     1.051963     2.321765 
    1.423846     1.737671     1.788703     1.570665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {11445586434662823036           : loaded from /home/usertest/.arrayfire/KER11445586434662823036_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.877431     1.600923     2.400484     1.505099 
    1.469203     2.020187     2.167123     2.499122 
    2.736738     1.669636     1.071963     2.341765 
    1.443846     1.757671     1.808703     1.590665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {13860561927626388323           : loaded from /home/usertest/.arrayfire/KER13860561927626388323_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.897431     1.620923     2.420484     1.525099 
    1.489203     2.040187     2.187123     2.519122 
    2.756738     1.689636     1.091963     2.361765 
    1.463846     1.777671     1.828703     1.610665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {9115169619677072734            : loaded from /home/usertest/.arrayfire/KER9115169619677072734_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.917431     1.640923     2.440484     1.545099 
    1.509203     2.060187     2.207123     2.539122 
    2.776738     1.709635     1.111963     2.381765 
    1.483846     1.797671     1.848703     1.630665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {10432859528145142058           : loaded from /home/usertest/.arrayfire/KER10432859528145142058_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.937431     1.660923     2.460484     1.565099 
    1.529203     2.080187     2.227123     2.559122 
    2.796738     1.729635     1.131963     2.401765 
    1.503846     1.817671     1.868703     1.650665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {4629432718483929470            : loaded from /home/usertest/.arrayfire/KER4629432718483929470_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.957431     1.680923     2.480484     1.585099 
    1.549203     2.100187     2.247123     2.579122 
    2.816738     1.749635     1.151963     2.421765 
    1.523846     1.837671     1.888703     1.670665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {14475113004234805746           : loaded from /home/usertest/.arrayfire/KER14475113004234805746_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.977431     1.700923     2.500484     1.605099 
    1.569203     2.120187     2.267123     2.599122 
    2.836738     1.769635     1.171963     2.441765 
    1.543846     1.857671     1.908703     1.690665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {11941013055093813643           : loaded from /home/usertest/.arrayfire/KER11941013055093813643_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    1.997431     1.720923     2.520484     1.625099 
    1.589203     2.140187     2.287123     2.619122 
    2.856738     1.789635     1.191963     2.461765 
    1.563846     1.877671     1.928703     1.710665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {3279670211152340507            : loaded from /home/usertest/.arrayfire/KER3279670211152340507_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    2.017431     1.740923     2.540484     1.645099 
    1.609203     2.160187     2.307123     2.639122 
    2.876738     1.809635     1.211963     2.481765 
    1.583846     1.897671     1.948703     1.730665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {3515735580072254376            : loaded from /home/usertest/.arrayfire/KER3515735580072254376_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    2.037431     1.760923     2.560484     1.665099 
    1.629203     2.180187     2.327123     2.659122 
    2.896738     1.829635     1.231963     2.501765 
    1.603846     1.917671     1.968703     1.750665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {10655839416335973832           : loaded from /home/usertest/.arrayfire/KER10655839416335973832_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    2.057431     1.780923     2.580484     1.685099 
    1.649203     2.200187     2.347123     2.679122 
    2.916738     1.849635     1.251963     2.521765 
    1.623846     1.937671     1.988703     1.770665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {8062769595258983271            : loaded from /home/usertest/.arrayfire/KER8062769595258983271_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    2.077431     1.800923     2.600484     1.705099 
    1.669203     2.220187     2.367123     2.699122 
    2.936738     1.869635     1.271963     2.541765 
    1.643846     1.957671     2.008703     1.790665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {10422351104587319939           : loaded from /home/usertest/.arrayfire/KER10422351104587319939_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    2.097431     1.820923     2.620484     1.725099 
    1.689203     2.240187     2.387123     2.719122 
    2.956738     1.889635     1.291963     2.561765 
    1.663846     1.977671     2.028703     1.810665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {11666188904743227362           : loaded from /home/usertest/.arrayfire/KER11666188904743227362_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    2.117431     1.840923     2.640484     1.745099 
    1.709203     2.260187     2.407123     2.739122 
    2.976738     1.909635     1.311963     2.581765 
    1.683846     1.997671     2.048703     1.830665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {1787806251153324139            : loaded from /home/usertest/.arrayfire/KER1787806251153324139_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    2.137431     1.860923     2.660484     1.765099 
    1.729203     2.280187     2.427123     2.759122 
    2.996738     1.929635     1.331963     2.601765 
    1.703846     2.017671     2.068703     1.850665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {623141716844196260             : loaded from /home/usertest/.arrayfire/KER623141716844196260_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    2.157431     1.880923     2.680484     1.785099 
    1.749202     2.300187     2.447123     2.779122 
    3.016738     1.949635     1.351963     2.621765 
    1.723846     2.037671     2.088703     1.870665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {12699108041887932317           : loaded from /home/usertest/.arrayfire/KER12699108041887932317_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    2.177431     1.900923     2.700484     1.805099 
    1.769202     2.320187     2.467123     2.799122 
    3.036738     1.969635     1.371963     2.641765 
    1.743846     2.057671     2.108703     1.890665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {611614531304123087             : loaded from /home/usertest/.arrayfire/KER611614531304123087_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    2.197431     1.920923     2.720484     1.825099 
    1.789202     2.340187     2.487123     2.819122 
    3.056738     1.989635     1.391963     2.661765 
    1.763846     2.077671     2.128703     1.910665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {5986385616890169400            : loaded from /home/usertest/.arrayfire/KER5986385616890169400_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    2.217431     1.940922     2.740484     1.845099 
    1.809202     2.360187     2.507123     2.839122 
    3.076738     2.009635     1.411963     2.681765 
    1.783846     2.097671     2.148703     1.930665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {10860670200350068981           : loaded from /home/usertest/.arrayfire/KER10860670200350068981_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    2.237431     1.960922     2.760484     1.865099 
    1.829202     2.380187     2.527123     2.859122 
    3.096738     2.029635     1.431963     2.701765 
    1.803846     2.117671     2.168703     1.950665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {6339083362791054776            : loaded from /home/usertest/.arrayfire/KER6339083362791054776_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    2.257431     1.980922     2.780484     1.885099 
    1.849202     2.400187     2.547123     2.879122 
    3.116738     2.049635     1.451963     2.721765 
    1.823846     2.137671     2.188703     1.970665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {12776251134869088115           : loaded from /home/usertest/.arrayfire/KER12776251134869088115_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    2.277431     2.000922     2.800484     1.905099 
    1.869202     2.420187     2.567123     2.899122 
    3.136738     2.069635     1.471963     2.741765 
    1.843846     2.157671     2.208703     1.990665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {12866964602749792835           : loaded from /home/usertest/.arrayfire/KER12866964602749792835_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    2.297431     2.020922     2.820484     1.925099 
    1.889202     2.440187     2.587123     2.919122 
    3.156738     2.089635     1.491963     2.761765 
    1.863846     2.177671     2.228703     2.010665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {13373890633582560916           : loaded from /home/usertest/.arrayfire/KER13373890633582560916_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    2.317430     2.040922     2.840484     1.945099 
    1.909202     2.460186     2.607123     2.939122 
    3.176738     2.109635     1.511963     2.781765 
    1.883846     2.197671     2.248703     2.030665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {12465730651509705298           : loaded from /home/usertest/.arrayfire/KER12465730651509705298_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    2.337430     2.060922     2.860484     1.965099 
    1.929202     2.480186     2.627123     2.959122 
    3.196738     2.129635     1.531963     2.801765 
    1.903846     2.217671     2.268703     2.050665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {12324720936770538910           : loaded from /home/usertest/.arrayfire/KER12324720936770538910_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    2.357430     2.080922     2.880484     1.985099 
    1.949202     2.500186     2.647123     2.979122 
    3.216738     2.149635     1.551963     2.821765 
    1.923846     2.237671     2.288703     2.070665 

c
[4 4 1 1]
[jit][1605388052][006181] [ ../src/backend/cuda/compile_module.cpp:430 ] {12814465927915904601           : loaded from /home/usertest/.arrayfire/KER12814465927915904601_CU_61_AF_37.cubin for GeForce GTX 1060 3GB }
    2.377430     2.100922     2.900484     2.005099 
    1.969202     2.520186     2.667123     2.999122 
    3.236738     2.169635     1.571963     2.841765 
    1.943846     2.257671     2.308703     2.090665 

c

@BA8F0D39
Copy link
Contributor Author

Wait a minute. Why does the print function generate kernels?

@9prady9
Copy link
Member

9prady9 commented Nov 16, 2020

I am able to reproduce the behavior although I don't think your conclusions are entirely correct because print functionality is not a kernel. It appears to be printing out when you call print because ArrayFire functions are asynchronous by default and until a print is called or eval is auto-triggered, the JIT nodes are not evaluated and thus the kernel is not looked up for actual execution.

As far as the trace log goes, the message right from the first one says it is loaded already cached kernel, so it is not compiling anything - likely because it did once in the past on the system you are running this program.

I am looking into it and will update my findings here as soon as I can. Thanks for sharing the trace output of the program.

@9prady9 9prady9 added Bug and removed Question labels Nov 16, 2020
@BA8F0D39
Copy link
Contributor Author

BA8F0D39 commented Nov 16, 2020

GPU to CPU data transfer also generates 100 different kernels.

Maybe, it is the host() function that the cache lookup fails on?

Also why does data transfer generate CUDA kernels?


	let dims = arrayfire::Dim4::new(&[4,1,1,1]);
	let a = arrayfire::randu::<f32>(dims);
	let mut b = arrayfire::randu::<f32>(dims);

	let mut c = a.clone();
	let mut c_cpu: [f32; 4] = [0.0, -4.1, 1.7, -0.9];
	while (1==1)
	{
		b = b + (0.02f32);
		c = arrayfire::add(&b, &a, false);
		c.host(&mut c_cpu);
	}

@9prady9
Copy link
Member

9prady9 commented Nov 17, 2020

As I have pointed out earlier, It is not host/print that is generating any kernels, they are triggering the JIT evaluations which just makes it appear as though host/print call is doing it. You can add do c.eval() instead of c.host() or print(&c) and you will see similar output.

My guess is that, somehow each iteration is triggering a separate JIT evaluation although it shouldn't, perhaps the very calls print or host or eval are causing it. Ideally, even if they are triggering load from disk it shouldn't happen for all iterations. I can only say for certain after looking into it. I shall update my findings here as soon as I can.

Edited:

Nevertheless, I think avoiding sync/eval/host kind of calls inside such a loop should definitely avoid triggering JIT evaluation. You can add an eval call after the loop which will ensure JIT evals only once for this logic. But if you have to fetch results to host for every few iterations inside the loop, you can wrap the host call with that condition so that JIT evals only when that condition is met.

@BA8F0D39
Copy link
Contributor Author

I am using host() function inside the while loop to dump data from GPU's RAM into my SSD because the data can't fit into the GPU's RAM. It would be nice if the saveArray function was implemented in rust to write data into the filesystem.

@9prady9
Copy link
Member

9prady9 commented Nov 18, 2020

@BA8F0D39 I have raised an feature request to follow progress of disk saving API - #263

Although those functions aren't available yet, I believe you can use serde feature I added recently to serialize and deserialize arrays. It is not in the current stable release(crate) yet but you can use it from master branch of github repository directly.

I was able to reproduce the behavior of more than one kernel getting generated in upstream. I don't have any updates as of now.

If your size of data is the concern, then you must be using some kind of condition to dump the data to disk rather than doing it in every iteration which is very inefficient. When the host is wrapped in such a condition, then kernels aren't evaluated in each iteration.

@9prady9 9prady9 added Upstream and removed Bug labels Jan 5, 2021
@9prady9
Copy link
Member

9prady9 commented Jan 5, 2021

@BA8F0D39 Sorry about the delay. I have figured out the reason why so many kernels are being generated. It is not a bug per say, but it is side effect of how our JIT workflow works right now.

Lets take the following code (please look at comments for info)

        dim4 dims(4, 4);
        const array a = randu(dims);
        array b = randu(dims);
        array c = a;
        af::sync(); // I have added this sync for clear boundary between 
                    // JIT from before loop and within loop, this does nothing otherwise

        for (int i = 0; i<10; i++) {
            b = b + 0.022f; // A JIT operation, in iteration  JIT_NODE = JIT_NODE + SCALAR_NODE
            b.eval();       // because b is eval'ed, the above iterative tree transforms C to buffer node
            c = b + a;      // this is a single JIT node that adds two buffer nodes
            af_print(c);
        }

Here's what happens from lets say Nth iteration to N+1 iteration and so on
N Iteration

b ---> |---> 0.022
       |---> b ---> |---> 0.022
                    |---> b ---> ...

N+1 Iteration

b ---> |---> 0.022
       |---> b ---> |---> 0.022
                    |---> b ---> |---> 0.022
                                 |---> b ---> ...

Now if I remove the eval on b, c also becomes an iterative JIT tree that builds upon on previous iteration, thus each iteration is essentially causing different JIT kernel.

        for (int i = 0; i<10; i++) {
            b = b + 0.022f; // This still: JIT_NODE = JIT_NODE + SCALAR_NODE
            c = b + a;      // This is also becomes different a JIT_NODE = JIT_NODE + JIT_NODE
            af_print(c);
        }

The number 100 is just due to our implementation - we limit the JIT depth to a maximum of 100 at which point eval is auto-triggered.

We will have internal discussion if this implementation of JIT workflow can be further improved. But, rest assured that if eval or af_print are not called inside for loop, there isn't much of performance effect due to this JIT workflow. Like I said earlier, we are constantly looking to improve JIT performance. We will look into this behavior as well.

Thank you for using ArrayFire! and Happy New Year :)

@9prady9
Copy link
Member

9prady9 commented Jan 5, 2021

@BA8F0D39 I will try moving this to GitHub Discussions since it is not a bug in code neither in wrapper nor upstream.

Update: Apparently, this can't be done yet due to community/community#2924 (comment)

@BA8F0D39
Copy link
Contributor Author

BA8F0D39 commented Jan 8, 2021

@9prady9

Thanks for the hard work.

For arrayfire 3.7.3.

In my debug code, I use print_gen to view the values of the matrices and I don't really care for the performance of the loop. So generating many kernels is fine for debug purposes.

In my release code, I need to dump matrices from GPU to CPU and the host function will call eval on the matrices. Is it possible to do asynchronous eval so it doesn't block the JIT and the eval can happen in the background?

You said that the JIT depth is set to 100 but in my code the JIT can generate more than 100,000 kernels for a single matrix operation? Is the detection of previously generated kernels failing????

@9prady9
Copy link
Member

9prady9 commented Jan 11, 2021

@BA8F0D39

You said that the JIT depth is set to 100 but in my code the JIT can generate more than 100,000 kernels for a single matrix operation? Is the detection of previously generated kernels failing????

May be I wasn't clear and it added to the confusion. When I say JIT depth it is the height of the JIT tree, you may think of it like code AST. This tree's height/depth is limited to 100. When the height goes beyond the limit, the corresponding Array get's automatically evaluated generating one or more kernel(s) depending on the code you have written. For example, for a given section of code if all the lines are JIT operations, then only a single kernel is generated not one kernel per operation. Hope this clears the confusion. If there are 100k kernels cached in your system, they could be from any of the following: 1) Old invalidated cache from previous versions 2) Regular functions (non-JIT operations) also cache their kernels.

In my release code, I need to dump matrices from GPU to CPU and the host function will call eval on the matrices. Is it possible to do asynchronous eval so it doesn't block the JIT and the eval can happen in the background?

From what I gathered, for your use case you actually need to dump matrices on every iteration to the disk. In that case, I think you need a different mechanism in your application for such matrix dumping. Let me explain why.

The main purpose of eval is to ensure a JITed Array is ready for some subsequent non-JITed function call that expects a buffer/memory. Note that this doesn't mean an the thread calling eval is blocked; af::sync() is different from eval(). eval() just translates the JIT tree(dynamically generated from user code) to a kernel, launches the generated kernel asynchronously. Most users don't need to call eval explicitly because it is done automatically based on some heuristics which also includes memory pressure. To simply put, eval() is not a blocking call.

array::host() on the other is a blocking call because there is no implicit way for arrayfire to know how long the host pointer is going to be valid. Hence my earlier statement that you have to use different technique to dump data asynchronously.

I think you should maintain a separate queue that is handled by a different thread. Whenever data from main thread is ready, it would push the corresponding array to queue and the queue handler thread will copy data to host and dump to disk. I personally would avoid such a queue because then one would have to handles issues related to differing data production and consumption rates. It's not hard but some extra logic.

Another approach would be to use ArrayFire events. You can mark an event after the target operation. Now move this event to the other thread along with source array array. In the other thread you can block using the event until the required data is ready. This avoids any queues but there might be some extra performance cost by using events. Which approach fares better needs to be tested out.

@BA8F0D39
Copy link
Contributor Author

@9prady9
How does the JIT know which sections of code to make into a single kernel?

Is it possible to force the JIT to generate a single kernel for each rust function?

Is it possible to mark sections of the code such that the JIT generates a single kernel for each section?
Like a jit! macro?

jit!{
    c = arrayfire::add(&b, &a, false);
    e = arrayfire::mul(&d, &c, false);
    q = arrayfire::add(&v, &e, false);
}

jit!{
    z = arrayfire::div(&q, &s, false);
    w = arrayfire::add(&x, &z, false);
}

Does the JIT generate a new kernel when a branch statement is encountered?

@9prady9
Copy link
Member

9prady9 commented Jan 13, 2021

How does the JIT know which sections of code to make into a single kernel?

An illustrative example (not c++ code, just algorithm)

1. a = b + c; // JIT operation
2. d = sin(b); // JIT operation
3. e = a && d; // JIT operation
4. v = erode(e, ... ); // not JIT, a handwritten kernel
5. u = 1 - v; // JIT operation

Lines 1,2 & 3 are all combined into a single kernel; then output of that kernel is fed into erode function. Line 5 is again arithmetic operation which is JIT. This creates a separate kernel.

Basically, most domain (compute vision, image processing, statistics, ML, Signal Processing, Linear Algebra) specific functions are not JIT. Such functions are essentially asynchronous barriers - they won't block the thread but they will cause JIT-ed inputs of the function in question to be evaluated automatically such that relevant buffer pointers are ready for these functions to operate upon.

Is it possible to force the JIT to generate a single kernel for each rust function?

Calling the method Array::eval() will do it but that is not advised as that would defeat the purpose of the asynchronous nature of ArrayFire API.

Is it possible to mark sections of the code such that the JIT generates a single kernel for each section? Like a jit! macro?

There is no such jit macro. But something similar can done with existing functionality.

let c = arrayfire::add(&b, &a, false);
let e = arrayfire::mul(&d, &c, false);
let q = arrayfire::add(&v, &e, false);
q.eval();

Does the JIT generate a new kernel when a branch statement is encountered?

Not sure I understand the question. What kind of branch statement are you asking about ? Vectorized operations don't usually have any branch instructions.

@BA8F0D39 BA8F0D39 closed this as completed Feb 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants