This Example demonstrates a simple vector multiple, with 3 separate device-memory backed tensors A, B, and C, where C = A .* B
The run.sh
script in the examples folder provides an environment capable of building all the examples. I would suggest running from this environment, however any CUDA-enabled environment should work.
Use the following commands to build the example:
mkdir build
cd build
cmake ..
make -j
Run the example with the command ./example_1
You should see the following output:
I have no name!@72063c2be218:/scratch/projects/gpuStarterResources/examples/example_1/build$ make -j
Consolidate compiler generated dependencies of target example_1
[ 50%] Building CUDA object CMakeFiles/example_1.dir/example_1.cu.o
[100%] Linking CUDA executable example_1
[100%] Built target example_1
I have no name!@72063c2be218:/scratch/projects/gpuStarterResources/examples/example_1/build$ ./example_1
Average elapsed time per iteration is: 3372.13us
to profile with nsys, try the following:
nsys profile -o example_1 ./example_1
to profile with ncu, try the following:
ncu --set=full --import-source=true -c 2 -f -o example_1 ./example_1