cuda programming exercise
- Naive Matrix Multiplication
- For a 32*32 matrix multiplication with float numbers, elapsed time on Host is 0.000131s
- Elapsed time on Device is 0.000019s if run with 32*32 threads
- Size of matrix limited by the number of threads allowed in a thread block, which is 1024 with CUDA toolkit 10
- Advanced Matrix Multiplication
- Flocking Simulation