cuda

cuda programming exercise

Naive Matrix Multiplication
- For a 32*32 matrix multiplication with float numbers, elapsed time on Host is 0.000131s
- Elapsed time on Device is 0.000019s if run with 32*32 threads
- Size of matrix limited by the number of threads allowed in a thread block, which is 1024 with CUDA toolkit 10
Advanced Matrix Multiplication
- Split the matrix into tiles, with each tile assigned to a block
- Each tile can access the shared memory instead of accessing the global memory directly
- For a 10241024 matrix with the tile size of 3232, it takes 14.500724s on the host, and 0.000022s on the device
- Below is a nsight profile screenshot
Flocking Simulation
- Based on the Reynolds Boids algorithm
- With two levels of optimization: a uniform grid, and a uniform grid with semi-coherent memory access
- Below are some results with 1k, 10k and 100k boids(particles)

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
cuda_flocking		cuda_flocking
stream_compaction		stream_compaction
README.md		README.md
hello.cu		hello.cu
multiply.cu		multiply.cu
nsight.png		nsight.png

Provide feedback