Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 1: Jian Ru #5

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 41 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,46 @@
**University of Pennsylvania, CIS 565: GPU Programming and Architecture,
Project 1 - Flocking**

* (TODO) YOUR NAME HERE
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Jian Ru
* Tested on: Windows 10, i7-4850 @ 2.30GHz 16GB, GT 750M 2GB (Personal)

### (TODO: Your README)
---
### Results

Include screenshots, analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
* Parameters
* Number of particles: 40,000
* Blocks: 40, 1, 1
* Threads: 128, 1, 1
* Rule distances: 5.0, 3.0, 5.0
* Rule scales: 0.01, 0.1, 0.1
* Scene scale: 100.0
* Delta time: 0.2
![result](images/demo1.gif)

---
### Analysis

* Simulation Time vs. Number of Particles
* For the brute force version, the simulation time grows polynomially as particle count increases. This is expected because
even though the complexity of each thread is O(n) but there are too many particles and hence too many threads. So it is
impossible to parallize all the threads at once. Therefore, the time complexity should still grow in a polynomial fashion
but less sensitive than sequential implementation.
* For the scattered and coherent grid versions, they still demonstrates a little polynomial growth but the speed is much
slower and their growth seems almost linear. This is expected because each particle has much fewer neighbours to examine
in each step. Statistically, the number of neighbours grows linearly as particle count increases. But the number of threads
also increase at the same time so the time complexity of the implementation should be a liitle bit more expensive than O(n).
![sp](images/st_pc.png)

* Simulation Time vs. Block Size
* The relationship between simulation time and block size is somewhat random but expected. Since it guaranteed that GPU
executes each block on a single SM, putting more threads that access the same memory region with similar access pattern
should increase performance due to the increased cache hit-rate. But putting too many threads in a single block may hinder
performance if a SM cannot execute all the threads in a block at once.
![sb](images/bs_st.png)

* Coherent Grid vs. Scattered Grid
* From my experimentation, coherent grid performances better than scattered grid. This is expected because even though
reordering position and velocity arrays has cost, in this case, the gain from increased cache hit-rate outweight the cost
of copying and additional kernel calls. Since adjacent threads tend to have shared neighouring cells, they tend to access
the same memory regions when they execute. Even for just one thread, it also enjoys cache hit-rate increase because after
sorting, the data of particles in the same cell are stored closely in one consecutive memory region.
Binary file added images/bs_st.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/demo1.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/st_pc.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,5 @@ set(SOURCE_FILES

cuda_add_library(src
${SOURCE_FILES}
OPTIONS -arch=sm_20
OPTIONS -arch=sm_30
)
Loading