Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Late submission #26

Open
wants to merge 33 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
b59a72c
init
Sep 17, 2016
58aeff5
working
Sep 17, 2016
8d5de44
added debug
Sep 17, 2016
3f6b0cd
trying to figure out why particleArrayIndices mapping throws things o…
Sep 17, 2016
79edb25
working (I think?)
Sep 17, 2016
f8498fe
assessment: it's working but slow with all grids being searched.
Sep 17, 2016
c1c5760
added if (particleArrayIndices)...
Sep 17, 2016
e134801
pulled out the selfIndex crap
Sep 18, 2016
2ae2a71
gave up on searchAbout and aggregate refactor
Sep 18, 2016
1e9beea
change signature on ...inGrids
Sep 18, 2016
04a3b75
maybe works??? But very slow
Sep 18, 2016
84d8e4e
caught out of bounds offset bug
Sep 18, 2016
7ecbbe1
got 2 working
Sep 19, 2016
48f9148
got 2 working, really
Sep 19, 2016
71f184a
init
Sep 19, 2016
3ca7fdc
working
Sep 19, 2016
e41c7cb
still debugging
Sep 19, 2016
65bdf30
working, but performance is an issue
Sep 19, 2016
c5ccdc2
working, but performance is an issue 2
Sep 19, 2016
8822a6d
everything is sort of working except coherent.
Sep 20, 2016
89ccee8
init
Sep 20, 2016
f525b21
donegit checkout -b workinggit checkout -b working!
Sep 20, 2016
e05dec3
add gifs
Sep 20, 2016
32d64ef
add gifs
Sep 20, 2016
6450e02
re-recorded gif
Sep 20, 2016
1de47a8
re-recorded gif
Sep 20, 2016
9342882
Added GIF to read
ethanabrooks Sep 20, 2016
4255704
Started README
ethanabrooks Sep 27, 2016
9d00fa1
First draft README.
ethanabrooks Sep 27, 2016
f205ba9
added performance
Sep 27, 2016
a31445e
added images
Sep 27, 2016
7ea3eae
Finish README
ethanabrooks Sep 27, 2016
e88a6ab
Now done
ethanabrooks Sep 27, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added Performance.pdf
Binary file not shown.
Binary file added Performance.xlsx
Binary file not shown.
Binary file added Performance_Page_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Performance_Page_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
48 changes: 45 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,52 @@
**University of Pennsylvania, CIS 565: GPU Programming and Architecture,
Project 1 - Flocking**

* (TODO) YOUR NAME HERE
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
![alt text] (https://github.com/lobachevzky/Project1-CUDA-Flocking/blob/working/project1.gif "Running with all possible optimizations")

### (TODO: Your README)
* Ethan Brooks
* Tested on: Windows 7, Intel(R) Xeon(R), GeForce GTX 1070 8GB (SIG Lab)

#Flocking Simulation

## Summary
Include screenshots, analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)

This project simulates flocking behavior by graphically depicting "boids", or colored points, in a 3-d environment which obey three rules:

1. Boids try to fly towards the centre of mass of neighbouring boids.
2. Boids try to keep a small distance away from other objects (including other boids).
3. Boids try to match velocity with near boids.

These rules cause groups of boids to coalesce into flocks, with all the boids in a flock flying in parallel, close to one another, at similar velocities.

## Optimizations.
This project includes three implementations:
1. A naive one that compares each boid to every other boid
2. One that only compares boids within neighborhoods.
3. A further optimization of 2 that minimizes memory access.

### Implementation 1
Though this implementation does utilize the GPU, calling kernel functions on each boid which are executed in parallel, each of these kernel functions compares its assigned boid with every other boid in the simulation. Thus every kernel function is O(n), where n is the number of boids.

### Implementation 2
Since all three flocking rules only apply to boids within a certain proximity, it is possible to cut down the number of comparisons by discretizing the 3d space into cells and only comparing boids within the same cells or adjacent cells. Our implementation maps boids to cells based on their (x, y, z) positions. Therefore, given a boid and its position, we can easily identify the id of the cell in which it resides. We subsequently identify cells adjacent to this one in all three directions (27 total).

The next task is to identify the boids within each of these 27 cells. A naive approach would search all boids in the simulation and identify those within the 27 cells. However, this would defeat the purpose since we would be back to linear time complexity as in Implementation 1. In order to avoid this, we develop a second buffer of pointers to boids, sorted by cell location and then map each cell to a range within this buffer. This way, given a cell, we use this map to identify the range within the second buffer to search. By following the pointers within this second buffer, we can access the boids that are within the cell.

Our time complexity for each kernel invocation is still O(n), but now n is the number of boids in the neighboring cells, not in the entire simulation -- a small fraction of the total number. In practice, as the chart below depicts, the number of boids in neighboring cells remains relatively constant and the execution only increases a little as the number of boids increases.

### Implementation 3
The previous approach has one major weakness: each cell is mapped to an array of _pointers_ to boids. Therefore when searching within each cell, we actually have to follow a pointer for each boid, and since a given boid is likely in the vicinity of multiple boids, we actually have to follow the pointer for the same boid repeatedly. On the GPU, memory access is slow. In order to minimize memory access, we instead _sort the boids themselves_--that is, we sort the positions and velocities by cell. To do so, we still develop a second buffer that we sort by cell index, but this time we use that buffer to rearrange the actual positions and velocity buffers themselves.

This implementation actually requires us to resort the boids every time step, since boids change position and do not stay within the same cell necessarily. However sorting is cheap on the GPU--O(log n)--and we only perform this step once per timestep, whereas the memory accesses in Implementation 2 happened repeatedly.

Another advantage of this approach is that it puts adjacent boids close to each other in memory. In Implementation 2, the boid pointers that we were following might point to any location in the position and velocity buffers and therefore to disparate locations in memory. In contrast, this implementation places the positions and velocities for a given cell be next to each other. This helps make memory access quicker by obeying Locality of Reference, which ensures that memory accesses nearby to previous accesses are quicker.

The memory improvements are evident in the following graph:
![alt text] (https://github.com/lobachevzky/Project1-CUDA-Flocking/blob/master/Performance_Page_1.png "A comparison of performance across implementations.")
In this graph, time per frame is averaged across 1000 frames.

### Optimization across block sizes
Finally, we experimented with different block counts. Each experiment was run with 2^15 boids. The results are shown below:
![alt text] (https://github.com/lobachevzky/Project1-CUDA-Flocking/blob/master/Performance_Page_2.png "A comparison of performance across implementations.")
Binary file added demo-gif.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added project1.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,5 @@ set(SOURCE_FILES

cuda_add_library(src
${SOURCE_FILES}
OPTIONS -arch=sm_20
OPTIONS -arch=sm_52
)
Loading