CIS565-Fall-2016 · jian-ru · Sep 6, 2016 · Sep 7, 2016 · Sep 7, 2016 · Sep 7, 2016
diff --git a/README.md b/README.md
@@ -1,10 +1,46 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Jian Ru
+* Tested on: Windows 10, i7-4850 @ 2.30GHz 16GB, GT 750M 2GB (Personal)
 
-### (TODO: Your README)
+---
+### Results
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+* Parameters
+  * Number of particles: 40,000
+  * Blocks: 40, 1, 1
+  * Threads: 128, 1, 1
+  * Rule distances: 5.0, 3.0, 5.0
+  * Rule scales: 0.01, 0.1, 0.1
+  * Scene scale: 100.0
+  * Delta time: 0.2
+  ![result](images/demo1.gif)
+
+---
+### Analysis
+
+* Simulation Time vs. Number of Particles
+  * For the brute force version, the simulation time grows polynomially as particle count increases. This is expected because
+    even though the complexity of each thread is O(n) but there are too many particles and hence too many threads. So it is
+    impossible to parallize all the threads at once. Therefore, the time complexity should still grow in a polynomial fashion
+    but less sensitive than sequential implementation.
+  * For the scattered and coherent grid versions, they still demonstrates a little polynomial growth but the speed is much
+    slower and their growth seems almost linear. This is expected because each particle has much fewer neighbours to examine
+    in each step. Statistically, the number of neighbours grows linearly as particle count increases. But the number of threads
+    also increase at the same time so the time complexity of the implementation should be a liitle bit more expensive than O(n).
+    ![sp](images/st_pc.png)
+
+* Simulation Time vs. Block Size
+  * The relationship between simulation time and block size is somewhat random but expected. Since it guaranteed that GPU
+    executes each block on a single SM, putting more threads that access the same memory region with similar access pattern
+    should increase performance due to the increased cache hit-rate. But putting too many threads in a single block may hinder
+    performance if a SM cannot execute all the threads in a block at once.
+    ![sb](images/bs_st.png)
+
+* Coherent Grid vs. Scattered Grid
+  * From my experimentation, coherent grid performances better than scattered grid. This is expected because even though
+    reordering position and velocity arrays has cost, in this case, the gain from increased cache hit-rate outweight the cost
+    of copying and additional kernel calls. Since adjacent threads tend to have shared neighouring cells, they tend to access
+    the same memory regions when they execute. Even for just one thread, it also enjoys cache hit-rate increase because after
+    sorting, the data of particles in the same cell are stored closely in one consecutive memory region.
diff --git a/images/bs_st.png b/images/bs_st.png
diff --git a/images/demo1.gif b/images/demo1.gif
diff --git a/images/st_pc.png b/images/st_pc.png
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
@@ -10,5 +10,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_30
     )