diff --git a/README.md b/README.md
index cad1abd..1db65cf 100644
--- a/README.md
+++ b/README.md
@@ -5,13 +5,76 @@ CUDA Rasterizer
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Jian Ru
+* Tested on: Windows 10, i7-4850 @ 2.3GHz 16GB, GT 750M 2GB (Personal)
 
-### (TODO: Your README)
+### Overview
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+**Summary:** A basic software rasterizer implemented in CUDA
+
+**Features:**
+
+* Basic pipeline
+  * Vertex assembly and vertex shader
+  * Primitive assembly
+  * Rasterization
+  * Depth test
+  * Fragment shader
+* SSAO
+* Tile-based rasterization
+* Texture mapping with perspective-correct interpolation and bilinear filtering
+* Back face culling
+
+**Result:**
+
+![](renders/demo.gif)
+
+### Performance Analysis
+
+* SSAO
+  * Implemented after due date. Please do not count it into my grade because I don't want to use my late days.
+  * Screen space ambient occlusion darkens area where occlusion is likely to happen. The main idea is that you take a bunch of samples in the upper hemisphere and at run time, for each pixel, you transform the samples into eye space, randomly rotate them about z-axis, and align the z-axis of their local frame with the eye-space normal of the pixel. Then you project the samples onto screen space and compare their depth with the depth of the pixel at that point (need to use LINEAR depth). A sample is occluded if the pixel is before it (less deep). The goal is to compute the ratio of samples being occluded and darken pixel color accordingly.
+  * SSAO is not physically accurate but it is usually visually plausible and cheap enough for real-time rendering. The effect is usually subtle but does enhance realism of object appearance.
+  * In the test, I used 64 samples and a 4x4 noise texture. The cost is about 3 ms. This cost scales with render target resolution and is not affected by scene complexity.
+  ![](renders/ssao_perf.png)
+  
+  | 2_cylineder_engine | di |
+  | --- | --- |
+  | ![](renders/ssao_engine.png) | ![](renders/ssao_di.png) |
+  
+  | SSAO On | SSAO Off |
+  | --- | --- |
+  | ![](renders/ssao_demo1.png) | ![](renders/ssao_demo2.png) |
+* Tile-based rasterization
+  * The following graph shows the performance of tile-based (with/without culling) and per-primitive rasterization tested on various 3D models. It shows that tile-based approach has better performance on models that have fewer but bigger triangles but it is slightly slower than the basic approach when the scene is composed of small triangles. Even if so, tile-based approach is more stable in terms of execution time and totally eliminates write conflicts. The drawback is that we need more memory to mantain a list of triangles for each tile. The size of extra memory required can be quite large if we choose to be safe and we may risk data corruption if we make it too small. Back face culling is important here because it can significantly reduce the number of triangles each tile may overlap and thus reduce the size of triangle list and improves performance.
+  ![](renders/ras_perf.png)
+  * Performance vs. tile size
+    * I experimented several possible tile sizes and found that an 8x8 tile size works the best on my computer
+    ![](renders/ts_perf.png)
+    * According to the perf report from NSight, my tile-based rasterization kernel is expensive and resource demanding. It uses 63 register per thread (I tried to optimize on register usage but could not reduce the number further. I guess nvcc may have done the optimization already) and contains relatively complex logics. Given this nature, incrasing tile size doesn't improve occupancy before each SM has almost saturated (occupancy data from NSight agrees on this explanation), however, it increases the number of triangles each tile needs to process. Increasing execution complexity without incrasing occupancy, that is why increasing tile size from 8 doesn't improve performance. On the other hand, due to the active block limit of my GPU, further decrease tile size cannot increase the number of active blocks on each SM (occupancy data from NSight shows decreased occupancy from 50% to 25%). For a tile size of 4, each block contains only 16 threads, which is less than warp size. So half of the threads in a warp are doing nothing. This is also another reason not to decrease tile size further.
+    ![](renders/occupancy_graph.png)
+    
+    | Tile Size 4 | Tile Size 8 |
+    | --- | --- |
+    | ![](renders/ts4_occupancy.png) | ![](renders/ts8_occupancy.png) |
+    
+    | Tile Size 16 | Tile Size 32 |
+    | --- | --- |
+    | ![](renders/ts16_occupancy.png) | ![](renders/ts32_occupancy.png) |
+* Bilinear interpolation vs. Nearest Neighbor
+  * Based on experimentation, bilinear texture filtering has small hit on performance but increase texture quality especially when the object is close to the camera.
+  ![](renders/interp_perf.png)
+  
+  | Bilinear | Nearest Neighbor |
+  | --- | --- |
+  | ![](renders/bilerp.png) | ![](renders/nn.png) |
+
+* Perspective correctness
+  * Since the Barycentric coordinates of triangles are calculated in screen space, we cannot use it to perform linear interpolation directly. The main reason is that perspective projection doesn't preserver distance (parallel lines will converge to a single point after perspective projection). Luckily, people have found out that doing linear interpolation on the reciprocals of vertex depths (in eye space) does give perspective correct results for depth. For other vertex attributes, we can linearly interpolate each attribute scaled by the reciprocal of corresponding depth value and multiply the result by the correctly interpolated depth. So this leads to the implementation of perspective correct interpolation of vertex attributes (the effect is obvious when you have large textured triangles).
+  
+  | With Perspective Correction | Without Perspective Correction |
+  | --- | --- |
+  | ![](renders/perspective_correct.png) | ![](renders/perspective_incorrect.png) |
 
 
 ### Credits
diff --git a/external/lib/win/glfw3.lib b/external/lib/win/glfw3.lib
index efdd643..a00c640 100644
Binary files a/external/lib/win/glfw3.lib and b/external/lib/win/glfw3.lib differ
diff --git a/external/lib/win/glfw3dll.lib b/external/lib/win/glfw3dll.lib
index 365cdba..51e92ee 100644
Binary files a/external/lib/win/glfw3dll.lib and b/external/lib/win/glfw3dll.lib differ
diff --git a/gltfs/checkerboard/checkerboard.gltf b/gltfs/checkerboard/checkerboard.gltf
new file mode 100644
index 0000000..41cd769
--- /dev/null
+++ b/gltfs/checkerboard/checkerboard.gltf
@@ -0,0 +1,301 @@
+{
+  "accessors": {
+    "accessor_index_0": {
+      "bufferView": "bufferView_1",
+      "byteOffset": 0,
+      "byteStride": 0,
+      "componentType": 5123,
+      "count": 6,
+      "type": "SCALAR",
+      "min": [
+        0
+      ],
+      "max": [
+        3
+      ]
+    },
+    "accessor_position": {
+      "bufferView": "bufferView_0",
+      "byteOffset": 0,
+      "byteStride": 0,
+      "componentType": 5126,
+      "count": 4,
+      "min": [
+        -1,
+        0,
+        -1
+      ],
+      "max": [
+        1,
+        0,
+        1
+      ],
+      "type": "VEC3"
+    },
+    "accessor_uv": {
+      "bufferView": "bufferView_0",
+      "byteOffset": 48,
+      "byteStride": 0,
+      "componentType": 5126,
+      "count": 4,
+      "type": "VEC2",
+      "min": [
+        0.00009999999747378752,
+        0.00009999999747378752
+      ],
+      "max": [
+        0.9998999834060669,
+        0.9998999834060669
+      ]
+    },
+    "accessor_normal_generated": {
+      "bufferView": "bufferView_0",
+      "byteOffset": 80,
+      "byteStride": 0,
+      "componentType": 5126,
+      "count": 4,
+      "type": "VEC3",
+      "min": [
+        0,
+        1,
+        0
+      ],
+      "max": [
+        0,
+        1,
+        0
+      ]
+    }
+  },
+  "asset": {
+    "generator": "OBJ2GLTF",
+    "premultipliedAlpha": true,
+    "profile": {
+      "api": "WebGL",
+      "version": "1.0"
+    },
+    "version": "1.0"
+  },
+  "buffers": {
+    "buffer_0": {
+      "type": "arraybuffer",
+      "byteLength": 140,
+      "uri": "data:application/octet-stream;base64,AACAvwAAAAAAAIA/AACAPwAAAAAAAIA/AACAPwAAAAAAAIC/AACAvwAAAAAAAIC/F7fROHL5fz9y+X8/cvl/P3L5fz8Xt9E4F7fROBe30TgAAAAAAACAPwAAAAAAAAAAAACAPwAAAAAAAAAAAACAPwAAAAAAAAAAAACAPwAAAAAAAAEAAgAAAAIAAwA="
+    }
+  },
+  "bufferViews": {
+    "bufferView_0": {
+      "buffer": "buffer_0",
+      "byteLength": 128,
+      "byteOffset": 0,
+      "target": 34962
+    },
+    "bufferView_1": {
+      "buffer": "buffer_0",
+      "byteLength": 12,
+      "byteOffset": 128,
+      "target": 34963
+    }
+  },
+  "images": {
+    "Checkered": {
+      "uri": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAhwAAAIcCAIAAAAynOArAAAACXBIWXMAAAsTAAALEwEAmpwYAAALjUlEQVR42u3ZMU5jZxuGYY5tJtiWEGnRrGNSjKZmD6np2Uf2geQGiWWkQyILCQVgG5njvxhpVnAX/3GuawevnuLWp2/4559/lsvl2cTNZrO///77zz///P3336d+yx9//PHXX3+tViujGMUoRpmWi4uLxWq1OoFV5vP5zzT++++/U7/l7e1ttVqt12ujGMUoRpmW1Wo1m/oNAPz/EBUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAzGI2m83n86mfcQIn/LLb7YZhMIpRjGKUyXl/fx82m81yuTyBY8ZxfH19PR6PUz/kcDhcXl6en58bxShGMcq0fHx8DCewx0/39/c3NzfjOE79kO12e3d39/j4aBSjGMUo07Lf7xdXV1cvLy8nsMrxeBzH8fPz8wRuGYYTib1RjGKU/9Qo4zj6qAcgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAmcW3b9/e3t6mfsZutzscDrvd7ng8Tv2W7Xb79evX79+/G8UoRjHKtOz3++H5+Xm5XE79kmEYnp6eNpvNbDb5t9f19fXt7e3FxYVRjGIUo0zLly9fFuv1erVaTf2S+Xy+WCweHh6mfsjZ2dmPHz+Wy+V6vTaKUYxilGlZrVb+VADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJAZjGbzebz+dTPOIETftntdsMwGMUoRjHK5Ly/vw+bzWa5XJ7AMeM4vr6+Ho/HqR9yOBwuLy/Pz8+NYhSjGGVaPj4+hhPY46f7+/ubm5txHKd+yHa7vbu7e3x8NIpRjGKUadnv94urq6uXl5cTWOV4PI7j+Pn5eQK3DMOJxN4oRjHKf2qUcRx91AOQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAzOLbt29vb29TP2O32x0Oh91udzwep37Ldrv9+vXr9+/fjWIUoxhlWvb7/fD8/LxcLqd+yTAMT09Pm81mNpv82+v6+vr29vbi4sIoRjGKUably5cvi/V6vVqtpn7JfD5fLBYPDw9TP+Ts7OzHjx/L5XK9XhvFKEYxyrSsVit/KgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgs5jNZvP5fOpnnMAJv+x2u2EYjGIUoxhlct7f34fNZrNcLk/gmHEcX19fj8fj1A85HA6Xl5fn5+dGMYpRjDItHx8fwwns8dP9/f3Nzc04jlM/ZLvd3t3dPT4+GsUoRjHKtOz3+8XV1dXLy8sJrHI8Hsdx/Pz8PIFbhuFEYm8UoxjlPzXKOI4+6gHIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJAZvHt27e3t7epn7Hb7Q6Hw263Ox6PU79lu91+/fr1+/fvRjGKUYwyLfv9fnh+fl4ul1O/ZBiGp6enzWYzm03+7XV9fX17e3txcWEUoxjFKNPy5cuXxXq9Xq1WU79kPp8vFouHh4epH3J2dvbjx4/lcrler41iFKMYZVpWq5U/FQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQWcxms/l8PvUzTuCEX3a73TAMRjGKUYwyOe/v78Nms1kulydwzDiOr6+vx+Nx6occDofLy8vz83OjGMUoRpmWj4+P4QT2+On+/v7m5mYcx6kfst1u7+7uHh8fjWIUoxhlWvb7/eLq6url5eUEVjkej+M4fn5+nsAtw3AisTeKUYzynxplHEcf9QBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgs7i4uFgul1M/Y7vdfnx8fHx8jOM49Vv2+/1vv/1mFKMYxShTHOV/51nv6/az2UUAAAAASUVORK5CYII="
+    }
+  },
+  "materials": {
+    "material_Material.001": {
+      "name": "Material.001",
+      "extensions": {},
+      "values": {
+        "ambient": [
+          0,
+          0,
+          0,
+          1
+        ],
+        "diffuse": "texture_Checkered",
+        "emission": [
+          0,
+          0,
+          0,
+          1
+        ],
+        "specular": [
+          0.5,
+          0.5,
+          0.5,
+          1
+        ],
+        "shininess": 96.078431,
+        "transparency": 1
+      },
+      "technique": "technique0"
+    }
+  },
+  "meshes": {
+    "mesh_checkerboard": {
+      "name": "checkerboard",
+      "primitives": [
+        {
+          "attributes": {
+            "POSITION": "accessor_position",
+            "TEXCOORD_0": "accessor_uv",
+            "NORMAL": "accessor_normal_generated"
+          },
+          "indices": "accessor_index_0",
+          "material": "material_Material.001",
+          "mode": 4
+        }
+      ]
+    }
+  },
+  "nodes": {
+    "rootNode": {
+      "children": [],
+      "meshes": [
+        "mesh_checkerboard"
+      ],
+      "matrix": [
+        1,
+        0,
+        0,
+        0,
+        0,
+        1,
+        0,
+        0,
+        0,
+        0,
+        1,
+        0,
+        0,
+        0,
+        0,
+        1
+      ]
+    }
+  },
+  "samplers": {
+    "sampler_0": {
+      "magFilter": 9729,
+      "minFilter": 9986,
+      "wrapS": 10497,
+      "wrapT": 10497
+    }
+  },
+  "scene": "scene_checkerboard",
+  "scenes": {
+    "scene_checkerboard": {
+      "nodes": [
+        "rootNode"
+      ]
+    }
+  },
+  "textures": {
+    "texture_Checkered": {
+      "format": 6407,
+      "internalFormat": 6407,
+      "sampler": "sampler_0",
+      "source": "Checkered",
+      "target": 3553,
+      "type": 5121
+    }
+  },
+  "extensionsUsed": [],
+  "animations": {},
+  "cameras": {},
+  "techniques": {
+    "technique0": {
+      "attributes": {
+        "a_position": "position",
+        "a_normal": "normal",
+        "a_texcoord_0": "texcoord_0"
+      },
+      "parameters": {
+        "modelViewMatrix": {
+          "semantic": "MODELVIEW",
+          "type": 35676
+        },
+        "projectionMatrix": {
+          "semantic": "PROJECTION",
+          "type": 35676
+        },
+        "normalMatrix": {
+          "semantic": "MODELVIEWINVERSETRANSPOSE",
+          "type": 35675
+        },
+        "ambient": {
+          "type": 35666
+        },
+        "diffuse": {
+          "type": 35678
+        },
+        "emission": {
+          "type": 35666
+        },
+        "specular": {
+          "type": 35666
+        },
+        "shininess": {
+          "type": 5126
+        },
+        "transparency": {
+          "type": 5126
+        },
+        "position": {
+          "semantic": "POSITION",
+          "type": 35665
+        },
+        "normal": {
+          "semantic": "NORMAL",
+          "type": 35665
+        },
+        "texcoord_0": {
+          "semantic": "TEXCOORD_0",
+          "type": 35664
+        }
+      },
+      "program": "program0",
+      "states": {
+        "enable": [
+          2884,
+          2929
+        ]
+      },
+      "uniforms": {
+        "u_modelViewMatrix": "modelViewMatrix",
+        "u_projectionMatrix": "projectionMatrix",
+        "u_normalMatrix": "normalMatrix",
+        "u_ambient": "ambient",
+        "u_diffuse": "diffuse",
+        "u_emission": "emission",
+        "u_specular": "specular",
+        "u_shininess": "shininess",
+        "u_transparency": "transparency"
+      }
+    }
+  },
+  "programs": {
+    "program0": {
+      "attributes": [
+        "a_position",
+        "a_normal",
+        "a_texcoord_0"
+      ],
+      "fragmentShader": "fragmentShader0",
+      "vertexShader": "vertexShader0"
+    }
+  },
+  "shaders": {
+    "vertexShader0": {
+      "type": 35633,
+      "uri": "data:text/plain;base64,cHJlY2lzaW9uIGhpZ2hwIGZsb2F0Owp1bmlmb3JtIG1hdDQgdV9tb2RlbFZpZXdNYXRyaXg7CnVuaWZvcm0gbWF0NCB1X3Byb2plY3Rpb25NYXRyaXg7CnVuaWZvcm0gbWF0MyB1X25vcm1hbE1hdHJpeDsKYXR0cmlidXRlIHZlYzMgYV9wb3NpdGlvbjsKdmFyeWluZyB2ZWMzIHZfcG9zaXRpb25FQzsKYXR0cmlidXRlIHZlYzMgYV9ub3JtYWw7CnZhcnlpbmcgdmVjMyB2X25vcm1hbDsKYXR0cmlidXRlIHZlYzIgYV90ZXhjb29yZF8wOwp2YXJ5aW5nIHZlYzIgdl90ZXhjb29yZF8wOwp2b2lkIG1haW4odm9pZCkgewogIHZlYzQgcG9zID0gdV9tb2RlbFZpZXdNYXRyaXggKiB2ZWM0KGFfcG9zaXRpb24sMS4wKTsKICB2X3Bvc2l0aW9uRUMgPSBwb3MueHl6OwogIGdsX1Bvc2l0aW9uID0gdV9wcm9qZWN0aW9uTWF0cml4ICogcG9zOwogIHZfbm9ybWFsID0gdV9ub3JtYWxNYXRyaXggKiBhX25vcm1hbDsKICB2X3RleGNvb3JkXzAgPSBhX3RleGNvb3JkXzA7Cn0K"
+    },
+    "fragmentShader0": {
+      "type": 35632,
+      "uri": "data:text/plain;base64,cHJlY2lzaW9uIGhpZ2hwIGZsb2F0Owp1bmlmb3JtIHZlYzQgdV9hbWJpZW50Owp1bmlmb3JtIHNhbXBsZXIyRCB1X2RpZmZ1c2U7CnVuaWZvcm0gdmVjNCB1X2VtaXNzaW9uOwp1bmlmb3JtIHZlYzQgdV9zcGVjdWxhcjsKdW5pZm9ybSBmbG9hdCB1X3NoaW5pbmVzczsKdW5pZm9ybSBmbG9hdCB1X3RyYW5zcGFyZW5jeTsKdmFyeWluZyB2ZWMzIHZfcG9zaXRpb25FQzsKdmFyeWluZyB2ZWMzIHZfbm9ybWFsOwp2YXJ5aW5nIHZlYzIgdl90ZXhjb29yZF8wOwp2b2lkIG1haW4odm9pZCkgewogIHZlYzMgbm9ybWFsID0gbm9ybWFsaXplKHZfbm9ybWFsKTsKICB2ZWM0IGRpZmZ1c2UgPSB0ZXh0dXJlMkQodV9kaWZmdXNlLCB2X3RleGNvb3JkXzApOwogIHZlYzMgZGlmZnVzZUxpZ2h0ID0gdmVjMygwLjAsIDAuMCwgMC4wKTsKICB2ZWMzIHNwZWN1bGFyID0gdV9zcGVjdWxhci5yZ2I7CiAgdmVjMyBzcGVjdWxhckxpZ2h0ID0gdmVjMygwLjAsIDAuMCwgMC4wKTsKICB2ZWMzIGVtaXNzaW9uID0gdV9lbWlzc2lvbi5yZ2I7CiAgdmVjMyBhbWJpZW50ID0gdV9hbWJpZW50LnJnYjsKICB2ZWMzIHZpZXdEaXIgPSAtbm9ybWFsaXplKHZfcG9zaXRpb25FQyk7CiAgdmVjMyBhbWJpZW50TGlnaHQgPSB2ZWMzKDAuMCwgMC4wLCAwLjApOwogIGFtYmllbnRMaWdodCArPSB2ZWMzKDAuMiwgMC4yLCAwLjIpOwogIHZlYzMgbCA9IHZlYzMoMC4wLCAwLjAsIDEuMCk7CiAgZGlmZnVzZUxpZ2h0ICs9IHZlYzMoMS4wLCAxLjAsIDEuMCkgKiBtYXgoZG90KG5vcm1hbCxsKSwgMC4pOwogIHZlYzMgaCA9IG5vcm1hbGl6ZShsICsgdmlld0Rpcik7CiAgZmxvYXQgc3BlY3VsYXJJbnRlbnNpdHkgPSBtYXgoMC4sIHBvdyhtYXgoZG90KG5vcm1hbCwgaCksIDAuKSwgdV9zaGluaW5lc3MpKTsKICBzcGVjdWxhckxpZ2h0ICs9IHZlYzMoMS4wLCAxLjAsIDEuMCkgKiBzcGVjdWxhckludGVuc2l0eTsKICB2ZWMzIGNvbG9yID0gdmVjMygwLjAsIDAuMCwgMC4wKTsKICBjb2xvciArPSBkaWZmdXNlLnJnYiAqIGRpZmZ1c2VMaWdodDsKICBjb2xvciArPSBzcGVjdWxhciAqIHNwZWN1bGFyTGlnaHQ7CiAgY29sb3IgKz0gZW1pc3Npb247CiAgY29sb3IgKz0gYW1iaWVudCAqIGFtYmllbnRMaWdodDsKICBnbF9GcmFnQ29sb3IgPSB2ZWM0KGNvbG9yICogZGlmZnVzZS5hLCBkaWZmdXNlLmEgKiB1X3RyYW5zcGFyZW5jeSk7Cn0K"
+    }
+  },
+  "skins": {},
+  "extensions": {}
+}
\ No newline at end of file
diff --git a/gltfs/flower/flower.gltf b/gltfs/flower/flower.gltf
index a59e3d2..2b3edb2 100644
Binary files a/gltfs/flower/flower.gltf and b/gltfs/flower/flower.gltf differ
diff --git a/renders/bilerp.png b/renders/bilerp.png
new file mode 100644
index 0000000..2796587
Binary files /dev/null and b/renders/bilerp.png differ
diff --git a/renders/demo.gif b/renders/demo.gif
new file mode 100644
index 0000000..4ebf0ca
Binary files /dev/null and b/renders/demo.gif differ
diff --git a/renders/demo_checker.gif b/renders/demo_checker.gif
new file mode 100644
index 0000000..a298d83
Binary files /dev/null and b/renders/demo_checker.gif differ
diff --git a/renders/demo_duck.gif b/renders/demo_duck.gif
new file mode 100644
index 0000000..33812dc
Binary files /dev/null and b/renders/demo_duck.gif differ
diff --git a/renders/demo_engine.gif b/renders/demo_engine.gif
new file mode 100644
index 0000000..2ea2760
Binary files /dev/null and b/renders/demo_engine.gif differ
diff --git a/renders/demo_truck.gif b/renders/demo_truck.gif
new file mode 100644
index 0000000..18d5a6b
Binary files /dev/null and b/renders/demo_truck.gif differ
diff --git a/renders/interp_perf.png b/renders/interp_perf.png
new file mode 100644
index 0000000..0a74e41
Binary files /dev/null and b/renders/interp_perf.png differ
diff --git a/renders/nn.png b/renders/nn.png
new file mode 100644
index 0000000..e87992d
Binary files /dev/null and b/renders/nn.png differ
diff --git a/renders/occupancy_graph.png b/renders/occupancy_graph.png
new file mode 100644
index 0000000..8fcdb9b
Binary files /dev/null and b/renders/occupancy_graph.png differ
diff --git a/renders/perspective_correct.png b/renders/perspective_correct.png
new file mode 100644
index 0000000..67e01a9
Binary files /dev/null and b/renders/perspective_correct.png differ
diff --git a/renders/perspective_incorrect.png b/renders/perspective_incorrect.png
new file mode 100644
index 0000000..ed7c47f
Binary files /dev/null and b/renders/perspective_incorrect.png differ
diff --git a/renders/proj4_perf_data.xlsx b/renders/proj4_perf_data.xlsx
new file mode 100644
index 0000000..50b666c
Binary files /dev/null and b/renders/proj4_perf_data.xlsx differ
diff --git a/renders/ras_perf.png b/renders/ras_perf.png
new file mode 100644
index 0000000..f607af6
Binary files /dev/null and b/renders/ras_perf.png differ
diff --git a/renders/ssao_demo1.png b/renders/ssao_demo1.png
new file mode 100644
index 0000000..6496886
Binary files /dev/null and b/renders/ssao_demo1.png differ
diff --git a/renders/ssao_demo2.png b/renders/ssao_demo2.png
new file mode 100644
index 0000000..ec9f2ab
Binary files /dev/null and b/renders/ssao_demo2.png differ
diff --git a/renders/ssao_di.png b/renders/ssao_di.png
new file mode 100644
index 0000000..176e2ff
Binary files /dev/null and b/renders/ssao_di.png differ
diff --git a/renders/ssao_engine.png b/renders/ssao_engine.png
new file mode 100644
index 0000000..98e07f7
Binary files /dev/null and b/renders/ssao_engine.png differ
diff --git a/renders/ssao_perf.png b/renders/ssao_perf.png
new file mode 100644
index 0000000..0898ce6
Binary files /dev/null and b/renders/ssao_perf.png differ
diff --git a/renders/ssao_perf_data.xlsx b/renders/ssao_perf_data.xlsx
new file mode 100644
index 0000000..bfd4cb4
Binary files /dev/null and b/renders/ssao_perf_data.xlsx differ
diff --git a/renders/ts16_occupancy.png b/renders/ts16_occupancy.png
new file mode 100644
index 0000000..26ed31b
Binary files /dev/null and b/renders/ts16_occupancy.png differ
diff --git a/renders/ts32_occupancy.png b/renders/ts32_occupancy.png
new file mode 100644
index 0000000..a144d62
Binary files /dev/null and b/renders/ts32_occupancy.png differ
diff --git a/renders/ts4_occupancy.png b/renders/ts4_occupancy.png
new file mode 100644
index 0000000..428ef6f
Binary files /dev/null and b/renders/ts4_occupancy.png differ
diff --git a/renders/ts8_occupancy.png b/renders/ts8_occupancy.png
new file mode 100644
index 0000000..8b98af4
Binary files /dev/null and b/renders/ts8_occupancy.png differ
diff --git a/renders/ts_perf.png b/renders/ts_perf.png
new file mode 100644
index 0000000..20eb05c
Binary files /dev/null and b/renders/ts_perf.png differ
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index a57f69f..00edee0 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -6,5 +6,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_61
     )
diff --git a/src/main.cpp b/src/main.cpp
index a36b955..d51859e 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -17,6 +17,7 @@
 //-------------------------------
 //-------------MAIN--------------
 //-------------------------------
+float sampleKernelRadius = 1.f;
 
 int main(int argc, char **argv) {
     if (argc != 2) {
@@ -62,21 +63,40 @@ int main(int argc, char **argv) {
     return 0;
 }
 
-void mainLoop() {
-    while (!glfwWindowShouldClose(window)) {
+void mainLoop()
+{
+	float runCudaTimeMs;
+	float accumRunCudaTimeMs = 0.f;
+	float averageRunCudaTimeMs;
+	cudaEvent_t start, stop;
+	cudaEventCreate(&start);
+	cudaEventCreate(&stop);
+
+	while (!glfwWindowShouldClose(window)) {
         glfwPollEvents();
+
+		cudaEventRecord(start);
         runCuda();
+		cudaEventRecord(stop);
+		cudaEventSynchronize(stop);
+		cudaEventElapsedTime(&runCudaTimeMs, start, stop);
+		accumRunCudaTimeMs += runCudaTimeMs;
 
         time_t seconds2 = time (NULL);
 
-        if (seconds2 - seconds >= 1) {
+        if (seconds2 - seconds >= 1)
+		{
+			averageRunCudaTimeMs = accumRunCudaTimeMs / fpstracker;
+			accumRunCudaTimeMs = 0.f;
 
             fps = fpstracker / (seconds2 - seconds);
             fpstracker = 0;
             seconds = seconds2;
         }
 
-        string title = "CIS565 Rasterizer | " + utilityCore::convertIntToString((int)fps) + " FPS";
+        string title = "CIS565 Rasterizer | " + utilityCore::convertIntToString((int)fps) + " FPS | " +
+			utilityCore::convertFloatToString(averageRunCudaTimeMs) + " ms | " +
+			"Radius: " + utilityCore::convertFloatToString(sampleKernelRadius);
         glfwSetWindowTitle(window, title.c_str());
 
         glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo);
@@ -95,6 +115,11 @@ void mainLoop() {
 //-------------------------------
 //---------RUNTIME STUFF---------
 //-------------------------------
+enum WhichVarToChange
+{
+	None,
+	SampleKernelRadius
+} whichVarToChange;
 float scale = 1.0f;
 float x_trans = 0.0f, y_trans = 0.0f, z_trans = -10.0f;
 float x_angle = 0.0f, y_angle = 0.0f;
@@ -103,23 +128,23 @@ void runCuda() {
     // No data is moved (Win & Linux). When mapped to CUDA, OpenGL should not use this buffer
     dptr = NULL;
 
-	glm::mat4 P = glm::frustum<float>(-scale * ((float)width) / ((float)height),
-		scale * ((float)width / (float)height),
-		-scale, scale, 1.0, 1000.0);
+	glm::mat4 P = glm::perspective(45.f, (float)width / (float)height, .1f, 100.f);
 
 	glm::mat4 V = glm::mat4(1.0f);
 
 	glm::mat4 M =
 		glm::translate(glm::vec3(x_trans, y_trans, z_trans))
 		* glm::rotate(x_angle, glm::vec3(1.0f, 0.0f, 0.0f))
-		* glm::rotate(y_angle, glm::vec3(0.0f, 1.0f, 0.0f));
+		* glm::rotate(y_angle, glm::vec3(0.0f, 1.0f, 0.0f))
+		//* glm::scale(glm::vec3(1.f, 1.f, 1.f));
+		* glm::scale(glm::vec3(.01f, .01f, .01f));
 
 	glm::mat3 MV_normal = glm::transpose(glm::inverse(glm::mat3(V) * glm::mat3(M)));
 	glm::mat4 MV = V * M;
 	glm::mat4 MVP = P * MV;
 
     cudaGLMapBufferObject((void **)&dptr, pbo);
-	rasterize(dptr, MVP, MV, MV_normal);
+	rasterize(dptr, MVP, MV, MV_normal, P);
     cudaGLUnmapBufferObject(pbo);
 
     frame++;
@@ -130,7 +155,10 @@ void runCuda() {
 //----------SETUP STUFF----------
 //-------------------------------
 
-bool init(const tinygltf::Scene & scene) {
+bool init(const tinygltf::Scene & scene)
+{
+	whichVarToChange = None;
+
     glfwSetErrorCallback(errorCallback);
 
     if (!glfwInit()) {
@@ -327,6 +355,28 @@ void keyCallback(GLFWwindow *window, int key, int scancode, int action, int mods
     if (key == GLFW_KEY_ESCAPE && action == GLFW_PRESS) {
         glfwSetWindowShouldClose(window, GL_TRUE);
     }
+	else if (key == GLFW_KEY_1 && action == GLFW_PRESS)
+	{
+		whichVarToChange = (whichVarToChange == SampleKernelRadius) ? None : SampleKernelRadius;
+	}
+	else if (key == GLFW_KEY_UP && action == GLFW_PRESS)
+	{
+		switch (whichVarToChange)
+		{
+		case SampleKernelRadius:
+			sampleKernelRadius += .1f;
+			break;
+		}
+	}
+	else if (key == GLFW_KEY_DOWN && action == GLFW_PRESS)
+	{
+		switch (whichVarToChange)
+		{
+		case SampleKernelRadius:
+			sampleKernelRadius -= .1f;
+			break;
+		}
+	}
 }
 
 //----------------------------
@@ -394,6 +444,6 @@ void mouseMotionCallback(GLFWwindow* window, double xpos, double ypos)
 
 void mouseWheelCallback(GLFWwindow* window, double xoffset, double yoffset)
 {
-	const double s = 1.0;	// sensitivity
+	const double s = 0.5;	// sensitivity
 	z_trans += (float)(s * yoffset);
 }
diff --git a/src/rasterize.cu b/src/rasterize.cu
index 4e3504b..20ecce0 100644
--- a/src/rasterize.cu
+++ b/src/rasterize.cu
@@ -6,6 +6,9 @@
  * @copyright University of Pennsylvania & STUDENT
  */
 
+#include <random>
+#include <functional>
+#include <chrono>
 #include <cmath>
 #include <cstdio>
 #include <cuda.h>
@@ -18,6 +21,14 @@
 #include <glm/gtc/quaternion.hpp>
 #include <glm/gtc/matrix_transform.hpp>
 
+
+#define TILE_SIZE 8
+#define TILE_TRI_LIST_SCALE 0.1f
+#define TRI_LIST_SCALE_THRESHOLD 100
+
+#define ROUND_UP_DIV(x, n) (((x) + (n) - 1) / (n))
+
+
 namespace {
 
 	typedef unsigned short VertexIndex;
@@ -46,7 +57,8 @@ namespace {
 		// glm::vec3 col;
 		 glm::vec2 texcoord0;
 		 TextureData* dev_diffuseTex = NULL;
-		// int texWidth, texHeight;
+		 int diffuseTexWidth;
+		 int diffuseTexHeight;
 		// ...
 	};
 
@@ -56,16 +68,20 @@ namespace {
 	};
 
 	struct Fragment {
-		glm::vec3 color;
+		//glm::vec3 color;
 
 		// TODO: add new attributes to your Fragment
 		// The attributes listed below might be useful, 
 		// but always feel free to modify on your own
 
-		// glm::vec3 eyePos;	// eye space position used for shading
-		// glm::vec3 eyeNor;
-		// VertexAttributeTexcoord texcoord0;
-		// TextureData* dev_diffuseTex;
+		glm::vec3 eyePos;	// eye space position used for shading
+		glm::vec3 eyeNor;
+		VertexAttributeTexcoord texcoord0;
+		TextureData* dev_diffuseTex;
+		int diffuseTexWidth;
+		int diffuseTexHeight;
+		int shouldShade;
+		float ssaoFactor;
 		// ...
 	};
 
@@ -83,6 +99,8 @@ namespace {
 		VertexAttributeTexcoord* dev_texcoord0;
 
 		// Materials, add more attributes when needed
+		int diffuseTexWidth;
+		int diffuseTexHeight;
 		TextureData* dev_diffuseTex;
 		// TextureData* dev_specularTex;
 		// TextureData* dev_normalTex;
@@ -107,7 +125,25 @@ static Primitive *dev_primitives = NULL;
 static Fragment *dev_fragmentBuffer = NULL;
 static glm::vec3 *dev_framebuffer = NULL;
 
-static int * dev_depth = NULL;	// you might need this buffer when doing depth test
+static float *dev_depth = NULL;	// you might need this buffer when doing depth test
+
+// Tile-based rasterization
+static int numTilesX = 0;
+static int numTilesY = 0;
+static int triListSize = 0;
+static int *dev_primCounts = nullptr; // one entry per tile
+static int *dev_tileTriLists = nullptr;
+
+// SSAO
+extern float sampleKernelRadius;
+static const int numSamples = 64;
+static const int noiseTexSize1D = 4;
+// random points in a upper hemisphere whose z axis will be rotated to align
+// with pixel normal in eye space
+static glm::vec3 *dev_samples = nullptr;
+// A 4x4 noise texture. Each pixel contains an x-axis randomly rotated in the x-y plane
+// Will be tiled across the entire frame buffer
+static glm::vec3 *dev_noiseTex = nullptr;
 
 /**
  * Kernel that writes the image to the OpenGL PBO directly.
@@ -116,7 +152,8 @@ __global__
 void sendImageToPBO(uchar4 *pbo, int w, int h, glm::vec3 *image) {
     int x = (blockIdx.x * blockDim.x) + threadIdx.x;
     int y = (blockIdx.y * blockDim.y) + threadIdx.y;
-    int index = x + (y * w);
+    int index = x + y * w;
+	int outIdx = w - x - 1 + (h - y - 1) * w;
 
     if (x < w && y < h) {
         glm::vec3 color;
@@ -124,10 +161,10 @@ void sendImageToPBO(uchar4 *pbo, int w, int h, glm::vec3 *image) {
         color.y = glm::clamp(image[index].y, 0.0f, 1.0f) * 255.0;
         color.z = glm::clamp(image[index].z, 0.0f, 1.0f) * 255.0;
         // Each thread writes one pixel location in the texture (textel)
-        pbo[index].w = 0;
-        pbo[index].x = color.x;
-        pbo[index].y = color.y;
-        pbo[index].z = color.z;
+        pbo[outIdx].w = 0;
+		pbo[outIdx].x = color.x;
+		pbo[outIdx].y = color.y;
+		pbo[outIdx].z = color.z;
     }
 }
 
@@ -140,14 +177,199 @@ void render(int w, int h, Fragment *fragmentBuffer, glm::vec3 *framebuffer) {
     int y = (blockIdx.y * blockDim.y) + threadIdx.y;
     int index = x + (y * w);
 
-    if (x < w && y < h) {
-        framebuffer[index] = fragmentBuffer[index].color;
+    if (x < w && y < h && fragmentBuffer[index].shouldShade) {
+        //framebuffer[index] = fragmentBuffer[index].color;
 
 		// TODO: add your fragment shader code here
+		// Phone shading
+		const glm::vec3 lightDir = glm::normalize(glm::vec3(1.f, 1.f, 1.f)); // in eye space
+		const glm::vec3 ambientColor(1.f, 1.f, 1.f);
+		glm::vec3 diffuseColor;
+		const glm::vec3 specularColor(1.f, 1.f, 1.f);
+		const float specExp = 20.f;
+		const float Ka = 0.1f;
+		const float Kd = 0.5f;
+		const float Ks = 0.5f;
+
+		if (fragmentBuffer[index].dev_diffuseTex)
+		{
+			int width = fragmentBuffer[index].diffuseTexWidth;
+			int height = fragmentBuffer[index].diffuseTexHeight;
+			glm::vec2 uv = fragmentBuffer[index].texcoord0;
+			float fx = uv.x * width - 0.5f;
+			float fy = uv.y * height - 0.5f;
+			float wx = fx - glm::floor(fx);
+			float wy = fy - glm::floor(fy);
+			int x = glm::min(width - 2, glm::max(0, static_cast<int>(fx)));
+			int y = glm::min(height - 2, glm::max(0, static_cast<int>(fy)));
+			int i00 = y * width + x;
+			int i10 = y * width + x + 1;
+			int i01 = (y + 1) * width + x;
+			int i11 = (y + 1) * width + x + 1;
+			const TextureData *tex = fragmentBuffer[index].dev_diffuseTex;
+			glm::vec3 p00(tex[i00 * 3] / 255.f, tex[i00 * 3 + 1] / 255.f, tex[i00 * 3 + 2] / 255.f);
+			glm::vec3 p10(tex[i10 * 3] / 255.f, tex[i10 * 3 + 1] / 255.f, tex[i10 * 3 + 2] / 255.f);
+			glm::vec3 p01(tex[i01 * 3] / 255.f, tex[i01 * 3 + 1] / 255.f, tex[i01 * 3 + 2] / 255.f);
+			glm::vec3 p11(tex[i11 * 3] / 255.f, tex[i11 * 3 + 1] / 255.f, tex[i11 * 3 + 2] / 255.f);
+			diffuseColor = (1.f - wy) * ((1.f - wx) * p00 + wx * p10) + wy * ((1.f - wx) * p01 + wx * p11);
+		}
+		else
+		{
+			diffuseColor = glm::vec3(1.f, 1.f, 1.f);
+		}
+
+		glm::vec3 eyePos = fragmentBuffer[index].eyePos;
+		glm::vec3 eyeNor = fragmentBuffer[index].eyeNor;
 
+		glm::vec3 h = glm::normalize(lightDir - eyePos);
+		float costhetah = glm::max(0.f, glm::dot(h, eyeNor));
+		float costheta = glm::max(0.f, glm::dot(lightDir, eyeNor));
+
+		framebuffer[index] =
+			fragmentBuffer[index].ssaoFactor * (
+			Ka * ambientColor +
+			Kd * diffuseColor * costheta +
+			Ks * specularColor * powf(costhetah, specExp));
+			//glm::vec3(fragmentBuffer[index].ssaoFactor);
     }
 }
 
+__device__ glm::mat3 coordinateSystemZFromZX(const glm::vec3 &z, const glm::vec3 &x)
+{
+	glm::vec3 t = glm::normalize(x - glm::dot(z, x) * z);
+	glm::vec3 b = glm::cross(z, t);
+	return glm::mat3(t, b, z);
+}
+
+__global__ void ssaoBlur(int width, int height, int blurSize, Fragment *fragmentBuffer)
+{
+	extern __shared__ float s_ssaoFactors[];
+
+	int padding = blurSize >> 1;
+	int startX = blockIdx.x * (blockDim.x - padding * 2) - padding;
+	int startY = blockIdx.y * (blockDim.y - padding * 2) - padding;
+	int gx = startX + threadIdx.x;
+	int gy = startY + threadIdx.y;
+	
+	if (gx >= 0 && gx < width && gy >= 0 && gy < height)
+	{
+		s_ssaoFactors[threadIdx.y * blockDim.x + threadIdx.x] = fragmentBuffer[gy * width + gx].ssaoFactor;
+	}
+	else
+	{
+		s_ssaoFactors[threadIdx.y * blockDim.x + threadIdx.x] = 0.f;
+	}
+
+	__syncthreads();
+
+	if (threadIdx.x >= padding && threadIdx.x < blockDim.x - padding &&
+		threadIdx.y >= padding && threadIdx.y < blockDim.y - padding &&
+		gx < width && gy < height)
+	{
+		float average = 0.f;
+		int base = -padding;
+
+		for (int i = 0; i < blurSize; ++i)
+		{
+			for (int j = 0; j < blurSize; ++j)
+			{
+				average += s_ssaoFactors[(threadIdx.y + base + i) * blockDim.x + (threadIdx.x + base + j)];
+			}
+		}
+
+		fragmentBuffer[gy * width + gx].ssaoFactor = average / static_cast<float>(blurSize * blurSize);
+	}
+}
+
+/**
+ * Compute occlusion values for each pixel
+ */
+__global__ void computeSSAO(int width, int height,
+	glm::mat4 proj,
+	int NoiseTexSize1D, const glm::vec3 *noiseTex,
+	float sampleKernelRadius, int numSamples, const glm::vec3 *samples,
+	Fragment *fragmentBuffer)
+{
+	int x = blockIdx.x * blockDim.x + threadIdx.x;
+	int y = blockIdx.y * blockDim.y + threadIdx.y;
+	int index = x + (y * width);
+
+	if (x < width && y < height)
+	{
+		Fragment &frag = fragmentBuffer[index];
+		glm::vec3 eyePos = frag.eyePos;
+		glm::vec3 eyeNrm = frag.eyeNor;
+		glm::vec3 right = noiseTex[(x & (NoiseTexSize1D - 1)) + NoiseTexSize1D * (y & (NoiseTexSize1D - 1))];
+		glm::mat3 tbn = coordinateSystemZFromZX(eyeNrm, right);
+
+		float occlusion = 0.f;
+		const float deepestDepth = -FLT_MAX * .5f; // the camera is looking into -z
+
+		for (int i = 0; i < numSamples; ++i)
+		{
+			// transform the sample into eye space
+			glm::vec3 sample = tbn * samples[i];
+			sample = eyePos + sampleKernelRadius * sample;
+
+			// project the sample onto screen space
+			glm::vec4 ss = proj * glm::vec4(sample, 1.0);
+			ss.x = (ss.x / ss.w * .5f + .5f) * width;
+			ss.y = (ss.y / ss.w * .5f + .5f) * height;
+
+			// get pixel depth at sample postion
+			// depth value needs to be linear for the range check to actually make sense
+			int sx = static_cast<int>(ss.x + .5f);
+			int sy = static_cast<int>(ss.y + .5f);
+			bool isNotValid = sx < 0 || sx >= width || sy < 0 || sy >= height || !fragmentBuffer[sy * width + sx].shouldShade;
+			float sampleDepth = isNotValid ? deepestDepth : fragmentBuffer[sy * width + sx].eyePos.z;
+			float rangeCheck = fabs(eyePos.z - sampleDepth) < sampleKernelRadius ? 1.f : 0.f;
+			occlusion += ((sampleDepth > sample.z) ? 1.f : 0.f) * rangeCheck;
+		}
+
+		frag.ssaoFactor = 1.f - occlusion / static_cast<float>(numSamples);
+	}
+}
+
+/**
+ * Generate samples and noise texture on CPU.
+ * Copy results into GPU buffers.
+ */
+void ssaoInit()
+{
+	auto seed = std::chrono::high_resolution_clock::now().time_since_epoch().count();
+	auto u01 = std::bind(std::uniform_real_distribution<float>(0.f, 1.f), std::mt19937(seed));
+	auto u_11 = std::bind(std::uniform_real_distribution<float>(-1.f, 1.f), std::mt19937(seed));
+
+	std::vector<glm::vec3> samples(numSamples);
+	for (int i = 0; i < numSamples; ++i)
+	{
+		glm::vec3 s(u_11(), u_11(), u01());
+		s = glm::normalize(s);
+		float scale = static_cast<float>(i) / static_cast<float>(numSamples);
+		scale *= scale;
+		scale = (1.f - scale) * .1f + scale;
+		s *= scale;
+		samples[i] = s;
+	}
+
+	int noiseTexSize2D = noiseTexSize1D * noiseTexSize1D;
+	std::vector<glm::vec3> noiseTex(noiseTexSize2D);
+	for (int i = 0; i < noiseTexSize2D; ++i)
+	{
+		glm::vec3 x(u_11(), u_11(), 0.f);
+		x = glm::normalize(x);
+		noiseTex[i] = x;
+	}
+
+	cudaFree(dev_samples);
+	cudaMalloc(&dev_samples, numSamples * sizeof(glm::vec3));
+	cudaMemcpy(dev_samples, samples.data(), numSamples * sizeof(glm::vec3), cudaMemcpyHostToDevice);
+
+	cudaFree(dev_noiseTex);
+	cudaMalloc(&dev_noiseTex, noiseTexSize2D * sizeof(glm::vec3));
+	cudaMemcpy(dev_noiseTex, noiseTex.data(), noiseTexSize2D * sizeof(glm::vec3), cudaMemcpyHostToDevice);
+}
+
 /**
  * Called once at the beginning of the program to allocate memory.
  */
@@ -162,7 +384,17 @@ void rasterizeInit(int w, int h) {
     cudaMemset(dev_framebuffer, 0, width * height * sizeof(glm::vec3));
     
 	cudaFree(dev_depth);
-	cudaMalloc(&dev_depth, width * height * sizeof(int));
+	cudaMalloc(&dev_depth, width * height * sizeof(float));
+
+	numTilesX = ROUND_UP_DIV(w, TILE_SIZE);
+	numTilesY = ROUND_UP_DIV(h, TILE_SIZE);
+	int numTiles = numTilesX * numTilesY;
+	
+	cudaFree(dev_primCounts);
+	cudaMalloc(&dev_primCounts, numTiles * sizeof(int));
+	cudaFree(dev_tileTriLists);
+
+	ssaoInit();
 
 	checkCUDAError("rasterizeInit");
 }
@@ -358,25 +590,26 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 			auto itMeshName = N.meshes.begin();
 			auto itEndMeshName = N.meshes.end();
 
-			for (; itMeshName != itEndMeshName; ++itMeshName) {
-
+			for (; itMeshName != itEndMeshName; ++itMeshName)
+			{
 				const tinygltf::Mesh & mesh = scene.meshes.at(*itMeshName);
 
 				auto res = mesh2PrimitivesMap.insert(std::pair<std::string, std::vector<PrimitiveDevBufPointers>>(mesh.name, std::vector<PrimitiveDevBufPointers>()));
 				std::vector<PrimitiveDevBufPointers> & primitiveVector = (res.first)->second;
 
 				// for each primitive
-				for (size_t i = 0; i < mesh.primitives.size(); i++) {
+				for (size_t i = 0; i < mesh.primitives.size(); i++)
+				{
 					const tinygltf::Primitive &primitive = mesh.primitives[i];
 
 					if (primitive.indices.empty())
 						return;
 
 					// TODO: add new attributes for your PrimitiveDevBufPointers when you add new attributes
-					VertexIndex* dev_indices;
-					VertexAttributePosition* dev_position;
-					VertexAttributeNormal* dev_normal;
-					VertexAttributeTexcoord* dev_texcoord0;
+					VertexIndex* dev_indices = nullptr;
+					VertexAttributePosition* dev_position = nullptr;
+					VertexAttributeNormal* dev_normal = nullptr;
+					VertexAttributeTexcoord* dev_texcoord0 = nullptr;
 
 					// ----------Indices-------------
 
@@ -519,7 +752,10 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 					// You can only worry about this part once you started to 
 					// implement textures for your rasterizer
 					TextureData* dev_diffuseTex = NULL;
-					if (!primitive.material.empty()) {
+					int diffuseTexWidth = 0, diffuseTexHeight = 0;
+
+					if (!primitive.material.empty())
+					{
 						const tinygltf::Material &mat = scene.materials.at(primitive.material);
 						printf("material.name = %s\n", mat.name.c_str());
 
@@ -535,8 +771,8 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 									cudaMemcpy(dev_diffuseTex, &image.image.at(0), s, cudaMemcpyHostToDevice);
 									
 									// TODO: store the image size to your PrimitiveDevBufPointers
-									// image.width;
-									// image.height;
+									diffuseTexWidth = image.width;
+									diffuseTexHeight = image.height;
 
 									checkCUDAError("Set Texture Image data");
 								}
@@ -576,6 +812,8 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 						dev_normal,
 						dev_texcoord0,
 
+						diffuseTexWidth,
+						diffuseTexHeight,
 						dev_diffuseTex,
 
 						dev_vertexOut	//VertexOut
@@ -595,6 +833,16 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 	// 3. Malloc for dev_primitives
 	{
 		cudaMalloc(&dev_primitives, totalNumPrimitives * sizeof(Primitive));
+		int numTiles = numTilesX * numTilesY;
+		if (totalNumPrimitives > TRI_LIST_SCALE_THRESHOLD)
+		{
+			triListSize = static_cast<int>(TILE_TRI_LIST_SCALE * totalNumPrimitives);
+		}
+		else
+		{
+			triListSize = totalNumPrimitives;
+		}
+		cudaMalloc(&dev_tileTriLists, numTiles * triListSize * sizeof(int));
 	}
 	
 
@@ -623,8 +871,8 @@ void _vertexTransformAndAssembly(
 	int numVertices, 
 	PrimitiveDevBufPointers primitive, 
 	glm::mat4 MVP, glm::mat4 MV, glm::mat3 MV_normal, 
-	int width, int height) {
-
+	int width, int height)
+{
 	// vertex id
 	int vid = (blockIdx.x * blockDim.x) + threadIdx.x;
 	if (vid < numVertices) {
@@ -633,10 +881,34 @@ void _vertexTransformAndAssembly(
 		// Multiply the MVP matrix for each vertex position, this will transform everything into clipping space
 		// Then divide the pos by its w element to transform into NDC space
 		// Finally transform x and y to viewport space
+		glm::vec3 pos = primitive.dev_position[vid];
+		glm::vec3 nrm = primitive.dev_normal[vid];
+
+		glm::vec3 eyePos = glm::vec3(MV * glm::vec4(pos, 1.f));
+		glm::vec3 eyeNor = glm::normalize(MV_normal * nrm);
+
+		glm::vec4 clipPos = MVP * glm::vec4(pos, 1.f);
+		glm::vec4 outPos(
+			(clipPos.x / clipPos.w + 1.f) * .5f * static_cast<float>(width),
+			(clipPos.y / clipPos.w + 1.f) * .5f * static_cast<float>(height),
+			clipPos.z,
+			clipPos.w);
 
 		// TODO: Apply vertex assembly here
 		// Assemble all attribute arraies into the primitive array
-		
+		primitive.dev_verticesOut[vid].pos = outPos; // x, y in screen space. z in NDC
+		primitive.dev_verticesOut[vid].eyePos = eyePos;
+		primitive.dev_verticesOut[vid].eyeNor = eyeNor;
+		if (primitive.dev_texcoord0)
+		{
+			primitive.dev_verticesOut[vid].texcoord0 = primitive.dev_texcoord0[vid];
+		}
+		if (primitive.dev_diffuseTex)
+		{
+			primitive.dev_verticesOut[vid].dev_diffuseTex = primitive.dev_diffuseTex;
+		}
+		primitive.dev_verticesOut[vid].diffuseTexWidth = primitive.diffuseTexWidth;
+		primitive.dev_verticesOut[vid].diffuseTexHeight = primitive.diffuseTexHeight;
 	}
 }
 
@@ -655,12 +927,13 @@ void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_
 		// TODO: uncomment the following code for a start
 		// This is primitive assembly for triangles
 
-		//int pid;	// id for cur primitives vector
-		//if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
-		//	pid = iid / (int)primitive.primitiveType;
-		//	dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
-		//		= primitive.dev_verticesOut[primitive.dev_indices[iid]];
-		//}
+		int pid;	// id for cur primitives vector
+		if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES)
+		{
+			pid = iid / (int)primitive.primitiveType;
+			dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
+				= primitive.dev_verticesOut[primitive.dev_indices[iid]];
+		}
 
 
 		// TODO: other primitive types (point, line)
@@ -669,11 +942,207 @@ void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_
 }
 
 
+__global__ void _rasterize(int numPrims, int width, int height, const Primitive *primitives, int *depthBuff, Fragment *fragments)
+{
+	int pid = blockDim.x * blockIdx.x + threadIdx.x;
+
+	if (pid < numPrims)
+	{
+		const Primitive &prim = primitives[pid];
+
+		glm::vec3 tri[3] =
+		{
+			glm::vec3(prim.v[0].pos),
+			glm::vec3(prim.v[1].pos),
+			glm::vec3(prim.v[2].pos)
+		};
+
+		if (!isFrontFacing(tri))
+		{
+			return;
+		}
+
+		AABB bbox = getAABBForTriangle(tri);
+
+		int xmin = glm::min(width - 1, glm::max(0, static_cast<int>(bbox.min.x)));
+		int xmax = glm::min(width - 1, glm::max(0, static_cast<int>(bbox.max.x)));
+		int ymin = glm::min(height - 1, glm::max(0, static_cast<int>(bbox.min.y)));
+		int ymax = glm::min(height - 1, glm::max(0, static_cast<int>(bbox.max.y)));
+
+		for (int x = xmin; x <= xmax; ++x)
+		{
+			for (int y = ymin; y <= ymax; ++y)
+			{
+				glm::vec2 pix(x, y);
+				glm::vec3 abc = calculateBarycentricCoordinate(tri, pix);
+
+				if (isBarycentricCoordInBounds(abc))
+				{
+					// TODO
+					// write fragment (x, y) if it passes depth test
+					// For persepctive correct interpolation, we need to interpolate the reciprocal of
+					// vertex depths before doing perspective division in order to obtain the correct
+					// depth values. Interpolation of other vertex attributes also need special treatment
+					// rather than iterpolating using screen-space Barycentric coordinates directly.
+					// clipPos.w should be used instead of clipPos.z because, clipPos.w is the vertex's
+					// depth in eye space multiplied by a constant. When it is used for interpolation,
+					// the constant will eventually being cancelled out. Nonetheless, clipPos.z equals
+					// the vertex's depth in eye space times a constant and then offset by -1. As a
+					// result, it is not porpotional to the actual depth value of the vertex.
+					int idx = y * width + x;
+					float oneOverZ0 = 1.f / prim.v[0].pos.w;
+					float oneOverZ1 = 1.f / prim.v[1].pos.w;
+					float oneOverZ2 = 1.f / prim.v[2].pos.w;
+					float oneOverPixDepth = getFloatAtCoordinate(abc, oneOverZ0, oneOverZ1, oneOverZ2);
+					float pixDepth = 1.f / oneOverPixDepth;
+					float pixDepthNDC = getFloatAtCoordinate(abc, tri[0].z * oneOverZ0, tri[1].z * oneOverZ1, tri[2].z * oneOverZ2);
+
+					if (pixDepthNDC > -1.f && pixDepthNDC < 1.f)
+					{
+						int iPixDepth = static_cast<int>(pixDepthNDC * INT_MAX);
+						int iOldDepth = atomicMin(&depthBuff[idx], iPixDepth);
+
+						if (iPixDepth < iOldDepth)
+						{
+							//fragments[idx].color = glm::vec3(1.f, 1.f, 1.f);
+							fragments[idx].eyePos =
+								getVec3AtCoordinate(abc, prim.v[0].eyePos * oneOverZ0, prim.v[1].eyePos * oneOverZ1, prim.v[2].eyePos * oneOverZ2) * pixDepth;
+							fragments[idx].eyeNor = glm::normalize(
+								getVec3AtCoordinate(abc, prim.v[0].eyeNor * oneOverZ0, prim.v[1].eyeNor * oneOverZ1, prim.v[2].eyeNor * oneOverZ2) * pixDepth);
+							if (prim.v[0].dev_diffuseTex)
+							{
+								fragments[idx].texcoord0 =
+									getVec2AtCoordinate(abc, prim.v[0].texcoord0 * oneOverZ0, prim.v[1].texcoord0 * oneOverZ1, prim.v[2].texcoord0 * oneOverZ2) * pixDepth;
+								fragments[idx].dev_diffuseTex =
+									prim.v[0].dev_diffuseTex;
+								fragments[idx].diffuseTexWidth = prim.v[0].diffuseTexWidth;
+								fragments[idx].diffuseTexHeight = prim.v[0].diffuseTexHeight;
+							}
+							fragments[idx].shouldShade = 1;
+						}
+					}
+				}
+			}
+		}
+	}
+}
+
+
+__global__ void fillTileTriLists(int numPrims, int numTilesX, int numTilesY, int triListSize, const Primitive *primitives, int *primCounts, int *tileTriLists)
+{
+	int pidx = blockDim.x * blockIdx.x + threadIdx.x;
+
+	if (pidx < numPrims)
+	{
+		const Primitive &prim = primitives[pidx];
+		glm::vec3 tri[3] =
+		{
+			glm::vec3(prim.v[0].pos),
+			glm::vec3(prim.v[1].pos),
+			glm::vec3(prim.v[2].pos)
+		};
+
+		if (!isFrontFacing(tri))
+		{
+			return;
+		}
+
+		AABB bbox = getAABBForTriangle(tri);
+
+		int tileXMin = glm::max(0, static_cast<int>(floorf((bbox.min.x + .5f) / TILE_SIZE)));
+		int tileXMax = glm::min(numTilesX - 1, static_cast<int>(floorf((bbox.max.x + .5f) / TILE_SIZE)));
+		int tileYMin = glm::max(0, static_cast<int>(floorf((bbox.min.y + .5f) / TILE_SIZE)));
+		int tileYMax = glm::min(numTilesY - 1, static_cast<int>(floorf((bbox.max.y + .5f) / TILE_SIZE)));
+
+		for (int y = tileYMin; y <= tileYMax; ++y)
+		{
+			for (int x = tileXMin; x <= tileXMax; ++x)
+			{
+				AABB tileBound =
+				{
+					{ float(x * TILE_SIZE) - .5f, float(y * TILE_SIZE) - .5f, 0.f },
+					{ float((x + 1) * TILE_SIZE) - .5f, float((y + 1) * TILE_SIZE) - .5f, 0.f },
+				};
+
+				if (triAABBIntersect(tileBound, tri))
+				{
+					int tidx = y * numTilesX + x;
+					int offset = atomicAdd(&primCounts[tidx], 1);
+					tileTriLists[tidx * triListSize + offset] = pidx;
+				}
+			}
+		}
+	}
+}
+
+
+__global__ void tileBasedRasterize(
+	int numPrims, int width, int height, int numTilesX, int numTilesY, int triListSize,
+	const Primitive *primitives, int *primCounts, int *tileTriLists,
+	float *depthBuff, Fragment *fragments)
+{
+	float l_depthBuff = 1.f;
+	Fragment l_fragment{};
+
+	int pixIdxX = blockIdx.x * TILE_SIZE + threadIdx.x;
+	int pixIdxY = blockIdx.y * TILE_SIZE + threadIdx.y;
+	int pixIdx = pixIdxY * width + pixIdxX;
+	int tileIdx = blockIdx.y * numTilesX + blockIdx.x;
+	int numTris = primCounts[tileIdx];
+	int *triList = tileTriLists + triListSize * tileIdx;
+
+	if (pixIdxX >= width || pixIdxY >= height)
+	{
+		return;
+	}
+
+	for (int i = 0; i < numTris; ++i)
+	{
+		const Primitive &prim = primitives[triList[i]];
+		glm::vec3 tri[3] =
+		{
+			glm::vec3(prim.v[0].pos),
+			glm::vec3(prim.v[1].pos),
+			glm::vec3(prim.v[2].pos)
+		};
+		glm::vec3 abc = calculateBarycentricCoordinate(tri, glm::vec2(pixIdxX, pixIdxY));
+
+		if (isBarycentricCoordInBounds(abc))
+		{
+			float oneOverZ0 = 1.f / prim.v[0].pos.w;
+			float oneOverZ1 = 1.f / prim.v[1].pos.w;
+			float oneOverZ2 = 1.f / prim.v[2].pos.w;
+			float oneOverPixDepth = getFloatAtCoordinate(abc, oneOverZ0, oneOverZ1, oneOverZ2);
+			float pixDepth = 1.f / oneOverPixDepth;
+			float pixDepthNDC = getFloatAtCoordinate(abc, tri[0].z * oneOverZ0, tri[1].z * oneOverZ1, tri[2].z * oneOverZ2);
+
+			if (pixDepthNDC > -1.f && pixDepthNDC < l_depthBuff)
+			{
+				l_depthBuff = pixDepthNDC;
+				l_fragment.eyePos =
+					getVec3AtCoordinate(abc, prim.v[0].eyePos * oneOverZ0, prim.v[1].eyePos * oneOverZ1, prim.v[2].eyePos * oneOverZ2) * pixDepth;
+				l_fragment.eyeNor = glm::normalize(
+					getVec3AtCoordinate(abc, prim.v[0].eyeNor * oneOverZ0, prim.v[1].eyeNor * oneOverZ1, prim.v[2].eyeNor * oneOverZ2) * pixDepth);
+				l_fragment.texcoord0 =
+					getVec2AtCoordinate(abc, prim.v[0].texcoord0 * oneOverZ0, prim.v[1].texcoord0 * oneOverZ1, prim.v[2].texcoord0 * oneOverZ2) * pixDepth;
+				l_fragment.dev_diffuseTex =
+					prim.v[0].dev_diffuseTex;
+				l_fragment.diffuseTexWidth = prim.v[0].diffuseTexWidth;
+				l_fragment.diffuseTexHeight = prim.v[0].diffuseTexHeight;
+				l_fragment.shouldShade = true;
+			}
+		}
+	}
+
+	depthBuff[pixIdx] = l_depthBuff;
+	fragments[pixIdx] = l_fragment;
+}
+
 
 /**
  * Perform rasterization.
  */
-void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal) {
+void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal, const glm::mat4 &P) {
     int sideLength2d = 8;
     dim3 blockSize2d(sideLength2d, sideLength2d);
     dim3 blockCount2d((width  - 1) / blockSize2d.x + 1,
@@ -697,11 +1166,11 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g
 				dim3 numBlocksForVertices((p->numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
 				dim3 numBlocksForIndices((p->numIndices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
 
-				_vertexTransformAndAssembly << < numBlocksForVertices, numThreadsPerBlock >> >(p->numVertices, *p, MVP, MV, MV_normal, width, height);
+				_vertexTransformAndAssembly<<<numBlocksForVertices, numThreadsPerBlock>>>(p->numVertices, *p, MVP, MV, MV_normal, width, height);
 				checkCUDAError("Vertex Processing");
 				cudaDeviceSynchronize();
-				_primitiveAssembly << < numBlocksForIndices, numThreadsPerBlock >> >
-					(p->numIndices, 
+				_primitiveAssembly<<<numBlocksForIndices, numThreadsPerBlock>>>(
+					p->numIndices, 
 					curPrimitiveBeginId, 
 					dev_primitives, 
 					*p);
@@ -713,16 +1182,37 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g
 
 		checkCUDAError("Vertex Processing and Primitive Assembly");
 	}
-	
-	cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
-	initDepth << <blockCount2d, blockSize2d >> >(width, height, dev_depth);
-	
+
 	// TODO: rasterize
+	int numTiles = numTilesX * numTilesY;
+	cudaMemset(dev_primCounts, 0, numTiles * sizeof(int));
+
+	const int blockSize1d = 128;
+	int numBlocks = (totalNumPrimitives + blockSize1d - 1) / blockSize1d;
+	fillTileTriLists<<<numBlocks, blockSize1d>>>(totalNumPrimitives, numTilesX, numTilesY, triListSize, dev_primitives, dev_primCounts, dev_tileTriLists);
+
+	dim3 numBlocks3(numTilesX, numTilesY, 1);
+	dim3 blockSize3(TILE_SIZE, TILE_SIZE, 1);
+	cudaFuncSetCacheConfig(tileBasedRasterize, cudaFuncCachePreferL1);
+	tileBasedRasterize<<<numBlocks3, blockSize3>>>(totalNumPrimitives, width, height, numTilesX, numTilesY, triListSize, dev_primitives, dev_primCounts, dev_tileTriLists, dev_depth, dev_fragmentBuffer);
+	checkCUDAError("tileBasedRasterize");
+
+	//cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
+	//initDepth<<<blockCount2d, blockSize2d>>>(width, height, dev_depth);
+	//_rasterize<<<numBlocks, blockSize1d>>>(totalNumPrimitives, width, height, dev_primitives, dev_depth, dev_fragmentBuffer);
 
+	computeSSAO<<<blockCount2d, blockSize2d>>>(width, height, P, noiseTexSize1D, dev_noiseTex, sampleKernelRadius, numSamples, dev_samples, dev_fragmentBuffer);
+	checkCUDAError("computeSSAO");
 
+	int blurSize = noiseTexSize1D;
+	int padding = blurSize >> 1;
+	dim3 blurBlockSize3(sideLength2d + 2 * padding, sideLength2d + 2 * padding, 1);
+	ssaoBlur<<<blockCount2d , blurBlockSize3, blurBlockSize3.x * blurBlockSize3.y * sizeof(float)>>>(width, height, blurSize, dev_fragmentBuffer);
+	checkCUDAError("ssaoBlur");
 
     // Copy depthbuffer colors into framebuffer
-	render << <blockCount2d, blockSize2d >> >(width, height, dev_fragmentBuffer, dev_framebuffer);
+	cudaMemset(dev_framebuffer, 0, width * height * sizeof(glm::vec3));
+	render<<<blockCount2d, blockSize2d>>>(width, height, dev_fragmentBuffer, dev_framebuffer);
 	checkCUDAError("fragment shader");
     // Copy framebuffer into OpenGL buffer for OpenGL previewing
     sendImageToPBO<<<blockCount2d, blockSize2d>>>(pbo, width, height, dev_framebuffer);
@@ -767,5 +1257,17 @@ void rasterizeFree() {
 	cudaFree(dev_depth);
 	dev_depth = NULL;
 
+	cudaFree(dev_primCounts);
+	dev_primCounts = nullptr;
+
+	cudaFree(dev_tileTriLists);
+	dev_tileTriLists = nullptr;
+
+	cudaFree(dev_samples);
+	dev_samples = nullptr;
+
+	cudaFree(dev_noiseTex);
+	dev_noiseTex = nullptr;
+
     checkCUDAError("rasterize Free");
 }
diff --git a/src/rasterize.h b/src/rasterize.h
index 560aae9..ab90173 100644
--- a/src/rasterize.h
+++ b/src/rasterize.h
@@ -20,5 +20,5 @@ namespace tinygltf{
 void rasterizeInit(int width, int height);
 void rasterizeSetBuffers(const tinygltf::Scene & scene);
 
-void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal);
+void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal, const glm::mat4 &P);
 void rasterizeFree();
diff --git a/src/rasterizeTools.h b/src/rasterizeTools.h
index 46c701e..daf054c 100644
--- a/src/rasterizeTools.h
+++ b/src/rasterizeTools.h
@@ -99,3 +99,94 @@ float getZAtCoordinate(const glm::vec3 barycentricCoord, const glm::vec3 tri[3])
            + barycentricCoord.y * tri[1].z
            + barycentricCoord.z * tri[2].z);
 }
+
+// For perspective correct linear interpolation
+__host__ __device__ static
+float getFloatAtCoordinate(const glm::vec3 barycentricCoord, float a0, float a1, float a2)
+{
+	return barycentricCoord.x * a0 + barycentricCoord.y * a1 + barycentricCoord.z * a2;
+}
+
+__host__ __device__ static
+glm::vec3 getVec3AtCoordinate(const glm::vec3 &abc, const glm::vec3 &v1, const glm::vec3 &v2, const glm::vec3 &v3)
+{
+	return abc.x * v1 + abc.y * v2 + abc.z * v3;
+}
+
+__host__ __device__ static
+glm::vec2 getVec2AtCoordinate(const glm::vec3 &abc, const glm::vec2 &v1, const glm::vec2 &v2, const glm::vec2 &v3)
+{
+	return abc.x * v1 + abc.y * v2 + abc.z * v3;
+}
+
+__host__ __device__ static
+void projectPointOntoAxis(const glm::vec2 &axis, const glm::vec2 &p, float &mind, float &maxd)
+{
+	float d = glm::dot(axis, p);
+	mind = d < mind ? d : mind;
+	maxd = d > maxd ? d : maxd;
+}
+
+__host__ __device__ static
+bool triAABBIntersect(const AABB &box, const glm::vec3 tri[3])
+{
+	glm::vec2 c((box.max.x + box.min.x) * .5f, (box.max.y + box.min.y) * .5f);
+	float hx = (box.max.x - box.min.x) * .5f;
+	float hy = (box.max.y - box.min.y) * .5f;
+	glm::vec2 b0(-hx, -hy);
+	glm::vec2 b1(hx, -hy);
+	glm::vec2 b2(hx, hy);
+	glm::vec2 b3(-hx, hy);
+	glm::vec2 v0 = glm::vec2(tri[0]) - c;
+	glm::vec2 v1 = glm::vec2(tri[1]) - c;
+	glm::vec2 v2 = glm::vec2(tri[2]) - c;
+	glm::vec2 e0 = v1 - v0;
+	glm::vec2 e1 = v2 - v1;
+	glm::vec2 e2 = v0 - v2;
+	glm::vec2 axes[5] =
+	{
+		{ 1.f, 0.f },
+		{ 0.f, 1.f },
+		{ -e0.y, e0.x },
+		{ -e1.y, e1.x },
+		{ -e2.y, e2.x },
+	};
+
+	for (int i = 0; i < 5; ++i)
+	{
+		float boxMin = FLT_MAX, boxMax = -FLT_MAX;
+		float triMin = FLT_MAX, triMax = -FLT_MAX;
+
+		// project box onto the axis
+		projectPointOntoAxis(axes[i], b0, boxMin, boxMax);
+		projectPointOntoAxis(axes[i], b1, boxMin, boxMax);
+		projectPointOntoAxis(axes[i], b2, boxMin, boxMax);
+		projectPointOntoAxis(axes[i], b3, boxMin, boxMax);
+
+		// project triangle onto the axis
+		projectPointOntoAxis(axes[i], v0, triMin, triMax);
+		projectPointOntoAxis(axes[i], v1, triMin, triMax);
+		projectPointOntoAxis(axes[i], v2, triMin, triMax);
+
+		bool noOverlap = triMin > boxMax || triMax < boxMin;
+		if (noOverlap) // a separation axis has been found == no intersection
+		{
+			return false;
+		}
+	}
+
+	return true;
+}
+
+__host__ __device__ bool isFrontFacing(const glm::vec3 tri[3], bool ccwIsFront = true)
+{
+	float z = (tri[1].x - tri[0].x) * (tri[2].y - tri[0].y) - (tri[1].y - tri[0].y) * (tri[2].x - tri[0].x);
+	if (ccwIsFront)
+	{
+		return z > 0.f;
+	}
+	else
+	{
+		return z < 0.f;
+	}
+}
\ No newline at end of file
diff --git a/util/CMakeLists.txt b/util/CMakeLists.txt
index c995fae..084fb39 100644
--- a/util/CMakeLists.txt
+++ b/util/CMakeLists.txt
@@ -8,5 +8,5 @@ set(SOURCE_FILES
 
 cuda_add_library(util
    ${SOURCE_FILES}
-   OPTIONS -arch=sm_52
+   OPTIONS -arch=sm_61
    )
diff --git a/util/utilityCore.cpp b/util/utilityCore.cpp
index b40ffe7..200ee7f 100644
--- a/util/utilityCore.cpp
+++ b/util/utilityCore.cpp
@@ -7,6 +7,7 @@
  */
 
 #include <iostream>
+#include <iomanip>
 #include <cstdio>
 #include <glm/gtc/matrix_transform.hpp>
 #include <glm/gtc/matrix_inverse.hpp>
@@ -37,6 +38,13 @@ std::string utilityCore::convertIntToString(int number) {
     return ss.str();
 }
 
+std::string utilityCore::convertFloatToString(float number)
+{
+	std::stringstream ss;
+	ss << std::fixed << std::setprecision(2) << number;
+	return ss.str();
+}
+
 glm::vec3 utilityCore::clampRGB(glm::vec3 color) {
     if (color[0] < 0) {
         color[0] = 0;
diff --git a/util/utilityCore.hpp b/util/utilityCore.hpp
index a67db68..9a0ccc5 100644
--- a/util/utilityCore.hpp
+++ b/util/utilityCore.hpp
@@ -33,6 +33,7 @@ extern bool epsilonCheck(float a, float b);
 extern std::vector<std::string> tokenizeString(std::string str);
 extern glm::mat4 buildTransformationMatrix(glm::vec3 translation, glm::vec3 rotation, glm::vec3 scale);
 extern std::string convertIntToString(int number);
+extern std::string convertFloatToString(float number);
 extern std::istream &safeGetline(std::istream &is, std::string &t); //Thanks to http://stackoverflow.com/a/6089413
 
 //-----------------------------