diff --git a/README.md b/README.md
index cad1abd..30bf8f7 100644
--- a/README.md
+++ b/README.md
@@ -1,20 +1,96 @@
-CUDA Rasterizer
+# University of Pennsylvania, CIS 565: GPU Programming and Architecture.
+Project 4: CUDA Rasterizer
 ===============
 
-[CLICK ME FOR INSTRUCTION OF THIS PROJECT](./INSTRUCTION.md)
+## User resources
+- **Name:** David Grosman.
+- **Tested on:** Microsoft Windows 7 Professional, i7-5600U @ 2.6GHz, 256GB, GeForce 840M (Personal laptop).
+- **Linked Project with:** "cudadevrt.lib".
+Note: You must include "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\lib\x64\cudadevrt.lib;" in your Project Properties' Linker Input's Additional Dependencies field to get the Project running. This is because I am using thrust::remove_if() for my back-face culling phase.
 
-**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
+### Preview
+![](renders/DUMP.gif)
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+## Project description
+This Project's purpose was to gain some experience with writing Graphics code that would benefit as much as possible from CUDA as possible. In fact, a rasterized graphics pipeline, similar to the OpenGL pipeline can be processed by custom kernels handling each primitives, vertices and fragments per kernel.
+In this project, I have implemented the following features/pipeline stages:
 
-### (TODO: Your README)
+* Vertex shading. (`_vertexTransformAndAssembly` in `rasterize.cu`)
+* Primitive assembly with support for triangles read from buffers of index and
+  vertex data. (`_primitiveAssembly` in `rasterize.cu`)
+* Rasterization. (`_rasterize` in `rasterize.cu`)
+* Fragment shading. Using Blinn-Phong with ambient, diffuse and specular light effects. (`render` in `rasterize.cu`)
+* A depth buffer to store and depth test fragments.
+* Fragment-to-depth-buffer writing (**with** atomics for race avoidance).
+* Toon Shading by Post processing the fragment buffer divided into "tiles" and using Sobel Filter.
+* Backface culling, optimized using thrust's stream compaction.
+* UV texture mapping with bilinear texture filtering and perspective correct texture coordinates.
+* Support for rasterizing additional primitives:
+   * Lines or line strips (Press `W` to toggle between model drawing and wireframe mode).
+   * Points (Press `P` to toggle between model drawing and point-cloud mode).
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+### My Pipeline
+* Cleared the fragment and depth buffers with some default values in `initFrameBuffers()`
+* Vertex shading:
+  * `VertexIn[n] vs_input -> VertexOut[n] vs_output`
+  * Transformed each vertex from local space to pixel space.
+* Primitive assembly.
+  * `VertexOut[n] vs_output -> Triangle[t] primitives`
+  * Assemble a list of vertices into triangle, line or point primitives for rasterization.
+* Back-face culling.
+  * Reduce the number of primitives that will be processed through the remainding pipeline.
+    * We simply use the primitives' vertices eye-space normals to determine whether a primitive is faced towards the camera or not.
+* Rasterization.
+  * `Triangle[t], Line[l] or Point[p] primitives -> Fragment[m] rasterized`
+  * Parallelized over each primitive
+    * If it is a triangle, we only scan over the box around the triangle (`getAABBForTriangle`).
+    * If it is a line, we use Bresenham line algorithm which handles all 4 quadrants.
+    * If it is a point, we simply convert its coordinates to integer values and color the corresponding pixel.
+  * Linearly interpolate eye-space position and normal vertices attributes. 
+  * Perform Perspective-Correct interpolation of texture coordinates attributes.
+* Fragments to depth buffer.
+  * `Fragment[m] rasterized -> Fragment[width][height] depthbuffer`
+    * `depthbuffer` is for storing and depth testing fragments.
+  * This is done before fragment shading, which prevents the fragment
+    shader from changing the depth of a fragment.
+  * Since multiple primitives may write fragments to the same pixel in the depth buffer,
+    races must be avoided by using CUDA atomics. This is why we convert your depth value
+    to a fixed-point `int`, and use `atomicMin` to store it into an `int`-typed depth
+    buffer `intdepth`.
+* Sobel Filter.
+  * Post-Process the fragment buffer to outline the final model's edges which will be highlighted in black for toon-shading. 
+  * Simply apply a Sobel kernel on the depth map to figure out over which pixels it is discontinuous: this is where the models edegs are.
+* Fragment shading.
+  * `Fragment[width][height] depthbuffer ->`
+  * Added Blinn-Phong shader with ambient, diffuse and specular light effects.
+  * Implemented UV texture mapping with bilinear texture filtering.
+* Fragment to framebuffer writing.
+  * `-> vec3[width][height] framebuffer`
+  * Simply copies the colors out of the depth buffer into the framebuffer (to be displayed on the screen).
+    
+![](renders/TextureCoordinates.jpg)
+The above diagram shows the correctness of our perspective correct texture coordinates interpolation.
 
+### Performance Analysis
+![](renders/PipeStagePerfs.JPG)
+
+From the graph above, we notice that:
+  1. The rasterization stage is the most costly since it must determine whether a pixel is part of a primitive or not.
+  2. The Intialization of the Frame buffers, their final transfer to OpenGL and the Sobel Mapping used for toon-shading always take the same amount of time since they compute the exact same operation on exactly the same sized-data (every pixel).
+  3. The vertex shader is the least costly operation. This is because it is not only an embarassingly parallel problem but it also mostly involve FPU computations which the GPU is very good at.
+
+![](renders/BackFacePerf.JPG)
+
+As seen in the table above, backface culling isn't improving much the overall performance of the rasterizer. This is because there are a lot of data that needs to be sorted and removed from the primitives array (depending on whether a primitive is back-face culling). I would expect the operation to be much faster in OpenGL where it may be processed directly by the hardware itself.
+
+### Postview
+![](renders/DUMP2.gif)
 
 ### Credits
 
 * [tinygltfloader](https://github.com/syoyo/tinygltfloader) by [@soyoyo](https://github.com/syoyo)
 * [glTF Sample Models](https://github.com/KhronosGroup/glTF/blob/master/sampleModels/README.md)
+* [Bresenham Line Drawing Algorithm](http://groups.csail.mit.edu/graphics/classes/6.837/F99/grading/asst2/turnin/rdror/Bresenham.java): Source-code which I placed in my Code directly. Note that I implemented line and more advanced scanning algorithms in my undergraduate studies but couldn't find my code...
+* [Sobel Kernel] (https://blog.saush.com/2011/04/20/edge-detection-with-the-sobel-operator-in-ruby/): Useful materials for understanding the concepts behind sobel kernels.
+* [Perspective-Correct texture coordinates] (http://www.scratchapixel.com/lessons/3d-basic-rendering/rasterization-practical-implementation/perspective-correct-interpolation-vertex-attributes): Good ressource for understanding Perspective-correct interpolation.
+* [Bilinear-Filtering] (https://en.wikipedia.org/wiki/Bilinear_filtering)
diff --git a/gltfs/checkerboard/checkerboard.gltf b/gltfs/checkerboard/checkerboard.gltf
new file mode 100644
index 0000000..a333738
--- /dev/null
+++ b/gltfs/checkerboard/checkerboard.gltf
@@ -0,0 +1,301 @@
+{
+  "accessors": {
+    "accessor_index_0": {
+      "bufferView": "bufferView_1",
+      "byteOffset": 0,
+      "byteStride": 0,
+      "componentType": 5123,
+      "count": 6,
+      "type": "SCALAR",
+      "min": [
+        0
+      ],
+      "max": [
+        3
+      ]
+    },
+    "accessor_position": {
+      "bufferView": "bufferView_0",
+      "byteOffset": 0,
+      "byteStride": 0,
+      "componentType": 5126,
+      "count": 4,
+      "min": [
+        -1,
+        0,
+        -1
+      ],
+      "max": [
+        1,
+        0,
+        1
+      ],
+      "type": "VEC3"
+    },
+    "accessor_uv": {
+      "bufferView": "bufferView_0",
+      "byteOffset": 48,
+      "byteStride": 0,
+      "componentType": 5126,
+      "count": 4,
+      "type": "VEC2",
+      "min": [
+        0.00009999999747378752,
+        0.00009999999747378752
+      ],
+      "max": [
+        0.9998999834060669,
+        0.9998999834060669
+      ]
+    },
+    "accessor_normal_generated": {
+      "bufferView": "bufferView_0",
+      "byteOffset": 80,
+      "byteStride": 0,
+      "componentType": 5126,
+      "count": 4,
+      "type": "VEC3",
+      "min": [
+        0,
+        1,
+        0
+      ],
+      "max": [
+        0,
+        1,
+        0
+      ]
+    }
+  },
+  "asset": {
+    "generator": "OBJ2GLTF",
+    "premultipliedAlpha": true,
+    "profile": {
+      "api": "WebGL",
+      "version": "1.0"
+    },
+    "version": "1.0"
+  },
+  "buffers": {
+    "buffer_0": {
+      "type": "arraybuffer",
+      "byteLength": 140,
+      "uri": "data:application/octet-stream;base64,AACAvwAAAAAAAIA/AACAPwAAAAAAAIA/AACAPwAAAAAAAIC/AACAvwAAAAAAAIC/F7fROHL5fz9y+X8/cvl/P3L5fz8Xt9E4F7fROBe30TgAAAAAAACAPwAAAAAAAAAAAACAPwAAAAAAAAAAAACAPwAAAAAAAAAAAACAPwAAAAAAAAEAAgAAAAIAAwA="
+    }
+  },
+  "bufferViews": {
+    "bufferView_0": {
+      "buffer": "buffer_0",
+      "byteLength": 128,
+      "byteOffset": 0,
+      "target": 34962
+    },
+    "bufferView_1": {
+      "buffer": "buffer_0",
+      "byteLength": 12,
+      "byteOffset": 128,
+      "target": 34963
+    }
+  },
+  "images": {
+    "Checkered": {
+      "uri": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAhwAAAIcCAIAAAAynOArAAAACXBIWXMAAAsTAAALEwEAmpwYAAALjUlEQVR42u3ZMU5jZxuGYY5tJtiWEGnRrGNSjKZmD6np2Uf2geQGiWWkQyILCQVgG5njvxhpVnAX/3GuawevnuLWp2/4559/lsvl2cTNZrO///77zz///P3336d+yx9//PHXX3+tViujGMUoRpmWi4uLxWq1OoFV5vP5zzT++++/U7/l7e1ttVqt12ujGMUoRpmW1Wo1m/oNAPz/EBUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAzGI2m83n86mfcQIn/LLb7YZhMIpRjGKUyXl/fx82m81yuTyBY8ZxfH19PR6PUz/kcDhcXl6en58bxShGMcq0fHx8DCewx0/39/c3NzfjOE79kO12e3d39/j4aBSjGMUo07Lf7xdXV1cvLy8nsMrxeBzH8fPz8wRuGYYTib1RjGKU/9Qo4zj6qAcgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAmcW3b9/e3t6mfsZutzscDrvd7ng8Tv2W7Xb79evX79+/G8UoRjHKtOz3++H5+Xm5XE79kmEYnp6eNpvNbDb5t9f19fXt7e3FxYVRjGIUo0zLly9fFuv1erVaTf2S+Xy+WCweHh6mfsjZ2dmPHz+Wy+V6vTaKUYxilGlZrVb+VADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJAZjGbzebz+dTPOIETftntdsMwGMUoRjHK5Ly/vw+bzWa5XJ7AMeM4vr6+Ho/HqR9yOBwuLy/Pz8+NYhSjGGVaPj4+hhPY46f7+/ubm5txHKd+yHa7vbu7e3x8NIpRjGKUadnv94urq6uXl5cTWOV4PI7j+Pn5eQK3DMOJxN4oRjHKf2qUcRx91AOQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAzOLbt29vb29TP2O32x0Oh91udzwep37Ldrv9+vXr9+/fjWIUoxhlWvb7/fD8/LxcLqd+yTAMT09Pm81mNpv82+v6+vr29vbi4sIoRjGKUably5cvi/V6vVqtpn7JfD5fLBYPDw9TP+Ts7OzHjx/L5XK9XhvFKEYxyrSsVit/KgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgs5jNZvP5fOpnnMAJv+x2u2EYjGIUoxhlct7f34fNZrNcLk/gmHEcX19fj8fj1A85HA6Xl5fn5+dGMYpRjDItHx8fwwns8dP9/f3Nzc04jlM/ZLvd3t3dPT4+GsUoRjHKtOz3+8XV1dXLy8sJrHI8Hsdx/Pz8PIFbhuFEYm8UoxjlPzXKOI4+6gHIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJAZvHt27e3t7epn7Hb7Q6Hw263Ox6PU79lu91+/fr1+/fvRjGKUYwyLfv9fnh+fl4ul1O/ZBiGp6enzWYzm03+7XV9fX17e3txcWEUoxjFKNPy5cuXxXq9Xq1WU79kPp8vFouHh4epH3J2dvbjx4/lcrler41iFKMYZVpWq5U/FQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQWcxms/l8PvUzTuCEX3a73TAMRjGKUYwyOe/v78Nms1kulydwzDiOr6+vx+Nx6occDofLy8vz83OjGMUoRpmWj4+P4QT2+On+/v7m5mYcx6kfst1u7+7uHh8fjWIUoxhlWvb7/eLq6url5eUEVjkej+M4fn5+nsAtw3AisTeKUYzynxplHEcf9QBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgs7i4uFgul1M/Y7vdfnx8fHx8jOM49Vv2+/1vv/1mFKMYxShTHOV/51nv6/az2UUAAAAASUVORK5CYII="
+    }
+  },
+  "materials": {
+    "material_Material.001": {
+      "name": "Material.001",
+      "extensions": {},
+      "values": {
+        "ambient": [
+          0,
+          0,
+          0,
+          1
+        ],
+        "diffuse": "texture_Checkered",
+        "emission": [
+          0,
+          0,
+          0,
+          1
+        ],
+        "specular": [
+          0.5,
+          0.5,
+          0.5,
+          1
+        ],
+        "shininess": 96.078431,
+        "transparency": 1
+      },
+      "technique": "technique0"
+    }
+  },
+  "meshes": {
+    "mesh_checkerboard": {
+      "name": "checkerboard",
+      "primitives": [
+        {
+          "attributes": {
+            "POSITION": "accessor_position",
+            "TEXCOORD_0": "accessor_uv",
+            "NORMAL": "accessor_normal_generated"
+          },
+          "indices": "accessor_index_0",
+          "material": "material_Material.001",
+          "mode": 4
+        }
+      ]
+    }
+  },
+  "nodes": {
+    "rootNode": {
+      "children": [],
+      "meshes": [
+        "mesh_checkerboard"
+      ],
+      "matrix": [
+        1,
+        0,
+        0,
+        0,
+        0,
+        1,
+        0,
+        0,
+        0,
+        0,
+        1,
+        0,
+        0,
+        0,
+        0,
+        1
+      ]
+    }
+  },
+  "samplers": {
+    "sampler_0": {
+      "magFilter": 9729,
+      "minFilter": 9986,
+      "wrapS": 10497,
+      "wrapT": 10497
+    }
+  },
+  "scene": "scene_checkerboard",
+  "scenes": {
+    "scene_checkerboard": {
+      "nodes": [
+        "rootNode"
+      ]
+    }
+  },
+  "textures": {
+    "texture_Checkered": {
+      "format": 6407,
+      "internalFormat": 6407,
+      "sampler": "sampler_0",
+      "source": "Checkered",
+      "target": 3553,
+      "type": 5121
+    }
+  },
+  "extensionsUsed": [],
+  "animations": {},
+  "cameras": {},
+  "techniques": {
+    "technique0": {
+      "attributes": {
+        "a_position": "position",
+        "a_normal": "normal",
+        "a_texcoord_0": "texcoord_0"
+      },
+      "parameters": {
+        "modelViewMatrix": {
+          "semantic": "MODELVIEW",
+          "type": 35676
+        },
+        "projectionMatrix": {
+          "semantic": "PROJECTION",
+          "type": 35676
+        },
+        "normalMatrix": {
+          "semantic": "MODELVIEWINVERSETRANSPOSE",
+          "type": 35675
+        },
+        "ambient": {
+          "type": 35666
+        },
+        "diffuse": {
+          "type": 35678
+        },
+        "emission": {
+          "type": 35666
+        },
+        "specular": {
+          "type": 35666
+        },
+        "shininess": {
+          "type": 5126
+        },
+        "transparency": {
+          "type": 5126
+        },
+        "position": {
+          "semantic": "POSITION",
+          "type": 35665
+        },
+        "normal": {
+          "semantic": "NORMAL",
+          "type": 35665
+        },
+        "texcoord_0": {
+          "semantic": "TEXCOORD_0",
+          "type": 35664
+        }
+      },
+      "program": "program0",
+      "states": {
+        "enable": [
+          2884,
+          2929
+        ]
+      },
+      "uniforms": {
+        "u_modelViewMatrix": "modelViewMatrix",
+        "u_projectionMatrix": "projectionMatrix",
+        "u_normalMatrix": "normalMatrix",
+        "u_ambient": "ambient",
+        "u_diffuse": "diffuse",
+        "u_emission": "emission",
+        "u_specular": "specular",
+        "u_shininess": "shininess",
+        "u_transparency": "transparency"
+      }
+    }
+  },
+  "programs": {
+    "program0": {
+      "attributes": [
+        "a_position",
+        "a_normal",
+        "a_texcoord_0"
+      ],
+      "fragmentShader": "fragmentShader0",
+      "vertexShader": "vertexShader0"
+    }
+  },
+  "shaders": {
+    "vertexShader0": {
+      "type": 35633,
+      "uri": "data:text/plain;base64,cHJlY2lzaW9uIGhpZ2hwIGZsb2F0Owp1bmlmb3JtIG1hdDQgdV9tb2RlbFZpZXdNYXRyaXg7CnVuaWZvcm0gbWF0NCB1X3Byb2plY3Rpb25NYXRyaXg7CnVuaWZvcm0gbWF0MyB1X25vcm1hbE1hdHJpeDsKYXR0cmlidXRlIHZlYzMgYV9wb3NpdGlvbjsKdmFyeWluZyB2ZWMzIHZfcG9zaXRpb25FQzsKYXR0cmlidXRlIHZlYzMgYV9ub3JtYWw7CnZhcnlpbmcgdmVjMyB2X25vcm1hbDsKYXR0cmlidXRlIHZlYzIgYV90ZXhjb29yZF8wOwp2YXJ5aW5nIHZlYzIgdl90ZXhjb29yZF8wOwp2b2lkIG1haW4odm9pZCkgewogIHZlYzQgcG9zID0gdV9tb2RlbFZpZXdNYXRyaXggKiB2ZWM0KGFfcG9zaXRpb24sMS4wKTsKICB2X3Bvc2l0aW9uRUMgPSBwb3MueHl6OwogIGdsX1Bvc2l0aW9uID0gdV9wcm9qZWN0aW9uTWF0cml4ICogcG9zOwogIHZfbm9ybWFsID0gdV9ub3JtYWxNYXRyaXggKiBhX25vcm1hbDsKICB2X3RleGNvb3JkXzAgPSBhX3RleGNvb3JkXzA7Cn0K"
+    },
+    "fragmentShader0": {
+      "type": 35632,
+      "uri": "data:text/plain;base64,cHJlY2lzaW9uIGhpZ2hwIGZsb2F0Owp1bmlmb3JtIHZlYzQgdV9hbWJpZW50Owp1bmlmb3JtIHNhbXBsZXIyRCB1X2RpZmZ1c2U7CnVuaWZvcm0gdmVjNCB1X2VtaXNzaW9uOwp1bmlmb3JtIHZlYzQgdV9zcGVjdWxhcjsKdW5pZm9ybSBmbG9hdCB1X3NoaW5pbmVzczsKdW5pZm9ybSBmbG9hdCB1X3RyYW5zcGFyZW5jeTsKdmFyeWluZyB2ZWMzIHZfcG9zaXRpb25FQzsKdmFyeWluZyB2ZWMzIHZfbm9ybWFsOwp2YXJ5aW5nIHZlYzIgdl90ZXhjb29yZF8wOwp2b2lkIG1haW4odm9pZCkgewogIHZlYzMgbm9ybWFsID0gbm9ybWFsaXplKHZfbm9ybWFsKTsKICB2ZWM0IGRpZmZ1c2UgPSB0ZXh0dXJlMkQodV9kaWZmdXNlLCB2X3RleGNvb3JkXzApOwogIHZlYzMgZGlmZnVzZUxpZ2h0ID0gdmVjMygwLjAsIDAuMCwgMC4wKTsKICB2ZWMzIHNwZWN1bGFyID0gdV9zcGVjdWxhci5yZ2I7CiAgdmVjMyBzcGVjdWxhckxpZ2h0ID0gdmVjMygwLjAsIDAuMCwgMC4wKTsKICB2ZWMzIGVtaXNzaW9uID0gdV9lbWlzc2lvbi5yZ2I7CiAgdmVjMyBhbWJpZW50ID0gdV9hbWJpZW50LnJnYjsKICB2ZWMzIHZpZXdEaXIgPSAtbm9ybWFsaXplKHZfcG9zaXRpb25FQyk7CiAgdmVjMyBhbWJpZW50TGlnaHQgPSB2ZWMzKDAuMCwgMC4wLCAwLjApOwogIGFtYmllbnRMaWdodCArPSB2ZWMzKDAuMiwgMC4yLCAwLjIpOwogIHZlYzMgbCA9IHZlYzMoMC4wLCAwLjAsIDEuMCk7CiAgZGlmZnVzZUxpZ2h0ICs9IHZlYzMoMS4wLCAxLjAsIDEuMCkgKiBtYXgoZG90KG5vcm1hbCxsKSwgMC4pOwogIHZlYzMgaCA9IG5vcm1hbGl6ZShsICsgdmlld0Rpcik7CiAgZmxvYXQgc3BlY3VsYXJJbnRlbnNpdHkgPSBtYXgoMC4sIHBvdyhtYXgoZG90KG5vcm1hbCwgaCksIDAuKSwgdV9zaGluaW5lc3MpKTsKICBzcGVjdWxhckxpZ2h0ICs9IHZlYzMoMS4wLCAxLjAsIDEuMCkgKiBzcGVjdWxhckludGVuc2l0eTsKICB2ZWMzIGNvbG9yID0gdmVjMygwLjAsIDAuMCwgMC4wKTsKICBjb2xvciArPSBkaWZmdXNlLnJnYiAqIGRpZmZ1c2VMaWdodDsKICBjb2xvciArPSBzcGVjdWxhciAqIHNwZWN1bGFyTGlnaHQ7CiAgY29sb3IgKz0gZW1pc3Npb247CiAgY29sb3IgKz0gYW1iaWVudCAqIGFtYmllbnRMaWdodDsKICBnbF9GcmFnQ29sb3IgPSB2ZWM0KGNvbG9yICogZGlmZnVzZS5hLCBkaWZmdXNlLmEgKiB1X3RyYW5zcGFyZW5jeSk7Cn0K"
+    }
+  },
+  "skins": {},
+  "extensions": {}
+}
diff --git a/renders/BackFacePerf.JPG b/renders/BackFacePerf.JPG
new file mode 100644
index 0000000..aea36d5
Binary files /dev/null and b/renders/BackFacePerf.JPG differ
diff --git a/renders/DUMP.gif b/renders/DUMP.gif
new file mode 100644
index 0000000..faa6c76
Binary files /dev/null and b/renders/DUMP.gif differ
diff --git a/renders/DUMP2.gif b/renders/DUMP2.gif
new file mode 100644
index 0000000..962dcb8
Binary files /dev/null and b/renders/DUMP2.gif differ
diff --git a/renders/PerfAnalysis.xlsx b/renders/PerfAnalysis.xlsx
new file mode 100644
index 0000000..e1d23db
Binary files /dev/null and b/renders/PerfAnalysis.xlsx differ
diff --git a/renders/PipeStagePerfs.JPG b/renders/PipeStagePerfs.JPG
new file mode 100644
index 0000000..16331e1
Binary files /dev/null and b/renders/PipeStagePerfs.JPG differ
diff --git a/renders/TextureCoordinates.jpg b/renders/TextureCoordinates.jpg
new file mode 100644
index 0000000..7ecad3b
Binary files /dev/null and b/renders/TextureCoordinates.jpg differ
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index a57f69f..ce4f1b6 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -6,5 +6,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_50
     )
diff --git a/src/main.cpp b/src/main.cpp
index a36b955..1749e5e 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -7,7 +7,7 @@
  */
 
 
-
+#include <util/timer.h>
 #include "main.hpp"
 
 #define STB_IMAGE_IMPLEMENTATION
@@ -17,6 +17,7 @@
 //-------------------------------
 //-------------MAIN--------------
 //-------------------------------
+static tinygltf::Scene* pScene = NULL;
 
 int main(int argc, char **argv) {
     if (argc != 2) {
@@ -24,19 +25,20 @@ int main(int argc, char **argv) {
         return 0;
     }
 
-	tinygltf::Scene scene;
 	tinygltf::TinyGLTFLoader loader;
 	std::string err;
 	std::string input_filename(argv[1]);
 	std::string ext = getFilePathExtension(input_filename);
 
+	pScene = new  tinygltf::Scene();
+
 	bool ret = false;
 	if (ext.compare("glb") == 0) {
 		// assume binary glTF.
-		ret = loader.LoadBinaryFromFile(&scene, &err, input_filename.c_str());
+		ret = loader.LoadBinaryFromFile(pScene, &err, input_filename.c_str());
 	} else {
 		// assume ascii glTF.
-		ret = loader.LoadASCIIFromFile(&scene, &err, input_filename.c_str());
+		ret = loader.LoadASCIIFromFile(pScene, &err, input_filename.c_str());
 	}
 
 	if (!err.empty()) {
@@ -54,11 +56,12 @@ int main(int argc, char **argv) {
     fpstracker = 0;
 
     // Launch CUDA/GL
-    if (init(scene)) {
+	if (init(*pScene)) {
         // GLFW main loop
         mainLoop();
     }
 
+	delete pScene;
     return 0;
 }
 
@@ -121,7 +124,7 @@ void runCuda() {
     cudaGLMapBufferObject((void **)&dptr, pbo);
 	rasterize(dptr, MVP, MV, MV_normal);
     cudaGLUnmapBufferObject(pbo);
-
+	
     frame++;
     fpstracker++;
 }
@@ -158,6 +161,7 @@ bool init(const tinygltf::Scene & scene) {
     initTextures();
     initCuda();
 	initPBO();
+	Timer::initializeTimer();
 
 	// Mouse Control Callbacks
 	glfwSetMouseButtonCallback(window, mouseButtonCallback);
@@ -179,8 +183,7 @@ bool init(const tinygltf::Scene & scene) {
 		}
 	}
 
-
-	rasterizeSetBuffers(scene);
+	setSceneBuffers(scene);
 
     GLuint passthroughProgram;
     passthroughProgram = initShader();
@@ -306,8 +309,12 @@ void deleteTexture(GLuint *tex) {
     *tex = (GLuint)NULL;
 }
 
-void shut_down(int return_code) {
+void shut_down(int return_code)
+{
+	clearSceneBuffers();
     rasterizeFree();
+	Timer::shutdownTimer();
+
     cudaDeviceReset();
 #ifdef __APPLE__
     glfwTerminate();
@@ -323,10 +330,44 @@ void errorCallback(int error, const char *description) {
     fputs(description, stderr);
 }
 
-void keyCallback(GLFWwindow *window, int key, int scancode, int action, int mods) {
-    if (key == GLFW_KEY_ESCAPE && action == GLFW_PRESS) {
+void keyCallback(GLFWwindow *window, int key, int scancode, int action, int mods) 
+{
+    if (key == GLFW_KEY_ESCAPE && action == GLFW_PRESS)
+	{
         glfwSetWindowShouldClose(window, GL_TRUE);
     }
+	else if (key == GLFW_KEY_W && action == GLFW_PRESS)
+	{
+		switch (pScene->drawModelType)
+		{
+		case tinygltf::Scene::Triangle:
+			pScene->drawModelType = tinygltf::Scene::Line;
+			break;
+		case tinygltf::Scene::Line:
+			pScene->drawModelType = tinygltf::Scene::Triangle;
+			break;
+		default:
+			break;
+		}
+		clearSceneBuffers();
+		setSceneBuffers(*pScene);
+	}
+	else if (key == GLFW_KEY_P && action == GLFW_PRESS)
+	{
+		switch (pScene->drawModelType)
+		{
+		case tinygltf::Scene::Triangle:
+			pScene->drawModelType = tinygltf::Scene::Point;
+			break;
+		case tinygltf::Scene::Point:
+			pScene->drawModelType = tinygltf::Scene::Triangle;
+			break;
+		default:
+			break;
+		}
+		clearSceneBuffers();
+		setSceneBuffers(*pScene);
+	}
 }
 
 //----------------------------
diff --git a/src/rasterize.cu b/src/rasterize.cu
index 4e3504b..92e16d1 100644
--- a/src/rasterize.cu
+++ b/src/rasterize.cu
@@ -10,23 +10,33 @@
 #include <cstdio>
 #include <cuda.h>
 #include <cuda_runtime.h>
+
 #include <thrust/random.h>
+#include <thrust/remove.h>
+#include <thrust/device_ptr.h>
+#include <thrust/device_vector.h>
+
 #include <util/checkCUDAError.h>
 #include <util/tiny_gltf_loader.h>
+#include <util/timer.h>
+
 #include "rasterizeTools.h"
 #include "rasterize.h"
+
+#include <glm/gtc/type_ptr.hpp>
 #include <glm/gtc/quaternion.hpp>
 #include <glm/gtc/matrix_transform.hpp>
 
 namespace {
 
+	typedef unsigned char BufferByte;
+
 	typedef unsigned short VertexIndex;
 	typedef glm::vec3 VertexAttributePosition;
 	typedef glm::vec3 VertexAttributeNormal;
 	typedef glm::vec2 VertexAttributeTexcoord;
 	typedef unsigned char TextureData;
-
-	typedef unsigned char BufferByte;
+	typedef glm::vec3 Color;
 
 	enum PrimitiveType{
 		Point = 1,
@@ -43,33 +53,38 @@ namespace {
 
 		 glm::vec3 eyePos;	// eye space position used for shading
 		 glm::vec3 eyeNor;	// eye space normal used for shading, cuz normal will go wrong after perspective transformation
-		// glm::vec3 col;
-		 glm::vec2 texcoord0;
-		 TextureData* dev_diffuseTex = NULL;
-		// int texWidth, texHeight;
-		// ...
+		
+		 VertexAttributeTexcoord texcoord0;
 	};
 
-	struct Primitive {
+	struct Primitive
+	{
 		PrimitiveType primitiveType = Triangle;	// C++ 11 init
+		
 		VertexOut v[3];
+
+		int texWidth, texHeight;
+		TextureData* dev_diffuseTex = NULL;
 	};
 
 	struct Fragment {
-		glm::vec3 color;
+		Color color;
 
 		// TODO: add new attributes to your Fragment
 		// The attributes listed below might be useful, 
 		// but always feel free to modify on your own
 
-		// glm::vec3 eyePos;	// eye space position used for shading
-		// glm::vec3 eyeNor;
-		// VertexAttributeTexcoord texcoord0;
-		// TextureData* dev_diffuseTex;
-		// ...
+		glm::vec3 eyePos;	// eye space position used for shading
+		glm::vec3 eyeNor;
+		
+		int texWidth, texHeight;
+		TextureData* dev_diffuseTex;
+		VertexAttributeTexcoord texcoord0;
+		float sobelXY;
 	};
 
-	struct PrimitiveDevBufPointers {
+	struct PrimitiveDevBufPointers
+	{
 		int primitiveMode;	//from tinygltfloader macro
 		PrimitiveType primitiveType;
 		int numPrimitives;
@@ -83,6 +98,7 @@ namespace {
 		VertexAttributeTexcoord* dev_texcoord0;
 
 		// Materials, add more attributes when needed
+		int texWidth, texHeight;
 		TextureData* dev_diffuseTex;
 		// TextureData* dev_specularTex;
 		// TextureData* dev_normalTex;
@@ -93,73 +109,38 @@ namespace {
 
 		// TODO: add more attributes when needed
 	};
-
 }
 
-static std::map<std::string, std::vector<PrimitiveDevBufPointers>> mesh2PrimitivesMap;
+static int SOBEL_GRID_SIZE = 3;
 
+static std::map<std::string, std::vector<PrimitiveDevBufPointers>> mesh2PrimitivesMap;
 
 static int width = 0;
 static int height = 0;
 
 static int totalNumPrimitives = 0;
 static Primitive *dev_primitives = NULL;
+static Primitive *dev_primitivesCulled = NULL;
 static Fragment *dev_fragmentBuffer = NULL;
-static glm::vec3 *dev_framebuffer = NULL;
+static Color *dev_framebuffer = NULL;
 
 static int * dev_depth = NULL;	// you might need this buffer when doing depth test
 
-/**
- * Kernel that writes the image to the OpenGL PBO directly.
- */
-__global__ 
-void sendImageToPBO(uchar4 *pbo, int w, int h, glm::vec3 *image) {
-    int x = (blockIdx.x * blockDim.x) + threadIdx.x;
-    int y = (blockIdx.y * blockDim.y) + threadIdx.y;
-    int index = x + (y * w);
-
-    if (x < w && y < h) {
-        glm::vec3 color;
-        color.x = glm::clamp(image[index].x, 0.0f, 1.0f) * 255.0;
-        color.y = glm::clamp(image[index].y, 0.0f, 1.0f) * 255.0;
-        color.z = glm::clamp(image[index].z, 0.0f, 1.0f) * 255.0;
-        // Each thread writes one pixel location in the texture (textel)
-        pbo[index].w = 0;
-        pbo[index].x = color.x;
-        pbo[index].y = color.y;
-        pbo[index].z = color.z;
-    }
-}
-
-/** 
-* Writes fragment colors to the framebuffer
-*/
-__global__
-void render(int w, int h, Fragment *fragmentBuffer, glm::vec3 *framebuffer) {
-    int x = (blockIdx.x * blockDim.x) + threadIdx.x;
-    int y = (blockIdx.y * blockDim.y) + threadIdx.y;
-    int index = x + (y * w);
-
-    if (x < w && y < h) {
-        framebuffer[index] = fragmentBuffer[index].color;
-
-		// TODO: add your fragment shader code here
-
-    }
-}
-
 /**
  * Called once at the beginning of the program to allocate memory.
  */
-void rasterizeInit(int w, int h) {
+void rasterizeInit(int w, int h)
+{
     width = w;
     height = h;
+
 	cudaFree(dev_fragmentBuffer);
 	cudaMalloc(&dev_fragmentBuffer, width * height * sizeof(Fragment));
 	cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
+
     cudaFree(dev_framebuffer);
-    cudaMalloc(&dev_framebuffer,   width * height * sizeof(glm::vec3));
-    cudaMemset(dev_framebuffer, 0, width * height * sizeof(glm::vec3));
+    cudaMalloc(&dev_framebuffer,   width * height * sizeof(Color));
+    cudaMemset(dev_framebuffer, 0, width * height * sizeof(Color));
     
 	cudaFree(dev_depth);
 	cudaMalloc(&dev_depth, width * height * sizeof(int));
@@ -167,8 +148,25 @@ void rasterizeInit(int w, int h) {
 	checkCUDAError("rasterizeInit");
 }
 
+/**
+* Called once at the end of the program to free CUDA memory.
+*/
+void rasterizeFree() 
+{
+	cudaFree(dev_fragmentBuffer);
+	dev_fragmentBuffer = NULL;
+
+	cudaFree(dev_framebuffer);
+	dev_framebuffer = NULL;
+
+	cudaFree(dev_depth);
+	dev_depth = NULL;
+
+	checkCUDAError("rasterize Free");
+}
+
 __global__
-void initDepth(int w, int h, int * depth)
+void initFrameBuffers(int w, int h, int * depth, Fragment *fragmentBuffer)
 {
 	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
 	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
@@ -177,6 +175,7 @@ void initDepth(int w, int h, int * depth)
 	{
 		int index = x + (y * w);
 		depth[index] = INT_MAX;
+		fragmentBuffer[index].color = Color(0.33, 0.33, 0.33);
 	}
 }
 
@@ -226,9 +225,12 @@ void _nodeMatrixTransform(
 
 	// vertex id
 	int vid = (blockIdx.x * blockDim.x) + threadIdx.x;
-	if (vid < numVertices) {
+	if (vid < numVertices) 
+	{
 		position[vid] = glm::vec3(MV * glm::vec4(position[vid], 1.0f));
-		normal[vid] = glm::normalize(MV_normal * normal[vid]);
+		
+		if (normal)
+			normal[vid] = glm::normalize(MV_normal * normal[vid]);
 	}
 }
 
@@ -292,12 +294,12 @@ void traverseNode (
 	}
 }
 
-void rasterizeSetBuffers(const tinygltf::Scene & scene) {
+void setSceneBuffers(const tinygltf::Scene & scene) {
 
 	totalNumPrimitives = 0;
 
 	std::map<std::string, BufferByte*> bufferViewDevPointers;
-
+	checkCUDAError("Set BufferView Device Mem");
 	// 1. copy all `bufferViews` to device memory
 	{
 		std::map<std::string, tinygltf::BufferView>::const_iterator it(
@@ -313,9 +315,10 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 			}
 
 			const tinygltf::Buffer &buffer = scene.buffers.at(bufferView.buffer);
-
+			checkCUDAError("Set BufferView Device Mem");
 			BufferByte* dev_bufferView;
 			cudaMalloc(&dev_bufferView, bufferView.byteLength);
+			checkCUDAError("Set BufferView Device Mem");
 			cudaMemcpy(dev_bufferView, &buffer.data.front() + bufferView.byteOffset, bufferView.byteLength, cudaMemcpyHostToDevice);
 
 			checkCUDAError("Set BufferView Device Mem");
@@ -344,6 +347,20 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 			}
 		}
 
+		int primitiveDrawMode = 0;
+		switch (scene.drawModelType)
+		{
+		case tinygltf::Scene::Point:
+			primitiveDrawMode = TINYGLTF_MODE_POINTS;
+			break;
+		case tinygltf::Scene::Line:
+			primitiveDrawMode = TINYGLTF_MODE_LINE;
+			break;
+		case tinygltf::Scene::Triangle:
+		default:
+			primitiveDrawMode = TINYGLTF_MODE_TRIANGLES;
+			break;
+		}
 
 		// parse through node to access mesh
 
@@ -368,15 +385,17 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 				// for each primitive
 				for (size_t i = 0; i < mesh.primitives.size(); i++) {
 					const tinygltf::Primitive &primitive = mesh.primitives[i];
+					
+					(const_cast<tinygltf::Primitive*> (&primitive))->mode = primitiveDrawMode;
 
 					if (primitive.indices.empty())
 						return;
 
 					// TODO: add new attributes for your PrimitiveDevBufPointers when you add new attributes
-					VertexIndex* dev_indices;
-					VertexAttributePosition* dev_position;
-					VertexAttributeNormal* dev_normal;
-					VertexAttributeTexcoord* dev_texcoord0;
+					VertexIndex* dev_indices = NULL;
+					VertexAttributePosition* dev_position = NULL;
+					VertexAttributeNormal* dev_normal = NULL;
+					VertexAttributeTexcoord* dev_texcoord0 = NULL;
 
 					// ----------Indices-------------
 
@@ -389,7 +408,7 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 					int numIndices = indexAccessor.count;
 					int componentTypeByteSize = sizeof(VertexIndex);
 					int byteLength = numIndices * n * componentTypeByteSize;
-
+					checkCUDAError("Set Index Buffer");
 					dim3 numThreadsPerBlock(128);
 					dim3 numBlocks((numIndices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
 					cudaMalloc(&dev_indices, byteLength);
@@ -407,7 +426,7 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 
 
 					// ---------Primitive Info-------
-
+					
 					// Warning: LINE_STRIP is not supported in tinygltfloader
 					int numPrimitives;
 					PrimitiveType primitiveType;
@@ -426,7 +445,7 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 						break;
 					case TINYGLTF_MODE_LINE:
 						primitiveType = PrimitiveType::Line;
-						numPrimitives = numIndices / 2;
+						numPrimitives = 3 * (numIndices / 3);
 						break;
 					case TINYGLTF_MODE_LINE_LOOP:
 						primitiveType = PrimitiveType::Line;
@@ -489,24 +508,27 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 							dev_attribute = (BufferByte**)&dev_texcoord0;
 						}
 
-						std::cout << accessor.bufferView << "  -  " << it->second << "  -  " << it->first << '\n';
-
-						dim3 numThreadsPerBlock(128);
-						dim3 numBlocks((n * numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
-						int byteLength = numVertices * n * componentTypeByteSize;
-						cudaMalloc(dev_attribute, byteLength);
-
-						_deviceBufferCopy << <numBlocks, numThreadsPerBlock >> > (
-							n * numVertices,
-							*dev_attribute,
-							dev_bufferView,
-							n,
-							accessor.byteStride,
-							accessor.byteOffset,
-							componentTypeByteSize);
-
-						std::string msg = "Set Attribute Buffer: " + it->first;
-						checkCUDAError(msg.c_str());
+						if (dev_attribute)
+						{
+							std::cout << accessor.bufferView << "  -  " << it->second << "  -  " << it->first << '\n';
+
+							dim3 numThreadsPerBlock(128);
+							dim3 numBlocks((n * numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
+							int byteLength = numVertices * n * componentTypeByteSize;
+							cudaMalloc(dev_attribute, byteLength);
+
+							_deviceBufferCopy << <numBlocks, numThreadsPerBlock >> > (
+								n * numVertices,
+								*dev_attribute,
+								dev_bufferView,
+								n,
+								accessor.byteStride,
+								accessor.byteOffset,
+								componentTypeByteSize);
+
+							std::string msg = "Set Attribute Buffer: " + it->first;
+							checkCUDAError(msg.c_str());
+						}
 					}
 
 					// malloc for VertexOut
@@ -518,6 +540,7 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 
 					// You can only worry about this part once you started to 
 					// implement textures for your rasterizer
+					int texWidth, texHeight;
 					TextureData* dev_diffuseTex = NULL;
 					if (!primitive.material.empty()) {
 						const tinygltf::Material &mat = scene.materials.at(primitive.material);
@@ -534,9 +557,8 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 									cudaMalloc(&dev_diffuseTex, s);
 									cudaMemcpy(dev_diffuseTex, &image.image.at(0), s, cudaMemcpyHostToDevice);
 									
-									// TODO: store the image size to your PrimitiveDevBufPointers
-									// image.width;
-									// image.height;
+									texWidth = image.width;
+									texHeight = image.height;
 
 									checkCUDAError("Set Texture Image data");
 								}
@@ -576,13 +598,14 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 						dev_normal,
 						dev_texcoord0,
 
+						texWidth,
+						texHeight,
 						dev_diffuseTex,
 
 						dev_vertexOut	//VertexOut
 					});
 
 					totalNumPrimitives += numPrimitives;
-
 				} // for each primitive
 
 			} // for each mesh
@@ -591,7 +614,7 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 
 	}
 	
-
+	checkCUDAError("Free BufferView Device Mem");
 	// 3. Malloc for dev_primitives
 	{
 		cudaMalloc(&dev_primitives, totalNumPrimitives * sizeof(Primitive));
@@ -600,7 +623,7 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 
 	// Finally, cudaFree raw dev_bufferViews
 	{
-
+		checkCUDAError("Free BufferView Device Mem");
 		std::map<std::string, BufferByte*>::const_iterator it(bufferViewDevPointers.begin());
 		std::map<std::string, BufferByte*>::const_iterator itEnd(bufferViewDevPointers.end());
 			
@@ -612,11 +635,39 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 
 		checkCUDAError("Free BufferView Device Mem");
 	}
-
-
 }
 
+void clearSceneBuffers()
+{
+	// deconstruct primitives attribute/indices device buffer
+	checkCUDAError("Set BufferView Device Mem");
+	auto it(mesh2PrimitivesMap.begin());
+	auto itEnd(mesh2PrimitivesMap.end());
+	for (; it != itEnd; ++it)
+	{
+		for (auto p = it->second.begin(); p != it->second.end(); ++p)
+		{
+			cudaFree(p->dev_indices);
+			checkCUDAError("Set BufferView Device Mem");
+			cudaFree(p->dev_position);
+			checkCUDAError("Set BufferView Device Mem");
+			cudaFree(p->dev_normal);
+			checkCUDAError("Set BufferView Device Mem");
+			cudaFree(p->dev_texcoord0);
+			checkCUDAError("Set BufferView Device Mem");
+			cudaFree(p->dev_diffuseTex);
+			checkCUDAError("Set BufferView Device Mem");
+			cudaFree(p->dev_verticesOut);
+			//TODO: release other attributes and materials
+		}
+	}
+	mesh2PrimitivesMap.clear();
 
+	checkCUDAError("Set BufferView Device Mem");
+	cudaFree(dev_primitives);
+	dev_primitives = NULL;
+	checkCUDAError("Set BufferView Device Mem");
+}
 
 __global__ 
 void _vertexTransformAndAssembly(
@@ -628,15 +679,30 @@ void _vertexTransformAndAssembly(
 	// vertex id
 	int vid = (blockIdx.x * blockDim.x) + threadIdx.x;
 	if (vid < numVertices) {
-
-		// TODO: Apply vertex transformation here
+		
+		// vtxPos: Where the original positions buffer read from model are.
+		glm::vec4 vtxPos = glm::vec4(primitive.dev_position[vid], 1.0f);
+		
 		// Multiply the MVP matrix for each vertex position, this will transform everything into clipping space
-		// Then divide the pos by its w element to transform into NDC space
+		glm::vec4 projVtxPos = MVP * vtxPos;
+		// Then divide the pos by its w element to transform into NDC space. 
+		const float projW = projVtxPos.w;
+		projVtxPos /= projVtxPos.w; // All elements are now between (-1, -1, -1) ~ (1, 1, 1).
 		// Finally transform x and y to viewport space
+		glm::vec3 pixelPos(
+		(float)width	* 0.5f * (-projVtxPos.x + 1.0f),
+		(float)height	* 0.5f * (-projVtxPos.y + 1.0f),
+						  0.5f * (projVtxPos.z + 1.0f));
 
-		// TODO: Apply vertex assembly here
 		// Assemble all attribute arraies into the primitive array
+		primitive.dev_verticesOut[vid].pos = glm::vec4(pixelPos, projW);
+		primitive.dev_verticesOut[vid].eyePos = glm::vec3(MV * vtxPos);
+		
+		if (primitive.dev_normal)
+			primitive.dev_verticesOut[vid].eyeNor = MV_normal * primitive.dev_normal[vid];
 		
+		if (primitive.dev_texcoord0)
+			primitive.dev_verticesOut[vid].texcoord0 = primitive.dev_texcoord0[vid];
 	}
 }
 
@@ -652,23 +718,467 @@ void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_
 
 	if (iid < numIndices) {
 
-		// TODO: uncomment the following code for a start
-		// This is primitive assembly for triangles
+		int pid = iid;	// id for cur primitives vector
+		if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES)
+		{
+			pid = iid / (int)primitive.primitiveType;
 
-		//int pid;	// id for cur primitives vector
-		//if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
-		//	pid = iid / (int)primitive.primitiveType;
-		//	dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
-		//		= primitive.dev_verticesOut[primitive.dev_indices[iid]];
-		//}
+			dev_primitives[pid + curPrimitiveBeginId].primitiveType = primitive.primitiveType;
+			dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
+				= primitive.dev_verticesOut[primitive.dev_indices[iid]];
+			
+			dev_primitives[pid + curPrimitiveBeginId].texWidth = primitive.texWidth;
+			dev_primitives[pid + curPrimitiveBeginId].texHeight = primitive.texHeight;
+			dev_primitives[pid + curPrimitiveBeginId].dev_diffuseTex = primitive.dev_diffuseTex;
+		}
+		else if (primitive.primitiveMode == TINYGLTF_MODE_LINE)
+		{
+			const int ptIdx = iid % 3;
+			if (ptIdx == 0)
+			{
+				dev_primitives[pid + curPrimitiveBeginId].primitiveType = primitive.primitiveType;
+				dev_primitives[pid + curPrimitiveBeginId].v[0]
+					= primitive.dev_verticesOut[primitive.dev_indices[iid]];
+
+				dev_primitives[pid + curPrimitiveBeginId + 2].primitiveType = primitive.primitiveType;
+				dev_primitives[pid + curPrimitiveBeginId + 2].v[1]
+					= primitive.dev_verticesOut[primitive.dev_indices[iid]];
+			}
+			else if (ptIdx == 1)
+			{
+				dev_primitives[pid + curPrimitiveBeginId].primitiveType = primitive.primitiveType;
+				dev_primitives[pid + curPrimitiveBeginId].v[0]
+					= primitive.dev_verticesOut[primitive.dev_indices[iid]];
+
+				dev_primitives[pid + curPrimitiveBeginId - 1].primitiveType = primitive.primitiveType;
+				dev_primitives[pid + curPrimitiveBeginId - 1].v[1]
+					= primitive.dev_verticesOut[primitive.dev_indices[iid]];
+			}
+			else //if (ptIdx == 2)
+			{
+				dev_primitives[pid + curPrimitiveBeginId].primitiveType = primitive.primitiveType;
+				dev_primitives[pid + curPrimitiveBeginId].v[0]
+					= primitive.dev_verticesOut[primitive.dev_indices[iid]];
+
+				dev_primitives[pid + curPrimitiveBeginId - 1].primitiveType = primitive.primitiveType;
+				dev_primitives[pid + curPrimitiveBeginId - 1].v[1]
+					= primitive.dev_verticesOut[primitive.dev_indices[iid]];
+			}
+		}
+		else if (primitive.primitiveMode == TINYGLTF_MODE_POINTS)
+		{
+			pid = iid;
+			
+			dev_primitives[pid + curPrimitiveBeginId].primitiveType = primitive.primitiveType;
+			dev_primitives[pid + curPrimitiveBeginId].v[0]
+				= primitive.dev_verticesOut[primitive.dev_indices[iid]];
+		}
+	}
+}
+
+#define MIN(a,b) (((a)<(b))?(a):(b))
+#define MAX(a,b) (((a)>(b))?(a):(b))
+
+__global__
+void _rasterize(int width, int height, int numPrimitives, int * depth, Primitive* dev_primitives, Fragment *fragmentBuffer)
+{
+	// id for cur primitives vector
+	int pid = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (pid < numPrimitives)
+	{
+		Primitive& primitive = dev_primitives[pid];
+		if (primitive.primitiveType == PrimitiveType::Triangle)
+		{
+			const glm::vec3 triPixelPos[] =
+			{ glm::vec3(primitive.v[0].pos), glm::vec3(primitive.v[1].pos), glm::vec3(primitive.v[2].pos) };
+
+			AABB& triAabb = getAABBForTriangle(triPixelPos);
+			triAabb.min.x = MIN( MAX(triAabb.min.x, 0), width);
+			triAabb.min.y = MIN( MAX(triAabb.min.y, 0), height);
+
+			triAabb.max.x = MIN( MAX(triAabb.max.x, 0), width);
+			triAabb.max.y = MIN( MAX(triAabb.max.y, 0), height);
+
+			for (size_t aabbX = triAabb.min.x; aabbX < triAabb.max.x + 1; aabbX++)
+			{
+				for (size_t aabbY = triAabb.min.y; aabbY < triAabb.max.y + 1; aabbY++)
+				{
+					glm::vec2 triAabbPos(aabbX, aabbY);
+					glm::vec3 bCoeffs = calculateBarycentricCoordinate(triPixelPos, triAabbPos);
+					if (isBarycentricCoordInBounds(bCoeffs))
+					{
+						const int pixelIdx = aabbX + aabbY * width;
+
+						const float depthZ = getZAtCoordinate(bCoeffs, triPixelPos);
+						const int depthBufferVal = INT_MAX * -depthZ; // depthZ will be between 0 (near) and -1 (far).
+						atomicMin(&depth[pixelIdx], depthBufferVal);
+						if (depth[pixelIdx] == depthBufferVal)
+						{
+							const VertexOut* triVtx = &primitive.v[0];
+							Fragment& frag = fragmentBuffer[pixelIdx];
+
+							frag.texWidth = primitive.texWidth;
+							frag.texHeight = primitive.texHeight;
+							frag.dev_diffuseTex = primitive.dev_diffuseTex;
+
+							frag.eyeNor = bCoeffs[0] * triVtx[0].eyeNor
+								+ bCoeffs[1] * triVtx[1].eyeNor
+								+ bCoeffs[2] * triVtx[2].eyeNor;
+
+							frag.eyePos = bCoeffs[0] * triVtx[0].eyePos
+								+ bCoeffs[1] * triVtx[1].eyePos
+								+ bCoeffs[2] * triVtx[2].eyePos;
+
+							//frag.texcoord0	= bCoeffs[0] * triVtx[0].texcoord0
+							//					+ bCoeffs[1] * triVtx[1].texcoord0
+							//					+ bCoeffs[2] * triVtx[2].texcoord0;
+
+							// Perspective-Correct Texturing.
+							float perspZ = 1.0f / (bCoeffs[0] / primitive.v[0].pos.w
+								+ bCoeffs[1] / primitive.v[1].pos.w
+								+ bCoeffs[2] / primitive.v[2].pos.w);
+							frag.color = Color(1.0f, 1.0f, 1.0f);
+							frag.texcoord0 = bCoeffs[0] * triVtx[0].texcoord0 / primitive.v[0].pos.w
+								+ bCoeffs[1] * triVtx[1].texcoord0 / primitive.v[1].pos.w
+								+ bCoeffs[2] * triVtx[2].texcoord0 / primitive.v[2].pos.w;
+							frag.texcoord0 *= perspZ;
+						}
+					}
+
+				}
+			}
+		}
+		else if (primitive.primitiveType == PrimitiveType::Line)
+		{
+			// Bresenham Algorithm
+			// Taken from:
+			// http://groups.csail.mit.edu/graphics/classes/6.837/F99/grading/asst2/turnin/rdror/Bresenham.java
+			glm::vec2 vStart(primitive.v[0].pos);
+			glm::vec2 vEnd(primitive.v[1].pos);
+			
+			vStart.x	= MIN( MAX( vStart.x, 0 ), width);
+			vStart.y	= MIN( MAX( vStart.y, 0 ), height);
+			
+			vEnd.x		= MIN( MAX( vEnd.x, 0), width);
+			vEnd.y		= MIN( MAX( vEnd.y, 0), height);
+
+			float dx = vEnd.x - vStart.x; if (dx < 0) dx = -dx;
+			float dy = vEnd.y - vStart.y; if (dy < 0) dy = -dy;
+
+			int incx = 1; if (vEnd.x < vStart.x) incx = -1;
+			int incy = 1; if (vEnd.y < vStart.y) incy = -1;
+
+			int xStart = (int)(vStart.x + 0.5f); 
+			int yStart = (int)(vStart.y + 0.5f);
+			int xEnd = (int)(vEnd.x + 0.5f);
+			int yEnd = (int)(vEnd.y + 0.5f);
+			
+			// If slope is outside the range [-1,1], swap x and y
+			bool xy_swap = false;
+			if (dy > dx) {
+				xy_swap = true;
+				int temp = xStart;
+				xStart = yStart;
+				yStart = temp;
+				temp = xEnd;
+				xEnd = yEnd;
+				yEnd = temp;
+			}
 
+			// If line goes from right to left, swap the endpoints
+			if (xEnd - xStart < 0) {
+				int temp = xStart;
+				xStart = xEnd;
+				xEnd = temp;
+				temp = yStart;
+				yStart = yEnd;
+				yEnd = temp;
+			}
 
-		// TODO: other primitive types (point, line)
+			int x,                       // Current x position
+				y = yStart,                  // Current y position
+				e = 0,                   // Current error
+				m_num = yEnd - yStart,         // Numerator of slope
+				m_denom = xEnd - xStart,       // Denominator of slope
+				threshold = m_denom / 2;  // Threshold between E and NE increment 
+
+			for (x = xStart; x < xEnd; x++) {
+				if (xy_swap)
+				{
+					const int pixelIdx = y + x * width;
+					fragmentBuffer[pixelIdx].color = Color(0.0f, 0.0f, 1.0f);
+				}
+				else
+				{
+					const int pixelIdx = x + y * width;
+					fragmentBuffer[pixelIdx].color = Color(0.0f, 0.0f, 1.0f);
+				}
+
+				e += m_num;
+
+				// Deal separately with lines sloping upward and those
+				// sloping downward
+				if (m_num < 0) {
+					if (e < -threshold) {
+						e += m_denom;
+						y--;
+					}
+				}
+				else if (e > threshold) {
+					e -= m_denom;
+					y++;
+				}
+			}
+
+			if (xy_swap)
+			{
+				const int pixelIdx = y + x * width;
+				fragmentBuffer[pixelIdx].color = Color(0.0f, 0.0f, 1.0f);
+			}
+			else
+			{
+				const int pixelIdx = x + y * width;
+				fragmentBuffer[pixelIdx].color = Color(0.0f, 0.0f, 1.0f);
+			}
+
+		}
+		else if (primitive.primitiveType == PrimitiveType::Point)
+		{
+			
+			glm::vec2 vPoint(primitive.v[0].pos);
+			if (0 < vPoint.x && vPoint.x < width && 0 < vPoint.y && vPoint.y < height)
+			{
+				static int i = 0;
+
+				const int pixelIdx = ((int)vPoint.x) + ((int)vPoint.y) * width;
+				fragmentBuffer[pixelIdx].color = Color(0.0f, 1.0f, 0.0f);
+			}
+		}
+	}
+}
+
+// Taken from https://blog.saush.com/2011/04/20/edge-detection-with-the-sobel-operator-in-ruby/
+__global__
+void _buildSobelMap(int w, int h, int* dev_depth, Fragment *fragmentBuffer)
+{
+	const int pixelX = (blockIdx.x * blockDim.x) + threadIdx.x;
+	const int pixelY = (blockIdx.y * blockDim.y) + threadIdx.y;
+
+	if (0 < pixelX && pixelX < w && 0 < pixelY && pixelY < h)
+	{
+		const float filterX[3][3] = { { -1, -2, -1 }, { 0, 0,  0 }, { 1, 2, 1 } };
+		const float filterY[3][3] = { { 1, 0, -1 }, { 2, 0, -2 }, { 1, 0, -1 } };
+
+		float sobelX = 0.0f; float sobelY = 0.0f;
+		for (int xOffset = -1; xOffset <= 1; xOffset += 1)
+		{
+			for (int yOffset = -1; yOffset <= 1; yOffset += 1)
+			{
+				if ((0 < pixelX + xOffset) && (pixelX + xOffset < w) && (0 < pixelY + yOffset) && (pixelY + yOffset < h))
+				{
+					const int pixelIdx = (pixelX + xOffset) + ((pixelY + yOffset) * w);
+					const Fragment& frag = fragmentBuffer[pixelIdx];
+
+					sobelX += filterX[xOffset + 1][yOffset + 1] * dev_depth[pixelIdx] / (1.0f*INT_MAX);
+					sobelY += filterY[xOffset + 1][yOffset + 1] * dev_depth[pixelIdx] / (1.0f*INT_MAX);
+				}
+			}
+		}
+		const int pixelIdx = pixelX + pixelY * w;
+		Fragment& frag = fragmentBuffer[pixelIdx];
+		frag.sobelXY = sobelX * sobelX + sobelY * sobelY;
+	}
+}
+
+__global__
+void _buildSobelMapWithSharedMemory(int w, int h, int* dev_depth, Fragment *fragmentBuffer)
+{
+	const int pixelX = (blockIdx.x * blockDim.x) + threadIdx.x;
+	const int pixelY = (blockIdx.y * blockDim.y) + threadIdx.y;
+
+	if (0 < pixelX && pixelX < w && 0 < pixelY && pixelY < h)
+	{
+		const float filterX[3][3] = { { -1, -2, -1 }, { 0, 0, 0 }, { 1, 2, 1 } };
+		const float filterY[3][3] = { { 1, 0, -1 }, { 2, 0, -2 }, { 1, 0, -1 } };
+
+		float sobelX = 0.0f; float sobelY = 0.0f;
+		for (int xOffset = -1; xOffset <= 1; xOffset += 1)
+		{
+			for (int yOffset = -1; yOffset <= 1; yOffset += 1)
+			{
+				if ((0 < pixelX + xOffset) && (pixelX + xOffset < w) && (0 < pixelY + yOffset) && (pixelY + yOffset < h))
+				{
+					const int pixelIdx = (pixelX + xOffset) + ((pixelY + yOffset) * w);
+					const Fragment& frag = fragmentBuffer[pixelIdx];
+
+					sobelX += filterX[xOffset + 1][yOffset + 1] * dev_depth[pixelIdx] / (1.0f*INT_MAX);
+					sobelY += filterY[xOffset + 1][yOffset + 1] * dev_depth[pixelIdx] / (1.0f*INT_MAX);
+				}
+			}
+		}
+		const int pixelIdx = pixelX + pixelY * w;
+		Fragment& frag = fragmentBuffer[pixelIdx];
+		frag.sobelXY = sobelX * sobelX + sobelY * sobelY;
 	}
+}
+
+/**
+* Writes fragment colors to the framebuffer
+*/
+__device__
+Color tex2D(int w, int h, TextureData* tex2D, int texelX, int texelY)
+{
+	texelX = (texelX < w) ? texelX : w - 1;
+	texelY = (texelY < h) ? texelY : h - 1;
+
+	const int texelIdx = 3 * (texelX + texelY * w);
+	if (tex2D != NULL)
+	{
+		TextureData* tex = &tex2D[texelIdx];
+		return Color(tex[0] / 255.f, tex[1] / 255.f, tex[2] / 255.f);
+	}
+	else
+	{
+		return Color(0.0f, 0.0f, 1.0f);
+	}
+}
+
+__device__
+Color getTextureColor(const Fragment& frag)
+{
+	int texel[] = { frag.texcoord0.x * frag.texWidth, frag.texcoord0.y * frag.texHeight };
+
+	TextureData* tex = frag.dev_diffuseTex;
+	
+	Color diffuseTexColor = tex2D(frag.texWidth, frag.texHeight, tex, texel[0], texel[1]);
+	return diffuseTexColor;
+}
+
+__device__
+Color getTextureBilinearColor(const Fragment& frag)
+{
+	float texeluv[]		= { frag.texcoord0.x * frag.texWidth - 0.5, frag.texcoord0.y * frag.texHeight - 0.5 };
+	int texelxy[]		= { floor(texeluv[0]), floor(texeluv[1]) };
+
+	float texel_diff[]	= { texeluv[0] - texelxy[0], texeluv[1] - texelxy[1] };
+	float texel_OppDiff[] = { 1 - texel_diff[0], 1 - texel_diff[1] };
+
+	TextureData* tex = frag.dev_diffuseTex;
+	Color tex00 = tex2D(frag.texWidth, frag.texHeight, tex,	  texelxy[0],	texelxy[1]);
+	Color tex10 = tex2D(frag.texWidth, frag.texHeight, tex, texelxy[0]+1,	texelxy[1]);
+	Color tex01 = tex2D(frag.texWidth, frag.texHeight, tex,   texelxy[0],	texelxy[1]+1);
+	Color tex11 = tex2D(frag.texWidth, frag.texHeight, tex, texelxy[0]+1,	texelxy[1]+1);
+
+	Color result = (texel_OppDiff[0] * tex00 + texel_diff[0] * tex10) * texel_OppDiff[1]
+				+ (texel_OppDiff[0] * tex01 + texel_diff[0] * tex11) * texel_diff[1];
+	return result;
+}
+
+__global__
+void render(int w, int h, int* depth, Fragment *fragmentBuffer, Color *framebuffer)
+{
+	const glm::vec3 lightPos		= glm::vec3(0, 5, 0);
+	const Color matAmbientClr		= Color(0.25f, 0.25f, 0.25f);
+	const glm::vec3 matDiffuseClr	= Color(0.3f, 0.3f, 0.3f);
+	const float lightDiffuseI		= 0.65f;
+	const glm::vec3 matSpecColor	= Color(0.03f, 0.03f, 0.03f);
+	const float matSpecShininess	= 2.5f;
+	const float lightSpecI			= 0.75f;
+	
+	const float toonShadingVariance = 12;
+
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+	int pixelIdx = x + (y * w);
 	
+	if (0 < x && x < w && 0 < y && y < h)
+	{
+		const Fragment& frag = fragmentBuffer[pixelIdx];
+		Color& pixelClr = framebuffer[pixelIdx];
+		{
+			if (depth[pixelIdx] != INT_MAX && frag.dev_diffuseTex)
+			{
+				glm::vec3 unitEyeNormal = glm::normalize(frag.eyeNor);
+				glm::vec3 lightVec = glm::normalize(lightPos - frag.eyePos);
+				float diffCoeff = MIN(MAX(0, glm::dot(lightVec, unitEyeNormal)), 1.0f);
+
+				glm::vec3 lightReflV = -glm::normalize(glm::reflect(lightVec, unitEyeNormal));
+				float specCoeff = MIN( MAX(0.0f, glm::dot(lightReflV, glm::normalize(-frag.eyePos))), 1.0f);
+				specCoeff = pow(specCoeff, matSpecShininess);
+
+				//pixelClr = getTextureColor(frag);
+				pixelClr	= matAmbientClr 
+							+ lightDiffuseI * diffCoeff * getTextureBilinearColor(frag)
+							+ lightSpecI * specCoeff * matSpecColor;
+
+				// The cel - shading process starts with a typical 3D model. Where cel-shading differs
+				// from conventional rendering is in its non - photorealistic illumination model.
+				// Conventional(smooth) lighting values are calculated for each pixel and then quantized
+				//to a small number of discrete shades to create the characteristic flat look:
+				pixelClr = glm::ceil(pixelClr * toonShadingVariance) / toonShadingVariance;
+				//Black "ink" outlines and contour lines can be created using a variety of methods.
+				// We use the Sobel method to create a shared memory buffer.
+				if (frag.sobelXY > 1.0f) pixelClr = glm::vec3(0.0f, 0.0f, 0.0f);
+			}
+			else
+			{
+				pixelClr = frag.color;
+			}
+		}
+	}
 }
 
+/**
+* Kernel that writes the image to the OpenGL PBO directly.
+*/
+__global__
+void sendImageToPBO(uchar4 *pbo, int w, int h, Color *image) {
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+	int index = x + (y * w);
 
+	if (x < w && y < h) {
+		glm::vec3 color;
+		color.x = glm::clamp(image[index].x, 0.0f, 1.0f) * 255.0;
+		color.y = glm::clamp(image[index].y, 0.0f, 1.0f) * 255.0;
+		color.z = glm::clamp(image[index].z, 0.0f, 1.0f) * 255.0;
+		// Each thread writes one pixel location in the texture (textel)
+		pbo[index].w = 0;
+		pbo[index].x = color.x;
+		pbo[index].y = color.y;
+		pbo[index].z = color.z;
+	}
+}
+
+struct IsPrimitiveBackFacing
+{
+	__host__ __device__
+	bool operator()(const Primitive& p)
+	{
+		switch (p.primitiveType)
+		{
+			case PrimitiveType::Line:
+			{
+				glm::vec3 frontFacingDegree = 0.5f * glm::vec3(p.v[1].eyeNor + p.v[0].eyeNor);
+				return frontFacingDegree.z < 0;
+			}
+			break;
+			case PrimitiveType::Triangle:
+			{
+				glm::vec3 normal = glm::normalize(glm::cross(
+				glm::vec3(p.v[1].eyePos - p.v[0].eyePos),
+				glm::vec3(p.v[2].eyePos - p.v[0].eyePos)));
+
+				float frontFacingDegree = glm::dot(glm::vec3(-p.v[0].eyePos), normal);
+				return frontFacingDegree < 0;
+			}
+			break;
+			case PrimitiveType::Point:
+			default:
+				return false;
+		}
+	}
+};
 
 /**
  * Perform rasterization.
@@ -676,13 +1186,22 @@ void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_
 void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal) {
     int sideLength2d = 8;
     dim3 blockSize2d(sideLength2d, sideLength2d);
-    dim3 blockCount2d((width  - 1) / blockSize2d.x + 1,
-		(height - 1) / blockSize2d.y + 1);
+    dim3 blockCount2d((width  - 1) / blockSize2d.x + 1, (height - 1) / blockSize2d.y + 1);
 
 	// Execute your rasterization pipeline here
 	// (See README for rasterization pipeline outline.)
+	Timer::resetTimer(true);
+	Timer::playTimer();
+	{
+		cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
+		initFrameBuffers << <blockCount2d, blockSize2d >> >(width, height, dev_depth, dev_fragmentBuffer);
+	}
+	Timer::pauseTimer();
+	Timer::printTimer("InitFB: ", 1.0f);
 
 	// Vertex Process & primitive assembly
+	Timer::resetTimer(true);
+	Timer::playTimer();
 	{
 		curPrimitiveBeginId = 0;
 		dim3 numThreadsPerBlock(128);
@@ -708,64 +1227,78 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g
 				checkCUDAError("Primitive Assembly");
 
 				curPrimitiveBeginId += p->numPrimitives;
+				
 			}
 		}
-
 		checkCUDAError("Vertex Processing and Primitive Assembly");
 	}
-	
-	cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
-	initDepth << <blockCount2d, blockSize2d >> >(width, height, dev_depth);
-	
-	// TODO: rasterize
-
-
-
-    // Copy depthbuffer colors into framebuffer
-	render << <blockCount2d, blockSize2d >> >(width, height, dev_fragmentBuffer, dev_framebuffer);
-	checkCUDAError("fragment shader");
-    // Copy framebuffer into OpenGL buffer for OpenGL previewing
-    sendImageToPBO<<<blockCount2d, blockSize2d>>>(pbo, width, height, dev_framebuffer);
-    checkCUDAError("copy render result to pbo");
-}
-
-/**
- * Called once at the end of the program to free CUDA memory.
- */
-void rasterizeFree() {
+	Timer::pauseTimer();
+	Timer::printTimer("VP: ", 1.0f);
 
-    // deconstruct primitives attribute/indices device buffer
+	//Back-Face culling
+	Timer::resetTimer(true);
+	Timer::playTimer();
+	{
+		totalNumPrimitives = curPrimitiveBeginId;
+		//printf("Pre-Cull: %d;", totalNumPrimitives);
+		cudaMalloc(&dev_primitivesCulled, totalNumPrimitives * sizeof(Primitive));
+		cudaMemcpy(dev_primitivesCulled, dev_primitives, totalNumPrimitives * sizeof(Primitive), cudaMemcpyDeviceToDevice);
+
+		thrust::device_ptr<Primitive> dPtr_primitivesBegin(dev_primitivesCulled);
+		thrust::device_ptr<Primitive> dPtr_primitivesEnd = dPtr_primitivesBegin + totalNumPrimitives;
+		dPtr_primitivesEnd = thrust::remove_if(dPtr_primitivesBegin, dPtr_primitivesEnd, IsPrimitiveBackFacing());
+		totalNumPrimitives = dPtr_primitivesEnd - dPtr_primitivesBegin;
+		checkCUDAError("backface culling");
+
+		//printf(" Post-Cull: %d\n", totalNumPrimitives);
+	}
+	Timer::pauseTimer();
+	Timer::printTimer("BackFace: ", 1.0f);
 
-	auto it(mesh2PrimitivesMap.begin());
-	auto itEnd(mesh2PrimitivesMap.end());
-	for (; it != itEnd; ++it) {
-		for (auto p = it->second.begin(); p != it->second.end(); ++p) {
-			cudaFree(p->dev_indices);
-			cudaFree(p->dev_position);
-			cudaFree(p->dev_normal);
-			cudaFree(p->dev_texcoord0);
-			cudaFree(p->dev_diffuseTex);
+	if (totalNumPrimitives > 0)
+	{
+		// TODO: rasterize
+		Timer::resetTimer(true);
+		Timer::playTimer();
+		{
+			dim3 numThreadsPerBlock(256);
+			dim3 numBlocksForPrims((totalNumPrimitives + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
 
-			cudaFree(p->dev_verticesOut);
+			_rasterize << < numBlocksForPrims, numThreadsPerBlock >> > (width, height, totalNumPrimitives, dev_depth, dev_primitivesCulled, dev_fragmentBuffer);
+		}
+		Timer::pauseTimer();
+		Timer::printTimer("Rasterize: ", 1.0f);
 
-			
-			//TODO: release other attributes and materials
+		// Sobel Filter
+		Timer::resetTimer(true);
+		Timer::playTimer();
+		{
+			dim3 numThreadsPerBlock(SOBEL_GRID_SIZE, SOBEL_GRID_SIZE, 1);
+			dim3 numBlocksForSobel((width - 1) / numThreadsPerBlock.x + 1, (height - 1) / numThreadsPerBlock.y + 1);
+			_buildSobelMap << < numBlocksForSobel, numThreadsPerBlock >> > (width, height, dev_depth, dev_fragmentBuffer);
 		}
+		Timer::pauseTimer();
+		Timer::printTimer("SobelMap: ", 1.0f);
 	}
+	
+	Timer::resetTimer(true);
+	Timer::playTimer();
+	{
+		// Copy depthbuffer colors into framebuffer
+		render << <blockCount2d, blockSize2d >> >(width, height, dev_depth, dev_fragmentBuffer, dev_framebuffer);
+		checkCUDAError("fragment shader");
+	}
+	Timer::pauseTimer();
+	Timer::printTimer("FS: ", 1.0f);
 
-	////////////
-
-    cudaFree(dev_primitives);
-    dev_primitives = NULL;
-
-	cudaFree(dev_fragmentBuffer);
-	dev_fragmentBuffer = NULL;
-
-    cudaFree(dev_framebuffer);
-    dev_framebuffer = NULL;
-
-	cudaFree(dev_depth);
-	dev_depth = NULL;
-
-    checkCUDAError("rasterize Free");
+	Timer::resetTimer(true);
+	Timer::playTimer();
+	{
+		// Copy framebuffer into OpenGL buffer for OpenGL previewing
+		sendImageToPBO << <blockCount2d, blockSize2d >> >(pbo, width, height, dev_framebuffer);
+		checkCUDAError("copy render result to pbo");
+	}
+	Timer::pauseTimer();
+	Timer::printTimer("ToOGL: ", 1.0f);
+	cudaFree(dev_primitivesCulled);
 }
diff --git a/src/rasterize.h b/src/rasterize.h
index 560aae9..056d40b 100644
--- a/src/rasterize.h
+++ b/src/rasterize.h
@@ -18,7 +18,10 @@ namespace tinygltf{
 
 
 void rasterizeInit(int width, int height);
-void rasterizeSetBuffers(const tinygltf::Scene & scene);
+void rasterizeFree();
+
+void setSceneBuffers(const tinygltf::Scene & scene);
+void clearSceneBuffers();
 
 void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal);
-void rasterizeFree();
+
diff --git a/util/CMakeLists.txt b/util/CMakeLists.txt
index c995fae..c511f82 100644
--- a/util/CMakeLists.txt
+++ b/util/CMakeLists.txt
@@ -4,9 +4,11 @@ set(SOURCE_FILES
     "glslUtility.hpp"
     "glslUtility.cpp"
     "tiny_gltf_loader.h"
+    "timer.h"
+    "timer.cpp"
    )
 
 cuda_add_library(util
    ${SOURCE_FILES}
-   OPTIONS -arch=sm_52
+   OPTIONS -arch=sm_50
    )
diff --git a/util/timer.cpp b/util/timer.cpp
new file mode 100644
index 0000000..075d65f
--- /dev/null
+++ b/util/timer.cpp
@@ -0,0 +1,164 @@
+
+#include <cstdio>
+#include <chrono>
+
+#include "timer.h"
+
+#include <cuda.h>
+#include <cuda_runtime.h>
+
+#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
+#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
+static void checkCUDAErrorFn(const char *msg, const char *file, int line) {
+#if ERRORCHECK
+	cudaDeviceSynchronize();
+	cudaError_t err = cudaGetLastError();
+	if (cudaSuccess == err) {
+		return;
+	}
+
+	fprintf(stderr, "CUDA error");
+	if (file) {
+		fprintf(stderr, " (%s:%d)", file, line);
+	}
+	fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err));
+#  ifdef _WIN32
+	getchar();
+#  endif
+	exit(EXIT_FAILURE);
+#endif
+}
+
+MyTimer* Timer::m_myTimer = NULL;
+
+class MyTimer
+{
+public:
+	using Clock = std::chrono::high_resolution_clock;
+	using TimePoint = std::chrono::time_point<Clock>;
+
+public:
+	MyTimer()
+	{
+		m_refCount = 0;
+		m_useGPU = true;
+		m_elapsedTimeInms = 0.0f;
+
+		cudaEventCreate(&m_start);
+		cudaEventCreate(&m_stop);
+
+		m_startTime = Clock::now();
+		m_stopTime = Clock::now();
+	}
+
+	~MyTimer()
+	{
+		cudaEventDestroy(m_start);
+		cudaEventDestroy(m_stop);
+	}
+
+public:
+
+	void resetTimer(bool useGPU = true) 
+	{
+		m_useGPU = useGPU;
+		m_elapsedTimeInms = 0.0f;
+	}
+
+	void playTimer()
+	{
+		if (m_refCount++ == 0)
+		{
+			if (m_useGPU)
+			{
+				cudaEventRecord(m_start);
+			}
+			else
+			{
+				m_startTime = Clock::now();
+			}
+		}
+	}
+
+	bool pauseTimer()
+	{
+		bool bPaused = false;
+		if (--m_refCount == 0)
+		{
+			float newElapsedTime = 0.0f;
+			if (m_useGPU)
+			{
+				cudaEventRecord(m_stop);
+				cudaEventSynchronize(m_stop);
+				cudaEventElapsedTime(&newElapsedTime, m_start, m_stop);
+			}
+			else
+			{
+				m_stopTime = Clock::now();
+				newElapsedTime = std::chrono::duration_cast<std::chrono::milliseconds>(m_stopTime - m_startTime).count();
+			}
+			m_elapsedTimeInms += newElapsedTime;
+			bPaused = true;
+		}
+		return bPaused;
+	}
+
+	float printTimer(const char* timerHeader, float timerFactor)
+	{
+		float elapsedTime = timerFactor * m_elapsedTimeInms;
+		printf("%s - Elapsed Time:%f ms.\n", timerHeader, elapsedTime);
+		return elapsedTime;
+	}
+
+private:
+	size_t m_refCount;
+	bool m_useGPU;
+	float m_elapsedTimeInms;
+private:
+	cudaEvent_t m_start;
+	cudaEvent_t m_stop;
+
+private:
+	TimePoint m_startTime;
+	TimePoint m_stopTime;
+};
+
+Timer::Timer()
+{
+}
+
+Timer::~Timer()
+{
+}
+
+void Timer::initializeTimer()
+{
+	if (m_myTimer == NULL)
+		m_myTimer = new MyTimer;
+}
+
+void Timer::shutdownTimer()
+{
+	if (m_myTimer != NULL)
+		delete m_myTimer;
+}
+
+void Timer::resetTimer(bool useGPU)
+{
+	m_myTimer->resetTimer(useGPU);
+}
+
+void Timer::playTimer()
+{
+	m_myTimer->playTimer();
+}
+
+void Timer::pauseTimer()
+{
+	m_myTimer->pauseTimer();
+}
+
+void Timer::printTimer(const char* timerHeader, float timerFactor)
+{
+	m_myTimer->printTimer(timerHeader, timerFactor);
+}
\ No newline at end of file
diff --git a/util/timer.h b/util/timer.h
new file mode 100644
index 0000000..7ab61db
--- /dev/null
+++ b/util/timer.h
@@ -0,0 +1,23 @@
+#pragma once
+
+class MyTimer;
+
+class Timer
+{
+public:
+	Timer();
+	~Timer();
+
+public:
+	static void initializeTimer();
+	static void shutdownTimer();
+
+public:
+	static void resetTimer(bool useGPU = true);
+	static void playTimer();
+	static void pauseTimer();
+	static void printTimer(const char* timerHeader, float timerFactor);
+
+private:
+	static MyTimer* m_myTimer;
+};
diff --git a/util/tiny_gltf_loader.h b/util/tiny_gltf_loader.h
index 3746ac8..4398613 100644
--- a/util/tiny_gltf_loader.h
+++ b/util/tiny_gltf_loader.h
@@ -321,29 +321,40 @@ typedef struct {
   char pad[7];
 } Asset;
 
-class Scene {
- public:
-  Scene() {}
-  ~Scene() {}
-
-  std::map<std::string, Accessor> accessors;
-  std::map<std::string, Animation> animations;
-  std::map<std::string, Buffer> buffers;
-  std::map<std::string, BufferView> bufferViews;
-  std::map<std::string, Material> materials;
-  std::map<std::string, Mesh> meshes;
-  std::map<std::string, Node> nodes;
-  std::map<std::string, Texture> textures;
-  std::map<std::string, Image> images;
-  std::map<std::string, Shader> shaders;
-  std::map<std::string, Program> programs;
-  std::map<std::string, Technique> techniques;
-  std::map<std::string, Sampler> samplers;
-  std::map<std::string, std::vector<std::string> > scenes;  // list of nodes
-
-  std::string defaultScene;
-
-  Asset asset;
+class Scene
+{
+public:
+	enum PrimitiveType{
+		Point = 1,
+		Line = 2,
+		Triangle = 3
+	};
+
+public:
+
+	Scene() : drawModelType(PrimitiveType::Triangle) {}
+	~Scene() {}
+
+	std::map<std::string, Accessor> accessors;
+	std::map<std::string, Animation> animations;
+	std::map<std::string, Buffer> buffers;
+	std::map<std::string, BufferView> bufferViews;
+	std::map<std::string, Material> materials;
+	std::map<std::string, Mesh> meshes;
+	std::map<std::string, Node> nodes;
+	std::map<std::string, Texture> textures;
+	std::map<std::string, Image> images;
+	std::map<std::string, Shader> shaders;
+	std::map<std::string, Program> programs;
+	std::map<std::string, Technique> techniques;
+	std::map<std::string, Sampler> samplers;
+	std::map<std::string, std::vector<std::string> > scenes;  // list of nodes
+
+	std::string defaultScene;
+	
+	Asset asset;
+
+	PrimitiveType drawModelType;
 };
 
 enum SectionCheck