diff --git a/README.md b/README.md index a82ea0f..c3f1841 100644 --- a/README.md +++ b/README.md @@ -3,211 +3,30 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +siqi Huang Tested on: Windows 7, Inter(R) Core(TM) i7-4870 HQ CPU@ 2.5GHz; GeForce GT 750M(GK107) (Personal Computer) -### (TODO: Your README) +PART I: +This part is for cpu scan and compaction. The relative code is in cpu.cu. -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +PART II: +This part is for Naive GPU scan, and the code is in naive.cu. This is not done using shared memory. So multiple kernel is called in the process, and the num is log2(n-1). In each kernel, each thread add a different according to their current state. -Instructions (delete me) -======================== +PART III: +This part is for Efficient scan and its compaction. This is also not done using shared memory. So I still using multiple kernel. And the num is 2*log(n-1) because both up and down scan takes log(n-1). Each thread work differently in the kernel according to their state. The compaction is like this, first get the 0-1 value of the input, then get the scan, then put right value to right place and copy back. -This is due Sunday, September 13 at midnight. +PART IV: +This part is for thrust scan. The thrust is an enbeded function, so I first copy the input to a vector, then initialize the vector, put it in the GPU and get the result back. -**Summary:** In this project, you'll implement GPU stream compaction in CUDA, -from scratch. This algorithm is widely used, and will be important for -accelerating your path tracer project. +PART V(extra): +This part is for radix sort. The input is n number from 0 to n-1. In my case it is initially sorted from top to bottom. The process of the sort is like this. First get the number of digits, which is the digits of n-1. Then for each digits sort them in parallel. After log(n-1) step, we get the result and copy back. To change the input, you can simply change the number, or customize you own number, but be sure the numbers are consecutive. -Your stream compaction implementations in this project will simply remove `0`s -from an array of `int`s. In the path tracer, you will remove terminated paths -from an array of rays. +Comparsion of several scan methods: +![](image/compare.png) +This image shows the time used for each scan method in ms. The CPU uses the same time, do not know if the cuda API is also valid when no kernel is invoked. The thrust basically uses no time, I do not implement it, it may use the shared memory. For the naive scan, it is very surprising to be faster than efficient scan. But since we do not use shared memory, the time to invoke multiple kernel may be much higher than just invoking one. We notice that when the input is small, the naive scan is almost twice as fast as efficient. Then I notice in my implementation, the efficient scan runs excatly on more time of kernel than naive scan(both up and down scan for efficient scan). So if we use shared memory and reduce this disparity, I believe their difference is small. When the input is large in the last case, the efficient case is smaller than half of naive case, which means in one single kernel, the efficient case is faster than the naive case. -In addition to being useful for your path tracer, this project is meant to -reorient your algorithmic thinking to the way of the GPU. On GPUs, many -algorithms can benefit from massive parallelism and, in particular, data -parallelism: executing the same code many times simultaneously with different -data. - -You'll implement a few different versions of the *Scan* (*Prefix Sum*) -algorithm. First, you'll implement a CPU version of the algorithm to reinforce -your understanding. Then, you'll write a few GPU implementations: "naive" and -"work-efficient." Finally, you'll use some of these to implement GPU stream -compaction. - -**Algorithm overview & details:** There are two primary references for details -on the implementation of scan and stream compaction. - -* The [slides on Parallel Algorithms](https://github.com/CIS565-Fall-2015/cis565-fall-2015.github.io/raw/master/lectures/2-Parallel-Algorithms.pptx) - for Scan, Stream Compaction, and Work-Efficient Parallel Scan. -* GPU Gems 3, Chapter 39 - [Parallel Prefix Sum (Scan) with CUDA](http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html). - -Your GPU stream compaction implementation will live inside of the -`stream_compaction` subproject. This way, you will be able to easily copy it -over for use in your GPU path tracer. - - -## Part 0: The Usual - -This project (and all other CUDA projects in this course) requires an NVIDIA -graphics card with CUDA capability. Any card with Compute Capability 2.0 -(`sm_20`) or greater will work. Check your GPU on this -[compatibility table](https://developer.nvidia.com/cuda-gpus). -If you do not have a personal machine with these specs, you may use those -computers in the Moore 100B/C which have supported GPUs. - -**HOWEVER**: If you need to use the lab computer for your development, you will -not presently be able to do GPU performance profiling. This will be very -important for debugging performance bottlenecks in your program. - -### Useful existing code - -* `stream_compaction/common.h` - * `checkCUDAError` macro: checks for CUDA errors and exits if there were any. - * `ilog2ceil(x)`: computes the ceiling of log2(x), as an integer. -* `main.cpp` - * Some testing code for your implementations. - - -## Part 1: CPU Scan & Stream Compaction - -This stream compaction method will remove `0`s from an array of `int`s. - -In `stream_compaction/cpu.cu`, implement: - -* `StreamCompaction::CPU::scan`: compute an exclusive prefix sum. -* `StreamCompaction::CPU::compactWithoutScan`: stream compaction without using - the `scan` function. -* `StreamCompaction::CPU::compactWithScan`: stream compaction using the `scan` - function. Map the input array to an array of 0s and 1s, scan it, and use - scatter to produce the output. You will need a **CPU** scatter implementation - for this (see slides or GPU Gems chapter for an explanation). - -These implementations should only be a few lines long. - - -## Part 2: Naive GPU Scan Algorithm - -In `stream_compaction/naive.cu`, implement `StreamCompaction::Naive::scan` - -This uses the "Naive" algorithm from GPU Gems 3, Section 39.2.1. We haven't yet -taught shared memory, and you **shouldn't use it yet**. Example 39-1 uses -shared memory, but is limited to operating on very small arrays! Instead, write -this using global memory only. As a result of this, you will have to do -`ilog2ceil(n)` separate kernel invocations. - -Beware of errors in Example 39-1 in the book; both the pseudocode and the CUDA -code in the online version of Chapter 39 are known to have a few small errors -(in superscripting, missing braces, bad indentation, etc.) - -Since the parallel scan algorithm operates on a binary tree structure, it works -best with arrays with power-of-two length. Make sure your implementation works -on non-power-of-two sized arrays (see `ilog2ceil`). This requires extra memory -- your intermediate array sizes will need to be rounded to the next power of -two. - - -## Part 3: Work-Efficient GPU Scan & Stream Compaction - -### 3.1. Scan - -In `stream_compaction/efficient.cu`, implement -`StreamCompaction::Efficient::scan` - -All of the text in Part 2 applies. - -* This uses the "Work-Efficient" algorithm from GPU Gems 3, Section 39.2.2. -* Beware of errors in Example 39-2. -* Test non-power-of-two sized arrays. - -### 3.2. Stream Compaction - -This stream compaction method will remove `0`s from an array of `int`s. - -In `stream_compaction/efficient.cu`, implement -`StreamCompaction::Efficient::compact` - -For compaction, you will also need to implement the scatter algorithm presented -in the slides and the GPU Gems chapter. - -In `stream_compaction/common.cu`, implement these for use in `compact`: - -* `StreamCompaction::Common::kernMapToBoolean` -* `StreamCompaction::Common::kernScatter` - - -## Part 4: Using Thrust's Implementation - -In `stream_compaction/thrust.cu`, implement: - -* `StreamCompaction::Thrust::scan` - -This should be a very short function which wraps a call to the Thrust library -function `thrust::exclusive_scan(first, last, result)`. - -To measure timing, be sure to exclude memory operations by passing -`exclusive_scan` a `thrust::device_vector` (which is already allocated on the -GPU). You can create a `thrust::device_vector` by creating a -`thrust::host_vector` from the given pointer, then casting it. - - -## Part 5: Radix Sort (Extra Credit) (+10) - -Add an additional module to the `stream_compaction` subproject. Implement radix -sort using one of your scan implementations. Add tests to check its correctness. - - -## Write-up - -1. Update all of the TODOs at the top of this README. -2. Add a description of this project including a list of its features. -3. Add your performance analysis (see below). - -All extra credit features must be documented in your README, explaining its -value (with performance comparison, if applicable!) and showing an example how -it works. For radix sort, show how it is called and an example of its output. - -Always profile with Release mode builds and run without debugging. - -### Questions - -* Roughly optimize the block sizes of each of your implementations for minimal - run time on your GPU. - * (You shouldn't compare unoptimized implementations to each other!) - -* Compare all of these GPU Scan implementations (Naive, Work-Efficient, and - Thrust) to the serial CPU version of Scan. Plot a graph of the comparison - (with array size on the independent axis). - * You should use CUDA events for timing. Be sure **not** to include any - explicit memory operations in your performance measurements, for - comparability. - * To guess at what might be happening inside the Thrust implementation, take - a look at the Nsight timeline for its execution. - -* Write a brief explanation of the phenomena you see here. - * Can you find the performance bottlenecks? Is it memory I/O? Computation? Is - it different for each implementation? - -* Paste the output of the test program into a triple-backtick block in your - README. - * If you add your own tests (e.g. for radix sort or to test additional corner - cases), be sure to mention it explicitly. - -These questions should help guide you in performance analysis on future -assignments, as well. - -## Submit - -If you have modified any of the `CMakeLists.txt` files at all (aside from the -list of `SOURCE_FILES`), you must test that your project can build in Moore -100B/C. Beware of any build issues discussed on the Google Group. - -1. Open a GitHub pull request so that we can see that you have finished. - The title should be "Submission: YOUR NAME". -2. Send an email to the TA (gmail: kainino1+cis565@) with: - * **Subject**: in the form of `[CIS565] Project 2: PENNKEY` - * Direct link to your pull request on GitHub - * In the form of a grade (0-100+) with comments, evaluate your own - performance on the project. - * Feedback on the project itself, if any. +Performance: +![](image/analysis1.png) +![](image/analysis2.png) +Those two images are for the 32768 input case. And the output for all input case are in the following files: +"output/512.txt","output/2048.txt","output/8192.txt","output/32768.txt" +The test case for radix sort is in the bottom, which contains power of 2 test(256) and non power of 2 test(130) diff --git a/image/analysis1.png b/image/analysis1.png new file mode 100644 index 0000000..441fb23 Binary files /dev/null and b/image/analysis1.png differ diff --git a/image/analysis2.png b/image/analysis2.png new file mode 100644 index 0000000..ec25458 Binary files /dev/null and b/image/analysis2.png differ diff --git a/image/compare.png b/image/compare.png new file mode 100644 index 0000000..02b14ff Binary files /dev/null and b/image/compare.png differ diff --git a/output/2048.txt b/output/2048.txt new file mode 100644 index 0000000..355f231 --- /dev/null +++ b/output/2048.txt @@ -0,0 +1,93 @@ + +**************** +** SCAN TESTS ** +**************** + [ 38 19 38 37 5 47 15 35 0 12 3 0 42 ... 32 0 ] +==== cpu scan, power-of-two ==== +Time used in scan on CPU 0.002592 ms + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 49321 49353 ] +==== cpu scan, non-power-of-two ==== +Time used in scan on CPU 0.002624 ms + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 49267 49273 ] + passed +==== naive scan, power-of-two ==== +Time used in naive scan on GPU 0.047008 ms + passed +==== naive scan, non-power-of-two ==== +Time used in naive scan on GPU 0.04736 ms + passed +==== work-efficient scan, power-of-two ==== +Time used in efficient scan on GPU 0.090016 ms + passed +==== work-efficient scan, non-power-of-two ==== +Time used in efficient scan on GPU 0.16592 ms + passed +==== thrust scan, power-of-two ==== +Time used in thrust scan on GPU 0 ms + passed +==== thrust scan, non-power-of-two ==== +Time used in thrust scan on GPU 0 ms + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 2 3 2 1 3 1 1 1 2 0 1 0 2 ... 0 0 ] +==== cpu compact without scan, power-of-two ==== + [ 1 1 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 ] + passed +==== cpu compact without scan, non-power-of-two ==== + [ 1 1 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 ] + passed +==== cpu compact with scan ==== +Time used in scan on CPU 0.002592 ms +Time used in compaction on CPU 15.2762 ms + [ 1 1 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 ] + passed +==== work-efficient compact, power-of-two ==== +Time used in efficient scan on GPU 0.088608 ms +Time used in compaction on GPU 19.9872 ms + passed +==== work-efficient compact, non-power-of-two ==== +Time used in efficient scan on GPU 0.088288 ms +Time used in compaction on GPU 19.9553 ms + passed +==== ---------------radix sort power of two(256)---------------- ==== +Time used in efficient scan on GPU 0.05936 ms +Time used in efficient scan on GPU 0.059488 ms +Time used in efficient scan on GPU 0.058848 ms +Time used in efficient scan on GPU 0.058912 ms +Time used in efficient scan on GPU 0.061504 ms +Time used in efficient scan on GPU 0.059008 ms +Time used in efficient scan on GPU 0.059904 ms +Time used in efficient scan on GPU 0.058976 ms +0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29, +30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56 +,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,8 +3,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,10 +7,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,12 +7,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,14 +7,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,16 +7,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,18 +7,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,20 +7,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,22 +7,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,24 +7,248,249,250,251,252,253,254,255, +==== ---------------radix sort power of two(256)---------------- ==== +==== ------------radix sort none power of two(130)------------- ==== +Time used in efficient scan on GPU 0.058688 ms +Time used in efficient scan on GPU 0.058464 ms +Time used in efficient scan on GPU 0.058688 ms +Time used in efficient scan on GPU 0.058752 ms +Time used in efficient scan on GPU 0.057152 ms +Time used in efficient scan on GPU 0.058464 ms +Time used in efficient scan on GPU 0.05808 ms +Time used in efficient scan on GPU 0.0592 ms +0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29, +30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56 +,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,8 +3,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,10 +7,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,12 +7,128,129, +==== ------------radix sort none power of two(130)------------- ==== +请按任意键继续. . . \ No newline at end of file diff --git a/output/32768.txt b/output/32768.txt new file mode 100644 index 0000000..dbc7c5c --- /dev/null +++ b/output/32768.txt @@ -0,0 +1,93 @@ + +**************** +** SCAN TESTS ** +**************** + [ 38 19 38 37 5 47 15 35 0 12 3 0 42 ... 7 0 ] +==== cpu scan, power-of-two ==== +Time used in scan on CPU 0.002592 ms + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 803684 803691 ] +==== cpu scan, non-power-of-two ==== +Time used in scan on CPU 0.002592 ms + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 803630 803660 ] + passed +==== naive scan, power-of-two ==== +Time used in naive scan on GPU 0.381408 ms + passed +==== naive scan, non-power-of-two ==== +Time used in naive scan on GPU 0.370976 ms + passed +==== work-efficient scan, power-of-two ==== +Time used in efficient scan on GPU 0.540896 ms + passed +==== work-efficient scan, non-power-of-two ==== +Time used in efficient scan on GPU 0.541696 ms + passed +==== thrust scan, power-of-two ==== +Time used in thrust scan on GPU 0 ms + passed +==== thrust scan, non-power-of-two ==== +Time used in thrust scan on GPU 0 ms + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 2 3 2 1 3 1 1 1 2 0 1 0 2 ... 3 0 ] +==== cpu compact without scan, power-of-two ==== + [ 1 1 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 ] + passed +==== cpu compact without scan, non-power-of-two ==== + [ 1 1 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 ] + passed +==== cpu compact with scan ==== +Time used in scan on CPU 0.002432 ms +Time used in compaction on CPU 18.6609 ms + [ 1 1 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 ] + passed +==== work-efficient compact, power-of-two ==== +Time used in efficient scan on GPU 0.552384 ms +Time used in compaction on GPU 24.6243 ms + passed +==== work-efficient compact, non-power-of-two ==== +Time used in efficient scan on GPU 0.56576 ms +Time used in compaction on GPU 20.6086 ms + passed +==== ---------------radix sort power of two(256)---------------- ==== +Time used in efficient scan on GPU 0.058144 ms +Time used in efficient scan on GPU 0.058336 ms +Time used in efficient scan on GPU 0.059232 ms +Time used in efficient scan on GPU 0.05792 ms +Time used in efficient scan on GPU 0.05856 ms +Time used in efficient scan on GPU 0.058656 ms +Time used in efficient scan on GPU 0.058752 ms +Time used in efficient scan on GPU 0.058272 ms +0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29, +30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56 +,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,8 +3,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,10 +7,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,12 +7,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,14 +7,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,16 +7,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,18 +7,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,20 +7,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,22 +7,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,24 +7,248,249,250,251,252,253,254,255, +==== ---------------radix sort power of two(256)---------------- ==== +==== ------------radix sort none power of two(130)------------- ==== +Time used in efficient scan on GPU 0.058688 ms +Time used in efficient scan on GPU 0.058208 ms +Time used in efficient scan on GPU 0.061856 ms +Time used in efficient scan on GPU 0.057888 ms +Time used in efficient scan on GPU 0.06144 ms +Time used in efficient scan on GPU 0.058528 ms +Time used in efficient scan on GPU 0.062752 ms +Time used in efficient scan on GPU 0.059776 ms +0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29, +30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56 +,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,8 +3,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,10 +7,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,12 +7,128,129, +==== ------------radix sort none power of two(130)------------- ==== +请按任意键继续. . . \ No newline at end of file diff --git a/output/512.txt b/output/512.txt new file mode 100644 index 0000000..51920b3 --- /dev/null +++ b/output/512.txt @@ -0,0 +1,94 @@ + + +**************** +** SCAN TESTS ** +**************** + [ 38 19 38 37 5 47 15 35 0 12 3 0 42 ... 11 0 ] +==== cpu scan, power-of-two ==== +Time used in scan on CPU 0.002592 ms + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 12265 12276 ] +==== cpu scan, non-power-of-two ==== +Time used in scan on CPU 0.002592 ms + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 12216 12221 ] + passed +==== naive scan, power-of-two ==== +Time used in naive scan on GPU 0.036192 ms + passed +==== naive scan, non-power-of-two ==== +Time used in naive scan on GPU 0.03632 ms + passed +==== work-efficient scan, power-of-two ==== +Time used in efficient scan on GPU 0.068352 ms + passed +==== work-efficient scan, non-power-of-two ==== +Time used in efficient scan on GPU 0.068288 ms + passed +==== thrust scan, power-of-two ==== +Time used in thrust scan on GPU 0 ms + passed +==== thrust scan, non-power-of-two ==== +Time used in thrust scan on GPU 0 ms + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 2 3 2 1 3 1 1 1 2 0 1 0 2 ... 3 0 ] +==== cpu compact without scan, power-of-two ==== + [ 1 1 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 ] + passed +==== cpu compact without scan, non-power-of-two ==== + [ 1 1 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 ] + passed +==== cpu compact with scan ==== +Time used in scan on CPU 0.002432 ms +Time used in compaction on CPU 15.9923 ms + [ 1 1 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 ] + passed +==== work-efficient compact, power-of-two ==== +Time used in efficient scan on GPU 0.067168 ms +Time used in compaction on GPU 19.6162 ms + passed +==== work-efficient compact, non-power-of-two ==== +Time used in efficient scan on GPU 0.066048 ms +Time used in compaction on GPU 19.7871 ms + passed +==== ---------------radix sort power of two(256)---------------- ==== +Time used in efficient scan on GPU 0.058272 ms +Time used in efficient scan on GPU 0.058624 ms +Time used in efficient scan on GPU 0.058848 ms +Time used in efficient scan on GPU 0.058592 ms +Time used in efficient scan on GPU 0.058336 ms +Time used in efficient scan on GPU 0.058624 ms +Time used in efficient scan on GPU 0.058592 ms +Time used in efficient scan on GPU 0.05904 ms +0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29, +30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56 +,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,8 +3,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,10 +7,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,12 +7,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,14 +7,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,16 +7,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,18 +7,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,20 +7,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,22 +7,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,24 +7,248,249,250,251,252,253,254,255, +==== ---------------radix sort power of two(256)---------------- ==== +==== ------------radix sort none power of two(130)------------- ==== +Time used in efficient scan on GPU 0.058656 ms +Time used in efficient scan on GPU 0.058528 ms +Time used in efficient scan on GPU 0.059104 ms +Time used in efficient scan on GPU 0.058656 ms +Time used in efficient scan on GPU 0.058816 ms +Time used in efficient scan on GPU 0.058528 ms +Time used in efficient scan on GPU 0.058496 ms +Time used in efficient scan on GPU 0.057888 ms +0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29, +30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56 +,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,8 +3,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,10 +7,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,12 +7,128,129, +==== ------------radix sort none power of two(130)------------- ==== +请按任意键继续. . . \ No newline at end of file diff --git a/output/8192.txt b/output/8192.txt new file mode 100644 index 0000000..d9ed37d --- /dev/null +++ b/output/8192.txt @@ -0,0 +1,93 @@ + +**************** +** SCAN TESTS ** +**************** + [ 38 19 38 37 5 47 15 35 0 12 3 0 42 ... 4 0 ] +==== cpu scan, power-of-two ==== +Time used in scan on CPU 0.002592 ms + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 200667 200671 ] +==== cpu scan, non-power-of-two ==== +Time used in scan on CPU 0.002592 ms + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 200610 200643 ] + passed +==== naive scan, power-of-two ==== +Time used in naive scan on GPU 0.068576 ms + passed +==== naive scan, non-power-of-two ==== +Time used in naive scan on GPU 0.068832 ms + passed +==== work-efficient scan, power-of-two ==== +Time used in efficient scan on GPU 0.142912 ms + passed +==== work-efficient scan, non-power-of-two ==== +Time used in efficient scan on GPU 0.142816 ms + passed +==== thrust scan, power-of-two ==== +Time used in thrust scan on GPU 0 ms + passed +==== thrust scan, non-power-of-two ==== +Time used in thrust scan on GPU 0 ms + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 2 3 2 1 3 1 1 1 2 0 1 0 2 ... 0 0 ] +==== cpu compact without scan, power-of-two ==== + [ 1 1 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 ] + passed +==== cpu compact without scan, non-power-of-two ==== + [ 1 1 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 ] + passed +==== cpu compact with scan ==== +Time used in scan on CPU 0.002432 ms +Time used in compaction on CPU 14.8358 ms + [ 1 1 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 ] + passed +==== work-efficient compact, power-of-two ==== +Time used in efficient scan on GPU 0.139968 ms +Time used in compaction on GPU 19.58 ms + passed +==== work-efficient compact, non-power-of-two ==== +Time used in efficient scan on GPU 0.140608 ms +Time used in compaction on GPU 20.0698 ms + passed +==== ---------------radix sort power of two(256)---------------- ==== +Time used in efficient scan on GPU 0.061664 ms +Time used in efficient scan on GPU 0.058688 ms +Time used in efficient scan on GPU 0.061056 ms +Time used in efficient scan on GPU 0.058528 ms +Time used in efficient scan on GPU 0.05744 ms +Time used in efficient scan on GPU 0.061696 ms +Time used in efficient scan on GPU 0.058464 ms +Time used in efficient scan on GPU 0.05856 ms +0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29, +30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56 +,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,8 +3,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,10 +7,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,12 +7,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,14 +7,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,16 +7,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,18 +7,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,20 +7,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,22 +7,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,24 +7,248,249,250,251,252,253,254,255, +==== ---------------radix sort power of two(256)---------------- ==== +==== ------------radix sort none power of two(130)------------- ==== +Time used in efficient scan on GPU 0.058688 ms +Time used in efficient scan on GPU 0.059264 ms +Time used in efficient scan on GPU 0.05824 ms +Time used in efficient scan on GPU 0.058944 ms +Time used in efficient scan on GPU 0.058336 ms +Time used in efficient scan on GPU 0.058624 ms +Time used in efficient scan on GPU 0.058464 ms +Time used in efficient scan on GPU 0.058464 ms +0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29, +30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56 +,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,8 +3,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,10 +7,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,12 +7,128,129, +==== ------------radix sort none power of two(130)------------- ==== +请按任意键继续. . . \ No newline at end of file diff --git a/src/main.cpp b/src/main.cpp index 7308451..58c6009 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -11,10 +11,12 @@ #include #include #include +#include #include "testing_helpers.hpp" +#include int main(int argc, char* argv[]) { - const int SIZE = 1 << 8; + const int SIZE = 1 << 15; const int NPOT = SIZE - 3; int a[SIZE], b[SIZE], c[SIZE]; @@ -26,6 +28,7 @@ int main(int argc, char* argv[]) { printf("****************\n"); genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case + a[SIZE - 1] = 0; printArray(SIZE, a, true); zeroArray(SIZE, b); @@ -83,6 +86,7 @@ int main(int argc, char* argv[]) { // Compaction tests genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case + a[SIZE - 1] = 0; printArray(SIZE, a, true); int count, expectedCount, expectedNPOT; @@ -118,4 +122,25 @@ int main(int argc, char* argv[]) { count = StreamCompaction::Efficient::compact(NPOT, c, a); //printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); -} + + //Radix sort test + printDesc("---------------radix sort power of two(256)----------------"); + int *input1=new int[256],*output=new int[256]; + for(int i=0;i<256;++i) input1[i]=255-i; + //input1[0]=4;input1[1]=7;input1[2]=2;input1[3]=6;input1[4]=3; + //input1[5]=5;input1[6]=1;input1[7]=0; + StreamCompaction::Radix::radix(256,output,input1); + for(int i=0;i<256;++i) std::cout< #include #include +#include +#include +#include #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__) #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__) - +#define blockSize 128 /** * Check for CUDA errors; print and exit if there was a problem. */ diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index e600c29..b517aa5 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -1,5 +1,8 @@ #include #include "cpu.h" +#include "common.h" +//#include +#include namespace StreamCompaction { namespace CPU { @@ -9,7 +12,20 @@ namespace CPU { */ void scan(int n, int *odata, const int *idata) { // TODO - printf("TODO\n"); + if(n==0) return ; + cudaEvent_t start, stop; + cudaEventCreate(&start); + cudaEventCreate(&stop); + cudaEventRecord(start); + odata[0]=0; + for(int i=1;i #include "common.h" #include "efficient.h" - +#include namespace StreamCompaction { namespace Efficient { // TODO: __global__ +__global__ void upSwapOnGPU(int *idata,int step,int n,int newN){ + int index=blockIdx.x*blockDim.x+threadIdx.x; + if(index=n) idata[index]=0; + if((index+1)%(step*2)==0) idata[index]+=idata[index-step]; + } +} + +__global__ void downSwapOnGPU(int *idata,int step,int n,int newN){ + int index=blockIdx.x*blockDim.x+threadIdx.x; + if(index>>(dev_idata,step,n,newN); + step*=2; + } + step/=2; + while(step!=0){ + downSwapOnGPU<<>>(dev_idata,step,n,newN); + step/=2; + } + cudaEventRecord(stop); + + cudaEventSynchronize(stop); + float milliseconds = 0; + cudaEventElapsedTime(&milliseconds, start, stop); + std::cout<<"Time used in efficient scan on GPU "<>>(n,dev_idata,dev_tmp1); + cudaMemcpy(tmp1,dev_tmp1,n*sizeof(int),cudaMemcpyDeviceToHost); + scan(n,tmp2,tmp1); + cudaMemcpy(dev_tmp2,tmp2,n*sizeof(int),cudaMemcpyHostToDevice); + getCompact<<>>(dev_tmp1,dev_tmp2,dev_odata,n); + cudaMemcpy(odata,dev_odata,n*sizeof(int),cudaMemcpyDeviceToHost); + int count=tmp2[n-1]+tmp1[n-1]; + + cudaEventRecord(stop); + cudaEventSynchronize(stop); + + delete tmp1; + delete tmp2; + cudaFree(dev_tmp1); + cudaFree(dev_tmp2); + cudaFree(dev_idata); + cudaFree(dev_odata); + + float milliseconds = 0; + cudaEventElapsedTime(&milliseconds, start, stop); + std::cout<<"Time used in compaction on GPU "<=step) odata[index]=idata[index]+idata[index-step]; + else odata[index]=idata[index]; + } +} + void scan(int n, int *odata, const int *idata) { // TODO - printf("TODO\n"); + int step=1,count=0; + int *dev_odata,*dev_idata; + cudaEvent_t start, stop; + cudaEventCreate(&start); + cudaEventCreate(&stop); + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + cudaMemcpy(dev_odata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + dim3 blockPerGrid=((n+blockSize-1)/blockSize); + cudaEventRecord(start); + while(step>>(n,dev_odata,dev_idata,step); + else scanOnGPU<<>>(n,dev_idata,dev_odata,step); + count++; + step*=2; + } + cudaEventRecord(stop); + cudaEventSynchronize(stop); + + if(count%2==1) cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToHost); + else cudaMemcpy(odata, dev_idata, n*sizeof(int), cudaMemcpyDeviceToHost); + for(int i=n-1;i>0;--i){ + odata[i]=odata[i-1]; + } + odata[0]=0; + cudaFree(dev_odata); + cudaFree(dev_idata); + + float milliseconds = 0; + cudaEventElapsedTime(&milliseconds, start, stop); + std::cout<<"Time used in naive scan on GPU "< +#include +#include "radix.h" +#include "efficient.h" +#include "common.h"; +#include + +namespace StreamCompaction { +namespace Radix { + +void scan(int n, int *odata, const int *idata){ + StreamCompaction::Efficient::scan(n,odata,idata); +} + +__device__ int getDigit(int n,int pos){ + int result=0; + for(int i=0;i>>(n,dev_idata,dev_b,i); + else getDigits<<>>(n,dev_odata,dev_b,i); + Reverse<<>>(n,dev_b,dev_e); + cudaMemcpy(host_e,dev_e,n*sizeof(int),cudaMemcpyDeviceToHost); + scan(n,host_f,host_e); + cudaMemcpy(dev_f,host_f,n*sizeof(int),cudaMemcpyHostToDevice); + int totalFalse=host_e[n-1]+host_f[n-1]; + getT<<>>(n,dev_f,dev_t,totalFalse); + getPos<<>>(n,dev_b,dev_t,dev_f,dev_d); + if(i%2==1) switchPos<<>>(n,dev_idata,dev_odata,dev_d); + else switchPos<<>>(n,dev_odata,dev_idata,dev_d); + } + + if(num%2==1) cudaMemcpy(odata,dev_odata,n*sizeof(int),cudaMemcpyDeviceToHost); + else cudaMemcpy(odata,dev_idata,n*sizeof(int),cudaMemcpyDeviceToHost); + + cudaFree(dev_idata); + cudaFree(dev_odata); + cudaFree(dev_b); + cudaFree(dev_t); + cudaFree(dev_d); + cudaFree(dev_f); + cudaFree(dev_e); +} + +} +} diff --git a/stream_compaction/radix.h b/stream_compaction/radix.h new file mode 100644 index 0000000..daf69cd --- /dev/null +++ b/stream_compaction/radix.h @@ -0,0 +1,8 @@ +#pragma once + +namespace StreamCompaction { +namespace Radix { + void scan(int n, int *odata, const int *idata); + void radix(int n, int *odata, const int *idata); +} +} \ No newline at end of file diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index d8dbb32..f09cede 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -16,6 +16,23 @@ void scan(int n, int *odata, const int *idata) { // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + std::vector in; + for(int i=0;i dv_in(in.begin(),in.end()); + thrust::device_vector dv_out(n,0); + thrust::exclusive_scan(dv_in.begin(),dv_in.end(),dv_out.begin()); + std::vector out(dv_out.begin(),dv_out.end()); + cudaEventRecord(stop); + cudaEventSynchronize(stop); + float milliseconds = 0; + cudaEventElapsedTime(&milliseconds, start, stop); + std::cout<<"Time used in thrust scan on GPU "<