-
Notifications
You must be signed in to change notification settings - Fork 13
etotheipi/CUDA-Image-Processing
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Image Processing in C++ using CUDA Ridiculously fast morphology and convolutions using an NVIDIA GPU! Additional: cudaImageHost<type> and cudaImageDevice<type> Automate all the "standard" CUDA memory operations needed for any numeric data type. The can be used to learn how to allocate and move basic memory between host and device, and can be used to soften the learning curve when implementing CUDA programs (memory allocation/copy issues are frustrating to track down) ----- Orig Author: Alan Reiner Date: 04 September, 2010 Email: [email protected] ----- This a BETA version of a complete morph/convolve image processing library using CUDA. CUDA is an NVIDIA programming interface for harnessing your NVIDIA graphics card to do computations. For problems that are highly parallelizable (like convolution or morphology) you can get speedups of many orders of magnitude. For instance, to convolve a 4096x4096 mask with a 15x15 point-spread fn, the CPU would take anywhere from 5 to 20 seconds. Using this library with an NVIDIA GTX 460, the operation takes less than 0.1 seconds! You can see nearly identical speed improvements with morphological operations. To use this library, you will need the CUDA 3.2 toolkit, and you might have to move a few libraries into your linker path. And of course, you need a CUDA-enabled graphics card (Fermi recommended, others will prob work; more on that below). This library is designed to be used for performing a sequence of image operations on a single image/mask, only copying the data in and out of the device once. Memory copies are expensive, computation is basically free... I envision this library could be useful for GIMP plugins, or ATR systems. Just remember, CUDA is NOT open-source, and the licensing will not be favorable to OSS projects. ***IMPORTANT NOTE*** I hate makefiles. And w/ CUDA, they got more complicated, so I am using the canned common/common.mk makefile that comes with the CUDA SDK. THIS MAKEFILE ERASES *.cpp FILES WHEN CALLING "make clean" I don't know how to fix this. Perhaps someone else knows. Just be sure to check-in your cpp code before cleaning (though, this project no longer contains any .cpp files) ----- Supported Hardware: This code was designed for NVIDIA devices of Compute Capability 2.0+ which is any NVIDIA GTX 4XX series card (Fermi). The code *should* compile and run on 1.X devices, but the code is not optimized for them so there may be a huge performance hit. Maybe not. (SEE NOTE BELOW about compiling for 1.X GPUs) I believe NVIDIA 8800 GT is the earliest NVIDIA card that supports CUDA, and it would be Compute Capability 1.0. (correction: I believe all NVIDIA 8XXX cards support CUDA 1.0) CUDA was created by NVIDIA, for NVIDIA. It will not work ATI cards. If you want to take up GPU-programming on ATI, the only two options I know are OpenGL and OpenCL. However, the maturity of those programming interfaces are unclear (but at least such programs can be run on both NVIDIA and ATI) ----- Installing and running: This directory should contain everything you need to compile the image convolution code, besides the CUDA 3.1 toolkit which can be downloaded from the following website: http://developer.nvidia.com/object/cuda_3_2_toolkit_rc.html In addition to installing the toolkit, you might need to add libcutil*.a and libshrutil*.a to your LD path. In linux, it is sufficient to copy them to your /usr/local/cuda/lib[64] directory. (note these two libraries are not formally part of NVIDIA CUDA toolbox, but come with the SDK as a set of tools for simplifying many CUDA tasks, like checking error codes) I personally work with space-separated files for images, because they are trivial to read and write from C++. Additionally, I use MATLAB to read/write these files too, which makes it easy create input for the CUDA code, and verify output. There is no reason this code won't work on Windows, though I've never tried it. I strongly urge anyone trying to learn CUDA to download the CUDA SDK. It contains endless examples of every CUDA feature in existence. Once you know what to do, you can get the syntax from there, which more often than not is very ugly. ----- To work with "older" NVIDIA cards (8XXX GT, 9XXX GT, GTX 2XX): Open commom/common.mk and around line 150, uncomment the GENCODE_SM10 line, which enables dual-compilation for CUDA Compute Capability 1.X devices. The reason I commented this out is that 1.X devices don't support printf(), which is my main method for debugging. Therefore, leaving this line in the Makefile prevents the code from compiling when I am attempting to debug. ----- Planned Updates - I have basic HOST and DEVICE memory automated with cudaImageHost<type> and cudaImageDevice<type>... but texture, constant, and page-locked memory still requires manual operation. I will create similar classes that handle the allocation and copying of such data (especially textures because there's a lot of fluff to preparing that data) - Need to modify the kernel functions (or invocations) to handle masks that have arbitrary dimensions. Currently, it is assumed that the input has dimensions that are multiples of 32. - In the future, I plan add preprocessors which branch the compiled code based on CUDA architecture. It may be as simple as choosing new block sizes, or reworking some of the algorithms. The point is, the code is currently written with Compute Capability 2.0+ in mind, and who knows what happens when run on 1.X device. (NOTE: one major difference between 1.X and 2.X is that 1.X is optimized for 24-bit integer operations, 2.x is optimized for 32-bit integer operations. Using the wrong ones for the given architecture can produce factor of 8-32 slowdown, this most definitely needs to be addressed in a preprocessor somewhere) - Right now my focus has been on morphology, though regular convolution HAS been implemented. I will implement a more-exhaustive array of image processing functions, such as more filters and FFTs. - I'm pretty sure that all operations need separate input/output buffers I need to check that and then see if I can avoid it. ------------------------------------------------------------------------------- ----- User Instructions - to use this library in your project This library is designed for the user to put data on an "ImageWorkbench", which stores the data in device memory, perform thousands of operations, then copy the data back to the host. Use cudaImageHost to store and move data around in HOST memory, and initiate a ImageWorkbench object. The user is welcome to create/use cudaImageDevice objects (to store and manage DEVICE memory), but it should be unnecessary, since the workbench handles that for you. A typical usage paradigm might be like the following: // Allocate and populate HOST memory cudaImageHost imageIn, imageOut; imageIn.readFromFile("picture.txt", 1024, 1024); // Initialize the workbench -- this copies the img to DEVICE memory ImageWorkbench theBench(imageIn); theBench.Dilate(); theBench.Erode(); theBench.PruningSweep(); theBench.Open(); theBench.Dilate(); theBench.Dilate(); theBench.Median(); theBench.Dilate(); // Use CountChanged method to determine how many pixels changed in // the previous operation. Frequently, we may want to apply a sequence // of operations until the image is no longer changing int nChanged = -1; while(nChanged != 0) { theBench.ThinningSweep(); nChanged = theBench.CountChanged(); } If the user wishes to use supplied structuring elements, you create a cudaImageHost object using {-1, 0, +1} for {OFF, DONTCARE, ON}, and then call: int seIndex = ImageWorkbench::addStructElt(seImageHost); This permanently stores the SE in a master (static list), and you only need to supply the returned index to use it: theBench.Dilate(seIndex); theBench.Open(seIndex); theBench.FindAndRemove(seIndex); Every operation so far leaves the buffer management to the IWB object. However, every method has optional two arguments for specifying source & destination buffers, if desired. The valid buffers are also accessed by index: 'A', 'B', or N >= 1. If a number is supplied, the IWB object will use the Nth buffer in its extraBuffers_ vector. If the buffer does not exist, it will create it. Keep in mind that A and B are #defines and do not need quotes. // No arguments uses 3x3 optimized morphology, (A --> B) theBench.Dilate(); // Supply structuring element, still (A --> B) theBench.Erode(seIndex); // Default 3x3 median calc on buffer A, result goes to buf 1 (A --> 1) theBench.Median(A, 1); // Copy in an external image to buffer 2 (Extern --> 2) theBench.copyHostToBuffer(newImageSameSize, 2); // Union buffers 1 and 2, store the result in buffer B (1U2 --> B) theBench.Union(1, 2, B); // Subtract B from buffer 1, store in 2 (1-B -> 2) theBench.Subtract(B, 1, 2); // Now morphological-open the result with SE and copy back to A theBench.Open(seIndex, 2, A); // DONE! // The last operation could've been replaced with the following two lines theBench.Open(seIndex, 2, B); theBench.flipBuffers(); // switches buffer ptrs so B is now A and vv In order to maintain IWB consistency, any manual buffer management needs to end with the result in buffer A, or in buffer B followed by a call to flipBuffers(). This is because subsequent operations always assume the last output was in buffer A. ------------------------------------------------------------------------------- ----- Developer Stuff - to expand this library The paradigm for developing this library differs substantially from simply using the library. There are two main reasons for this: 1) The developer needs his own set of buffers so that he has workspace but doesn't overwrite the buffers that the user may have created himself 2) The developer needs to ensure that every "standard" operations uses buffer A as input, buffer B as output AND THAT BOTH BUFFERS DISTINCTLY REPRESENT BEFORE-AND-AFTER VERSIONS of the operation. Therefore, the developer CANNOT just implement morph-close operation as: Dilate(seIndex); Erode(seIndex); because a subsequent call to CountChanged would only give the number of pixels that changed by the Erode operation. This is more important than it sounds, because many algorithms are based on observing when a cycle of operations reaches steady state, and if the developer doesn't respect this behavior, the user would have to redundantly implement the sequence himself to be able to do these kinds of checks. This is especially true of thinning and pruning, where you are performing 8 erosions and subtractions, and need to stop when the entire sequence of operations no longer changes any pixels. Therefore, the developer should ALWAYS use ZFunctions where possible. These functions always require source and destination, allow access to TEMP buffers and do NOT flip the buffers when called. In other words, the developer uses a different set of tools, and does everything manually. Second thing to note is that the getBufPtrAny command has the following range of valid inputs: {A=-1000, B=-1001, 1, 2, 3, 4, ..., -1, -2, -3, -4 ...} | PRIMARY | EXTRA | TEMPORARY | The USER has access to PRIMARY & EXTRA (getBufferPtr won't accept neg vals) The DEVELOPER has access to all, but should only use PRIMARY & TEMP The difference between EXTRA and TEMPORARY is the the developer may create a sequence of commands, which themselves may also use temporary buffers. This causes a recursion problem, where the developer has no way to protect his own buffers from being overwritten by a method that he is calling. Therefore, lock-flags were implemented so that the workbench can find the next available TEMP buffer. This retrieval "locks" the buffer, and it MUST be "released" at the end of the function. Much like new&delete, there needs to be a "release" for every "get". This example demonstrates how to write new methods safely for ImageWorkbench class: void A_Complicated_Morph_Operation(int seIndex) { // Retrieve three temporary buffers from the workbench int tmpBufIndex1 = getTempBuffer(); int tmpBufIndex2 = getTempBuffer(); int tmpBufIndex3 = getTempBuffer(); // Always read in from A ZErode(A, tmpBufIndex1); ZMedian(seIndex, tmpBufIndex1, tmpBufIndex2); ZSubtract(tmpBufIndex2, A, tmpBufIndex3); ZUnion(A, tmpBufIndex3, tmpBufIndex2); ZDifference(tmpBufIndex2, tmpBufIndex1, tmpBufIndex3); // Last morph op should always put result in B, and then flip ZFindAndRemove( seIndex, tmpBufIndex3, B); flipBuffers(); // Tell the workbench to "unlock" all the TEMP buffers we used releaseTempBuffer(tmpBufIndex1); releaseTempBuffer(tmpBufIndex2); releaseTempBuffer(tmpBufIndex3); } As per the descriptions above, this means that all new GPU kernels should take as input, pointers to input, output and all temporary memory locations and the workbench will supply those pointers when wrapping the GPU kernels. This also means, that when implementing new methods in IWB, there should actually be three new methods created: NAME (arg1, arg2, ...); // calls below with src=A, dst=B, then flip NAME (arg1, arg2, ..., int srcBuf, int dstBuf); // no flip ZNAME(arg1, arg2, ..., int srcBuf, int dstBuf); // no flip, TEMP access Some might argue that the ZNAME functions aren't necessary, since the only real difference they have compared to NAME is that they can access TEMP buffers. There may be a better way to do this, and in fact, it may not be necessary to shield the user from the TEMP buffers at all (just don't use them, please). Since most of the ZFunctions are written by #define macros anyway, I'll leave them alone for now.
About
Developing a complete set of GPU-accelerated image processing tools, including convolution and morphology
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published