Assorted GPU apriltag changes from your friends at FRC900 #39

kjaget · 2024-12-19T21:14:32Z

I know this is a huge number of diffs. I should have started it a while back but momentum got in the way and it has ended up here. So this is one of those "the best time to plant a tree was 20 years ago, the second best is today" kind of efforts. I fully expect lots of back and forth before merging, no worries. And there's certainly no obligation to take any of it. Hopefully the results of another set of eyes digging in will be useful but I won't be offended either way :)

First and foremost - I haven't looked into how to build this for your setup, so I know there will be issues with it as-is. I'd be happy to try, just not sure how much else of your environment I'd need to duplicate. Instructions welcome on that front.

It's a lot of code, here's my brain dump of the assorted changes and some rationale for them.
Right now for mono8 camera inputs I’m getting about 7-8msec runtime for decode on an Orin Nano. Perf is pretty similar on the Xavier NX which surprises me … the GPU seems pretty full on both, and I’d have expected better GPU perf on Orin. Maybe memory bandwidth limits? In any case, we're successfully running 2x 2MP 60FPS mono cameras at camera frame rate so we're happy.

Almost all of the changes where code is broken out into separate include files is to make life easier for integrating with our external code. There’s probably a bit more work there (there’s copy-pasted code in our repo that could likely be extracted into common files) but it is certainly usable now.

Updated the code to be templated on the image input format. Our cameras are either BGRA8 (8-bit blue, green, red + alpha channel) or 8-bit or 16-bit monochrome images. This change was a way to efficiently deal with the differences. Most of the changes are in the initial input->grayscale conversion but there are a few optimizations for the mono8 case since, for example, the input->grayscale conversion in that case is a no-op.
These changes were isolated to the initial image copy to device memory and conversion, so threshold.cc was templated on input type.

Part of this was also splitting the decimation kernel from the grayscale conversion code. This helped perf for mono8 inputs, since the grayscale conversion code can be skipped entirely. And the grayscale code can be split off into a separate stream since its results aren’t needed until the end of the cuda work … it can be held by sync primitives and scheduled to run when GPU work is minimal, getting some parallelism on the CUDA side. The latter part can be improved, but since I’m mainly using mono inputs this is harder for me to test and tune.

Modified the code to handle non-multiple-of-8 input heights. I did this in the cheapest way possible. All internal buffers are allocated after rounding up to the next multiple of 8. The decode function, however, copies the exact image size, meaning that some of the end of the GPU buffer will not be written in cases where the actual input isn’t a multiple of 8 The GPU image buffer is zeroed in its constructor so the data isn’t undefined. Having a few rows of all 0 pixels in the input doesn’t seem to make a difference in the results and was an easy way to support our camera resolutions.

I copied over code I had which collects timing info. It also marks ranges of code with ntvx markers so it is easy to see what is going on when using visual tools such as nsys-ui. The printout code needs to be hooked up more cleanly (I was working to integrate glog code with ROS console printing but haven’t had a chance to make it work yet … once I do it’ll be easier to have a clean solution).
This means a lot of the existing timing code events are redundant, but I haven’t yet cleaned up any of it.

Did a bunch of work making sure everything runs on the correct CUDA stream, and at the same time moved everything off the default stream. This means all memcpy / memset code is async aside from some initialization. Big changes here include

Instead of sync memcpy calls into local int vars which are then passed to subsequent cub:: calls, queue up an async memcpy into a HostMemory array of size 1. Then use cuda sync calls to hold the cub:: calls using those values from being submitted until after the memcpy is finished.
Move a memset() off into a separate stream so it can run in parallel with other code execution
The input->grayscale conversion code is also moved to a separate stream.
There were a few cases where it looked like cub:: code was unintentionally being run on the default stream, that should be fixed.

Changed 64-bit values in GPU code to 32-bit where possible. This provided a small but measurable speedup in some cases. It’s especially important for double->float changes, but also size_t -> uint32_t had a bit of an impact.

Switch to BGR input in threshold.cc Add support for multiple input types to gpu apriltag detection Make GpuDetector templated on an input format, use this to pick the input pixel depth as well as the initial conversion to greyscale Add cuda event timing Optimize mono8 inputs - remove a copy from color to gray since color is actually greyscale input already Handle non multiple of 8 sized heights Remove unused bayer image format Add cuda event timings to get markers in nvsight profile Clean up timing labels More todo comments for potential optimizations Have cuda spin rather than blocking when submitting work. This seems to help with the observed ~500usec kernel launch times, espcially with cub:: work. Make cuda event timing print in the order the events were added rather than alphabetically Add support for BGRA8 Looks like the stereolabs ZED camera outputs BRGA8, with the alpha hardcoded to 255. Wish we could just have it publish a BGR8 image instead to prevent pushing around useless data but don't see a way. So add support for BRGA8 as an input format instead. Lots of optmizations Move everything onto a non-default cuda stream. This mainly meant making the int / size_t results of kernels that are used as arguments for subsequent kernel launches into HostMemory async memcpys rather than copying directly into a single int variable. This way the copies can be truly async. Similary, moved the cuda memset() call into a separate stream so it runs in parallel with device compute work. Split up color->grey and input->decimated image processing. Doing so has a number of benefits. First off, the color->grey can be run on a separate stream since it is independent of the rest of the cuda work (the output grey image is used by CPU decoding towards the end of the pipe). It also helps thread utilization, since the combined process had ~25% utilization (most of the threads were idle while 1/4 were doing the decimate). Finally, the Mono8 case could be optimized because the input is the same as the greyscale, so the entire color->grey / memcpy d2h could be skipped in favor of a pointer assignment. Initial work on making device code use 32 bit values rather than 64 wherever possible. This includes size_t -> uint32_t and double->float in some cases. There's potentially more here to find but this commit handled some of the low hanging fruit. Make threshold code templated on image format Remove it from the top level gpu detector. This simplifies the code a lot by removing a bunch of unneeded template code from the main detector class. Fix buffer overrun cleanup Gpuapriltag mono16 (#1) * Add 16-bit mono as a valid image input to gpu apriltag code * Fix copy-paste error * Fixes from testing on live camera * More updates for mono16 gpu apriltag processing * Blind change to maybe transparently deal with endian issues for 16 bit images * Tested and fixed 16-bit apriltag input (in sim, at least) * Fixes from testing on live robot, works with both versions of ov2311 cam * Remove debugging code

justinT21 · 2025-01-16T20:05:41Z

This looks amazing! Sorry for the late response: I didn't see this until now. We probably won't have time to take a look at this deeply for at least a few weeks

kjaget added 3 commits December 19, 2024 15:05

Assorted cleanup in prep for pull request

10bf3b2

Formatting

8084246

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assorted GPU apriltag changes from your friends at FRC900 #39

Assorted GPU apriltag changes from your friends at FRC900 #39

kjaget commented Dec 19, 2024

justinT21 commented Jan 16, 2025

Assorted GPU apriltag changes from your friends at FRC900 #39

Are you sure you want to change the base?

Assorted GPU apriltag changes from your friends at FRC900 #39

Conversation

kjaget commented Dec 19, 2024

justinT21 commented Jan 16, 2025