Assorted GPU apriltag changes from your friends at FRC900 #39
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I know this is a huge number of diffs. I should have started it a while back but momentum got in the way and it has ended up here. So this is one of those "the best time to plant a tree was 20 years ago, the second best is today" kind of efforts. I fully expect lots of back and forth before merging, no worries. And there's certainly no obligation to take any of it. Hopefully the results of another set of eyes digging in will be useful but I won't be offended either way :)
First and foremost - I haven't looked into how to build this for your setup, so I know there will be issues with it as-is. I'd be happy to try, just not sure how much else of your environment I'd need to duplicate. Instructions welcome on that front.
It's a lot of code, here's my brain dump of the assorted changes and some rationale for them.
Right now for mono8 camera inputs I’m getting about 7-8msec runtime for decode on an Orin Nano. Perf is pretty similar on the Xavier NX which surprises me … the GPU seems pretty full on both, and I’d have expected better GPU perf on Orin. Maybe memory bandwidth limits? In any case, we're successfully running 2x 2MP 60FPS mono cameras at camera frame rate so we're happy.
Almost all of the changes where code is broken out into separate include files is to make life easier for integrating with our external code. There’s probably a bit more work there (there’s copy-pasted code in our repo that could likely be extracted into common files) but it is certainly usable now.
Updated the code to be templated on the image input format. Our cameras are either BGRA8 (8-bit blue, green, red + alpha channel) or 8-bit or 16-bit monochrome images. This change was a way to efficiently deal with the differences. Most of the changes are in the initial input->grayscale conversion but there are a few optimizations for the mono8 case since, for example, the input->grayscale conversion in that case is a no-op.
These changes were isolated to the initial image copy to device memory and conversion, so threshold.cc was templated on input type.
Part of this was also splitting the decimation kernel from the grayscale conversion code. This helped perf for mono8 inputs, since the grayscale conversion code can be skipped entirely. And the grayscale code can be split off into a separate stream since its results aren’t needed until the end of the cuda work … it can be held by sync primitives and scheduled to run when GPU work is minimal, getting some parallelism on the CUDA side. The latter part can be improved, but since I’m mainly using mono inputs this is harder for me to test and tune.
Modified the code to handle non-multiple-of-8 input heights. I did this in the cheapest way possible. All internal buffers are allocated after rounding up to the next multiple of 8. The decode function, however, copies the exact image size, meaning that some of the end of the GPU buffer will not be written in cases where the actual input isn’t a multiple of 8 The GPU image buffer is zeroed in its constructor so the data isn’t undefined. Having a few rows of all 0 pixels in the input doesn’t seem to make a difference in the results and was an easy way to support our camera resolutions.
I copied over code I had which collects timing info. It also marks ranges of code with ntvx markers so it is easy to see what is going on when using visual tools such as nsys-ui. The printout code needs to be hooked up more cleanly (I was working to integrate glog code with ROS console printing but haven’t had a chance to make it work yet … once I do it’ll be easier to have a clean solution).
This means a lot of the existing timing code events are redundant, but I haven’t yet cleaned up any of it.
Did a bunch of work making sure everything runs on the correct CUDA stream, and at the same time moved everything off the default stream. This means all memcpy / memset code is async aside from some initialization. Big changes here include
Changed 64-bit values in GPU code to 32-bit where possible. This provided a small but measurable speedup in some cases. It’s especially important for double->float changes, but also size_t -> uint32_t had a bit of an impact.