Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assorted GPU apriltag changes from your friends at FRC900 #39

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

kjaget
Copy link

@kjaget kjaget commented Dec 19, 2024

I know this is a huge number of diffs. I should have started it a while back but momentum got in the way and it has ended up here. So this is one of those "the best time to plant a tree was 20 years ago, the second best is today" kind of efforts. I fully expect lots of back and forth before merging, no worries. And there's certainly no obligation to take any of it. Hopefully the results of another set of eyes digging in will be useful but I won't be offended either way :)

First and foremost - I haven't looked into how to build this for your setup, so I know there will be issues with it as-is. I'd be happy to try, just not sure how much else of your environment I'd need to duplicate. Instructions welcome on that front.

It's a lot of code, here's my brain dump of the assorted changes and some rationale for them.
Right now for mono8 camera inputs I’m getting about 7-8msec runtime for decode on an Orin Nano. Perf is pretty similar on the Xavier NX which surprises me … the GPU seems pretty full on both, and I’d have expected better GPU perf on Orin. Maybe memory bandwidth limits? In any case, we're successfully running 2x 2MP 60FPS mono cameras at camera frame rate so we're happy.

Almost all of the changes where code is broken out into separate include files is to make life easier for integrating with our external code. There’s probably a bit more work there (there’s copy-pasted code in our repo that could likely be extracted into common files) but it is certainly usable now.

Updated the code to be templated on the image input format. Our cameras are either BGRA8 (8-bit blue, green, red + alpha channel) or 8-bit or 16-bit monochrome images. This change was a way to efficiently deal with the differences. Most of the changes are in the initial input->grayscale conversion but there are a few optimizations for the mono8 case since, for example, the input->grayscale conversion in that case is a no-op.
These changes were isolated to the initial image copy to device memory and conversion, so threshold.cc was templated on input type.

Part of this was also splitting the decimation kernel from the grayscale conversion code. This helped perf for mono8 inputs, since the grayscale conversion code can be skipped entirely. And the grayscale code can be split off into a separate stream since its results aren’t needed until the end of the cuda work … it can be held by sync primitives and scheduled to run when GPU work is minimal, getting some parallelism on the CUDA side. The latter part can be improved, but since I’m mainly using mono inputs this is harder for me to test and tune.

Modified the code to handle non-multiple-of-8 input heights. I did this in the cheapest way possible. All internal buffers are allocated after rounding up to the next multiple of 8. The decode function, however, copies the exact image size, meaning that some of the end of the GPU buffer will not be written in cases where the actual input isn’t a multiple of 8 The GPU image buffer is zeroed in its constructor so the data isn’t undefined. Having a few rows of all 0 pixels in the input doesn’t seem to make a difference in the results and was an easy way to support our camera resolutions.

I copied over code I had which collects timing info. It also marks ranges of code with ntvx markers so it is easy to see what is going on when using visual tools such as nsys-ui. The printout code needs to be hooked up more cleanly (I was working to integrate glog code with ROS console printing but haven’t had a chance to make it work yet … once I do it’ll be easier to have a clean solution).
This means a lot of the existing timing code events are redundant, but I haven’t yet cleaned up any of it.

Did a bunch of work making sure everything runs on the correct CUDA stream, and at the same time moved everything off the default stream. This means all memcpy / memset code is async aside from some initialization. Big changes here include

  • Instead of sync memcpy calls into local int vars which are then passed to subsequent cub:: calls, queue up an async memcpy into a HostMemory array of size 1. Then use cuda sync calls to hold the cub:: calls using those values from being submitted until after the memcpy is finished.
  • Move a memset() off into a separate stream so it can run in parallel with other code execution
  • The input->grayscale conversion code is also moved to a separate stream.
  • There were a few cases where it looked like cub:: code was unintentionally being run on the default stream, that should be fixed.

Changed 64-bit values in GPU code to 32-bit where possible. This provided a small but measurable speedup in some cases. It’s especially important for double->float changes, but also size_t -> uint32_t had a bit of an impact.

Switch to BGR input in threshold.cc

Add support for multiple input types to gpu apriltag detection

Make GpuDetector templated on an input format, use this to
pick the input pixel depth as well as the initial conversion
to greyscale

Add cuda event timing
Optimize mono8 inputs - remove a copy from color to gray since color is
actually greyscale input already
Handle non multiple of 8 sized heights
Remove unused bayer image format

Add cuda event timings to get markers in nvsight profile

Clean up timing labels

More todo comments for potential optimizations

Have cuda spin rather than blocking when submitting work. This seems
to help with the observed ~500usec kernel launch times, espcially with
cub:: work.

Make cuda event timing print in the order the events were added
rather than alphabetically

Add support for BGRA8

Looks like the stereolabs ZED camera outputs BRGA8, with the
alpha hardcoded to 255.  Wish we could just have it publish a
BGR8 image instead to prevent pushing around useless data but
don't see a way. So add support for BRGA8 as an input format
instead.

Lots of optmizations

Move everything onto a non-default cuda stream. This mainly meant making
the int / size_t results of kernels that are used as arguments for
subsequent kernel launches into HostMemory async memcpys rather than
copying directly into a single int variable.  This way the copies can be
truly async.

Similary, moved the cuda memset() call into a separate stream so it runs
in parallel with device compute work.

Split up color->grey and input->decimated image processing.  Doing so
has a number of benefits. First off, the color->grey can be run on a
separate stream since it is independent of the rest of the cuda work
(the output grey image is used by CPU decoding towards the end of the
pipe).  It also helps thread utilization, since the combined process had
~25% utilization (most of the threads were idle while 1/4 were doing the
decimate).  Finally, the Mono8 case could be optimized because the input
is the same as the greyscale, so the entire color->grey / memcpy d2h
could be skipped in favor of a pointer assignment.

Initial work on making device code use 32 bit values rather than 64
wherever possible. This includes size_t -> uint32_t and double->float in
some cases. There's potentially more here to find but this commit
handled some of the low hanging fruit.

Make threshold code templated on image format

Remove it from the top level gpu detector. This simplifies the code a
lot by removing a bunch of unneeded template code from the main detector
class.

Fix buffer overrun

cleanup

Gpuapriltag mono16 (#1)

* Add 16-bit mono as a valid image input to gpu apriltag code

* Fix copy-paste error

* Fixes from testing on live camera

* More updates for mono16 gpu apriltag processing

* Blind change to maybe transparently deal with endian issues for 16 bit images

* Tested and fixed 16-bit apriltag input (in sim, at least)

* Fixes from testing on live robot, works with both versions of ov2311 cam

* Remove debugging code
@justinT21
Copy link
Contributor

This looks amazing! Sorry for the late response: I didn't see this until now. We probably won't have time to take a look at this deeply for at least a few weeks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants