-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vision_msgs standardization #43
Comments
Hi @peci1, thanks for your insightful comments. I agree with most of your points. As a general comment, we'll have to find a tradeoff between cleaner message design (splitting out the Here are some specific comments:
I don't have any strong opinions on this one. Personally, I've never filled the
I wouldn't say it's a "lie". The fields are optional, so when a node receives a message where the I'm not convinced by the
I'll continue to fight for keeping it a string field. The pros and cons have already been discussed exhaustively in #25. My main argument is that in some use cases, the
Great suggestion!
That's interesting! I believe it's always much better to extract standard interfaces (like Since you said "converge closer": What is still preventing you from switching to |
Lets try to keep these messages smaller and more pointed...
Yes. Standard detections are in a 2D image frame. Most AI methods right now cannot do 3D (you have a bounding box on an image, it doesn't make sense to have a 3D representation for 2D pixel coordinates). We're working in pixel frame in AI, not world coordinates. UUIDs could be useful, I agree. The usefulness of UUIDs @mintar is when using multiple cameras doing tasks. If you just increment up then there could be multiple
I don't think that's in the cards, again, these are pixel coordinates which are most meaningful in a
This is common practice across geometry_msgs as well, I think we can safely do this based on the consistency argument.
Agreed
That seems oddly specific and also not really that easy for ROS2. We have similar concepts of latching, but there's no boolean switch anymore, its a range of settings. I don't think ultimately this makes sense to try to enforce, it might not make sense for all applications. |
Thanks for all the discussion, everyone. Good to see this repo getting some love :)
|
I went through our detection messages, and this is what I found. We have a field explicitly storing the objectness. I'm not sure if this is a generally useful thing or if it's just our particular need. We also like to store what sensor is the origin of the detection - the frame_id can change over time (e.g. by doing cutouts and such), but we still like to know which sensor captured the source image. But maybe it's just for convenience and we could always find the source sensor some other way, e.g. traversing the TF tree or so (which is, however, harder during replay than just reading a field). We also store the estimated 3D position after going through localization (depth and size estimation). Here, I'm not sure if it's still a relevant part of vision_msgs or it should rather be a part of something else. Detection3D would seem like the ideal "storage" for the localization result, but as long as it is tied to a single pointcloud as a source, it doesn't make sense. Our localization uses a source image and several pointclouds. Which of them should be stored as the source? As detectors can output other things than class probabilities (e.g. the objectness mentioned above, or some other estimated properties), wouldn't it make sense to generalize this a bit and allow the kind of "modular" data storage like pointcloud2 does using a byte buffer and PointField metadata description (which could be published on a separate latched topic)? But I'm not sure how often it happens that detectors output more kinds of data. This would definitely future-proof the message definition, but the question is whether it isn't just a workaround for not having a good definition :) In the case of PointCloud2, I think it really was a good decision, as at the time of its creation, nobody probably thought about "multi-layer" lidars and the need to store the "ring" from which a point comes... and so on.
Okay, that's a fair point. Using a 3D Box message for storing 2D pixel coordinates doesn't look like a good idea in the end. What about re-using sensor_msgs/RegionOfInterest? It seems close to BoundingBox2D, even semantically, but not exactly. It has the I was thinking a bit more about the (UU)IDs. And I figured out I had a wrong concept in my head about their usage here. I somehow thought that it would be an ID of each particular detection, so that e.g. a localization algorithm can store the set of detections that contributed to the estimated pose. But going through it more thoroughly, I figured out the IDs relate not to the detections, but to the classes. That makes Steve's point invalid in my view, because you'd probably run five detectors, but all of them with the same config, which would mean that ID=5 would mean a drill in both cameras 1 and 2. What would still argue a bit in favor of UUIDs would be the case when you run multiple detectors on a single image. If both detectors had hypotheses IDs as numbers starting from zero, you wouldn't be able to distinguish between them looking just at the hypotheses array. But I can't imagine the mess going through all our bagfiles and having to do some database lookups to find all drills (if the UUIDs would differ in each run). For this particular case, it seems to me that it'd be ideal to use IDs of form "namespace/class_number" or "namespace/class_name", which could be a single string, two strings, or a string and int. The need for a namespace is there to distinguish e.g. "yolo/drill" from an "otherdetector/drill", which could be trained on different datasets and thus report different kinds of objects as a drill. Wait, this namespace information is kind of available in VisionInfo, so maybe, in the end (through the rabbit hole and back), a hypothesis ID could be just an int or string. But it would deserve much more thorough documentation mentioning this kind of thought path...
CameraInfo is supposed to be published with each image, because it can change (e.g. for zooming cameras). VisionInfo should not change (except for the case of lifelong learning detectors). That's the reason why I suggested latching. In ROS 1, that would definitely make sense. In ROS 2, I don't know, I haven't tried it yet.
Outch, |
I think that's not really a good move. If we "recommend" something, we should just say that you should "do" that thing. If we're telling people to use UUIDs, then we should just use the proper UUID message. As ROS gets more mature, some standards need to get more strict to enforce good behavior. Detections from some arbitrary AI detection algorithm have no way of assigning a meaningful ID, they're also themselves just returning some hash to specifically identify one thing vs another. My understanding of this package is that this is the output of a detector / segmentation / classification task to then be consumed by later stages that will track those detections over time/space, or utilize them for the environmental model. Having truly unique IDs is probably really important to that problem in general and the class names should be what's used to communicate what the object is. If the AI detector that is publishing this message can also track the objects such that between frames a single object is correlated, then names start to matter a little more, but you could still also just use the same UUID again since the string name So what else is the string-based identifier used for that could add more value? I think that's the question that this boils down to. If its just that its a
I don't think anything needs to change. I'm fine with it as is.
I think we're all on the same page here |
UUIDs@peci1 wrote:
You probably looked at the Kinetic/Melodic version of the messages, where
Detection3D.msgstd_msgs/Header header
vision_msgs/ObjectHypothesisWithPose[] results
# The unique ID of the object class. To get additional information about
# this ID, such as its human-readable class name, listeners should perform a
# lookup in a metadata database. See vision_msgs/VisionInfo.msg for more detail.
string id
float64 score
geometry_msgs/PoseWithCovariance pose
vision_msgs/BoundingBox3D bbox
sensor_msgs/PointCloud2 source_cloud
# If this message was tracking result, this field set true.
bool is_tracking
# ID used for consistency across multiple detection messages. This value will
# likely differ from the id field set in each individual ObjectHypothesis.
# If you set this field, be sure to also set is_tracking to True.
string tracking_id The arguments for and against UUIDs as presented in #25 and reiterated by @SteveMacenski basically boil down to: Pro UUID:
Contra UUID:
Presumably, if you run multiple object detector nodes, they will all publish their results on separate topics. If you are worried about collisions, the aggregator node could simply prefix each BoundingBoxes@peci1 wrote:
Yes, we had that discussion before, but @Kukanani was adamant that we need the theta field. I believe BoundingBox2D should definitely stay in this package, because it's pixel based. But BoundingBox3D could still move to |
On |
Thanks for the discussion. It would be helpful if a general-enough representation is found which efficiently cover typical use cases. I would suggest to organize the messages around common vision tasks, keeping the representation footprint small for typical use cases while keeping the messages general enough to allow other uses cases as well. For example, for a 2D detector the output would be a list of bounding boxes with some confidences or class probabilities. More flexibility (for a reasonable price) could be added by explicitly adding class_id[] field so that a subset of classes can be captured in scores and class_id. This would be helpful when there are a large number of classes we model/detect and want to keep e.g. 10 most likely. This would add 4+ bytes. As the task really is image -> list of detections, I would keep the header only in the list-of-detections message. With 4 float32 values (or 5 with theta) to define the bounding box with subpixel precision, it would be ~36 bytes for a single 2D detection. This can't go that much lower I would say. I see as a problem that now there isn't any lightweight message for bounding box detection or a list of these - which is typical output of a 2D detector. Having a 3D pose with covariance as a part of 2D detection seems as an overkill to me, it is 288 bytes to represent the covariance alone (and additional 56 bytes for pose). This also mix 2D with 3D for no obvious reason - there is a 3D detection message after all. There could as well be the 3D bounding box in 2D detection. A similar situation is with the image (~ 35 bytes). The source image is presumably already being sent on some other topic and matching the detection result with the image source is therefore simple - you just sync these two topics. So I would omit it completely, definitelly from the type for a single detection, Detection2D, or use just compressed image buffer which can be empty. Regarding tracking IDs - conceptually, it seems to me that these should be a separate thing built on top of these 2D detections. |
I just realized where some of the 2D/3D confusion may come from. It seems that now the detection messages are organized more around which input modalities were used to generate them (2D image / 3D lidar) rather than around what information is being produced (2D bounding boxes / 3D poses with bounding boxes). I think the latter would be more useful since, in general, there can be any combination of modalities on input and the latter says what can be done further with the outputs. |
I agree with you that the messages are quite complex at the moment (maybe too complex) and it's worth thinking about having a simpler However, I wouldn't worry too much about a couple of bytes. A single 640x480 RGB image has 900 kB, so it doesn't matter too much whether the detection message for that image is 0.1 kB or 0.5 kB. A much better argument for removing weird fields that will probably not be filled in a large chunk of use cases is that it will make the messages clearer, less ambiguous and easier to use.
Yes, I think you are right. I have a good example for this: DOPE uses 2D images as input, but it produces 6DoF pose estimations and 3D bounding boxes, so I used In summary, would you agree that the following changes solve the problems? diff --git i/msg/Detection2D.msg w/msg/Detection2D.msg
index e13fedf..c8f87d8 100644
--- i/msg/Detection2D.msg
+++ w/msg/Detection2D.msg
@@ -7,19 +7,12 @@
Header header
# Class probabilities
-ObjectHypothesisWithPose[] results
+ObjectHypothesis[] results
# 2D bounding box surrounding the object.
BoundingBox2D bbox
-# The 2D data that generated these results (i.e. region proposal cropped out of
-# the image). Not required for all use cases, so it may be empty.
-sensor_msgs/Image source_img
-
-# If true, this message contains object tracking information.
-bool is_tracking
-
# ID used for consistency across multiple detection messages. This value will
# likely differ from the id field set in each individual ObjectHypothesis.
-# If you set this field, be sure to also set is_tracking to True.
+# This field may be empty.
string tracking_id That would result in the following
|
(I assume you meant vision_msgs/ObjectHypothesis[] results in the bottom code block.) |
Yes, that was a copy-paste error. I've edited my comment above and fixed it, thanks! |
I think that the same thing can be done with I would consider to allow a generic binary source blob to allow including source data digest of any kind. I know it has drawbacks but it could prevent people devising their own message types just to include compressed image crops, point clouds, radar data etc. they happen to have in their scenario, with minimum penalty for those who don't need this. |
So trying to accumulate changes that have some momentum to move this conversation forward to specific changes. I'm going to have a list below of owners and actions that we should take to split out these discussions into appropriate locations so we can take action:
Topics discussed that we're probably not going to take action on
@tpet can you briefly summarize tangible changes you'd like to have made? |
Done: #47
Could you clarify what you mean by that sentence? I don't understand who's suggesting to remove what. Is this a response to my comment #43 (comment) ? @Kukanani @SteveMacenski : A general question about procedure: |
No, no changes on released distributions, ROS1 is done for on all the interfaces. The state of things now is how they'll be. This is only for new ROS2 distributions moving forward since we cannot break ABI in the middle of an established distribution (especially Noetic as the last LTS ROS distribution). All migrations have been largely completed to Noetic by now so this would represent a non-trivial random task to appear out of the blue that impact a surprisingly large user base. |
Ok, thanks for the clarification. So all PRs should be targeted against the |
Composition with object hypothesis has been merged, UUID discussions are on-going and haven't started the discussions on the data decoupling. |
No, each object hypothesis needs its own pose, and so does the bounding box. This is because the bounding box pose is defined as the translation between the Even if it was always in the center of the bounding box, the orientation doesn't have to match. E.g. a flat object could be a keyboard in its "normal" orientation, or a book lying on its side. So you need different poses for each object hypothesis. |
@mintar I would tend to define the messages around tasks and respective outputs such as 2D/3D bounding box object detection. I meant Detection2D/3D message to represent bounding box detection, which is not clearly visible from the names as it could be. In that case, the task of identifying 3D reference frames (without bounding box) is simply a different task and should have a different message (or you could just disregard the bounding box size information).
I don't understand how the orientation couldn't match if the bounding box orientation is defined by a full 6DOF pose, same as in ObjectHypothesisWithPose, but without covariance. |
Why do you think the messages here should be restricted to bounding box object detection? I believe they should continue to allow full 6DoF pose estimation. This is how they are defined and used at the moment, for example by DOPE.
My point was that the pose in ObjectHypothesisWithPose is different from the bounding box pose. For example, here's a mesh of one of the objects from the standard YCB benchmark dataset: Clearly, the mesh origin is not in the center of the bounding box. To elaborate on the "book vs keyboard" example:
Of course, that's still assuming we're doing 6DoF pose estimation. The whole point of the "book vs. keyboard" example is that even if we move all the mesh origins to their respective bbox centers, we still need separate poses. If we were just doing some sort of bounding box detection without pose, we wouldn't need the poses in the Detection3D, that's correct. TL;DR: The bounding box and each object class need their own poses, since they refer to different reference frames. |
I agree with @mintar, I see no real to change this. An orientation for 3D is important so the bounding boxes can be non-axially aligned. The 2D case is less used in current technologies, but there are networks which do give oriented bounding boxes. The only thing left I think that there's consensus on is the decoupling of data (e.g. remove the PC2 / image fields from the messages). If someone wants those together, they can use an approximate or exact time message filter to grab both of those with corresponding headers. |
That's a misunderstanding. I think that there should be messages for standard vision tasks which are organized based on what output is produced. E.g., there should be messages for bounding box detection, which could be named Detection2D and Detection3D (BoundingBoxDetection2D and BoundingBoxDetection3D would be more clear).
Bounding box pose in Detection3D is full 6DoF pose. You can neglect the size of bounding box if this is not needed.
Static transformation from bounding box to mesh origin doesn't have to be represented in each instance of the message, IMHO.
I would say that they would need completely different bounding boxes - as the relative bounding box dimensions in x-y-z axes would be completely different in these two examples. |
Please look in this repository, much of what you describe already exists. |
PR for decoupling from data is created: #53. |
Even after concluding the new UUID discussion as wontfix, I still have some doubts about the concept of Strictly speaking, a detector outputs detections. By which I understand bounding boxes or similar things. But it does not (in my view) associate the bounding boxes with any objects (just with classes). Association to objects is the work of a tracker. Currently, it seems to me that the Take another look. The DetectionXD message contains a list of class probabilities. I.e. it represents some uncertain information. However, the Wouldn't it be better to decouple the tracking information from the detection message and create a "parent" message This would also nicely solve the question what to put in |
Sure, but for some detectors the bounding box may not be linked to the specific class of object. One object detection system I worked with did the following:
In this case, the bounding box was the bbox of the point cluster, and the results array contained all possible classifications of that cluster with their poses. In short, I believe the bbox pose and the pose of each classification are generally separate things, and don't need to be connected via fixed transformations, so we need all of them in the message. |
Yes, that sounds like an excellent suggestion!
Personally, I would strongly prefer to include the Detection message in the Track message. The extra bandwidth is negligible, and it's much easier to use than using a synchronizer to subscribe to two different topics and then iterate the arrays in those messages in parallel. |
But couldn't you just use multiple bounding boxes, each with a single object hypothesis? That would be still with much less overhead compared to the current state. |
For a single input (image / point cloud) you get multiple of these bounding box detections (similarly to e.g. instance segmentation), so you kind of get objects (not only classes), e.g. you "know" that these are different identities even if you don't link them across multiple timesteps.
I think that could easily become not that negligible, definitely in the current setup, where you have pose with covariance, which is very verbose. If I understand the proposal correctly, to stream outputs for a single track, I would output messages ever increasing in size. E.g., after tracking for 1 minute at 30 Hz, you would end up with messages of size > 30 * 60 * (36+7) * 8 = 619200 Bytes each (pose 7 values, covariance 36). So having some sort of tracking ID at single instance detection level has some advantages. You can easily collect these messages to get the whole track, both in online and offline/batch processing cases. |
My understanding is that per one input image/cloud/..., the detector outputs one
That's why I incline to not include the detection messages inside tracks. Because a track is the result of processing many detections. Maybe, if it's convenient, we could keep there a field like |
I don't think anyone was advocating for storing the entire Detection history in the Track message. Instead, the point was just that Tracks are based on Detections over time. I certainly am against storing detection history, anyway. But it is also possible that a single "current location of a specific object" as provided by one tracker could be based on multiple detections (from different detectors). Examples: sensor fusion, any object being tracked by multiple cameras spread across a volume.
Because in practice, the difference is small enough that a single message type works for both. The
This is starting to sound like One more thing: as a general rule, I greatly prefer basing these discussions on actual use cases rather than just debating what the "platonic ideals" of messages should be |
Closing due to inactivity. |
Hi, I come from the Discourse post, and I expected there will be some prepared space for discussion. I haven't found it, so I'm creating it here :) If I misunderstood how ideas/comments should be added, please, tell me where to post. Generally, I'm not a fan of all-in-one issues, but as you asked more for opinions than for resolute statements/bug reports, it would be complicated to post them one-at-a-time. So I'll try to separate ideas by paragraphs, but the thoughts are intertwined sometimes.
2D?: One general idea - do you think the 2D cases are so important to deserve their own messages? There are suggestions that full 3D information should be used whenever possible. And even now you're using e.g. the 3D ObjectHypothesisWithCovariance in Detection2D. I bet that bit-size is not a limiting factor in most vision detection applications. I know there will be the hassle theta vs quaternion, but if you think it through, it's not that bad. An angle
theta
representing yaw aroundz
boils down to quaternion(0, 0, sin(theta), cos(theta))
. And in a lot of further processing, you would anyways turntheta
into the sin/cos values. See also my further points about helper functions and decoupling from source data.Helper functions: If you fear users would need to write boilerplate code (as was expressed in #25), this can be fixed by suitable helper functions inside vision_msgs. You can provide these functions in libraries (as you do here for boundingbox construction, or as sensor_msgs does). gencpp can go one step further and provide the functionalities right in the generated header files (via the plugin infrastructure: it can add constructors, custom class method and operators and other free functions. Some off-the-shelf examples:
BoundingBox(Point, Vector3, theta)
(would do the trig to correctly fill the quaternion in 2D case),double BoundingBox::theta()
(would do the inverse),BoundingBox(Point, Vector3)
(construct 2D/3D axis-aligned box from center and extents),BoundingBox(Point, Point)
(construct 2D/3D axis-aligned box from min point and max point),std::string ObjectHypothesis::stringID()
(convert UUID to string; see my point on UUID),static UniqueID ObjectHypothesis::generateID()
. AFAIK, genpy doesn't have such plugin mechanism, but providing a helper module still seems like a sufficient way.Decouple from data: Some messages here include a field carrying some source data. I understand it is very practical on one hand (we do it too in our custom messages), but on the other hand it generates issues like #20 (and possibly future issues like sensor_msgs/PointCloud support etc.). And these have no good solution with the data inside the message except for composition on a higher level. I propose to remove the source data fields from vision_msgs, and possibly leave a note in the message definition that the user should either use a message synchronizer to get the pairs of image-detection, or create a custom type that contains both the image (cloud...) and the detection message. It's easy to use
topic_tools/relay_field
to extract the desired parts afterwards. There are situations where you do not want to relate the detection to the whole image, but you want to e.g. have the cutout from the image where the detection occured. This would work the same in case 1 detection per image is published (you'd just publish the cutout on a new topic). But it would become a bit more complicated in case the detector can return multiple detections per image and you would use DetectionArray. Here, matching by header would no longer be enough. Or - if taken rigorously - it could be, because making a cutout from an image formally invalidates its frame_id. So each cutout should have its own frame_id, and then you could make the detection-image mapping. Some more rationale: someone might want to run detections on an organized 3D pointcloud from a depth camera and return 2D detections (because he's e.g. only interested in bounding boxes in the RGB in the end). And a practical issue we faced: on a low-bandwidth link, we wanted to send detections, but as I mentioned earlier, the image is already a part of our messages (similarly to how it is done here now). But we figured out we do not want to transmit the image over the slow link. So we ended up not filling any values in the image fields, and now we have detections which lie to be detected on 0x0 px images. That doesn't really seem like a conceptual approach. Still, I'm not really 100% sure on this point, but it seems to me decoupling is the "more correct" way, but has its pros and cons. Bad thing is that using composition to create a DetectionsWithImagesArray message would not subsequently allow to usetopic_tools/relay_field
, because you would need to relay primitive-array fields, which I guess is not supported.UUID: Maybe it's time to revisit the proposal to use UUIDs (#25). Recently, ros-perception/radar_msgs#3 agreed on using them for a very similar thing (RadarTrack, which is the same as a tracked detection in terms of vision_msgs). (actually, there's still one review pending for it to be merged)
geometry_msgs: I'd really like to see both BoundingBox classes in geometry_msgs. That makes much more sense (but is a much longer run :) ). Bounding boxes are used outside of vision, too, e.g. in laser filtering. Eigen also has its AlignedBox class in the Geometry module. And, taking into account my point about 2D and helper functions, it might be sufficient to transfer only the 3D box and get rid of the 2D box. I could help with this one because I already did some PRs regarding bounding boxes to MoveIt and Eigen...
Duplicate headers: The
*Array
messages contain a header, but then they contain the array of Detections, which also each carries a header. That seems either unncessary, or not well documented (to help the user distinguish what values to fill in each header). You could either remove headers from the non-array versions and tell that even single-object detectors have to publish an array, or you could make stamped and unstamped versions of the messages.ObjectHypothesisWithPose should use composition: The ObjectHypothesisWithPose message should rather use composition to include an ObjectHypothesis object. I know this adds a bit more typing when creating and filling the message, but allows to use tools specialized at processing ObjectHypothesis messages, whereas the current implementation would require the tool to explicitly support both the variant with pose and without it, even though it might not be interested in the pose at all. I know there are some counter-examples in common_msgs (mainly the stamped/unstamped message versions), but I somehow always disliked the fact they were not interchangable when I was not interested in the timestamps or frames.
VisionInfo latching: I suggest to explicitly mention that VisionInfo should be published as a latched topic.
That's it for now. Do not treat these suggestions as some must-haves, it's just feedback which should help this package converge closer to something we could really use instead of our custom messages.
The text was updated successfully, but these errors were encountered: