Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reproduce MobilePose v2 result? Which diagonal edge for normalized? #71

Open
Learningm opened this issue Aug 19, 2022 · 4 comments

Comments

@Learningm
Copy link

Hi, I am interested in these amazing work but I wonder how to reproduce mobilepose v2 result.

How to understand the loss 'per vertex MSE normalized on diagonal edge length' ? What do you mean by diagonal edge length? 2D or 3D? I guess it should be 2D because the output keypoints are 2D, but which diagonal edge? We got six faces of the cuboid, 12 diagonal edges of cuboid faces, 2 diagonal edges for 3D cuboid. However, the 2 diagonal edges for 3D cuboid are not equal cause they are projected to the 2D space.

I guess the training pipeline should be:

  1. 2D detector training used 2D bbox data
  2. use 2D detector (gt or predicted seems both ok) to generate cropping region, and crop the image, adjust the keypoints ground-truth according to the cropping, and use backbone to predict 9 2D keypoints, then compute loss.

Could you explain more details about this part(which diagonal edges)? Thank you very much.

@Mechazo11
Copy link

Hi @Learningm I know this is an old issue but have you figured out after computing the 9 2D keypoints, how they were lifted to 3D space using EPnP

@Learningm
Copy link
Author

Hi @Learningm I know this is an old issue but have you figured out after computing the 9 2D keypoints, how they were lifted to 3D space using EPnP

@Mechazo11 I didn't figure out. It's hard to reimplement with few details mentioned in papers.

@Mechazo11
Copy link

@Learningm I am going to try this direction, sharing here to ask whether this kinda makes sense to you too.

We start by passing a 224x224x3 pytorch tensor of an object. Lets call the origin of this image crop $O$. Since the crop is part of the full image, we also know $O$'s coordinate in global image coordinate. Global here means the image coordinate of the entire image, not just the crop.

After ingesting the tensor, the network now gives me $9$ 2D `offsets normalized by dividing them by length of the diagonal of the image crop. The first row is centroid and rest are the eight corners (depending on the sequence they were defined).

To compute loss I also calculate the normalized offsets of the 9 2D keypoints from $O$ for the ground-truth. Now the loss should be MSE or smoothL1 score between the two sets of offsets + any penalty term that does not go into the part of the back propagation.

What do you think of this approach? My idea is rather than doing a coordinate shift, we capitalize on the well-known top-left corner that is commonly associated with 2D images.

@Learningm
Copy link
Author

@Mechazo11 I'm afraid that i can not make some comments about your approach, cause i have moved to other topics instead of focusing this direction.

https://github.com/NVlabs/FoundationPose, as far as i know, this recent work about pose estimation seems pretty good, hope it helps !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants