Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why body skeleton is required as input for the first stage structure model? #3

Open
HiSultryMan opened this issue Oct 15, 2023 · 3 comments

Comments

@HiSultryMan
Copy link

Can we just use text as input to enforce the joint learning of image appearance, spatial relationship, and geometry in a unified network?

@alvinliu0
Copy link
Contributor

Hi, many thanks for your interest in our work!

  1. It's possible to only use the text as input for joint learning. At the preliminary experiment stage, we tried a model that only takes text prompt as input, and managed to learn RGB, depth, and normal jointly;

  2. We use body skeleton as input mainly due to two reasons:

    1. As a very easily obtainable control signal from users, body skeleton can be drawn or dragged via keypoint location. It enables more free and controllable generation;
    1. We follow the settings of current SOTA text-to-human generation methods, i.e., ControlNet, T2I-Adapter, and HumanSD to take 17 COCO keypoint skeleton for fair setting and comparison.

Best

@HiSultryMan
Copy link
Author

Hi, many thanks for your interest in our work!

  1. It's possible to only use the text as input for joint learning. At the preliminary experiment stage, we tried a model that only takes text prompt as input, and managed to learn RGB, depth, and normal jointly;
  2. We use body skeleton as input mainly due to two reasons:
    1. As a very easily obtainable control signal from users, body skeleton can be drawn or dragged via keypoint location. It enables more free and controllable generation;
    1. We follow the settings of current SOTA text-to-human generation methods, i.e., ControlNet, T2I-Adapter, and HumanSD to take 17 COCO keypoint skeleton for fair setting and comparison.

Best

Thank you for your reply. Do you get the similar considerable improvement when only use the text as input for joint learning? Since purely text to image model is much conciser to use.

@alvinliu0
Copy link
Contributor

Yeah, there is considerable improvement over baselines given only text. If applicable, incorporating additional pose guidance gives more structural guidance for better visual quality, as also verified in ControlNet and T2I-Adapter. Maybe a naive extension is to use LLM to do text-to-pose, then use HyperHuman for generation, if you don't want to input pose. We will explore this in future work, as explained in limitations and future work in the last section.

Best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants