-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why body skeleton is required as input for the first stage structure model? #3
Comments
Hi, many thanks for your interest in our work!
Best |
Thank you for your reply. Do you get the similar considerable improvement when only use the text as input for joint learning? Since purely text to image model is much conciser to use. |
Yeah, there is considerable improvement over baselines given only text. If applicable, incorporating additional pose guidance gives more structural guidance for better visual quality, as also verified in ControlNet and T2I-Adapter. Maybe a naive extension is to use LLM to do text-to-pose, then use HyperHuman for generation, if you don't want to input pose. We will explore this in future work, as explained in limitations and future work in the last section. Best |
Can we just use text as input to enforce the joint learning of image appearance, spatial relationship, and geometry in a unified network?
The text was updated successfully, but these errors were encountered: