Why body skeleton is required as input for the first stage structure model? #3

HiSultryMan · 2023-10-15T16:45:42Z

Can we just use text as input to enforce the joint learning of image appearance, spatial relationship, and geometry in a unified network?

alvinliu0 · 2023-10-17T03:02:55Z

Hi, many thanks for your interest in our work!

It's possible to only use the text as input for joint learning. At the preliminary experiment stage, we tried a model that only takes text prompt as input, and managed to learn RGB, depth, and normal jointly;
We use body skeleton as input mainly due to two reasons:

1. As a very easily obtainable control signal from users, body skeleton can be drawn or dragged via keypoint location. It enables more free and controllable generation;
1. We follow the settings of current SOTA text-to-human generation methods, i.e., ControlNet, T2I-Adapter, and HumanSD to take 17 COCO keypoint skeleton for fair setting and comparison.

Best

HiSultryMan · 2023-10-17T03:13:52Z

Hi, many thanks for your interest in our work!

It's possible to only use the text as input for joint learning. At the preliminary experiment stage, we tried a model that only takes text prompt as input, and managed to learn RGB, depth, and normal jointly;

We use body skeleton as input mainly due to two reasons:

As a very easily obtainable control signal from users, body skeleton can be drawn or dragged via keypoint location. It enables more free and controllable generation;

We follow the settings of current SOTA text-to-human generation methods, i.e., ControlNet, T2I-Adapter, and HumanSD to take 17 COCO keypoint skeleton for fair setting and comparison.

Best

Thank you for your reply. Do you get the similar considerable improvement when only use the text as input for joint learning? Since purely text to image model is much conciser to use.

alvinliu0 · 2023-10-17T03:58:05Z

Yeah, there is considerable improvement over baselines given only text. If applicable, incorporating additional pose guidance gives more structural guidance for better visual quality, as also verified in ControlNet and T2I-Adapter. Maybe a naive extension is to use LLM to do text-to-pose, then use HyperHuman for generation, if you don't want to input pose. We will explore this in future work, as explained in limitations and future work in the last section.

Best

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why body skeleton is required as input for the first stage structure model? #3

Why body skeleton is required as input for the first stage structure model? #3

HiSultryMan commented Oct 15, 2023

alvinliu0 commented Oct 17, 2023

HiSultryMan commented Oct 17, 2023

alvinliu0 commented Oct 17, 2023

Why body skeleton is required as input for the first stage structure model? #3

Why body skeleton is required as input for the first stage structure model? #3

Comments

HiSultryMan commented Oct 15, 2023

alvinliu0 commented Oct 17, 2023

HiSultryMan commented Oct 17, 2023

alvinliu0 commented Oct 17, 2023