Despite significant progress in text-to-image diffusion models, achieving precise spatial control over generated outputs remains challenging. One of the popular approaches for this task is ControlNet, which introduces an auxiliary conditioning module into the architecture. To improve alignment of the generated image and control, ControlNet++ proposes a cycle consistency loss to refine correspondence between controls and outputs, but restricts its application to the final denoising steps, while the main structure is introduced at an early generation stage. To address this issue, we suggest InnerControl -- a training strategy that enforces spatial consistency across all diffusion steps. Specifically, we train lightweight control prediction probes — small convolutional networks — to reconstruct input control signals (e.g., edges, depth) from intermediate UNet features at every denoising step. We prove the efficiency of such models to extract signals even from very noisy latents and utilize these models to generate pseudo ground truth controls during training. Suggested approach enables alignment loss that minimizes the difference between predicted and target condition throughout the whole diffusion process. Our experiments demonstrate that our method improves control alignment and fidelity of generation. By integrating this loss with established training techniques (e.g., ControlNet++), we achieve high performance across different condition methods such as edge and depth conditions.
git clone https://github.com/control/InnerControl.git
pip3 install -r requirements.txt
pip3 install clean-fid
pip3 install torchmetrics
All the organized data has been put on Huggingface and will be automatically downloaded during training or evaluation. You can preview it in advance to check the data samples and disk space occupied with following links.
Task | Training Data 🤗 | Evaluation Data 🤗 |
---|---|---|
LineArt, Hed | Data, 1.14 TB | Data, 2.25GB |
Depth | Data, 1.22 TB | Data, 2.17GB |
We provide example of applying our pretrained model to generate images in the notebook.
By default, we conduct our training on 8 A100-80G GPUs. You can change number of utilized gpu number in train/config.yaml file. If you lack sufficient computational resources, you can reduce the batch size while increasing gradient accumulation.
We can directly perform reward-alignment fine-tuning.
bash train/aligned_depth.sh
bash train/aligned_hed.sh
bash train/aligned_linedrawing.sh
Please download the model weights and put them into each subset of checkpoints
:
model | ControlNet weights | Align model |
---|---|---|
LineArt | model | model |
Depth | model | model |
Hed (SoftEdge) | model | model |
|
Please make sure the folder directory is consistent with the test script, then you can eval each model by:
bash eval/eval_depth.sh
bash eval/eval_hed.sh
bash eval/eval_linedrawing.sh
To evaluate CLIP and FID:
bash eval/eval_clip.sh
bash eval/eval_fid.sh
For FID evaluation you should additionally save dataset images into separate folder.
We sincerely thank the Huggingface, ControlNet, ControlNet++ and Readout Guidance communities for their open source code and contributions. Our project would not be possible without these amazing works.
If our work assists your research, feel free to give us a star ⭐ or cite us using:
@misc{konovalova2025heedinginnervoicealigning,
title={Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback},
author={Nina Konovalova and Maxim Nikolaev and Andrey Kuznetsov and Aibek Alanov},
year={2025},
eprint={2507.02321},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.02321},
}