Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback

Despite significant progress in text-to-image diffusion models, achieving precise spatial control over generated outputs remains challenging. One of the popular approaches for this task is ControlNet, which introduces an auxiliary conditioning module into the architecture. To improve alignment of the generated image and control, ControlNet++ proposes a cycle consistency loss to refine correspondence between controls and outputs, but restricts its application to the final denoising steps, while the main structure is introduced at an early generation stage. To address this issue, we suggest InnerControl -- a training strategy that enforces spatial consistency across all diffusion steps. Specifically, we train lightweight control prediction probes — small convolutional networks — to reconstruct input control signals (e.g., edges, depth) from intermediate UNet features at every denoising step. We prove the efficiency of such models to extract signals even from very noisy latents and utilize these models to generate pseudo ground truth controls during training. Suggested approach enables alignment loss that minimizes the difference between predicted and target condition throughout the whole diffusion process. Our experiments demonstrate that our method improves control alignment and fidelity of generation. By integrating this loss with established training techniques (e.g., ControlNet++), we achieve high performance across different condition methods such as edge and depth conditions.

Environments

git clone https://github.com/control/InnerControl.git
pip3 install -r requirements.txt
pip3 install clean-fid
pip3 install torchmetrics

📌 Data Preperation

All the organized data has been put on Huggingface and will be automatically downloaded during training or evaluation. You can preview it in advance to check the data samples and disk space occupied with following links.

Task	Training Data 🤗	Evaluation Data 🤗
LineArt, Hed	Data, 1.14 TB	Data, 2.25GB
Depth	Data, 1.22 TB	Data, 2.17GB

Quickstart

Jupyter notebook

We provide example of applying our pretrained model to generate images in the notebook.

📌 Method diagram

📌 Training

By default, we conduct our training on 8 A100-80G GPUs. You can change number of utilized gpu number in train/config.yaml file. If you lack sufficient computational resources, you can reduce the batch size while increasing gradient accumulation.

We can directly perform reward-alignment fine-tuning.

bash train/aligned_depth.sh
bash train/aligned_hed.sh
bash train/aligned_linedrawing.sh

📌 Evaluation

Checkpoints Preparation

Please download the model weights and put them into each subset of checkpoints:

model	ControlNet weights	Align model
LineArt	model	model
Depth	model	model
Hed (SoftEdge)	model	model

|

📌 Evaluate Controllability

Please make sure the folder directory is consistent with the test script, then you can eval each model by:

bash eval/eval_depth.sh
bash eval/eval_hed.sh
bash eval/eval_linedrawing.sh

📌 Evaluate CLIP-Score and FID

To evaluate CLIP and FID:

bash eval/eval_clip.sh
bash eval/eval_fid.sh

For FID evaluation you should additionally save dataset images into separate folder.

🙏 Acknowledgements

We sincerely thank the Huggingface, ControlNet, ControlNet++ and Readout Guidance communities for their open source code and contributions. Our project would not be possible without these amazing works.

Citation

If our work assists your research, feel free to give us a star ⭐ or cite us using:

@misc{konovalova2025heedinginnervoicealigning,
      title={Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback}, 
      author={Nina Konovalova and Maxim Nikolaev and Andrey Kuznetsov and Aibek Alanov},
      year={2025},
      eprint={2507.02321},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.02321}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
eval		eval
images		images
mmlab/mmseg		mmlab/mmseg
train		train
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
inference.ipynb		inference.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback

Environments

📌 Data Preperation

Quickstart

Jupyter notebook

📌 Method diagram

📌 Training

📌 Evaluation

Checkpoints Preparation

📌 Evaluate Controllability

📌 Evaluate CLIP-Score and FID

🙏 Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

ControlGenAI/InnerControl

Folders and files

Latest commit

History

Repository files navigation

Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback

Environments

📌 Data Preperation

Quickstart

Jupyter notebook

📌 Method diagram

📌 Training

📌 Evaluation

Checkpoints Preparation

📌 Evaluate Controllability

📌 Evaluate CLIP-Score and FID

🙏 Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages