initial commit

CompVis · Dec 4, 2024 · eefe7e2 · eefe7e2
commit eefe7e2
Show file tree

Hide file tree

Showing 94 changed files with 7,392 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,65 @@
+<h2 align="center">🧹CleanDIFT: Diffusion Features without Noise</h2>
+<div align="center"> 
+  <a href="https://nickstracke.dev/" target="_blank">Nick Stracke</a><sup>*</sup> · 
+  <a href="https://stefan-baumann.eu/" target="_blank">Stefan A. Baumann</a><sup>*</sup> · 
+  <a href="https://bsky.app/profile/koljabauer.bsky.social" target="_blank">Kolja Bauer</a><sup>*</sup> · 
+  <a href="https://ffundel.de/" target="_blank">Frank Fundel</a> · 
+  <a href="https://ommer-lab.com/people/ommer/" target="_blank">Björn Ommer</a>
+</div>
+<p align="center"> 
+  <b>CompVis Group @ LMU Munich</b> <br/>
+  <sup>*</sup> Equal Contribution
+</p>
+
+[![Project Page](https://img.shields.io/badge/Project-Page-blue)](https://compvis.github.io/CleanDIFT/)
+[![Paper](https://img.shields.io/badge/arXiv-PDF-b31b1b)](https://compvis.github.io/CleanDIFT/static/pdfs/cleandift.pdf)
+[![Weights](https://img.shields.io/badge/HuggingFace-Weights-orange)](https://huggingface.co/CompVis/cleandift)
+
+
+
+This repository contains the official implementation of the paper "CleanDIFT: Diffusion Features without Noise".
+
+We propose CleanDIFT, a novel method to extract noise-free, timestep-independent features by enabling diffusion models to work directly with clean input images. Our approach is efficient, training on a single GPU in just 30 minutes.
+
+![teaser](./docs/static/images/teaser_fig.png)
+
+
+## 🚀 Usage
+### Setup
+Just clone the repo and install the requirements via `pip install -r requirements.txt`, then you're ready to go.
+
+### Training
+
+In order to train a feature extractor on your own, you can run `python train.py`. The training script expects your data to be stored in `./data` with the following format: Single level directory with images named `filename.jpg` and corresponding json files `filename.json` that contain the key `caption`. 
+
+### Feature Extraction
+
+For feature extraction, please refer to one of the notebooks at [`notebooks`](https://github.com/CompVis/CleanDIFT/tree/main/notebooks). We demonstrate how to extract features and use them for semantic correspondence detection and depth prediction. 
+
+Our checkpoints are fully compatible with the `diffusers` library. If you already have a pipeline using SD 1.5 or SD 2.1 from `diffusers`, you can simply replace the U-Net state dict:
+
+```python
+from diffusers import UNet2DConditionModel
+from huggingface_hub import hf_hub_download
+
+unet = UNet2DConditionModel.from_pretrained("stabilityai/stable-diffusion-2-1", subfolder="unet")
+ckpt_pth = hf_hub_download(repo_id="CompVis/cleandift", filename="cleandift_sd21_unet.safetensors")
+state_dict = load_file(ckpt_pth)
+unet.load_state_dict(state_dict, strict=True)
+```
+
+
+## 🎓 Citation
+
+If you use this codebase or otherwise found our work valuable, please cite our paper:
+
+```bibtex
+@misc{stracke2024cleandift,
+  title={CleanDIFT: Diffusion Features without Noise},
+  author={Nick Stracke and Stefan Andreas Baumann and Kolja Bauer and Frank Fundel and Björn Ommer},
+  year={2024},
+  eprint={????},
+  archivePrefix={arXiv},
+  primaryClass={cs.CV}
+}
+```
diff --git a/configs/sd15_feature_extractor.yaml b/configs/sd15_feature_extractor.yaml
@@ -0,0 +1,63 @@
+seed: 42
+max_val_steps: 10
+val_freq: 100
+checkpoint_freq: 100
+checkpoint_dir: ./checkpoints
+lr: 1e-5
+max_steps: null
+
+grad_accum_steps: 1
+
+data:
+  _target_: src.dataloader.DataModule
+  dataset_dir: ./data
+  batch_size: 8
+  img_size: 512
+
+model:
+  _target_: src.sd_feature_extraction.StableFeatureAligner
+  sd_version: sd15
+  t_max: 999 # Max timestep used during training
+  num_t_stratification_bins: 3
+  train_unet: True
+  learn_timestep: True
+  use_text_condition: true
+
+  ae:
+    _target_: src.ae.AutoencoderKL
+    repo: stable-diffusion-v1-5/stable-diffusion-v1-5
+  mapping:
+    _target_: src.utils.MappingSpec
+    depth: 2
+    width: 256
+    d_ff: 768
+    dropout: 0.0
+  adapter_layer_class: src.sd_feature_extraction.FFNStack
+  adapter_layer_params:
+    depth: 3
+    ffn_expansion: 1
+    dim_cond: ${..mapping.width}
+  feature_extractor_cls: src.sd_feature_extraction.SD15UNetFeatureExtractor
+  feature_dims:
+    mid: 1280
+    us1: 1280
+    us2: 1280
+    us3: 1280
+    us4: 1280
+    us5: 1280
+    us6: 1280
+    us7: 640
+    us8: 640
+    us9: 640
+    us10: 320
+
+
+lr_scheduler:
+  name: constant_with_warmup
+  num_warmup_steps: 2000
+  num_training_steps: null
+  scheduler_specific_kwargs: {}
+
+hydra:
+  job:
+    chdir: false
diff --git a/configs/sd21_depth_prober.yaml b/configs/sd21_depth_prober.yaml
@@ -0,0 +1,19 @@
+# @package _global_
+
+model:
+  _target_: src.depth.DepthPred
+  loss:
+    _target_: src.depth.SigLoss
+  model_config_path: ./configs/sd21_feature_extractor.yaml
+  diffusion_image_size: 768
+  channels: 1280
+  base_model_timestep: 199
+  use_base_model_features: false
+  adapter_timestep: null
+  interpolate_features: NONE
+
+hydra:
+  job:
+    chdir: false
+
+
diff --git a/configs/sd21_feature_extractor.yaml b/configs/sd21_feature_extractor.yaml
@@ -0,0 +1,63 @@
+seed: 42
+max_val_steps: 100
+val_freq: 100
+checkpoint_freq: 200
+checkpoint_dir: ./checkpoints
+lr: 1e-5
+max_steps: null
+
+grad_accum_steps: 1
+
+data:
+  _target_: src.dataloader.DataModule
+  dataset_dir: ./data
+  batch_size: 8
+  img_size: 768
+
+model:
+  _target_: src.sd_feature_extraction.StableFeatureAligner
+  sd_version: sd21
+  t_max: 999 # Max timestep used during training
+  num_t_stratification_bins: 3
+  train_unet: True
+  learn_timestep: True
+  use_text_condition: true
+
+  ae:
+    _target_: src.ae.AutoencoderKL
+    repo: stabilityai/stable-diffusion-2-1
+  mapping:
+    _target_: src.utils.MappingSpec
+    depth: 2
+    width: 256
+    d_ff: 768
+    dropout: 0.0
+  adapter_layer_class: src.sd_feature_extraction.FFNStack
+  adapter_layer_params:
+    depth: 3
+    ffn_expansion: 1
+    dim_cond: ${..mapping.width}
+  feature_extractor_cls: src.sd_feature_extraction.SD21UNetFeatureExtractor
+  feature_dims:
+    mid: 1280
+    us1: 1280
+    us2: 1280
+    us3: 1280
+    us4: 1280
+    us5: 1280
+    us6: 1280
+    us7: 640
+    us8: 640
+    us9: 640
+    us10: 320
+
+
+lr_scheduler:
+  name: constant_with_warmup
+  num_warmup_steps: 2000
+  num_training_steps: null
+  scheduler_specific_kwargs: {}
+
+hydra:
+  job:
+    chdir: false