Attention mask? #88

pcuenca · 2023-06-17T17:45:07Z

Like in Stable Diffusion, no attention mask appears to be used for input tokens:

Lines 93 to 101 in 2a03657

    
           input_ids = self.tokenizer( 
        
               text, 
        
               return_tensors="pt", 
        
               padding="max_length", 
        
               truncation=True, 
        
               max_length=self.tokenizer.model_max_length, 
        
           ).input_ids  # TODO: remove hardcode 
        
           input_ids = input_ids.to(self.device) 
        
           encoder_hidden_states = self.text_encoder(input_ids).last_hidden_state

But according to third party analysis this appears to have been a mistake all along. Do we have insight on whether attention masks would help for better prompt-image alignment?

Birch-san · 2023-07-07T22:50:25Z

these authors reckon it's better to train on an unmasked text embeddings (even though that risks learning from PAD token embeddings):
huggingface/diffusers#1890 (comment)

as for inference: the user needs to be able to match whatever approach was used during training.

I thought Muse was a bit wackier though. it actually masks vision tokens:

https://github.com/lucidrains/muse-maskgit-pytorch/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attention mask? #88

Attention mask? #88

pcuenca commented Jun 17, 2023

Birch-san commented Jul 7, 2023 •

edited

Loading

Attention mask? #88

Attention mask? #88

Comments

pcuenca commented Jun 17, 2023

Birch-san commented Jul 7, 2023 • edited Loading

Birch-san commented Jul 7, 2023 •

edited

Loading