-
Notifications
You must be signed in to change notification settings - Fork 328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented Coca architecture #2371
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great @VarunS1997 !! left a few comments. Also, please add a colab to verify the output.
One additional overhead work is needed. please add keras_cv/models/feature_extractor/coca to this file PS: we will fix this overhead soon, but in the mean time this is what we need to do to make sure the large GPU tests run. |
…each layer with expected sizing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Left a few comments.
import numpy as np | ||
from keras import Sequential | ||
from keras_cv.api_export import keras_cv_export | ||
from keras_nlp.layers import RotaryEmbedding, TransformerDecoder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we were doing a conditional import of keras_nlp
, so keras_cv
was installable without keras-nlp installed if you are using unrelated features. But that was when the only use was for one tokenizer.
keras-cv/keras_cv/models/feature_extractor/clip/clip_model.py
Lines 81 to 85 in 5faae37
if keras_nlp is None: | |
raise ValueError( | |
"ClipTokenizer requires keras-nlp. Please install " | |
"using pip `pip install -U keras-nlp && pip install -U keras`" | |
) |
We could reconsider if keras-cv
should hard depend on keras-nlp
if we want more stuff like this? No strong feelings. @divyashreepathihalli fyi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as we support more multi modal models we should depend on KerasNLP. If the tf-text install issue is resolved, we should add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sgtm! Though if we switch to a hard dependency here, we should probably add keras-nlp
as a dependency in setup.py
(which comes with a transitive dependency on tensorflow-text
and tensorflow
just fyi).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to include that in this PR? There's already some imports of Keras-NLP in other places of Keras CV.
If we make it a separate PR, it'll make it easier to rollback if we need to. Considering it's a new dependency, might be worth separating.
|
||
# [batch_size, sequence_length+1, text_dim] | ||
text_tokens = np.concatenate(texts, self.cls_token) | ||
mask = np.concatenate((np.ones_like(texts), np.zeros_like(self.cls_token))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't tell if this mask is the right shape or not. Usually you want something (batch_size, seq_length, seq_length)
(or seq lenth + 1 if that is the effective sequence length. What is text_dim
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
text_dim is the dimensionality of the text embeddings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I left a few comments
Also, please do add a colab demo to show that the model is working as expected.
keras_cv/layers/attention_pooling.py
Outdated
super().build(input_shape) | ||
|
||
self.multi_head_attn.build(input_shape) | ||
self.layer_norm.build(input_shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the input shape for layer_norm correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it, let me know if it still doesn't look right!
self.image_patching = PatchingAndEmbedding( | ||
self.encoder_width, self.img_patch_size | ||
) | ||
self.image_encoder = Sequential( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sequential might not work, the model will not build properly.
) | ||
|
||
self.text_embedding = RotaryEmbedding() | ||
self.unimodal_text_decoder = Sequential( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, sequential might not work well. Please double check.
for _ in range(self.unimodal_decoder_depth) | ||
] | ||
) | ||
self.multimodal_text_decoder = Sequential( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about Sequential
num_patches = (images_shape[1] // self.img_patch_size) * ( | ||
images_shape[2] // self.img_patch_size | ||
) + 1 | ||
self.image_encoder.build((batch_size, self.encoder_width, num_patches)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could keep batch_size as None
example
self.image_encoder.build((None, self.encoder_width, num_patches))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my understanding, is there a specific reason to do that over setting the batch_size?
contrastive loss from the ViT and the uni-modal Text Decoder is combined with a captioning loss from the multi-modal | ||
Decoder in order to produce the combined total loss. | ||
|
||
Basic Usage: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has to be changed to Example:
since we follow only Example
or Examples:
as a standard format.
@VarunS1997 will you be completing this one? |
What does this PR do?
Implements the work done in the "CoCa": Contrastive Captioners are Image-Text Foundation Models" (https://arxiv.org/pdf/2205.01917.pdf).
This PR requires:
Before submitting
Pull Request section?
to it if that's the case.