-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does part 1 of training (CLIP-based training) includes image modality? #9
Comments
Dear @sakshamsingh1, Yes, as you mentioned, this code is the old version. projection_audio_text = scale_constant1 * (audio_embedding @ text_embedding.T)
projection_audio_image = scale_constant2 * (audio_embedding @ image_embedding.T)
ce = torch.nn.CrossEntropyLoss()
text_contrastive_loss = ce(projection_audio_text, label) + ce(projection_audio_text.T, label)
image_contrastive_loss = ce(projection_audio_image, label) + ce(projection_audio_image.T, label)
loss = text_contrastive_loss + image_contrastive_loss Thanks |
Thanks @lsh3163 for the quick response. I am very interested in your work and actively looking into it. Thanks |
Yes, I plan to update this code later, but I am not sure when it will be. Thanks |
Great, Thanks for being so helpful.
Thanks |
Dear @sakshamsingh1
|
Thanks @lsh3163. That would be great!! In particular, I am interested in knowing how you pre-process audio before feeding it into the audio-encoder (to get audio embeddings). |
Also looking forward to the pre-process part! |
@Allencheng97 @sakshamsingh1 n_mels = 128
time_length = 864
resize_resolution = 512
y, sr = librosa.load(wav_name, sr=44100)
audio_inputs = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels)[0]
audio_inputs = librosa.power_to_db(audio_inputs, ref=np.max) / 80.0 + 1
zero = np.zeros((n_mels, time_length))
resize_resolution = 512
h, w = audio_inputs.shape
if w >= time_length:
j = 0
j = random.randint(0, w-time_length)
audio_inputs = audio_inputs[:,j:j+time_length]
else:
zero[:,:w] = audio_inputs[:,:w]
audio_inputs = zero
audio_inputs = cv2.resize(audio_inputs, (n_mels, resize_resolution)) |
This is a zero-shot audio-classifcation evaluation code. with torch.no_grad():
text_tokens = torch.cat([clip.tokenize(text) for text in labels])
text_embedding = clip_model.encode_text(text_tokens.to(device)).float()
audio_embedding = audio_encoder(audio_inputs)
audio_embedding = audio_embedding / audio_embedding.norm(dim=-1, keepdim=True)
proj_per_audio = (audio_embedding @ text_embedding.T) * math.exp(0.07)
label_idx = torch.argmax(proj_per_audio, axis=1)
pred_category = labels[label_idx] |
Hi,
Thanks for the great work!!
The paper states that during part-1 training (i.e. CLIP-based Contrastive Latent Representation Learning step) you consider image, text and audio modalities. But the code only uses audio and text modality for this training part.
Is this an old code? Or I misinterpreted the training part in paper?
Thanks
The text was updated successfully, but these errors were encountered: