Does part 1 of training (CLIP-based training) includes image modality? #9

sakshamsingh1 · 2022-09-27T00:31:00Z

Hi,
Thanks for the great work!!

The paper states that during part-1 training (i.e. CLIP-based Contrastive Latent Representation Learning step) you consider image, text and audio modalities. But the code only uses audio and text modality for this training part.

Is this an old code? Or I misinterpreted the training part in paper?
Thanks

lsh3163 · 2022-09-27T01:38:19Z

Dear @sakshamsingh1,
Thanks for your interest!

Yes, as you mentioned, this code is the old version.
For using both image and text for pre-training, you could download whole raw videos (image, audio, text) with yt_dlp.
After that, we employ below scripts:

projection_audio_text = scale_constant1 * (audio_embedding @ text_embedding.T)
projection_audio_image = scale_constant2 * (audio_embedding @ image_embedding.T)

ce = torch.nn.CrossEntropyLoss()
text_contrastive_loss = ce(projection_audio_text, label) + ce(projection_audio_text.T, label)
image_contrastive_loss = ce(projection_audio_image, label) + ce(projection_audio_image.T, label)
loss = text_contrastive_loss + image_contrastive_loss

Thanks

sakshamsingh1 · 2022-09-27T01:49:14Z

Thanks @lsh3163 for the quick response.
This makes sense.

I am very interested in your work and actively looking into it.
Do you plan to push the newer code?

Thanks

lsh3163 · 2022-09-27T02:00:08Z

Yes, I plan to update this code later, but I am not sure when it will be.
So, if you have any questions about the complete code, feel free to ask me.

Thanks

sakshamsingh1 · 2022-09-29T22:11:05Z

Great, Thanks for being so helpful.
I have some questions:

Which is the latest pre-trained audio encoder?
- resnet18 in pre-trained folder: here
- resnet18_57 provided in the Readme link: here
- Or none of these (assuming this is an older code).
Can you provide the code for zero-shot audio classification over ESC-50 dataset and US-8k dataset?

Thanks

lsh3163 · 2022-09-30T11:58:19Z

Dear @sakshamsingh1
This is my answer.

resnet18 is newer one!
Yes, I can provide it to you but the code has changed due to extension, so it needs to be modified to provide the original code. However, I'll attach the main code to this thread. Thanks. :)

sakshamsingh1 · 2022-10-04T19:26:04Z

Thanks @lsh3163. That would be great!!

In particular, I am interested in knowing how you pre-process audio before feeding it into the audio-encoder (to get audio embeddings).

Allencheng97 · 2022-10-17T01:48:00Z

Also looking forward to the pre-process part!

lsh3163 · 2022-10-17T07:31:26Z

@Allencheng97 @sakshamsingh1
Thanks for your interest. I think it would be good to refer to the code below.

n_mels = 128
time_length = 864
resize_resolution = 512
y, sr = librosa.load(wav_name, sr=44100)
audio_inputs = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels)[0]
audio_inputs = librosa.power_to_db(audio_inputs, ref=np.max) / 80.0 + 1

zero = np.zeros((n_mels, time_length))
resize_resolution = 512
h, w = audio_inputs.shape
if w >= time_length:
   j = 0
   j = random.randint(0, w-time_length)
  audio_inputs = audio_inputs[:,j:j+time_length]
else:
   zero[:,:w] = audio_inputs[:,:w]
   audio_inputs = zero
   audio_inputs = cv2.resize(audio_inputs, (n_mels, resize_resolution))

lsh3163 · 2022-10-17T07:36:00Z

This is a zero-shot audio-classifcation evaluation code.

with torch.no_grad():
   text_tokens = torch.cat([clip.tokenize(text) for text in labels])
   text_embedding = clip_model.encode_text(text_tokens.to(device)).float()
   audio_embedding = audio_encoder(audio_inputs)
   audio_embedding = audio_embedding / audio_embedding.norm(dim=-1, keepdim=True)
   proj_per_audio = (audio_embedding @ text_embedding.T) * math.exp(0.07)
   label_idx = torch.argmax(proj_per_audio, axis=1)
   pred_category = labels[label_idx]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does part 1 of training (CLIP-based training) includes image modality? #9

Does part 1 of training (CLIP-based training) includes image modality? #9

sakshamsingh1 commented Sep 27, 2022

lsh3163 commented Sep 27, 2022

sakshamsingh1 commented Sep 27, 2022

lsh3163 commented Sep 27, 2022

sakshamsingh1 commented Sep 29, 2022

lsh3163 commented Sep 30, 2022

sakshamsingh1 commented Oct 4, 2022 •

edited

Loading

Allencheng97 commented Oct 17, 2022

lsh3163 commented Oct 17, 2022

lsh3163 commented Oct 17, 2022

Does part 1 of training (CLIP-based training) includes image modality? #9

Does part 1 of training (CLIP-based training) includes image modality? #9

Comments

sakshamsingh1 commented Sep 27, 2022

lsh3163 commented Sep 27, 2022

sakshamsingh1 commented Sep 27, 2022

lsh3163 commented Sep 27, 2022

sakshamsingh1 commented Sep 29, 2022

lsh3163 commented Sep 30, 2022

sakshamsingh1 commented Oct 4, 2022 • edited Loading

Allencheng97 commented Oct 17, 2022

lsh3163 commented Oct 17, 2022

lsh3163 commented Oct 17, 2022

sakshamsingh1 commented Oct 4, 2022 •

edited

Loading