Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does part 1 of training (CLIP-based training) includes image modality? #9

Open
sakshamsingh1 opened this issue Sep 27, 2022 · 9 comments

Comments

@sakshamsingh1
Copy link

Hi,
Thanks for the great work!!

The paper states that during part-1 training (i.e. CLIP-based Contrastive Latent Representation Learning step) you consider image, text and audio modalities. But the code only uses audio and text modality for this training part.

Is this an old code? Or I misinterpreted the training part in paper?
Thanks

@lsh3163
Copy link
Collaborator

lsh3163 commented Sep 27, 2022

Dear @sakshamsingh1,
Thanks for your interest!

Yes, as you mentioned, this code is the old version.
For using both image and text for pre-training, you could download whole raw videos (image, audio, text) with yt_dlp.
After that, we employ below scripts:

projection_audio_text = scale_constant1 * (audio_embedding @ text_embedding.T)
projection_audio_image = scale_constant2 * (audio_embedding @ image_embedding.T)

ce = torch.nn.CrossEntropyLoss()
text_contrastive_loss = ce(projection_audio_text, label) + ce(projection_audio_text.T, label)
image_contrastive_loss = ce(projection_audio_image, label) + ce(projection_audio_image.T, label)
loss = text_contrastive_loss + image_contrastive_loss

Thanks

@sakshamsingh1
Copy link
Author

Thanks @lsh3163 for the quick response.
This makes sense.

I am very interested in your work and actively looking into it.
Do you plan to push the newer code?

Thanks

@lsh3163
Copy link
Collaborator

lsh3163 commented Sep 27, 2022

Yes, I plan to update this code later, but I am not sure when it will be.
So, if you have any questions about the complete code, feel free to ask me.

Thanks

@sakshamsingh1
Copy link
Author

Great, Thanks for being so helpful.
I have some questions:

  1. Which is the latest pre-trained audio encoder?

    • resnet18 in pre-trained folder: here
    • resnet18_57 provided in the Readme link: here
    • Or none of these (assuming this is an older code).
  2. Can you provide the code for zero-shot audio classification over ESC-50 dataset and US-8k dataset?

Thanks

@lsh3163
Copy link
Collaborator

lsh3163 commented Sep 30, 2022

Dear @sakshamsingh1
This is my answer.

  1. resnet18 is newer one!
  2. Yes, I can provide it to you but the code has changed due to extension, so it needs to be modified to provide the original code. However, I'll attach the main code to this thread. Thanks. :)

@sakshamsingh1
Copy link
Author

sakshamsingh1 commented Oct 4, 2022

Thanks @lsh3163. That would be great!!

In particular, I am interested in knowing how you pre-process audio before feeding it into the audio-encoder (to get audio embeddings).

@Allencheng97
Copy link

Also looking forward to the pre-process part!

@lsh3163
Copy link
Collaborator

lsh3163 commented Oct 17, 2022

@Allencheng97 @sakshamsingh1
Thanks for your interest. I think it would be good to refer to the code below.

n_mels = 128
time_length = 864
resize_resolution = 512
y, sr = librosa.load(wav_name, sr=44100)
audio_inputs = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels)[0]
audio_inputs = librosa.power_to_db(audio_inputs, ref=np.max) / 80.0 + 1

zero = np.zeros((n_mels, time_length))
resize_resolution = 512
h, w = audio_inputs.shape
if w >= time_length:
   j = 0
   j = random.randint(0, w-time_length)
  audio_inputs = audio_inputs[:,j:j+time_length]
else:
   zero[:,:w] = audio_inputs[:,:w]
   audio_inputs = zero
   audio_inputs = cv2.resize(audio_inputs, (n_mels, resize_resolution))

@lsh3163
Copy link
Collaborator

lsh3163 commented Oct 17, 2022

This is a zero-shot audio-classifcation evaluation code.

with torch.no_grad():
   text_tokens = torch.cat([clip.tokenize(text) for text in labels])
   text_embedding = clip_model.encode_text(text_tokens.to(device)).float()
   audio_embedding = audio_encoder(audio_inputs)
   audio_embedding = audio_embedding / audio_embedding.norm(dim=-1, keepdim=True)
   proj_per_audio = (audio_embedding @ text_embedding.T) * math.exp(0.07)
   label_idx = torch.argmax(proj_per_audio, axis=1)
   pred_category = labels[label_idx]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants