Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to understand and use the audio embedding? #148

Open
arthur19312 opened this issue Apr 29, 2024 · 4 comments
Open

How to understand and use the audio embedding? #148

arthur19312 opened this issue Apr 29, 2024 · 4 comments

Comments

@arthur19312
Copy link

arthur19312 commented Apr 29, 2024

I'm new here, I run the method get_audio_embedding_from_filelist with model music_audioset_epoch_15_esc_90.14.pt and get the audio embeddings just like

[[-4.639852792024612427e-02, -9.935184381902217865e-03, ...]]

I approximately know it represent the feature of the input audio somehow, while I don't know how to use it.
Could someone tell me what is the audio embedding that I get in format of float? And whether this audio embedding is common to other models? And how should I use it?

(PS: I'm really interested in this work while it seems like I lack some necessary background knowledge, so it would be better if someone could recommend me some relevant materials to get me into the field. Thank you so much ❤)

@cvillela
Copy link

I am having similar doubts.

When extracting text and audio embeddings, I can easily perform cosine similarity to find closely related pairs, and retrieve audio from text inputs and vice-versa.

However, I would like to know if there is a way to decode the embeddings into text. Decoding them into Audio seems manageable using AudioLDM.

@satvik-dixit
Copy link

@cvillela is there a way to decode CLAP embeddings to Audio using AudioLDM?

@arthur19312
Copy link
Author

When I make an analogy to CLIP, I would know how to use CLAP. My mind was stuck then ><. Thanks for your hints!
Now we know AudioLDM will turn text into audio, and is there any tools works like clip interrogator to turn audio into text?

@waldleitner
Copy link

@arthur19312 The following CLAP implementration also supports a model for audio captioning (not yet tested):

https://arxiv.org/abs/2309.05767
https://github.com/microsoft/CLAP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants