-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Complete speaker identification (diarization) in audio files #21
Comments
To transcribe audio, I would recommend to take a look at OpenAI's Whisper + a long context LLM (gives you more customization options), or AssemblyAI (can separate speaker 1 and speaker 2 , etc without any customization). |
We're talking to OpenAI already: https://openai.com/index/data-partnerships/ And we used Whisper (via the OpenAI partnership) to transcribe. So I guess my question is how you'd take a transcription we have and have it identify speakers? (The first run of our content through Whisper cost around $20k.) |
oh my... that's a significant spend, hope they gave you some credit. i can run whisper on my server next time you need it. |
Yeah, we had to get it done quickly, but thank you for the offer. We'll have to get smarter about this kind of thing in the future. I think identifying who says what is a huge deal, particularly if you can figure out who is who. One idea is making a moot court chat interface, where you argue with the judge and they argue back, in their voice (deep voice?), saying things that they'd be likely to say. Another use case is having better transcripts so people can read along in the browser (see Oyez.org). Allowing search by sentiment is another, and so forth. So I guess I'm not sure what we really want diarization for — or at least it's not one obvious thing, but it feels like the next step in this process though. |
There are a few ways to do this:
Figure out people's actual names by listening to how they address each other and/or using the information we know from scraping, like the names of the judges on the panel.
Just call people speaker1, speaker2, etc.
I haven't looked into how to do this, but I gather there are a bunch of AI methods these days. Definitely something to research. If anybody wants to pick this up, I'd love to see a feature/quality/price/etc comparison across diarization methods.
The text was updated successfully, but these errors were encountered: