Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete speaker identification (diarization) in audio files #21

Open
mlissner opened this issue Jul 15, 2024 · 4 comments
Open

Complete speaker identification (diarization) in audio files #21

mlissner opened this issue Jul 15, 2024 · 4 comments
Labels
courtlistener data enhancement Making an existing feature better

Comments

@mlissner
Copy link
Member

There are a few ways to do this:

  1. Figure out people's actual names by listening to how they address each other and/or using the information we know from scraping, like the names of the judges on the panel.

  2. Just call people speaker1, speaker2, etc.

I haven't looked into how to do this, but I gather there are a bunch of AI methods these days. Definitely something to research. If anybody wants to pick this up, I'd love to see a feature/quality/price/etc comparison across diarization methods.

@legaltextai
Copy link

legaltextai commented Jul 15, 2024

To transcribe audio, I would recommend to take a look at OpenAI's Whisper + a long context LLM (gives you more customization options), or AssemblyAI (can separate speaker 1 and speaker 2 , etc without any customization).
If you don't want to run your audio files through transcription again, I can write a script to feed the content of each text file into a fine-tuned or properly prompted model. IMHO, Gpt-4o or claude sonnet 3.5 should be able to pick up where one speaker stops and another starts, which speaker is a judge or a party, and a sentiment.
This will cost $$, but at some point you may want to reach out to OpenAI, Anthropic, Cohere and AssemblyAI and explain the public benefit of your project and ask for credits.

@mlissner
Copy link
Member Author

We're talking to OpenAI already: https://openai.com/index/data-partnerships/

And we used Whisper (via the OpenAI partnership) to transcribe. So I guess my question is how you'd take a transcription we have and have it identify speakers? (The first run of our content through Whisper cost around $20k.)

@legaltextai
Copy link

oh my... that's a significant spend, hope they gave you some credit. i can run whisper on my server next time you need it.
basically, you either 1) fine tune an open-source model yourself or using third party infra like Predibase, or 2) use one of the good commercial models (i like gpt-4o and claude-sonnet-3.5) with a prompt and examples. you can use some transcript from the SCOTUS transcriptions to train or for the prompt.
but first step, i would be clear about the use case and typical queries you'd like to be able to address with this service, then it's easier to fine-tune or prompt.
take a look at assemblyai, they specialize in transcriptions.

@mlissner
Copy link
Member Author

Yeah, we had to get it done quickly, but thank you for the offer. We'll have to get smarter about this kind of thing in the future.

I think identifying who says what is a huge deal, particularly if you can figure out who is who. One idea is making a moot court chat interface, where you argue with the judge and they argue back, in their voice (deep voice?), saying things that they'd be likely to say.

Another use case is having better transcripts so people can read along in the browser (see Oyez.org). Allowing search by sentiment is another, and so forth.

So I guess I'm not sure what we really want diarization for — or at least it's not one obvious thing, but it feels like the next step in this process though.

@mlissner mlissner transferred this issue from freelawproject/courtlistener Sep 25, 2024
@mlissner mlissner added enhancement Making an existing feature better courtlistener data labels Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
courtlistener data enhancement Making an existing feature better
Projects
Status: CourtListener Backlog
Development

No branches or pull requests

2 participants