Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to run diariazation pipeline on multiple file segments ? #265

Open
ywangwxd opened this issue Dec 25, 2024 · 2 comments
Open

how to run diariazation pipeline on multiple file segments ? #265

ywangwxd opened this issue Dec 25, 2024 · 2 comments

Comments

@ywangwxd
Copy link

I am trying to run the diariazation pipeline on multiple file segments which are continuous parts of a long audio file.

Here is what I am doing. To put it simple and ideally, I just need to set the timestamp_shift correctly for each segment.

    timestamp_shift=0
    #initialize pipeline here
    for fsplit in split_file_paths:
                args.source = fsplit
                padding = config.get_file_padding(args.source)
                audio_source = src.FileAudioSource(args.source, config.sample_rate, padding, block_size)
                pipeline.set_timestamp_shift(-padding[0] + timestamp_shift)
                #timestamp_shift += (audio_source.duration-config.duration)
                timestamp_shift += (audio_source.duration)
                inference = StreamingInference(
                    pipeline,
                    source=audio_source,
                    batch_size=config.batch_size,
                    do_profile=True,
                    show_progress=False,
                )

                # Attach observers for required side effects
                observers = []
                # observers = [pipeline.suggest_writer(audio_source.uri, args.output)]
                if not args.no_plot:
                    observers.append(pipeline.suggest_display())

                inference.attach_observers(*observers)
                #inference.source=audio_source
                res=inference()

But the result is incorrect. Specifically, the result of each segment after the first one is always one duration (5 seconds in my config) forward. Here is an example diariazation result to explain the probem. To check if the timestamp is correct, I have done a dummy test. I used the same file segment in two consecutive running of the pipeline. The duration of this file segment is 300 seconds. You can see that in the first segment, the speech starts at 7.820 seconds which is correct. In the second segment, it is supposed to start at roughly 307.820 seconds, but the result is starting at 312.820 seconds. The difference is exactly 5 seconds. I have gone into the code by debugging. It looks like that the last_end_time of audio_buffer in the end of each segment is always 5 seconds forward. But I do not know how to fix it .

#diariazation of first segment, total duration is 300 seconds
0.000-- 0.800 Music. 7.820--12.140 Hello and welcome to Close Up with The Hollywood Reporter. Actresses, I'm Matthew Bellany. 12.360--16.220 I'd like to welcome our guest today, Sarsha
Ronan, Allison Janney. 16.620--20.360 Mary J. Blige Emma Stone Jennifer Lawrence 20.830--21.890 and Jessica Chastain. 22.270--26.490 Let's get started. Obviously the headlines in Hollywood

#diariazation of second segment, totoal duration is 300 seconds
305.000--305.840 Music. 312.820--317.160 Hello and welcome to Close Up with The Hollywood Reporter. Actresses, I'm Matthew Bellany. 317.380--321.220 I'd like to welcome our guest today,
Sarsha Ronan, Allison Janney. 321.620--325.360 Mary J. Blige Emma Stone Jennifer Lawrence 325.830--326.890 and Jessica Chastain. 327.270--331.490 Let's get started. Obviously the headlines

@juanmc2005
Copy link
Owner

Hi @ywangwxd! What you're trying to do here is more complex than setting a timestamp shift.

If you run a different audio source and pipeline for each chunk, diart will assume that each file part is a different file, and it will attempt to give you all the results it can.

It would seem to me that you can frame the problem as a "conversion" between streams of audio chunks. In other words, you have a stream of non-overlapping 5s chunks and you want to feed that to a diart pipeline.
The class StreamingInference can help with that because it will attempt to reformat your stream to 5s windows with a 0.5s shift, but you should make sure that your 5s chunks are formatted as an AudioSource.

What I would do here is to implement your own audio source. In this custom source, you would iterate over your file parts. Each part should be split into blocks of size step (as in FileAudioSource, check the block_size parameter) and emit those blocks through the audio source stream with on_next(). Now this audio source paired with StreamingInference should give you the expected result.

@ywangwxd
Copy link
Author

ywangwxd commented Jan 3, 2025

Thanks for your suggestion.

I have made it work by a workaround. I found the problem was padding. To be more specific, the last chunk of each file segment will be padded to make a full length of duration (5s in my case). By doing so, it will have an effect of adding an artifact blank audio piece (less than 5s) in the end of each segment. So the timeshift for the next segment is not the actual duration of previous segment, but

math.ceil(audio_source.duration/config.duration)*config.duration .

By doing this modification, I can make the diariazation and asr alignment correct for each segment. But to remove the added artifact noise audio, I need to do another "subtraction". But this time it is done on the final transcription results. Overall speaking, I have done something like this.

            #Use original timeshift mechanism to make diarization and asr alignment correct
            pipeline.set_timestamp_shift(-padding[0] + timestamp_shift)

            if isinstance(pipeline, SpeakerAwareTranscription):
                 pipeline.set_timestamp_shift_backward(timestamp_shift_backward)
            
            timestamp_shift += math.ceil(audio_source.duration/config.duration)*config.duration
            #Let's subtract the time of added artifact noise audio
            timestamp_shift_backward -= (math.ceil(audio_source.duration/config.duration)*config.duration - audio_source.duration)

The use of set_timestamp_shift_backward is quite simple, just apply it on the timestamp value after all the original process is finished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants