More natural line-wrapping when using --max_line_width #78

JonasCz · 2024-01-17T16:46:01Z

By default, Whisper produces subtitles (SRT/VTT) with often quite long line-lengths. For some uses these can be too long for viewers to comfortably read. (a common recommendation is that subtitles should be ~50 characters maximum lenghth). For example, testing with "The Expert"


1
00:00:00,000 --> 00:00:04,440
Our company has a new strategic initiative to increase market penetration,

2
00:00:05,120 --> 00:00:07,720
maximise brand loyalty and enhance intangible assets.

3
00:00:08,080 --> 00:00:13,660
In pursuit of these objectives, we've started a new project for which we require seven red lines.

If I want them shorter, I can use something like --max_line_count 2 --max_line_width 50 which does result in very consistent, short lines, but the current line-wrapping implementation results in subtitles which are quite unnatural to read, due to line- and subtitle- breaks not being on (sub)-sentences.


1
00:00:00,000 --> 00:00:05,800
Our company has a new strategic initiative to
increase market penetration, maximise brand

2
00:00:05,800 --> 00:00:11,700
loyalty and enhance intangible assets. In pursuit
of these objectives, we've started a new project

3
00:00:11,700 --> 00:00:16,480
for which we require seven red lines. I understand
your company can help us in this matter. Of

This PR changes this, by wrapping lines in a more natural way, splitting them on periods or commas if possible, and otherwise on the longest gap around the middle of the too-long line. It results in more natural to read text, while staying within the set --max_line_width constraint:

1
00:00:00,000 --> 00:00:04,440
Our company has a new strategic
initiative to increase market penetration,

2
00:00:05,120 --> 00:00:07,720
maximise brand loyalty and
enhance intangible assets.

3
00:00:08,080 --> 00:00:12,060
In pursuit of these objectives,
we've started a new project for which

4
00:00:12,060 --> 00:00:13,660
we require seven red lines.

I've tested that:

Diarization output is the same
Works regardless of language
The JSON output is not changed

I'm not super familiar with Python, so this code is probably not the nicest. Any feedback is appreciated!

Purfview · 2024-02-12T14:57:48Z

Does this work with --highlight_words?

JonasCz · 2024-02-13T17:03:39Z

Yes, testing with --highlight_words True results in "karaoke style" underlined words as expected.

Purfview · 2024-02-13T17:32:07Z

Did you meant underlined and with "more natural line-wrapping"?

JonasCz · 2024-02-13T18:03:52Z

Yes, both together works, i.e. --word_timestamps True --highlight_words True --max_line_count 2 --max_line_width 50 gives underlines and natural line wraps as shown above

Purfview · 2024-02-13T18:08:42Z

Thx, then maybe I'll borrow your PR for my repo to work with "highlight_words" as my implementation of "max_line_width/max_line_count" is not compatible with "highlight_words".

Lycoan · 2024-02-16T08:26:07Z

@JonasCz, nice extension! Does it detect sentence endings besides period, like '?', '!' and even '-' ?

Anyway, it seems that your fork fails to run when --max_line_width is not given, but --word_timestamps is set to True.
It can be checked by the following in the base folder of the repo:
whisper-ctranslate2 --model medium --language Catalan --output_format srt --word_timestamps True ./e2e-tests/gossos.mp3

It it also worth running the tests and modify them, if needed (right now they fails unfortunately):
make run-tests
(the following packages are need to be installed first: pip install torch pyannote.audio)

more natural line-wrapping when using --max_line_width

043e686

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More natural line-wrapping when using --max_line_width #78

More natural line-wrapping when using --max_line_width #78

JonasCz commented Jan 17, 2024

Purfview commented Feb 12, 2024

JonasCz commented Feb 13, 2024

Purfview commented Feb 13, 2024

JonasCz commented Feb 13, 2024

Purfview commented Feb 13, 2024

Lycoan commented Feb 16, 2024

More natural line-wrapping when using --max_line_width #78

Are you sure you want to change the base?

More natural line-wrapping when using --max_line_width #78

Conversation

JonasCz commented Jan 17, 2024

Purfview commented Feb 12, 2024

JonasCz commented Feb 13, 2024

Purfview commented Feb 13, 2024

JonasCz commented Feb 13, 2024

Purfview commented Feb 13, 2024

Lycoan commented Feb 16, 2024