Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More natural line-wrapping when using --max_line_width #78

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

JonasCz
Copy link

@JonasCz JonasCz commented Jan 17, 2024

By default, Whisper produces subtitles (SRT/VTT) with often quite long line-lengths. For some uses these can be too long for viewers to comfortably read. (a common recommendation is that subtitles should be ~50 characters maximum lenghth). For example, testing with "The Expert"


1
00:00:00,000 --> 00:00:04,440
Our company has a new strategic initiative to increase market penetration,

2
00:00:05,120 --> 00:00:07,720
maximise brand loyalty and enhance intangible assets.

3
00:00:08,080 --> 00:00:13,660
In pursuit of these objectives, we've started a new project for which we require seven red lines.

If I want them shorter, I can use something like --max_line_count 2 --max_line_width 50 which does result in very consistent, short lines, but the current line-wrapping implementation results in subtitles which are quite unnatural to read, due to line- and subtitle- breaks not being on (sub)-sentences.


1
00:00:00,000 --> 00:00:05,800
Our company has a new strategic initiative to
increase market penetration, maximise brand

2
00:00:05,800 --> 00:00:11,700
loyalty and enhance intangible assets. In pursuit
of these objectives, we've started a new project

3
00:00:11,700 --> 00:00:16,480
for which we require seven red lines. I understand
your company can help us in this matter. Of

This PR changes this, by wrapping lines in a more natural way, splitting them on periods or commas if possible, and otherwise on the longest gap around the middle of the too-long line. It results in more natural to read text, while staying within the set --max_line_width constraint:

1
00:00:00,000 --> 00:00:04,440
Our company has a new strategic
initiative to increase market penetration,

2
00:00:05,120 --> 00:00:07,720
maximise brand loyalty and
enhance intangible assets.

3
00:00:08,080 --> 00:00:12,060
In pursuit of these objectives,
we've started a new project for which

4
00:00:12,060 --> 00:00:13,660
we require seven red lines.

I've tested that:

  • Diarization output is the same
  • Works regardless of language
  • The JSON output is not changed

I'm not super familiar with Python, so this code is probably not the nicest. Any feedback is appreciated!

@Purfview
Copy link

Does this work with --highlight_words?

@JonasCz
Copy link
Author

JonasCz commented Feb 13, 2024

Yes, testing with --highlight_words True results in "karaoke style" underlined words as expected.

@Purfview
Copy link

Did you meant underlined and with "more natural line-wrapping"?

@JonasCz
Copy link
Author

JonasCz commented Feb 13, 2024

Yes, both together works, i.e. --word_timestamps True --highlight_words True --max_line_count 2 --max_line_width 50 gives underlines and natural line wraps as shown above

@Purfview
Copy link

Thx, then maybe I'll borrow your PR for my repo to work with "highlight_words" as my implementation of "max_line_width/max_line_count" is not compatible with "highlight_words".

@Lycoan
Copy link

Lycoan commented Feb 16, 2024

@JonasCz, nice extension! Does it detect sentence endings besides period, like '?', '!' and even '-' ?

Anyway, it seems that your fork fails to run when --max_line_width is not given, but --word_timestamps is set to True.
It can be checked by the following in the base folder of the repo:
whisper-ctranslate2 --model medium --language Catalan --output_format srt --word_timestamps True ./e2e-tests/gossos.mp3

It it also worth running the tests and modify them, if needed (right now they fails unfortunately):
make run-tests
(the following packages are need to be installed first: pip install torch pyannote.audio)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants