v1.1.0 #408

ggerganov · 2023-01-15T12:00:56Z

ggerganov
Jan 15, 2023
Maintainer

Overview

The major change in this pre-release is the improved decoding implementation in whisper.cpp:

Support for average logprob and entropy based criteria for fallback
Support for temperature T > 0
Improved Greedy decoder via best_of parameter for T > 0
Add beam search decoding (a.k.a beam_size)

More information about the decoding changes can be found in #291
Additionally, there are a few performance improvements for Apple Silicon, WASM and non-F16C CPUs
Support for POWER9 architectures has been added.

The reason that this is a pre-release and not an official release is that the new implementation has not been sufficiently tested yet and the existing bindings for other languages have not been updated to support the API changes. The official release v1.1.x will be created when there is enough feedback about the new decoding implementation and when the bindings have been updated. So make sure to send your feedback in the discussion created for this pre-release. For now, the v1.0.4 release should be considered more stable.

What's Changed

Core `ggml` / `whisper`

ggml : POWER9 support by @fitzsim in ggml : add f16 acceleration for POWER9 ppc64le #320, ggml : improve f16 acceleration for POWER9 ppc64le #349, Reorganize POWER9 SIMD code #369
ggml : simplify the SIMD code by @ggerganov in Simplify the SIMD code #324
ggml : add SSE3 and fp16 conversion lookup table by @abitofevrything in Add SSE3 and fp16 conversion lookup table #368
ggml : utilise Accelerate's vDSP for some computations d51fc3e
ggml : speed-up softmax compute via Accelerate and loop unrolling d61d55c
ggml : do not start extra threads when using BLAS d347a59
whisper : do sample_to_timestamp calculation with 64 bit precision to avoid overflow by @boolemancer in Do sample_to_timestamp calculation with 64 bit precision to avoid overflow #388
whisper : various code clean-up and improvements by @asmaloney in ggml: Make consts static #317 whisper: Fix mem leak on failure to load model #318 whisper: Use emplace_back in place of push_back #319 examples: small code cleanups #322 etc
whisper : improve decoding by @ggerganov in Improve decoding #291
whisper : account for speed_up flag for short audio Short voice be skipped in speed_up mode #405

C-style API

Add loader class to allow loading from buffer and others by @prsyahmi in Add loader class to allow loading from buffer and others #353
Add whisper_token_data::plog
Add whisper_init_from_file()
Add whisper_init_from_buffer()
Change whisper_init()
Remove whisper_sample_best()
Remove whisper_sample_timestamp()
Add whisper_n_audio_ctx()
Add whisper_get_logits()
Remove whisper_get_probs()
Change struct whisper_full_params

Bindings

Golang bindings by @djthorpe in Initial import of golang bindings #287, go bindings updated so they can be used in third party packages. #379, go bindings: Adding features to the go-whisper example, etc #384

Examples

whisper.android : remove android ABI constraint by @Digipom in Remove android abi constraint #301
whisper.swiftui : SwiftUI example by @Digipom in Whisper.swiftui #308
main : add -ocsv, aka --output-csv for writing CSV file containing millisecond timestamps by @NielsMayer in Similar to Whisper PR#228, this adds -ocsv, aka --output-csv, writing CSV file containing millisecond timestamps #340
command : refactor to split command list & general transcription modes by @asmaloney in command: Refactor to split command list & general transcription modes #331
command : always-prompt mode by @dnhkng in Command: always test the prompt #383
stream : fix data race on bool + avoid division-by-zero a466c34
stream : fix a bug that inserted a lot of empty audio at the start a6dbd91
bench.wasm : print system info fafd789

New Contributors

@djthorpe made their first contribution in Initial import of golang bindings #287
@0xmohit made their first contribution in run go mod tidy before building examples #296
@asmaloney made their first contribution in {cmake} Add headers to target #298
@fitzsim made their first contribution in ggml : add f16 acceleration for POWER9 ppc64le #320
@NielsMayer made their first contribution in Similar to Whisper PR#228, this adds -ocsv, aka --output-csv, writing CSV file containing millisecond timestamps #340
@aviks made their first contribution in Add runtime destination install #345
@eltociear made their first contribution in models : fix typo in convert-h5-to-ggml.py #346
@abitofevrything made their first contribution in Add SSE3 and fp16 conversion lookup table #368
@Mike-Bell made their first contribution in Support AVX2 in windows better #381
@dnhkng made their first contribution in Command: always test the prompt #383
@prsyahmi made their first contribution in Add loader class to allow loading from buffer and others #353
@ianb made their first contribution in (README) Make first example and stream example easier to run #391

Full Changelog: v1.0.4...v1.1.0

Highlights

Sample SwiftUI application example/whisper.swiftui

This discussion was created from the release v1.1.0.

geimist · 2023-01-15T12:15:23Z

geimist
Jan 15, 2023

Thanks for your great work!
I have had several problematic files so far, including a 90 minute file which got stuck in a loop of a few words from minute 20 until the end. The result with the current version is flawless.
Thank you very much.

0 replies

szeidner · 2023-01-15T19:01:10Z

szeidner
Jan 15, 2023

This is awesome! In my testing so far, this new version has also not had any of the issues the previous version did of being stuck in a loop of one line or one word. One thing I've noticed is that there are dashes at the beginning of some of the lines of the output. I'm not sure if this is a change in the new version or just something odd with the model (I'm using tiny.en)

[00:47:07.920 --> 00:47:10.400]   Oh boy, kicking down chairs and knocking down tables.
[00:47:10.400 --> 00:47:12.080]   - In a restaurant.
[00:47:12.080 --> 00:47:12.920]   - I had a restaurant.
[00:47:12.920 --> 00:47:14.760]   - It's fine to the concord.
[00:47:14.760 --> 00:47:17.360]   - Just doing amazing parody of that that I am going to say.
[00:47:17.360 --> 00:47:19.360]   - I know, I have to see a lot that.
[00:47:19.360 --> 00:47:20.520]   - Yeah, but you can still send it to me.
[00:47:20.520 --> 00:47:21.840]   - But they are a little watch right now.
[00:47:21.840 --> 00:47:23.680]   - But it's so funny because each one of them
[00:47:23.680 --> 00:47:25.840]   does a funnier version than the other

1 reply

janngobble Jan 15, 2023

This is awesome! In my testing so far, this new version has also not had any of the issues the previous version did of being stuck in a loop of one line or one word. One thing I've noticed is that there are dashes at the beginning of some of the lines of the output. I'm not sure if this is a change in the new version or just something odd with the model (I'm using tiny.en)
[00:47:07.920 --> 00:47:10.400]   Oh boy, kicking down chairs and knocking down tables.
[00:47:10.400 --> 00:47:12.080]   - In a restaurant.
[00:47:12.080 --> 00:47:12.920]   - I had a restaurant.
[00:47:12.920 --> 00:47:14.760]   - It's fine to the concord.
[00:47:14.760 --> 00:47:17.360]   - Just doing amazing parody of that that I am going to say.
[00:47:17.360 --> 00:47:19.360]   - I know, I have to see a lot that.
[00:47:19.360 --> 00:47:20.520]   - Yeah, but you can still send it to me.
[00:47:20.520 --> 00:47:21.840]   - But they are a little watch right now.
[00:47:21.840 --> 00:47:23.680]   - But it's so funny because each one of them
[00:47:23.680 --> 00:47:25.840]   does a funnier version than the other

Whisper sometimes puts dashes when the speaker changes. I can’t guarantee this is the answer but it’s what I’ve observed in certain files I’ve transcribed.

debasish-mihup · 2023-01-16T06:27:01Z

debasish-mihup
Jan 16, 2023

@ggerganov Great work! Can you add upload artifact for other platforms apart from Windows. Can not seem to find pre-built binaries for non-windows platform under Actions.

1 reply

ggerganov Jan 16, 2023
Maintainer Author

Currently, only Windows binaries are produced.
I personally think that Unix platforms offer a super easy way to build the code from source, even for non-technical people: just git clone + make. But if providing pre-compiled binaries is of interest to more people, we can try to improve the CI, although there might be issues with CPU compatibility flags, system libraries versions, all that jazz.

djthorpe · 2023-01-16T08:50:25Z

djthorpe
Jan 16, 2023

Thank you Georgi! I have a small request to make the versioning correct for golang. Can you tag with a "v" at the front when making semantic tags/releases (for example, "v1.0.1" rather than "1.0.1")? I don't really know why, but golang expects this prefix, and it will help with the go documentation and module support when making updates.

The source I found about this is here:
https://pkg.go.dev/golang.org/x/mod/semver

But it's unclear to me exactly why golang needs the "v".

3 replies

janngobble Jan 16, 2023

Thank you Georgi! I have a small request to make the versioning correct for golang. Can you tag with a "v" at the front when making semantic tags/releases (for example, "v1.0.1" rather than "1.0.1")? I don't really know why, but golang expects this prefix, and it will help with the go documentation and module support when making updates.

The source I found about this is here: https://pkg.go.dev/golang.org/x/mod/semver

But it's unclear to me exactly why golang needs the "v".

This is just tagging. You can literally just use: go get example.com/[email protected] instead of go get example.com/[email protected].

If he changed the tag just 'cos someone on go's team decided using "v" when tagging would be best, it would be changing the entire GitHub tagging methodology.

If I understand https://go.dev/doc/modules/publishing correctly, then go doesn't expect it so much as someone on the dev team decided it would be nice.

For instance, when using macports, you do a port install [email protected]_0

The "v" you are asking for is simply part of the tag. It's the @ that denotes versioning. Or am I wrong?

Not trying to strike you down or anything, seriously wondering.

djthorpe Jan 16, 2023

OK thank you for your reply @janngobble - I spent a moment looking at this a bit more and to clarify, the golang spec says "Each version starts with the letter v, followed by a semantic version". So take this for what you will!

ggerganov Jan 16, 2023
Maintainer Author

I can see that both tags can work and neither is technically wrong.
But considering that we have already put effort for providing golang support, I am OK to switch to v1.1.0 just for that.
I am actually not sure why I decided to drop the "v" in the first place, given that I use it in my other projects.
Probably I have decided to experiment and try something "wild" this time 😆

NielsMayer · 2023-01-19T03:28:06Z

NielsMayer
Jan 19, 2023

I'm running 1.10+ (as in the head of respository as of today) and re-ran transcription on two videos that previously got stuck in a loop of repeating words and then non-output for the majority of the transcriopt. The original python implementation of Whisper did not have these problems. Now, as of the 1.10 version, these videos now transcribe successfully.

Original sources:

https://odysee.com/@Housatonic:0/517-ep-189-1:5 (4 hours long)
https://odysee.com/@Housatonic:0/519-ep-189-2%EF%BB%BF:2 (2'20" long)

I'm also noticing some interesting improvements over the original whisper, e.g. tagging sounds:

(orig source: https://rumble.com/v25zs54-why-can-we-still-not-talk-about-natural-immunity-060-stay-free-with-russell.html ... output format from my -ocsv extension on 'main'):

0, 10000, "[birds chirping]"
10000, 20020, "[birds chirping]"
20020, 30020, "[birds chirping]"
30020, 40040, "[birds chirping]"
40040, 50040, "[birds chirping]"
50040, 60060, "[birds chirping]"
60060, 62060, "[birds chirping]"
62060, 82060, "[music]"
...

(orig source https://www.youtube.com/watch?v=ZMUKa2kWtTk )

...
3833600, 3835600, "[ Applause ]"
3835600, 3837600, ">> Thank you."
3837600, 3839600, "[ Applause ]"
3839600, 3841600, "[ Applause ]"
3841600, 3843600, "[ Applause ]"
3843600, 3845600, "[ Applause ]"
3845600, 3847600, "[ Applause ]"
3847600, 3849600, "[ Applause ]"
3849600, 3851600, "[ Applause ]"
3851600, 3853600, "[ Applause ]"
3853600, 3858600, "[ Silence ]"
...

(orig source: https://rumble.com/v25eg37-title-more-on-brazils-radical-censorship-escalation-after-show-q-and-a-on-l.html )

0, 2580, "(upbeat music)"
...
1222380, 1224960, "(upbeat music)"
1224960, 1228320, "[MUSIC PLAYING]"

1 reply

ggerganov Jan 19, 2023
Maintainer Author

Thanks - very useful feedback.
Good thing is you used the latest master branch because yesterday I realized that the main example had accidentally disabled the temperature fallback f583e2d and it is enabled now.
It's the main thing that helps with resolving repetitions and other failure cases.

Additionally, adding --beam-size can further improve results via BeamSearch, but it comes at significantly increased computation time.

I believe the tagging is intentionally disabled in the original whisper via some post-filtering of the output - i.e. removing stuff in (...), [...], etc.

janngobble · 2023-01-20T12:03:00Z

janngobble
Jan 20, 2023

wrt @NielsMayer

> 0, 2580, "(upbeat music)"
> ...
> 1222380, 1224960, "(upbeat music)"
> 1224960, 1228320, "[MUSIC PLAYING]"
> ```

I love my latest podcast transcription (yes, it was a Christmas podcast, yes there was static after the theme and yes, there was dramatic music after that):🏆

[00:00:00.000 --> 00:00:07.000]   [Christmas music]
[00:00:07.000 --> 00:00:08.000]   [TV static]
[00:00:08.000 --> 00:00:16.000]   [dramatic music]

Oh, did you notice? It now gets inline conversational quoting correctly...and I am in NO WAY sure how it does this but examine the following:

[00:03:14.040 --> 00:03:18.720]   On that note, my kids have been watching it on Disney Plus now, and they've really taken to it,
[00:03:18.720 --> 00:03:22.440]   and watching it with them is the thing that finally made me realize,
[00:03:22.440 --> 00:03:25.440]   "Okay, if y'all like this, you're totally ready for The Simpsons."
[00:03:25.440 --> 00:03:26.520]   [Laughter]
[00:03:26.520 --> 00:03:27.880]   Yeah, basically.
[00:03:27.880 --> 00:03:33.760]   Do you remember how they even made fun of the dinosaurs on The Simpsons?
[00:03:33.760 --> 00:03:36.200]   Oh, I'm drawing a blank.
[00:03:36.200 --> 00:03:40.120]   I can't remember which episode it was, and I'm sure some of our friends listening to the episode
[00:03:40.120 --> 00:03:42.840]   are probably screaming like, "Oh my god, it's the Simpsons!"
[00:03:42.840 --> 00:03:46.440]   There was some, like the first act of some episode.

I'm a programmer and have NO IDEA how whisper.cpp gets this correct, but it does! (ps: before 1.1.0 it didn't used to be able to separate out when people were quoting something).

This is so cool! Thanks @ggerganov!

0 replies

janngobble · 2023-01-22T23:54:10Z

janngobble
Jan 22, 2023

I just want to say: I've processed over 100 podcasts so far using the 1.1.0 beta and NOT ONCE have I had the "repeating line" issue. (btw: it's using ggml-medium.en.bin)

Thanks so much - as that was the reason I've had to revert to using the whisper python implementation!

Thanks again, @ggerganov!

0 replies

geimist · 2023-01-24T10:08:48Z

geimist
Jan 24, 2023

I have been using version 1.1.1 since yesterday (previously 1.1.0 Beta) and unfortunately there are many repetitions again. I have now read that with this version a temperature fallback is set to -1.
Does this mean that I have to pass a certain parameter myself to avoid the problem with the repetitions? If so, which one?

6 replies

geimist Jan 24, 2023

78f1661 makes the difference. Once I build it without the change, there are no more annoying repeats and it works comparable to v1.1.0 beta.

The quality was really significantly worse with v1.1.1 - comparable to the state before the implementation of "improved decoding".

ggerganov Jan 25, 2023
Maintainer Author

Ok, in this case I believe this is not actually a regression.
In some cases, the previous transcription can contribute to repetitions in the future (due to the auto-regressive nature of the model). Due to the bug before 78f1661, the previously transcribed text was ignored and incidentally helped for you case. Now that it is fixed - it triggers the repetition.

Using the latest master you can now set --max-context 0 to main and it will behave as in v1.1.0. I.e. the previously transcribed text will not be passed as input for the future chunks. It's the same as transcribing every audio segment independently from what has happened in the past. This probably helps in your specific case, but in general should give worse results because the decoder loses context.

But in any case, the repetition-detection currently used in whisper.cpp is based on an simple Entropy metric which is probably not as good as the compression-based one used in the Python implementation. So it fails to detect the repetition and thus fails to perform the fallback "best-of" strategy.

Anyway, a lot of guess work here - let's see if more reports come in.

geimist Jan 25, 2023

First of all, thank you.
Does --max-context 0 refer to a boolean value or is it worth trying with different values?
The repetitions just occurred in many talk recordings (not studio quality but clearly understandable), but unfortunately I cannot share them as an example. Maybe I can find another example which I can publish.

ggerganov Feb 4, 2023
Maintainer Author

It's an integer. 0 means to disable it all together.
Strange, I expected it would resolve your issue..

o0101 Feb 14, 2023

Definitely seeing a lot of these "hallucination" repeats of a previous statement seemingly induced by silence. It's enough of a problem I'm considering ways to preprocess the input.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.1.0 #408

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 13 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

v1.1.0 #408

ggerganov Jan 15, 2023 Maintainer

Overview

What's Changed

Core ggml / whisper

C-style API

Bindings

Examples

New Contributors

Highlights

Replies: 9 comments · 13 replies

ggerganov Jan 16, 2023 Maintainer Author

ggerganov Jan 16, 2023 Maintainer Author

ggerganov Jan 19, 2023 Maintainer Author

ggerganov Jan 25, 2023 Maintainer Author

ggerganov Feb 4, 2023 Maintainer Author

ggerganov
Jan 15, 2023
Maintainer

Core `ggml` / `whisper`

Replies: 9 comments 13 replies

ggerganov Jan 16, 2023
Maintainer Author

ggerganov Jan 16, 2023
Maintainer Author

ggerganov Jan 19, 2023
Maintainer Author

ggerganov Jan 25, 2023
Maintainer Author

ggerganov Feb 4, 2023
Maintainer Author