Is it possible to get quality & alignment scores from NativeMessage? #135

Godnoken · 2023-06-21T19:00:36Z

According to the translate request we can request that quality scores and alignments tokens are to be returned. However, the response struct only holds a string.

Is it simply not included in the example or is it not at all implemented?

If it is the former - what does the complete struct look like?

Thank you!

jelmervdl · 2023-06-22T12:43:10Z

Hi! The quality scores aren't implemented in translateLocally.

The underlying module, bergamot-translator, implements them to a limited extent. A quality model has to be loaded during model loading, and scores are returned inline in the output HTML (and thus the feature only works if you translate HTML). Since quality scores weren't that useful, or true to reality, and the requirement to use HTML to get access to them, the feature wasn't added to translateLocally and the code to load the quality model is missing.

Godnoken · 2023-06-22T13:18:44Z

Okay, thank you! Quality scores would have been neat but I'm mostly interested in alignment tokens. However, I assume that they can not be returned through Native Message either then. But I am guessing that alignments could be implemented since it is currently in use in translateLocally, albeit not through Native messaging.

The reason that I am asking is that I want to create a word alignment highlighter for my application as a tool for language learning. The one that you have created for translateLocally seems to be working really well.

jelmervdl · 2023-06-22T13:46:37Z

Yep, those aren't returned either at the moment.

The alignment scores are per sentencepiece token, or per slice of N bytes (where N depends on the vocabulary). There's no guarantee that the slice itself is valid unicode on its own, which makes it tricky to make easily accessible alignment info in the native message response: we can't give you an array of the slices as strings which you could just concatenate :(

On top of that, alignment is an M to N matrix per sentence, with a score for how well source token M aligns with target token N, so it is quite big. Then getting from tokens to utf8 character offsets also requires a bit of plumbing. It is slow enough that I actually run that code in a separate thread so it doesn't slow down moving your caret through the input box in translateLocally.

The use cases we've had for native messaging so far have only needed alignment info for markup translation, and that's "solved" by doing the markup translation inside bergamot-translator itself 🤷

So I'm not entirely sure how much work translateLocally can do here for you, or how much we should just dump that info in the native message response and let you re-implement the byte/unicode offset conversion bits.

Do you have any ideas on how you'd want that data?

Godnoken · 2023-06-27T09:14:41Z

So I'm not entirely sure how much work translateLocally can do here for you, or how much we should just dump that info in the native message response and let you re-implement the byte/unicode offset conversion bits.

Do you have any ideas on how you'd want that data?

Honestly, no, I don't. I am way too inexperienced with sentencepiece tokens, bytes and so on. However, I'm confident I could re-implement the conversions in Rust if I get some time to play with it, given that it would be possible for me to receive the same data through Native Message that you use..?

Thank you very much for your elaborate answers, I'm learning a lot just from that.

Also please excuse the late reply, had Swedish midsummer + a busy weekend.

jelmervdl · 2023-06-27T10:36:39Z

How big are your translation requests generally? Are they just one or a couple of sentences, or more?

I'm contemplating whether it is easier for you if we just pass on the byte offset info we have directly to you, and let you deal with the byte-offset -> unicode character offfset conversion in Rust, or to do that using the code we use in TranslateLocally already and give you unicode character offsets instead.

The first would mean more work on your end, but you can do it on demand (like we do) so you don't have to compute highlight info for bits that the user never focusses on. The latter is just exposing what we already implemented. But it assumes that both Rust and QT have a similar concept of what a unicode character is 😅

Godnoken · 2023-06-27T12:11:04Z

How big are your translation requests generally? Are they just one or a couple of sentences, or more?

It would be very dependent on the user case & the game (It's a translation/language learning tool for games!).
I would say that it would generally be anywhere from 1 to 150 words. No idea how big the token data would be for something like 150 words or how much slower the translation requests would get, but at least in my app it'd be an on/off feature at the user's disposal.

I'm contemplating whether it is easier for you if we just pass on the byte offset info we have directly to you, and let you deal with the byte-offset -> unicode character offfset conversion in Rust, or to do that using the code we use in TranslateLocally already and give you unicode character offsets instead.

The first would mean more work on your end, but you can do it on demand (like we do) so you don't have to compute highlight info for bits that the user never focusses on. The latter is just exposing what we already implemented. But it assumes that both Rust and QT have a similar concept of what a unicode character is 😅

I'd love to do less work but I would very likely prefer to implement it myself at the end of the day no matter how much work that entails. Speed is of the essence! 😄
And if it is so that QT may do things differently than other languages, then I'm sure you'd prefer the former option too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to get quality & alignment scores from NativeMessage? #135

Is it possible to get quality & alignment scores from NativeMessage? #135

Godnoken commented Jun 21, 2023

jelmervdl commented Jun 22, 2023

Godnoken commented Jun 22, 2023

jelmervdl commented Jun 22, 2023

Godnoken commented Jun 27, 2023

jelmervdl commented Jun 27, 2023

Godnoken commented Jun 27, 2023

Is it possible to get quality & alignment scores from NativeMessage? #135

Is it possible to get quality & alignment scores from NativeMessage? #135

Comments

Godnoken commented Jun 21, 2023

jelmervdl commented Jun 22, 2023

Godnoken commented Jun 22, 2023

jelmervdl commented Jun 22, 2023

Godnoken commented Jun 27, 2023

jelmervdl commented Jun 27, 2023

Godnoken commented Jun 27, 2023