Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to get quality & alignment scores from NativeMessage? #135

Open
Godnoken opened this issue Jun 21, 2023 · 6 comments
Open

Comments

@Godnoken
Copy link

According to the translate request we can request that quality scores and alignments tokens are to be returned. However, the response struct only holds a string.

Is it simply not included in the example or is it not at all implemented?

If it is the former - what does the complete struct look like?

Thank you!

@jelmervdl
Copy link
Collaborator

Hi! The quality scores aren't implemented in translateLocally.

The underlying module, bergamot-translator, implements them to a limited extent. A quality model has to be loaded during model loading, and scores are returned inline in the output HTML (and thus the feature only works if you translate HTML). Since quality scores weren't that useful, or true to reality, and the requirement to use HTML to get access to them, the feature wasn't added to translateLocally and the code to load the quality model is missing.

@Godnoken
Copy link
Author

Okay, thank you! Quality scores would have been neat but I'm mostly interested in alignment tokens. However, I assume that they can not be returned through Native Message either then. But I am guessing that alignments could be implemented since it is currently in use in translateLocally, albeit not through Native messaging.

The reason that I am asking is that I want to create a word alignment highlighter for my application as a tool for language learning. The one that you have created for translateLocally seems to be working really well.

@jelmervdl
Copy link
Collaborator

Yep, those aren't returned either at the moment.

The alignment scores are per sentencepiece token, or per slice of N bytes (where N depends on the vocabulary). There's no guarantee that the slice itself is valid unicode on its own, which makes it tricky to make easily accessible alignment info in the native message response: we can't give you an array of the slices as strings which you could just concatenate :(

On top of that, alignment is an M to N matrix per sentence, with a score for how well source token M aligns with target token N, so it is quite big. Then getting from tokens to utf8 character offsets also requires a bit of plumbing. It is slow enough that I actually run that code in a separate thread so it doesn't slow down moving your caret through the input box in translateLocally.

The use cases we've had for native messaging so far have only needed alignment info for markup translation, and that's "solved" by doing the markup translation inside bergamot-translator itself 🤷

So I'm not entirely sure how much work translateLocally can do here for you, or how much we should just dump that info in the native message response and let you re-implement the byte/unicode offset conversion bits.

Do you have any ideas on how you'd want that data?

@Godnoken
Copy link
Author

So I'm not entirely sure how much work translateLocally can do here for you, or how much we should just dump that info in the native message response and let you re-implement the byte/unicode offset conversion bits.

Do you have any ideas on how you'd want that data?

Honestly, no, I don't. I am way too inexperienced with sentencepiece tokens, bytes and so on. However, I'm confident I could re-implement the conversions in Rust if I get some time to play with it, given that it would be possible for me to receive the same data through Native Message that you use..?

Thank you very much for your elaborate answers, I'm learning a lot just from that.


Also please excuse the late reply, had Swedish midsummer + a busy weekend.

@jelmervdl
Copy link
Collaborator

How big are your translation requests generally? Are they just one or a couple of sentences, or more?

I'm contemplating whether it is easier for you if we just pass on the byte offset info we have directly to you, and let you deal with the byte-offset -> unicode character offfset conversion in Rust, or to do that using the code we use in TranslateLocally already and give you unicode character offsets instead.

The first would mean more work on your end, but you can do it on demand (like we do) so you don't have to compute highlight info for bits that the user never focusses on. The latter is just exposing what we already implemented. But it assumes that both Rust and QT have a similar concept of what a unicode character is 😅

@Godnoken
Copy link
Author

How big are your translation requests generally? Are they just one or a couple of sentences, or more?

It would be very dependent on the user case & the game (It's a translation/language learning tool for games!).
I would say that it would generally be anywhere from 1 to 150 words. No idea how big the token data would be for something like 150 words or how much slower the translation requests would get, but at least in my app it'd be an on/off feature at the user's disposal.

I'm contemplating whether it is easier for you if we just pass on the byte offset info we have directly to you, and let you deal with the byte-offset -> unicode character offfset conversion in Rust, or to do that using the code we use in TranslateLocally already and give you unicode character offsets instead.

The first would mean more work on your end, but you can do it on demand (like we do) so you don't have to compute highlight info for bits that the user never focusses on. The latter is just exposing what we already implemented. But it assumes that both Rust and QT have a similar concept of what a unicode character is 😅

I'd love to do less work but I would very likely prefer to implement it myself at the end of the day no matter how much work that entails. Speed is of the essence! 😄
And if it is so that QT may do things differently than other languages, then I'm sure you'd prefer the former option too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants