-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to get quality & alignment scores from NativeMessage? #135
Comments
Hi! The quality scores aren't implemented in translateLocally. The underlying module, bergamot-translator, implements them to a limited extent. A quality model has to be loaded during model loading, and scores are returned inline in the output HTML (and thus the feature only works if you translate HTML). Since quality scores weren't that useful, or true to reality, and the requirement to use HTML to get access to them, the feature wasn't added to translateLocally and the code to load the quality model is missing. |
Okay, thank you! Quality scores would have been neat but I'm mostly interested in alignment tokens. However, I assume that they can not be returned through Native Message either then. But I am guessing that alignments could be implemented since it is currently in use in translateLocally, albeit not through Native messaging. The reason that I am asking is that I want to create a word alignment highlighter for my application as a tool for language learning. The one that you have created for translateLocally seems to be working really well. |
Yep, those aren't returned either at the moment. The alignment scores are per sentencepiece token, or per slice of N bytes (where N depends on the vocabulary). There's no guarantee that the slice itself is valid unicode on its own, which makes it tricky to make easily accessible alignment info in the native message response: we can't give you an array of the slices as strings which you could just concatenate :( On top of that, alignment is an M to N matrix per sentence, with a score for how well source token M aligns with target token N, so it is quite big. Then getting from tokens to utf8 character offsets also requires a bit of plumbing. It is slow enough that I actually run that code in a separate thread so it doesn't slow down moving your caret through the input box in translateLocally. The use cases we've had for native messaging so far have only needed alignment info for markup translation, and that's "solved" by doing the markup translation inside bergamot-translator itself 🤷 So I'm not entirely sure how much work translateLocally can do here for you, or how much we should just dump that info in the native message response and let you re-implement the byte/unicode offset conversion bits. Do you have any ideas on how you'd want that data? |
Honestly, no, I don't. I am way too inexperienced with sentencepiece tokens, bytes and so on. However, I'm confident I could re-implement the conversions in Rust if I get some time to play with it, given that it would be possible for me to receive the same data through Native Message that you use..? Thank you very much for your elaborate answers, I'm learning a lot just from that. Also please excuse the late reply, had Swedish midsummer + a busy weekend. |
How big are your translation requests generally? Are they just one or a couple of sentences, or more? I'm contemplating whether it is easier for you if we just pass on the byte offset info we have directly to you, and let you deal with the byte-offset -> unicode character offfset conversion in Rust, or to do that using the code we use in TranslateLocally already and give you unicode character offsets instead. The first would mean more work on your end, but you can do it on demand (like we do) so you don't have to compute highlight info for bits that the user never focusses on. The latter is just exposing what we already implemented. But it assumes that both Rust and QT have a similar concept of what a unicode character is 😅 |
It would be very dependent on the user case & the game (It's a translation/language learning tool for games!).
I'd love to do less work but I would very likely prefer to implement it myself at the end of the day no matter how much work that entails. Speed is of the essence! 😄 |
According to the translate request we can request that quality scores and alignments tokens are to be returned. However, the response struct only holds a string.
Is it simply not included in the example or is it not at all implemented?
If it is the former - what does the complete struct look like?
Thank you!
The text was updated successfully, but these errors were encountered: