-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cross-language translation data #23
Comments
Included in this issue is:
|
We should only use models that can translate both ways. The following are the language models from 🤗 Hugging Face that we can use to generate the translations (checked when implemented):
|
hey..👋🏻 |
Hey @abhijeet78880 👋 Yes I definitely expected that more information would be needed. This is a tough one, but as I said just needs some persistence 😊 Will write more later today! :) |
Ok, sooooo :) There's plenty to do here, with the first part of it being some research you/we could do. This comment is an ongoing list of all machine translation models available from 🤗 Hugging Face that we're using (I just edited it to add links to the models). As of now we only translate from English to the language that a user is typing in, but we want to expand this so that the user can translate from any of Scribe's supported languages to any other. This will be an option in the new menu we'll build in this iOS issue, with the designs for that being found here. Once we have a translation model we make a file like this one that translates from English to German. In this file we load a JSON that's words that we query in the source language (for now only English) from Wikidata, then we make a list of words from the JSON, set the What would be great is if you/we could look and find the missing models that we need for the new translations. A lot of these can just be At this point I think it's best if I check in with you and see how the above sounds. If you want to contribute in a simple way at first, going through and finding links for what models we'd need would be best. I could then do a more in depth explanation of the translation file from before and show you how to set up new ones for each of the models we find. We'd then run them, and bam we'd have translation data that we'd then reference and thus give users the option to translate from Spanish to German and all the other options 😊 This has been a lot! Again, just let me know how it sounds and feel free to ask questions. Everything we write here will help make things easier going forward :) :) |
@abhijeet78880, FYI I also made #35 just now which might be a nice first issue for you 😊 That's checking if the word |
Note that I've updated the models comment with further models we could use from Helsinki-NLP. There does appear to be some holes in their translation model coverage, so for some pairs we'll need to look harder for other models. At this point we'd be ready to start copying over some translation files 😊 For this the script from any Helsinki-NLP translation file can be used and the model name just needs to be changed :) |
The following two models could help us plug some of the holes in the above translation coverage: |
Neither of the above models was what we were looking for, and after playing around with T5 a bit more and getting some very sub par German-Portuguese translations I was able to get some strong results from a dummy JSON dataset using m2m100_418M. I'd say that the small model is enough for our purposes as the single word or small phrase translation that we're doing isn't going to be improved by a larger model (or only marginally, as larger models would be taking advantage of contextual information that our short input strings would lack). m2m100_418M should be able to handle all the missing language pairs for Scribe keyboard languages. A general thought might be to create a data pipeline that would use it as the sole model and then just switch the input and output languages as well as the input data during the run. Another thing to factor is that for now the outputs I was getting were capitalized as I'm assuming the model is expecting and returning a sentence. This can be remedied by the metadata that comes from Wikidata though, as we'll be querying a base translation corpus that includes word type and would thus know if a word is a proper noun that needs to be capitalized (or all nouns in German), or just lower case it. Will continue to fine tune the current example and then present the results at the next Scribe Weekly 😊 |
Hi! Sorry for the delay -- I had to figure out a bunch of stuff on my end. Where should I start? |
No stress on a delay, @nyfz18! Sorry for mine as well :) Let me organize some stuff and I'll send along some pseudocode for how this would be written as I said I would 😊 Generally the steps would be:
Do you have any questions on the above, @nyfz18? Btw I messaged on Matrix to see if a checkin call would help for this 🙃 |
Okay, sounds good. I sort of understand, but a check in call might be more helpful! |
Hey there @nyfz18! 👋 You now have extract_transform/translate.py at your disposal that loads in the model, checks arguments if you'd like to pass them, and prints out the ISO codes at the end. Following the working code there's also some pseudocode that outlines the steps we discussed in the call 😊 Let me know if you have any questions/comments! |
@nyfz18, we'll be doing the conversion of the JSON data that's being produced here in the new issue #46. @lillian-mo will work on that one 🚀 |
Closing this issue as individual ones have been made for each language that can be worked on as a part of Google Summer of Code ☀️ Thanks all for the discussion here! Help on the individual issues would be welcome 😊 |
Terms
Languages
All languages
Description
This issue is the interim step to adding full translation support to Scribe apps. What it entails is finding accurate enough models on 🤗 Hugging Face to translate between all currently supported languages and English. The
format_translations.py
script for each language will then need to be edited to run each model over a basic corpus to generate seven differenttranslations.json
files per language.From there this will allow an option for which base language to translate from to be added to Scribe-iOS' menu, which will be developed in scribe-org/Scribe-iOS#16.
The text was updated successfully, but these errors were encountered: