-
-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train a language identifier model that works well on ingredient lists #349
Comments
What I'm going to try (following the discussion on Slack):
@raphael0202 If that's OK with you, could you please assign the issue to me? |
Yes that's a good plan to start 👍 |
Here is the number of texts for each language
|
Hey, is that a requirements that the models needs to run locally (not from a server) ? |
Tried to use clustering to fix mislabeled data. I took the languages for which there are at least 100 texts (37 languages). Then took 100 texts for each language and used them as a training dataset (wanted then to get predictions for the entire dataset). The texts were converted to embeddings using fasttext (get_sentence_vector method), the dimension was reduced from 256 to 66 to preserve 95% variance using PCA. Either clustering is not suitable for this task, or I am doing something wrong. Now I will try to use another text classification model: lingua https://github.com/pemistahl/lingua-py to compare the predictions and confidence of two models. Then I'll take data in which the predictions of the models coincide and they are both confident and fine-tune one of them on this data. |
Here's a really nice article summarizing different approaches for language detection, from statistical to deep learning Would be great to have a validation dataset to estimate the performance of any solution. |
How I got the distribution of texts languages:
(but after that there are still some extra fields left, e.x.
|
How many samples should it contain? Should I select an equal number of samples for each language or just random? |
Roughly 30 labels per language to start with I would say.
|
Would it be possible to share the link to this original data set ? I am curious to have a look to it as well. |
I used the MongoDB dump. I described above how I retrieved the data from it. However, there might be an error in my script because some languages have fewer texts than expected (e.g. I got 912 samples of Japanese texts, but on https://jp-en.openfoodfacts.org/ there are around 16,000). Please keep me posted if you’re planning to work on this task, as I’m actively working on it. You can find me on OFF slack (Yulia Zhilyaeva) |
If this can help, there's now a Parquet dump in Hugging Face, which is the JSONL processed and cleaned from irrelevant features: https://huggingface.co/datasets/openfoodfacts/product-database |
Tried to retrieve data from huggingface dataset, but still get ~900 samples of Japanese texts, and ~996,000 texts in total. Am I doing something wrong? Or is it because at the moment the hf dataset stores text only in the original language? My code here
|
The Parquet contains the same information as the JSONL file, so it's not surprising. |
I see. I mean I don't understand why there https://jp-en.openfoodfacts.org/ are 16,000 products, and I have only 900 |
Oh, seems that just not all of them have ingredients list in Japanese |
I created a validation dataset from texts from OFF off_validation_dataset.csv (42 languages, 15-30 texts per language) and validated FastText and lingua models. I took 30 random texts in each language, obtained language predictions using the Deepl API and two other models (this and this). For languages they don’t support, I used Google Translate and ChatGPT for verification. (As a result, after correcting the labels, some languages have fewer than 30 texts). Accuracy of the models: Should I compare their accuracy on only short texts, or should I try to retrain fasttext? |
Hello @korablique, thank you for the analysis! So if I understood correctly, the And can you provide the metrics for each language? For reference, using duckdb, I computed the number of items for each language: |
I've just added to the Python SDK a new method to analyze the ingredients in a given language: Using the edit: you need the latest version of the SDK for it to work, |
Gj @korablique |
Yes
<style type="text/css"></style>
Serbian (sr), Bosnian (bs) and Croatian (hr) are very similar, so models confuse them. I talked to a friend from Serbia and he said that basically they are the same language with only tiny variations. Also, I considered the variants of Norwegian as one language. Sorry, I didn't think to filter only short texts from the beginning. I'll calculate the metrics again after I improve the dataset |
It seems like good results, congrats!
|
I would suggest also adding f1-score as a metric! |
Recalculated metrics on only short texts (no more than 10 words). 30 texts per language.
|
Problem
We're currently using fasttext for language identification.
This is useful especially to detect the language of an ingredient list extracted automatically using a ML model, or added by a contributor.
However, fasttext was trained on data that is quite different from ingredient lists (Wikipedia, Tatoeba and SETimes).
Sometimes the model fails for obvious cases, such as this one (french ingredient list):
This behaviour is mostly present for short ingredient lists.
We should explore training a new model for language identification using Open Food Facts data (especially ingredient lists).
Requirements
Using fasttext is not a requirement. We can either train a new fasttext model, or train it with pytorch/tensorflow and export it to ONNX format.
The text was updated successfully, but these errors were encountered: