You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Numerous quality errors about ingredient lists come from spelling errors. It's mostly due to errors during the OCR process, because the image is blurry or because of OCR model limitation (we use Google Cloud Vision). As a result, we have:
ingredients with spelling mistakes
ingredients not separated with comma (or other ingredient list separator), resulting in "unknown ingredient" warning
incorrect line continuation (the way Google Cloud Vision joins words to makes paragraphs): the ingredient list has unrelated words inside it.
I think (1) and (2) can be corrected using language models for spelling correction, (3) is more tricky.
We implemented a spellcheck module in Robotoff using Elasticsearch, but it's currently not good enough to be used without human supervision: it's currently unused and will be removed soon from the codebase.
Proposed solution
Explore the use of large language models for performing ingredient spellcheck. We must ensure the model does not hallucinate new ingredients or modify ingredients that were already valid. ChatGPT (GPT-3.5) seems a good starter.
If it works correctly, we can try to generate a high quality spellcheck dataset using chatGPT (a dataset mapping text to correct into corrected text), and fine-tune an open source large language model we can host on our servers to replicate this feature.
Where to get the data?
The best way to get a list of products with ingredient list with error is to get the Open Food Facts JSONL dataset, and look for products with ingredient quality warnings.
The data quality warnings tags are available in the data_quality_warnings_tags field. Relevant tags for spotting ingredient lists with errors:
Note that the lc=fr parameter is used to provide the language of the ingredient list, that is used to parse the ingredient list. If there are some unknown ingredients, it does not necessarily mean there is a spelling error, as some ingredients are not recognized, as they are not in our ingredient taxonomy. Ingredient coverage depends on the language (good for English and French, bad for low-resources languages).
Problem
Numerous quality errors about ingredient lists come from spelling errors. It's mostly due to errors during the OCR process, because the image is blurry or because of OCR model limitation (we use Google Cloud Vision). As a result, we have:
I think (1) and (2) can be corrected using language models for spelling correction, (3) is more tricky.
We implemented a spellcheck module in Robotoff using Elasticsearch, but it's currently not good enough to be used without human supervision: it's currently unused and will be removed soon from the codebase.
Proposed solution
Explore the use of large language models for performing ingredient spellcheck. We must ensure the model does not hallucinate new ingredients or modify ingredients that were already valid. ChatGPT (GPT-3.5) seems a good starter.
If it works correctly, we can try to generate a high quality spellcheck dataset using chatGPT (a dataset mapping text to correct into corrected text), and fine-tune an open source large language model we can host on our servers to replicate this feature.
Where to get the data?
The best way to get a list of products with ingredient list with error is to get the Open Food Facts JSONL dataset, and look for products with ingredient quality warnings.
The data quality warnings tags are available in the
data_quality_warnings_tags
field. Relevant tags for spotting ingredient lists with errors:en:ingredients-unknown-score-above-0
en:ingredients-50-percent-unknown
en:ingredients-60-percent-unknown
en:ingredients-70-percent-unknown
en:ingredients-80-percent-unknown
en:ingredients-90-percent-unknown
Additional resources
Wiki page about ingredient data quality.
You can test if the corrected text is well-recognized by Open Food Facts server by using this link:
https://world.openfoodfacts.org/cgi/test_ingredients_analysis.pl?lc=it
Note that the
lc=fr
parameter is used to provide the language of the ingredient list, that is used to parse the ingredient list. If there are some unknown ingredients, it does not necessarily mean there is a spelling error, as some ingredients are not recognized, as they are not in our ingredient taxonomy. Ingredient coverage depends on the language (good for English and French, bad for low-resources languages).Part of
The text was updated successfully, but these errors were encountered: