Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use large language models (LLM) to perform ingredient list spellcheck #314

Open
Tracked by #289
raphael0202 opened this issue May 23, 2023 · 0 comments
Open
Tracked by #289

Comments

@raphael0202
Copy link
Contributor

raphael0202 commented May 23, 2023

Problem

Numerous quality errors about ingredient lists come from spelling errors. It's mostly due to errors during the OCR process, because the image is blurry or because of OCR model limitation (we use Google Cloud Vision). As a result, we have:

  1. ingredients with spelling mistakes
  2. ingredients not separated with comma (or other ingredient list separator), resulting in "unknown ingredient" warning
  3. incorrect line continuation (the way Google Cloud Vision joins words to makes paragraphs): the ingredient list has unrelated words inside it.

I think (1) and (2) can be corrected using language models for spelling correction, (3) is more tricky.
We implemented a spellcheck module in Robotoff using Elasticsearch, but it's currently not good enough to be used without human supervision: it's currently unused and will be removed soon from the codebase.

Proposed solution

Explore the use of large language models for performing ingredient spellcheck. We must ensure the model does not hallucinate new ingredients or modify ingredients that were already valid. ChatGPT (GPT-3.5) seems a good starter.

If it works correctly, we can try to generate a high quality spellcheck dataset using chatGPT (a dataset mapping text to correct into corrected text), and fine-tune an open source large language model we can host on our servers to replicate this feature.

Where to get the data?

The best way to get a list of products with ingredient list with error is to get the Open Food Facts JSONL dataset, and look for products with ingredient quality warnings.
The data quality warnings tags are available in the data_quality_warnings_tags field. Relevant tags for spotting ingredient lists with errors:

  • en:ingredients-unknown-score-above-0
  • en:ingredients-50-percent-unknown
  • en:ingredients-60-percent-unknown
  • en:ingredients-70-percent-unknown
  • en:ingredients-80-percent-unknown
  • en:ingredients-90-percent-unknown
  • ...

Additional resources

Wiki page about ingredient data quality.

You can test if the corrected text is well-recognized by Open Food Facts server by using this link:
https://world.openfoodfacts.org/cgi/test_ingredients_analysis.pl?lc=it

Note that the lc=fr parameter is used to provide the language of the ingredient list, that is used to parse the ingredient list. If there are some unknown ingredients, it does not necessarily mean there is a spelling error, as some ingredients are not recognized, as they are not in our ingredient taxonomy. Ingredient coverage depends on the language (good for English and French, bad for low-resources languages).

Part of

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants