Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract nutritional data from scraped websites #375

Closed
mblennegard opened this issue Jun 24, 2024 · 9 comments
Closed

Extract nutritional data from scraped websites #375

mblennegard opened this issue Jun 24, 2024 · 9 comments
Labels
bug Something isn't working go Pull requests that update Go code
Milestone

Comments

@mblennegard
Copy link
Contributor

mblennegard commented Jun 24, 2024

Is your feature request related to a problem? Please describe.
Many websites today already have nutritional data present as part of the recipe.
Instead of trying to calculate this using generic ingredients within Recipya it would be better to extract this information directly from the recipe as is.

Describe the solution you'd like
If nutritional information is part of the recipe, try to extract it.
If not available, then use the current way of calculating the nutritional data.

For websites requiring custom scrapers this will of course be on a per website basis, but as nutritional information is part of the LD+JSON schema it should be possible to solve this for a big number of websites automatically by adding the nutritional extraction to the LD+JSON part of the scraper.
Additionally, this would also solve issues where the automatic nutritional calculation fails due to the recipe being in a language different than English.

@mblennegard mblennegard added the enhancement New feature or request label Jun 24, 2024
@reaper47
Copy link
Owner

reaper47 commented Jun 24, 2024

The nutritional information is currently extracted from the LD+JSON when available: https://github.com/reaper47/recipya/blob/main/internal/models/schema-recipe.go#L36. If it not available, then this function will execute in the background: https://github.com/reaper47/recipya/blob/main/internal/services/sqlite_service.go#L628.

Which website did you fetch that calculated the nutrition instead of extracting it?

@mblennegard
Copy link
Contributor Author

@reaper47 I was a bit too quick but I just noticed myself when browsing the scraper.go file that the nutrition was already part of the scraper...but you were even quicker to respond here. 😋

I tried with the following recipe:
https://tyngre.se/recept/kladdkakor/kladdkaka-med-hasselnoetter-och-brynt-smoer

It seems to be following the LD+JSON schema regarding the nutrition as well, at least as far as I can tell.

{
    "@context": "http://schema.org",
    "@type": "Recipe",
    "name": "Kladdkaka med hasselnötter och brynt smör",
    "image": "https://cdn.sanity.io/images/fbgp6g6y/production/69bbb647d50e5d715f9e85dc38ed1f94d07b3401-3024x4032.jpg",
    "author": {
        "@type": "Person",
        "name": "Skippa Sockret"
    },
    "description": "Kladdkaka med hasselnötter och brynt smör",
    "totalTime": "25 min ",
    "keywords": "bakmix kladdkaka, fika, dessert",
    "recipeCategory": "Kladdkakor",
    "recipeIngredient": [
        "4.25 dl bakmix kladdkaka ",
        "2 dl valfri mjölk",
        "2 msk olja",
        "50 g smör ",
        "1 dl rostade hasselnötter (eller efter smak)"
    ],
    "recipeInstructions": [
        {
            "@type": "HowToStep",
            "text": "Sätt ugnen på 150 grader. "
        },
        {
            "@type": "HowToStep",
            "text": "Mät upp kladdkakemixen och blanda ihop med mjölk och olja med hjälp av en slickepott. "
        },
        {
            "@type": "HowToStep",
            "text": "Bryn smöret i en kastrull tills du får en nötig karaktär. "
        },
        {
            "@type": "HowToStep",
            "text": "Grovhacka hasselnötterna. "
        },
        {
            "@type": "HowToStep",
            "text": "Tillsätt nu de brynta smöret och hasselnötterna i smeten, blanda runt. "
        },
        {
            "@type": "HowToStep",
            "text": "Smöra eller olja en rund springform och täck med lite kokos eller ströbröd alt. använd ett bakplåtspapper. Häll i smeten och grädda i ugnen cirka 15 minuter. "
        },
        {
            "@type": "HowToStep",
            "text": "Ta ut och låt svalna, låt gärna kladdkakan stå i kylen ett par timmar för godast resultat. Servera sedan med en riktigt god vaniljglass eller en klick grädde. "
        }
    ],
    "nutrition": {
        "@type": "NutritionInformation",
        "servingSize": 8,
        "calories": 1109,
        "fatContent": 90,
        "carbohydrateContent": 77,
        "proteinContent": 42
    }
}

@reaper47
Copy link
Owner

Something is off because the nutrition is indeed there. I'll check it out.

@reaper47 reaper47 added bug Something isn't working go Pull requests that update Go code and removed enhancement New feature or request labels Jun 24, 2024
@reaper47 reaper47 added this to Recipya Jun 24, 2024
@reaper47 reaper47 added this to the v1.2.0 milestone Jun 24, 2024
@reaper47 reaper47 moved this to Backlog in Recipya Jun 24, 2024
@mblennegard
Copy link
Contributor Author

mblennegard commented Jun 25, 2024

@reaper47 I debugged this issue and the root cause is that this particular website stores only the numeric values for the nutritional information, whereas the scraper expects string only values in the UnmarshalJSON for the NutritionSchema (

if val, ok := x["carbohydrateContent"].(string); ok {
).
The mapping of the nutrition fields is essentially skipped.

I tested this (rather crudely) for one of the properties with the below change, assuming that the nutrition function inside Recipya expects string values. This change then populated the property correctly in the final imported recipe.

if val, ok := x["carbohydrateContent"].(float64); ok {
	n.Carbohydrates = strconv.FormatFloat(val, 'f', -1, 64)
}

Perhaps the UnmarshalJSON function for the NutritionSchema could check and account for if the source data is string, float or integer and convert the values accordingly, to accomodate different implementations of the LD+JSON schema?

@reaper47
Copy link
Owner

Excellent, thank you for looking into it! That is exactly it. We shall add a test in https://github.com/reaper47/recipya/blob/main/internal%2Fmodels%2Fschema-recipe_test.go#L303 and modify the UnmarshalJSON function you linked to cover nutrition fields that use numerical values.

@mblennegard
Copy link
Contributor Author

mblennegard commented Jun 26, 2024

@reaper47 I have it handling both strings and number values on my end now, but then we have the interesting thing regarding that when we only have numbers we are also missing the unit type, e.g. grams, milligrams etc.

As far as I know, nutritional information is always in metric, even for american recipe sites. Have you seen anything else during your investigations?

If they are indeed always in metric then we can add static units for each property, e.g. calories in kcal, fat, sugar and protein in grams, sodium in milligrams etc., which we use in case of the nutritional information has number values.

Edit: At least the recipe schema specifies metric units, so I think I can assume metric if setting static units for each property. Do you agree?

@reaper47
Copy link
Owner

Yes, nutrition is always in the metric system. I have yet to see a product in a grocery store in the US whose nutrition facts is not metric. We can safely assume the units you mentioned when not specified.

@mblennegard
Copy link
Contributor Author

Implemented in pull request #382

@reaper47
Copy link
Owner

reaper47 commented Jul 4, 2024

Pull request 382 has been merged! Closing this issue.

@reaper47 reaper47 closed this as completed Jul 4, 2024
@github-project-automation github-project-automation bot moved this from Backlog to Done in Recipya Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working go Pull requests that update Go code
Projects
Status: Done
Development

No branches or pull requests

2 participants