Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better auto-detection of multilanguage content #187

Open
kelson42 opened this issue May 3, 2023 · 1 comment
Open

Better auto-detection of multilanguage content #187

kelson42 opened this issue May 3, 2023 · 1 comment

Comments

@kelson42
Copy link
Contributor

kelson42 commented May 3, 2023

Currently the ZIM "Language" Metadata can automatically be filled with only one language. Zimit check it on the Welcome page and then set it. Even if the other pages are using other languages.

It would be better to check all the pages, gather the list of languages and then at the end, set the "Language" Metadata properly.

Follow comments on #186

@kelson42 kelson42 changed the title Better handling of multilanguage content Better auto-detection of multilanguage content May 3, 2023
@rgaudin
Copy link
Member

rgaudin commented May 3, 2023

I'm not sure about this. I think what you propose will decrease quality while we already have quality issues with zimit.

The goal of this metadata is to inform users about the main languages in use in the ZIM so he can filter it in/out. It's not a technical one like the Counter which exhaustively lists all content types.

I'm afraid we'll often end up with several languages that are meaningless to the ZIM… while being time consuming (parsing all HTML entries) and while only reporting HTML languages and not the one of say PDF files for instance.

It should be set manually because that's what's best. Even a person foreign to the website can visit it and under 30s find out what the main languages are.

Now we have a shortcut that uses the main page's language because that's the most frequent use case.

I propose we make the language param mandatory and add a special handling for the homepage value which will use the homepage's language. We could even set homepage as default value in youzim.it's form.

Independently of this, warc2zim should allow specifying multiple languages which it doesn't at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants