-
-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of scope homepage redirect #138
Comments
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
What are the practical consequences? That those creating Zimfarm recipes or running Zimit will have to be careful to define the scope carefully? Are we getting scrapes that are too small (or too big) as a result of this change? Is this at all related to the appearance of ZIM files that are too small in |
No existing recipe would be affected because they were passing with the previous check so they don't have a redirect to an out-of-scope URL. I imagine that recipe/requests with such a redirect would complete successfully within seconds and create a tiny ZIM but we should test the scenario to be sure. |
I have a difficulty to judge the level of impact of this ticket/bug/problem? Can someone help me? |
I don't think I can be more clear than the explanation above. Lines 470 to 490 in c98e450
|
@rgaudin Sorry, my question was not specific enough. I mean the quantity impact. Do we have a lot of scrapes impacted or only a few each year? |
No idea and we can't really know: this information is just a warning in the logs. |
Clearly not a 2.0 issue from my PoV, I never saw this happening in real situations. |
Zimit 1.x, following #76 had a mechanism to ensure that should the passed URL redirect to an out-of-scope domain, the process would halt early as it would result in a barely usable ZIM (homepage not in ZIM).
With improvements to browsertrix-crawler,
--scope
has been removed in favor of a--scopeType
that can be:page
: Single URLpage-spa
: idem plus any fragment link to that URLprefix
(default): any URL that shares same prefix up to the last/
host
: any URL that shares same prefix up to the first/
domain
: Any URL on same domain or on any subdomain^^ (matched against non-www.
if it was present).any
: Anythingcustom
which uses--include
and--exclude
(regexp)Note that except for
page
that is a single URL, others automatically include bothhttp
andhttps
variants of matches.There's no documentation but here's implementation
With this new, complex scope mechanism, we had to remove our feature that checked if the redirected-to homepage is out-of-scope as it would require us to duplicate that whole scope code in zimit. Instead, a warning is displayed if the homepage is a redirection.
Question: is that enough? Do we want a different behavior? Should we duplicate that whole scope matching logic to fail early should target homepage be out-of-scope?
The text was updated successfully, but these errors were encountered: