Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coding for tommorrow incomplete file #215

Open
RavanJAltaie opened this issue Sep 5, 2023 · 8 comments
Open

Coding for tommorrow incomplete file #215

RavanJAltaie opened this issue Sep 5, 2023 · 8 comments
Assignees
Milestone

Comments

@RavanJAltaie
Copy link

The recipe of coding for tomorrow has been successful but the file in dev library is incomplete, all internal links are not clickable.
https://farm.openzim.org/recipes/codingfortomorrow_de_all

https://dev.library.kiwix.org/viewer#codingfortomorrow_de_all_2023-08/A/coding-for-tomorrow.de/downloads/

Can you check please?

@rgaudin
Copy link
Member

rgaudin commented Sep 5, 2023

Clicking on a download link, you get an error message:

Sorry, the url https://coding-for-tomorrow.de/wp-content/uploads/2020/11/Informationen-zum-neuen-Online-Angebot-von-Coding-For-Tomorrow.pdf is not found on this server

As you can see, this URL doesn't share the same prefix as the URL of the recipe (https://coding-for-tomorrow.de/wp-content is not within https://coding-for-tomorrow.de/downloads/). You need to change the scope to allow scraping such URLs

@rgaudin rgaudin closed this as completed Sep 5, 2023
@rgaudin rgaudin added the recipe label Sep 5, 2023
@RavanJAltaie
Copy link
Author

@rgaudin I tried to change the scope to Any, page, prefix and still the resulted file is the same.
Could you please let me know in scope parameter which one shall I use?

@rgaudin
Copy link
Member

rgaudin commented Sep 26, 2023

That's exactly why some documentation is needed. All those scopes have different effects.

You haven't tested Any and that's good. I'd strongly advise against it as it would crawl anything. prefix is the default and page is somewhat similar.

I advise you try with host (will grab anything under coding-for-tomorrow.de) and see how that goes. I think often times, custom is appropriate but it requires specifying includes and excludes which is very tedious.

There's no documentation on those scopes ; code is at https://github.com/webrecorder/browsertrix-crawler/blob/165a9787af8a7dce6b0acb5f91e6803ef525fd5b/util/seeds.js#L75

@RavanJAltaie RavanJAltaie reopened this Dec 5, 2023
@RavanJAltaie
Copy link
Author

I tried changing the scopes, the host scraped the website but without the needed projects
can you check please?
https://farm.openzim.org/recipes/codingfortomorrow_de_all

I disabled the recipe and marked the resulted file for deletion

@benoit74
Copy link
Collaborator

benoit74 commented Dec 5, 2023

Now that the URL configured is https://coding-for-tomorrow.de, what did you expected by changing the scope from the default (prefix) to host?

I don't get what you expected by making this change.

That being said, I analyzed a bit the issue:

All that being said, as you see there is a significant effort needed by a developer to make the scraping of this website be enhanced, and I'm not even sure it will succeed (at least there is a significant chance that stuff like the Youtube videos will not be available).

@Popolechien what are your views on this, do you think this is worth the effort?

@Popolechien
Copy link
Contributor

It's in German, not a core target audience. We can drop it I think.

@RavanJAltaie
Copy link
Author

The issue related is marked as upstream
openzim/zim-requests#460

@benoit74
Copy link
Collaborator

Let's keep this issue open, I doubt we will make any progress in the coming months due to lack of resources but the ZIM request is legit, I've identified a potential solution and we should fix this at some point, it is not purely impossible or an immense effort, just not a priority for now.

@benoit74 benoit74 reopened this Feb 19, 2024
@benoit74 benoit74 added this to the later milestone Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants