-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wget2 does not save URLs to folders correctly (for example: site.com/folder1) #365
Comments
How would wget2 know it is a folder vs a file? Sites can often use an extensionless page url's so the mentioned behavior would seem to make sense. If you add a slash at the end it knows it is a folder and will save it as index.html. |
@AlexBO222 Can we close this? |
So, as it is specified in the html specification and will understand that even with a slash, or not - it is a folder. The site settings should not affect the work of web scanners. The web scanner always considers it a folder, and on the server side the user can give a file or a folder, it does not affect the scanner. If you specify the option - default-page=index.html, then wget2 URL with a slash saves correctly creates in the folder index.html, but if without a slash at the end, it does not understand. This is the way the server is set up, you can't tell them what to do, they decided to do it without the slash, they have the right, this is a valid format, it is automatically converted to a URL with a slash by all scanners when saving. And wget2 doesn't understand this, it doesn't follow the html specification. |
P.S.. |
First, I am not on the dev team so take these things with a grain of salt:
Correct we don't control servers. While some web servers may change a request without a slash to the directory with a slash, most web servers today are using script routing anyway. Here are some examples from top sites where pages are clearly shown without an extension
No it does not, I assume you mean scrapers here but wget2 actually follows wget's behavior which seems very appropriate:
Can you link to which spec you are talking about? wget2 is a scraper as well so I am not sure a spec would apply for what local filename it ends up using. More importantly I think the statement is fundamentally wrong. Web browsers do not assume a URL without an extension is a folder. The very obvious case and point is any relative url/src/image etc. If the website is "https://example.com/site1" and it has an image of
It seems like you may be using a translator, but even then I don't think you get that English spit out without some rude input. wget2 is worked on as an opensource project that (I would assume) you are not a major contributor to. Even if you were though, it still wouldn't be a proper way to request features to be done. It is open source though so you can always modify yourself. |
OK, let's dive a bit deeper on this - it is a very old problem that never has been really solved. First let me clarify a few things.
If you think about it, you see there is a conflict. There several ways how to solve that conflict, but every way has it's pros and cons. How
|
Maybe I didn't find the right option, but the script doesn't download and save URLs to folders correctly.
For example:
site.com/folder1.
If you specify the -E option, the file in the root /folder1.html will be saved.
If you don't specify the -E option, the file /folder1 without extension will be saved in the root.
The file /folder1/index.html should be saved.
The text was updated successfully, but these errors were encountered: