Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wget2 does not save URLs to folders correctly (for example: site.com/folder1) #365

Open
AlexBO222 opened this issue Dec 2, 2024 · 6 comments

Comments

@AlexBO222
Copy link

Maybe I didn't find the right option, but the script doesn't download and save URLs to folders correctly.
For example:
site.com/folder1.

If you specify the -E option, the file in the root /folder1.html will be saved.
If you don't specify the -E option, the file /folder1 without extension will be saved in the root.
The file /folder1/index.html should be saved.

@mitchcapper
Copy link
Contributor

How would wget2 know it is a folder vs a file? Sites can often use an extensionless page url's so the mentioned behavior would seem to make sense. If you add a slash at the end it knows it is a folder and will save it as index.html.

@rockdaboot
Copy link
Owner

@AlexBO222 Can we close this?

@AlexBO222
Copy link
Author

Как wget2 узнает, что это папка, а не файл? Сайты часто могут использовать URL-адреса страниц без расширений, поэтому упомянутое поведение имеет смысл. Если вы добавите косую черту в конце, он поймет, что это папка, и сохранит ее как index.html.

So, as it is specified in the html specification and will understand that even with a slash, or not - it is a folder. The site settings should not affect the work of web scanners. The web scanner always considers it a folder, and on the server side the user can give a file or a folder, it does not affect the scanner.

If you specify the option - default-page=index.html, then wget2 URL with a slash saves correctly creates in the folder index.html, but if without a slash at the end, it does not understand. This is the way the server is set up, you can't tell them what to do, they decided to do it without the slash, they have the right, this is a valid format, it is automatically converted to a URL with a slash by all scanners when saving. And wget2 doesn't understand this, it doesn't follow the html specification.

@AlexBO222
Copy link
Author

@AlexBO222 Can we close this?

P.S..
If wget2 doesn't understand the html specification, then yes, you can close the issue. But it is better to fix the program so that it works correctly and saves such sites, there are many, very many of them.

@mitchcapper
Copy link
Contributor

mitchcapper commented Dec 15, 2024

First, I am not on the dev team so take these things with a grain of salt:

This is the way the server is set up, you can't tell them what to do, they decided to do it without the slash, they have the right, this is a valid format

Correct we don't control servers. While some web servers may change a request without a slash to the directory with a slash, most web servers today are using script routing anyway. Here are some examples from top sites where pages are clearly shown without an extension

it is automatically converted to a URL with a slash by all scanners when saving

No it does not, I assume you mean scrapers here but wget2 actually follows wget's behavior which seems very appropriate:

wget --content-on-error http://google.com/404 -E
--2024-12-15 08:37:19--  http://google.com/404
Resolving google.com (google.com)... 142.250.217.78, 2607:f8b0:400a:804::200e
Connecting to google.com (google.com)|142.250.217.78|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
Saving to: '404.html'

404.html                      100%[==============================================>]   1.53K  --.-KB/s    in 0s

2024-12-15 08:37:19 ERROR 404: Not Found.

root@b78aa4a4c841:/tmp/rt# wget --content-on-error http://google.com/404/ -E
--2024-12-15 08:37:21--  http://google.com/404/
Resolving google.com (google.com)... 142.250.217.78, 2607:f8b0:400a:804::200e
Connecting to google.com (google.com)|142.250.217.78|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
Saving to: 'index.html'

index.html                    100%[==============================================>]   1.53K  --.-KB/s    in 0s

2024-12-15 08:37:21 ERROR 404: Not Found.

And wget2 doesn't understand this, it doesn't follow the html specification

Can you link to which spec you are talking about? wget2 is a scraper as well so I am not sure a spec would apply for what local filename it ends up using.

More importantly I think the statement is fundamentally wrong. Web browsers do not assume a URL without an extension is a folder. The very obvious case and point is any relative url/src/image etc. If the website is "https://example.com/site1" and it has an image of <img src='test.png' /> your browser is making a request to https://example.com/test.png NOT https://example.com/site1/test.png. If this isn't iron clad enough I don't know what is.

If wget2 doesn't understand the html specification, then yes, you can close the issue. But it is better to fix the program so that it works correctly and saves such sites, there are many, very many of them.

It seems like you may be using a translator, but even then I don't think you get that English spit out without some rude input.

wget2 is worked on as an opensource project that (I would assume) you are not a major contributor to. Even if you were though, it still wouldn't be a proper way to request features to be done. It is open source though so you can always modify yourself.

@rockdaboot
Copy link
Owner

OK, let's dive a bit deeper on this - it is a very old problem that never has been really solved.

First let me clarify a few things.

  1. There is no spec or standard on how to store data from a web server in a file system. This knowledge might be outdated - if you know otherwise please provide a link.
  2. A web server may generate contents from a database or dynamically and is not bound to the limitations of a file system.
  3. The client doesn't know if site.com/xyz refers to a folder or a file content (as mentioned in 2, the server possibly doesn't use things like directory or file. There is nothing in the HTTP standard that allows the client to say that xyz refers to a directory.

If you think about it, you see there is a conflict. There several ways how to solve that conflict, but every way has it's pros and cons.

How wget 1.x does it:

A first command wget -x --content-on-error site.com/xyz creates a file site.com/xyz.
A second command wget -x --content-on-error site.com/xyz/index.html removes site.com/xyz and creates site.com/xyz/index.html(filexyz` deleted means loss of information).

This has been ranted about in the past, and so wget2 tries a different approach.

How wget 2.x does it:

A first command wget2 -x --content-on-error site.com/xyz creates a file site.com/xyz.
A second command wget2 -x --content-on-error site.com/xyz/index.html renames site.com/xyz to site.com/xyz.1 and creates site.com/xyz/index.html`. No information loss.

Of course there are even other ways to to solve the conflict.

Back to the original question

Why doesn't wget/wget2 know that site.com/folder1 refers to a folder?
See point 1 from above. There is no spec. The client has to guess.
And if the server just returns valid content for site.com/folder1, where should we save it?
Any genius idea or additional knowledge is welcome!

A new spec for directory contents over HTTP would just be awesome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants