Charset encoding decoding

This document tries to describe what the correct handling for charset encoding/decoding and percent encoding/decoding (escaping) should be. 'URL' is used in general for URI/IRI/URL here.

Encoding/Decoding

We basically need 4 different settings for character encoding.

The encoding of the filename(s) that we want to generate (e.g. utf-8). We have --local-encoding for this. An should have --filename-encoding in the future.
The encoding of the URL(s) given on the command line (e.g. gp2312). We have --local-encoding for this.
The encoding of the content of --input-file (e.g. iso-8859-15). We have --remote-encoding for this. Wget2 already has --input-encoding.
The encoding of the content of downloaded HTML (e.g. cp1252). We have --remote-encoding for this. In fact, this should only be a default for cases where we can't determine the encoding otherwise (normally we can).

These 4 encodings may all be needed for one single invocation of Wget. Any combination should be allowed. This is why we need 4 different command line options.

How to encode HTTP Get strings

Escaping/Unescaping

URLs may be partially %-encoded (escaped). We should only support single-escaped strings. URLs should first be parsed into their parts, the host part unescaped and converted to UTF-8 + punycode (if needed), the path unescaped and converted to UTF-8. Query and fragment ? Stay as they are or converted to UTF-8 ? That depends on the processing script on the server side, I guess.

Putting together the GET string

/ + escaped UTF-8 path + ? + escaped query + # + escaped fragment

Generating the filename

If host is part of the filename/path: convert host to filename encoding, if not possible use punycode. Convert the remaining part of the filename into filename encoding if possible. Percent-encode all special characters (not printable or not allowed for the file system).

Document encoding

about encoding see http://nikitathespider.com/articles/EncodingDivination.html
about GET encoding see http://stackoverflow.com/questions/1549213/whats-the-correct-encoding-of-http-get-request-strings
RFC 3986 URI generic syntax
[W3Schools URL Encoding] http://www.w3schools.com/tags/ref_urlencode.asp
[W3Schools Charset] http://www.w3schools.com/tags/ref_charactersets.asp
[W3Schools HTML Entities] http://www.w3schools.com/html/html_entities.asp

Provide feedback

Saved searches