Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

basenc: emit partial output on invalid input #6008

Open
BenWiederhake opened this issue Feb 24, 2024 · 2 comments
Open

basenc: emit partial output on invalid input #6008

BenWiederhake opened this issue Feb 24, 2024 · 2 comments

Comments

@BenWiederhake
Copy link
Collaborator

BenWiederhake commented Feb 24, 2024

This is a GNU behavior bug (i.e. this is a bug because GNU behaves differently, even though uutils' current behavior could be considered reasonable, too).

Example:

$ echo -n 'ze16jx(mMba' | LC_ALL=C basenc --z85 -d # GNU
missing!basenc: invalid input
[$? = 1]
$ echo -n 'ze16jx(mMba' | cargo run basenc --z85 -d # uutils
basenc: error: invalid input
[$? = 1]

I would prefer to first land #6007 before starting work on this issue.

@BenWiederhake
Copy link
Collaborator Author

Partial output on GNU basenc is cursed.

It looks like basenc --base64url -d usually tries to create partial output on an invalid input:

$ echo -n 'aGVsbG8>' | LC_ALL=C basenc --base64url -d | hd
basenc: invalid input
00000000  68 65 6c 6c 6f                                    |hello|
00000005

However, it does not create any output if a special base64-only character is detected:

$ echo -n 'aGVsbG8+' | LC_ALL=C basenc --base64url -d | hd
basenc: invalid input
$ echo -n 'aGVsbG8/' | LC_ALL=C basenc --base64url -d | hd
basenc: invalid input
$

… unless it's late enough in the stream:

$ cat <(yes | tr $'\ny' a | head -c5599) <(echo -n '.') | LC_ALL=C basenc --base64url -d | wc
basenc: invalid input
      0       1    4199
$ cat <(yes | tr $'\ny' a | head -c5599) <(echo -n '+') | LC_ALL=C basenc --base64url -d | wc
basenc: invalid input
      0       0       0
$ cat <(yes | tr $'\ny' a | head -c5599) <(echo -n 'a') | LC_ALL=C basenc --base64url -d | wc
      0       1    4200
$ cat <(yes | tr $'\ny' a | head -c5599) <(echo -n 'a+') | LC_ALL=C basenc --base64url -d | wc
basenc: invalid input
      0       1    4200
$

This is cursed. Replicating this behavior would require hard-coding this look-ahead, which seems a bad idea ("surprising behavior is a bug"), and is not documented in the help.

(Note that cat <(yes | tr $'\ny' a | head -c5599) <(echo -n 'X') simply means "5599 times the character a followed by X".)

@tertsdiepraam
Copy link
Member

tertsdiepraam commented Mar 1, 2024

What I think is happening is that it loads a chunk of the input into a buffer and then -- depending on the encoding -- optionally does a check on that buffer before converting it. Then, it loads the next part into the buffer and repeat.

Doing that at least is probably a good idea so that we can handle larger inputs because we don't need to store the entire input in memory.

The differences between the encodings might be a bug on GNU's side. We could try to bring this to their attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants