Add unicode handling capability #106

lenianiva · 2023-08-17T06:00:20Z

Previously, if the client program outputs unicode, the unicode output would be garbled when it is piped through NBReader. For more detail, see #105. Writing doesn't seem to have any trouble as demonstrated by examples/cat.rs.

Right now the unicode encoder is always on. In pexpect, an encoding option is available to toggle between the two, which I'm not sure where to put here.

lenianiva · 2023-08-17T06:31:49Z

The current behaviour is to withhold a half-completed unicode char from the output buffer. If the client program outputs EOF when the unicode char buffer is incomplete, the char is swallowed. However one problem is that if the client program outputs an invalid unicode char and then valid unicode chars, the output buffer will be stalled and no more chars will be piped into the buffer.

matthiasbeyer · 2023-08-17T06:52:02Z

Thanks for your patches and also for your bug report! ❤️ Highly appreciated!

The CI failure should be fixed after #107 is merged, don't worry about that.

I am not sure what the way forward is yet. I like the first option you proposed in #105 most:

Change the type of PipedChar(u8) to PipedChar(char): If the program sends over half of a unicode char and then stop it would hang the reader

But that might break some downstream users for non-UTF8 outputting programs, as I understand it!? @petreeftime maybe you have some ideas what's the best way to go here? Maybe make the whole thing configurable (a flag that can be passed to the library for each call)?

lenianiva · 2023-08-17T06:56:33Z

Thanks for your patches and also for your bug report! ❤️ Highly appreciated!

The CI failure should be fixed after #107 is merged, don't worry about that.

I am not sure what the way forward is yet. I like the first option you proposed in #105 most:

Change the type of PipedChar(u8) to PipedChar(char): If the program sends over half of a unicode char and then stop it would hang the reader

But that might break some downstream users for non-UTF8 outputting programs, as I understand it!? @petreeftime maybe you have some ideas what's the best way to go here? Maybe make the whole thing configurable (a flag that can be passed to the library for each call)?

I don't think the first option is the best one. The solution in this PR is kind of a compromise between the first two options. This is because the underlying datatype piped by the client program is always u8 and not some char type, and any assembly of u8 into chars is only really needed when downstream needs to check the existence of certain chars to e.g. break line or search a regex.

TLDR: Unicode encoding necessarily runs the risk of stalling the buffer, and this happens either in the read thread or the write thread.

You're correct that this may break non-UTF8 outputting programs, but if the program restricts its output to the ASCII range there shouldn't be any compatibility issue. I left a flag in the code to represent some external option which toggles between unicode and non-unicode encoding.

Do you think it would be sensible to add it as a crate-level feature? Or embed the option in every instance of NBReader and hence spawn (this is the solution from pexpect)?

matthiasbeyer · 2023-08-17T07:01:24Z

If anything I'd have it as an option on the data type, not a crate-level feature or compiletime option, because of the simple fact that someone might have UTF-8 outputting programs and non-UTF-8 outputting programs in one project. In general I think compiletime features should never be used for "either-or" functionality 😆

So a flag on the type would be totally fine with me, with UTF-8 compatible reading as the default.

I'd like to see what @petreeftime thinks as well though.

lenianiva · 2023-08-17T07:02:32Z

If anything I'd have it as an option on the data type, not a crate-level feature or compiletime option, because of the simple fact that someone might have UTF-8 outputting programs and non-UTF-8 outputting programs in one project. In general I think compiletime features should never be used for "either-or" functionality 😆

So a flag on the type would be totally fine with me, with UTF-8 compatible reading as the default.

I'd like to see what @petreeftime thinks as well though.

If you have any idea for where to add such a flag I can do it in this PR.

By the way why is Pull Request Checks failing? Is it because I didn't sign my commits?

matthiasbeyer · 2023-08-17T07:13:28Z

By the way why is Pull Request Checks failing? Is it because I didn't sign my commits?

Its because you didn't signoff your commits (the -s flag in git-commit). You can fix this by doing something like git rebase $(git merge-base origin/master HEAD) -x "git commit --amend --no-edit -s" (there might be a more elegant way... 🤷)

matthiasbeyer · 2023-08-17T07:53:47Z

We merged #107 that should make your compile issues go away, please do rebase to latest master.

lenianiva · 2023-08-17T14:43:23Z

We merged #107 that should make your compile issues go away, please do rebase to latest master.

Rebased. I added a field for NBReader which contains the encoding. The encoding cannot be set via a setter because by the time the setter is called the receiving thread may have already received one or two chars, so the encoding has to be known by the time NBReader::new is called. Whether you would like this function argument to NBReader::new to be rippled to everything else that uses NBReader::new is your call.

I think the problem about the client program stalling the buffer should not be an issue that needs handling. If the client program outputs a invalid unicode char followed by valid unicode then it is not outputting unicode anyways so UTF8 mode is inadequate.

matthiasbeyer

Overall this looks good.

I think having the encoding be an argument on new, but also providing two convenience helpers NBReader::ascii() and NBReader::utf8(), which call new with the respective argument would be nice and not that much of a hassle.

@petreeftime pinging you again, hope you can have a look as well.

src/encoding.rs

Signed-off-by: Leni Aniva <[email protected]>

lenianiva · 2023-10-20T17:15:18Z

Overall this looks good.

I think having the encoding be an argument on new, but also providing two convenience helpers NBReader::ascii() and NBReader::utf8(), which call new with the respective argument would be nice and not that much of a hassle.

@petreeftime pinging you again, hope you can have a look as well.

Have you decided that this is the way to go? I need this crate in production

lenianiva · 2024-03-01T05:20:00Z

I've merged master into this branch and propagated encoding into the new Options structure. By default it still uses ASCII encoding everywhere. However I'm getting this error:

---- session::tests::test_bash stdout ----
thread 'session::tests::test_bash' panicked at 'assertion failed: `(left == right)`
  left: `"/tmp\r\n"`,
 right: `"\u{1b}[?2004l\r\r\n/tmp\r\n\u{1b}[?2004h"`', src/session.rs:543:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

it seems like bash is generating some extra outputs? This also occurs on the master branch.

ccntrq · 2025-06-23T18:18:39Z

Hi all,

I’d love to see Unicode (UTF-8) support land in rexpect! To help move this PR forward, I’ve merged the latest changes from main into my own copy of this branch (ccntrq/rexpect-unicode/tree/unicode) (Didn't rebase to preserve the signed commits by @lenianiva). The test failures previously mentioned by @lenianiva are now resolved in main.

While updating, I also addressed a bug in try_read related to reading into the middle of a multi-byte UTF-8 character (see: ccntrq@2598847).

Notable changes in my updated branch:

ASCII remains the default encoding for now, to retain current behavior.
UTF-8 mode is available via spawn_with_options At the moment, that’s the only way to opt into Unicode support.
I’d suggest we add convenience wrappers (e.g., spawn_utf8, spawn_command_utf8, etc.) so users can easily spawn sessions in UTF-8 mode without having to manually specify options.

What do you think about this API direction? Are there any further changes or cleanup needed before merging? I’m happy to help finish this up!

Thanks for all your work on rexpect.

epage · 2025-06-23T19:05:06Z

Hmm, Options is exhaustive, so any change will be breaking. Without default field values, we can't use .. syntax for filling in the rest. So either we make a one off breaking change or we have to switch to a builder API as well.

Implementation wise, I'd expect PipedChar::Char to be renamed to communicate its proper intent.

In general, I would prefer a clean commit history

Isolate refactors to their own commits, like renaming PipedChar::Char
Rebase, rather than merge
Add tests before the change with them passing showing the current behavior. The change that introduces unicode support would change away from the default encoding and show how that fixed the problem

matthiasbeyer mentioned this pull request Aug 17, 2023

Update MSRV: 1.60.0 -> 1.63.0 #107

Merged

lenianiva force-pushed the unicode branch 2 times, most recently from fd91788 to f5a333d Compare August 17, 2023 14:40

matthiasbeyer requested changes Aug 18, 2023

View reviewed changes

src/encoding.rs Outdated Show resolved Hide resolved

lenianiva requested a review from matthiasbeyer August 19, 2023 04:01

lenianiva added 4 commits August 18, 2023 22:30

Add unicode capability (demonstration)

23b1a5b

Signed-off-by: Leni Aniva <[email protected]>

Add cat demonstration program

34bbaeb

Signed-off-by: Leni Aniva <[email protected]>

Add encoding field and enum

4293d9b

Signed-off-by: Leni Aniva <[email protected]>

derive Eq for Encoding

36f07d4

Signed-off-by: Leni Aniva <[email protected]>

lenianiva force-pushed the unicode branch from cd467cc to 36f07d4 Compare August 19, 2023 05:30

matthiasbeyer requested a review from petreeftime August 19, 2023 16:57

Merge branch 'master' into unicode

4d52d98

Make UTF8 the default encoding

efb3ba6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add unicode handling capability #106

Add unicode handling capability #106

Uh oh!

lenianiva commented Aug 17, 2023

Uh oh!

lenianiva commented Aug 17, 2023

Uh oh!

matthiasbeyer commented Aug 17, 2023

Uh oh!

lenianiva commented Aug 17, 2023 •

edited

Loading

Uh oh!

matthiasbeyer commented Aug 17, 2023

Uh oh!

lenianiva commented Aug 17, 2023

Uh oh!

matthiasbeyer commented Aug 17, 2023

Uh oh!

matthiasbeyer commented Aug 17, 2023

Uh oh!

lenianiva commented Aug 17, 2023 •

edited

Loading

Uh oh!

matthiasbeyer left a comment

Uh oh!

Uh oh!

lenianiva commented Oct 20, 2023

Uh oh!

lenianiva commented Mar 1, 2024 •

edited

Loading

Uh oh!

ccntrq commented Jun 23, 2025

Uh oh!

epage commented Jun 23, 2025

Uh oh!

Uh oh!

Add unicode handling capability #106

Are you sure you want to change the base?

Add unicode handling capability #106

Uh oh!

Conversation

lenianiva commented Aug 17, 2023

Uh oh!

lenianiva commented Aug 17, 2023

Uh oh!

matthiasbeyer commented Aug 17, 2023

Uh oh!

lenianiva commented Aug 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matthiasbeyer commented Aug 17, 2023

Uh oh!

lenianiva commented Aug 17, 2023

Uh oh!

matthiasbeyer commented Aug 17, 2023

Uh oh!

matthiasbeyer commented Aug 17, 2023

Uh oh!

lenianiva commented Aug 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matthiasbeyer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lenianiva commented Oct 20, 2023

Uh oh!

lenianiva commented Mar 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ccntrq commented Jun 23, 2025

Uh oh!

epage commented Jun 23, 2025

Uh oh!

Uh oh!

lenianiva commented Aug 17, 2023 •

edited

Loading

lenianiva commented Aug 17, 2023 •

edited

Loading

lenianiva commented Mar 1, 2024 •

edited

Loading