Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send non-printable codepoints as U+FFFD #76

Merged
merged 1 commit into from
Sep 3, 2024
Merged

Conversation

adsr
Copy link
Contributor

@adsr adsr commented Jul 25, 2024

Presently if someone sets a cell to an invalid codepoint, termbox will write it out verbatim, but it's guaranteed to be garbage. Instead send U+FFFD. This effectively limits cell contents to valid Unicode.

Any reasons to not do this? Could we be even more strict and use iswprint?

EDIT: Updated PR to use iswprint

Note users can still send arbitrary data via tb_send.

termbox2.h Outdated
@@ -1535,6 +1535,7 @@ static int send_sgr(uint32_t fg, uint32_t bg, int fg_is_default,
static int send_cursor_if(int x, int y);
static int send_char(int x, int y, uint32_t ch);
static int send_cluster(int x, int y, uint32_t *ch, size_t nch);
static int is_valid_codepoint(uint32_t ch);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just yak shaving, but is_invalid_codepoint with inverted logic will be more suitable here. No ternary operator and noting its return :-).

I reached this conclusion when I read the function definition and encountered the ternary in the return. And it is entirely possible you did it intentionally, in which case, ignore this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word invalid is even in the title of the PR :-D.

@adsr adsr force-pushed the is_valid_codepoint branch from b25a418 to 7516dea Compare July 27, 2024 20:53
@adsr
Copy link
Contributor Author

adsr commented Jul 28, 2024

Thank you for the review @pepe.

I altered the PR to filter based on iswprint instead of is_valid_codepoint. Is this too strict? I think it's OK.

chars (hex range) description iswprint
00-08 control codes no
09-0d whitespace no
0e-1f control codes no
20-7e printable ASCII yes
7f delete no

Of course there are codepoints above the 1-byte range, but the same categories apply. I think it comes down to control codes and whitespaces.

  • Control codes: I can't think of any use cases for setting a cell to a control code. Am I missing anything? I'll re-emphasize that users can send arbitrary data via tb_send*. (Although sending many of these control codes with termbox still active will probably break rendering.)
  • Whitespaces: I could see users wanting to use newlines specifically in tb_print*. I have an open PR to support that here. IMO it's not worth supporting tab, form feed (\f), vertical tab (\v), or macOS- or Windows-style newlines. I debated replacing these with a space instead of U+FFFD. But really that's kind of a lie.

One more note about control codes, specifically NULL: I'm not sure why termbox v1 went out of its way to replace NULL with a space. I think it may be from when clearing the buffer was a simple zero-fill, but now we set everything to a space.

Let me know if you can think of any counter-examples. I will leave these PRs open for a while for visibility.

@pepe
Copy link

pepe commented Jul 29, 2024

Thank you. That is a much better choice, anyway. I am planning to rewrite a TUI of mine in the coming weeks, so I will definitely use this version to test it.

@adsr adsr changed the title Send invalid codepoints as U+FFFD Send non-printable codepoints as U+FFFD Jul 31, 2024
@adsr adsr merged commit cd42ba3 into master Sep 3, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants