Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#408] Fix escaping #423

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

[#408] Fix escaping #423

wants to merge 1 commit into from

Conversation

kukimik
Copy link

@kukimik kukimik commented May 24, 2023

This should close #408. The extra \ was introduced by show. Along the way some other escaping issues introduced by the show "shortcut" are fixed (like improper escaping of control characters like \DEL, \NUL). Haskell escaping is not TOML escaping (the latter is much simpler, e.g. there is no need for a \&).

Caveats:

  • The big problem here is that this bug wasn't caught by the test suite. Unfortunately I don't have time at the moment to look into the tests, and I can't really predict when I will find the time. Perhaps in a week, perhaps in a few months.

  • TOML allows unescaped Unicode characters, with only a few exceptions:

    Any Unicode character may be used except those that must be escaped: quotation mark, backslash, and the control characters (U+0000 to U+001F, U+007F).

    Therefore escaping everything outside ASCII is a waste of space of and hurts readability. However, I wanted to keep the fix backwards compatible. Moreover, Unicode is difficult, and it is not trivial to decide which characters "should" be escaped (probably the unprintable ones? but what else?). I think the decision what to escape should be left to the user of the library, via a new PrinterOption. I think I can implement this in a separate PR (together with fixing the tests) if I find the time. What is your opinion?

  • We always use the long form of escaping \UXXXXXXXX. However, the short form \uXXXX is also permitted, and I think it is preferable whenever it can be applied, because it saves a few bytes. I kept the long form to make things backward compatible. Introducing the short form should be trivial change in the code and I am tempted to make it, but I want to know your opinion.

@kukimik kukimik requested a review from vrom911 as a code owner May 24, 2023 21:10
@kukimik
Copy link
Author

kukimik commented May 24, 2023

Also, I didn't measure the performance. I'm not sure how Text concatenations are implemented, i.e. how many times the whole thing gets copied. Probably this can get meaningful when dealing with long strings. As a newbie I'm not really sure if it's better to use a Builder here, or with this amount of concatenation it is not worth it.

@kukimik
Copy link
Author

kukimik commented Oct 7, 2024

@tomjaguarpaw @vrom911 Does this have chances to get merged? It's a rather annoying bug. I can try to fix the tests and maybe do some simple benchmarking to check the performance impact, but I do not want to do invest my time if this has little to no chances of reaching hackage.

@tomjaguarpaw
Copy link
Contributor

The Kowainik project is largely in emergency maintenance only mode. I am an assistant maintainer, and I mostly just work to keep this project building with newer GHC and dependencies. Personally I don't think I know enough about TOML to review this fix, although if there's enough popular support for it I guess I could be persuaded to look deeper.

@kukimik
Copy link
Author

kukimik commented Oct 7, 2024

Thanks a lot for the explanation! I was somewhat aware of the current state of Kowainik, but I didn't know what your role is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Encodes unicode characters with double backslash
2 participants