Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execution character set on Windows - add execution_windows_acp and execution_windows_ocp? #45

Open
eternaleye opened this issue Jun 19, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@eternaleye
Copy link

eternaleye commented Jun 19, 2024

On Windows, the question of "execution character set" (at least for narrow characters) is complicated by some additional factors:

  1. There are two execution character sets in play at runtime:
    • the OEM Code Page (CP_OCP), as is used for the console
    • the ANSI Code Page (CP_ACP), as is used for the GUI
  2. The behavior of the mbrto* and *tombr functions is, at least according to the documentation, inconsistent:
    • mbrtowc is documented as treating its input as the "current locale" (and you can play games with the .ACP and .OCP locales, accordingly)
    • mbrtoc32, on the other hand, is documented as treating its input as UTF-8 unconditionally.

As a result, I'm not sure that there is currently any way that ztd.text currently handles the "execution character set" on Windows that provides the expected result under all circumstances:

I haven't verified at runtime that (2) actually presents itself, partly because while this documentation is for Visual Studio I'm using Embarcadero C++ Builder (and their standard library is sorely underdocumented and variable by version), and partly because issue (1) is the more pressing (the application I'm working with needs to interact with both, as we're currently in the midst of making it UTF-8 native, but need to retain the ability to interact with legacy files that were due to oversights written according to CP_ACP, and also need to emit data to the console in certain circumstances).

As a result, is there any chance that flavors of the execution character set could be added for CP_ACP and CP_OCP - or possibly for Windows CP_* values in general?

@ThePhD
Copy link
Contributor

ThePhD commented Jun 19, 2024

Seems like something where we make the encoding classes (windows_acp, windows_ocp) available on all architectures and, if we detect we're not on Windows, we just call back to doing an execution encoding conversion.

I didn't think mbrtoc32 only worked with UTF-8 characters for the mb part. I'll have to actually write a real test to see how screwed up things really are.

Right proper kind of messed up the Windows situation is, though.

@ThePhD
Copy link
Contributor

ThePhD commented Jun 19, 2024

For "can we have encodings for CP_* values in general?" question, we are adding those. Slowly. One encoding at a time. See the encoding objects in:

https://ztdtext.readthedocs.io/en/latest/encodings.html

@ThePhD ThePhD self-assigned this Jun 19, 2024
@ThePhD ThePhD added the bug Something isn't working label Jun 19, 2024
@eternaleye
Copy link
Author

eternaleye commented Jun 21, 2024

Mm, I was more imagining the possibility of a backend based on Windows' WideCharToMultiByte and similar, like how there's a cuchar backend. Since that API takes a CP_* value, if support for CP_ACP and CP_OCP is added via that, then it's no additional effort to allow initializing the backend with any CP_* value. On the other hand, maybe you're planning on using GetACP and GetOEMCP functions and mapping the result to the supported character sets in ztd.text? Or possibly use GetCPInfoExW and use the textual CodePageName field of the resulting CPINFOEXW struct (using the W version to not have recursive CP_ACP problems reading the name of the character set CP_ACP maps to...). But again, once any of those options work, there's literally no obstacle to just invoking that with an arbitrary CP_* value, as an accessible API.

EDIT: Also, to go back to an earlier statement:

I didn't think mbrtoc32 only worked with UTF-8 characters for the mb part. I'll have to actually write a real test to see how screwed up things really are.

It's bloody well not supposed to, that's for sure. In the C++17 standard, 24.5.5 paragraph 1 delegates to the C standard, and notes "ISO C 7.28". If we pull up the C11 spec, 7.28.1.3 specifically is mbrtoc32. Unfortunately, the wording here simply refers to "multibyte characters", and does not explicitly constrain that this means the same thing for all such functions, so someone at Microsoft might have thought they had license to make the two interpret the same input differently... but if we look off to the side a little, at 6.4.4 paragraph 11 (which defines how to process wide character literals, L'…', u'…', and U'…'):

A wide character constant prefixed by the letter L has type wchar_t, an integer type defined in the <stddef.h> header; a wide character constant prefixed by the letter u or U has type char16_t or char32_t, respectively, unsigned integer types defined in the <uchar.h> header. The value of a wide character constant containing a single multibyte character that maps to a single member of the extended execution character set is the wide character corresponding to that multibyte character, as defined by the mbtowc, mbrtoc16, or mbrtoc32 function as appropriate for its type, with an implementation-defined current locale. The value of a wide character constant containing more than one multibyte character or a single multibyte character that maps to multiple members of the extended execution character set, or containing a multibyte character or escape sequence not represented in the extended execution character set, is implementation-defined.

(emphasis mine)

This, then, clearly constrains the three functions to process their input alike, and Microsoft is at least documenting their implementation of mbrtoc32 as behaving in violation of the spec. Whether it's an implementation bug or a documentation bug, I don't know.

@ThePhD
Copy link
Contributor

ThePhD commented Aug 4, 2024

I should come back to this to state: I can't very well just use WideCharToMultiByte raw because the API is actually woefully awful in terms of how much useful information it gives back to you. It also cannot be slotted directly into the one-at-a-time design of the codec (though one could, if they implement the one-at-a-time using other means, use the internal and external extension points to them delegate to WideCharToMultiByte.

You can read about why it's unsuitable here: https://thephd.dev/the-c-c++-rust-string-text-encoding-api-landscape#windows-api

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants