Execution character set on Windows - add `execution_windows_acp` and `execution_windows_ocp`? #45

eternaleye · 2024-06-19T00:47:48Z

On Windows, the question of "execution character set" (at least for narrow characters) is complicated by some additional factors:

There are two execution character sets in play at runtime:
- the OEM Code Page (CP_OCP), as is used for the console
- the ANSI Code Page (CP_ACP), as is used for the GUI
The behavior of the mbrto* and *tombr functions is, at least according to the documentation, inconsistent:
- mbrtowc is documented as treating its input as the "current locale" (and you can play games with the .ACP and .OCP locales, accordingly)
- mbrtoc32, on the other hand, is documented as treating its input as UTF-8 unconditionally.

As a result, I'm not sure that there is currently any way that ztd.text currently handles the "execution character set" on Windows that provides the expected result under all circumstances:

<cuchar>/<uchar.h> is affected by (2), as it uses mbrtoc32
iconv is not available
cuneicode only seems to have three approaches:
1. ztdc_is_execution_encoding_utf8(), which is false (or at least ought to be) unless the system code page has been set to CP_UTF8/65001
2. mbrtoc32, which falls to (2) above
3. using reinterpret_cast to treat the input as UTF-8, which is certainly not correct.

I haven't verified at runtime that (2) actually presents itself, partly because while this documentation is for Visual Studio I'm using Embarcadero C++ Builder (and their standard library is sorely underdocumented and variable by version), and partly because issue (1) is the more pressing (the application I'm working with needs to interact with both, as we're currently in the midst of making it UTF-8 native, but need to retain the ability to interact with legacy files that were due to oversights written according to CP_ACP, and also need to emit data to the console in certain circumstances).

As a result, is there any chance that flavors of the execution character set could be added for CP_ACP and CP_OCP - or possibly for Windows CP_* values in general?

The text was updated successfully, but these errors were encountered:

ThePhD · 2024-06-19T10:32:59Z

Seems like something where we make the encoding classes (windows_acp, windows_ocp) available on all architectures and, if we detect we're not on Windows, we just call back to doing an execution encoding conversion.

I didn't think mbrtoc32 only worked with UTF-8 characters for the mb part. I'll have to actually write a real test to see how screwed up things really are.

Right proper kind of messed up the Windows situation is, though.

ThePhD · 2024-06-19T18:24:54Z

For "can we have encodings for CP_* values in general?" question, we are adding those. Slowly. One encoding at a time. See the encoding objects in:

https://ztdtext.readthedocs.io/en/latest/encodings.html

eternaleye · 2024-06-21T08:39:01Z

Mm, I was more imagining the possibility of a backend based on Windows' WideCharToMultiByte and similar, like how there's a cuchar backend. Since that API takes a CP_* value, if support for CP_ACP and CP_OCP is added via that, then it's no additional effort to allow initializing the backend with any CP_* value. On the other hand, maybe you're planning on using GetACP and GetOEMCP functions and mapping the result to the supported character sets in ztd.text? Or possibly use GetCPInfoExW and use the textual CodePageName field of the resulting CPINFOEXW struct (using the W version to not have recursive CP_ACP problems reading the name of the character set CP_ACP maps to...). But again, once any of those options work, there's literally no obstacle to just invoking that with an arbitrary CP_* value, as an accessible API.

EDIT: Also, to go back to an earlier statement:

I didn't think mbrtoc32 only worked with UTF-8 characters for the mb part. I'll have to actually write a real test to see how screwed up things really are.

It's bloody well not supposed to, that's for sure. In the C++17 standard, 24.5.5 paragraph 1 delegates to the C standard, and notes "ISO C 7.28". If we pull up the C11 spec, 7.28.1.3 specifically is mbrtoc32. Unfortunately, the wording here simply refers to "multibyte characters", and does not explicitly constrain that this means the same thing for all such functions, so someone at Microsoft might have thought they had license to make the two interpret the same input differently... but if we look off to the side a little, at 6.4.4 paragraph 11 (which defines how to process wide character literals, L'…', u'…', and U'…'):

A wide character constant prefixed by the letter L has type wchar_t, an integer type defined in the <stddef.h> header; a wide character constant prefixed by the letter u or U has type char16_t or char32_t, respectively, unsigned integer types defined in the <uchar.h> header. The value of a wide character constant containing a single multibyte character that maps to a single member of the extended execution character set is the wide character corresponding to that multibyte character, as defined by the mbtowc, mbrtoc16, or mbrtoc32 function as appropriate for its type, with an implementation-defined current locale. The value of a wide character constant containing more than one multibyte character or a single multibyte character that maps to multiple members of the extended execution character set, or containing a multibyte character or escape sequence not represented in the extended execution character set, is implementation-defined.

(emphasis mine)

This, then, clearly constrains the three functions to process their input alike, and Microsoft is at least documenting their implementation of mbrtoc32 as behaving in violation of the spec. Whether it's an implementation bug or a documentation bug, I don't know.

ThePhD · 2024-08-04T03:35:53Z

I should come back to this to state: I can't very well just use WideCharToMultiByte raw because the API is actually woefully awful in terms of how much useful information it gives back to you. It also cannot be slotted directly into the one-at-a-time design of the codec (though one could, if they implement the one-at-a-time using other means, use the internal and external extension points to them delegate to WideCharToMultiByte.

You can read about why it's unsuitable here: https://thephd.dev/the-c-c++-rust-string-text-encoding-api-landscape#windows-api

ThePhD self-assigned this Jun 19, 2024

ThePhD added the bug Something isn't working label Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Execution character set on Windows - add `execution_windows_acp` and `execution_windows_ocp`? #45

Execution character set on Windows - add `execution_windows_acp` and `execution_windows_ocp`? #45

eternaleye commented Jun 19, 2024 •

edited

Loading

ThePhD commented Jun 19, 2024

ThePhD commented Jun 19, 2024

eternaleye commented Jun 21, 2024 •

edited

Loading

ThePhD commented Aug 4, 2024

Execution character set on Windows - add execution_windows_acp and execution_windows_ocp? #45

Execution character set on Windows - add execution_windows_acp and execution_windows_ocp? #45

Comments

eternaleye commented Jun 19, 2024 • edited Loading

ThePhD commented Jun 19, 2024

ThePhD commented Jun 19, 2024

eternaleye commented Jun 21, 2024 • edited Loading

ThePhD commented Aug 4, 2024

Execution character set on Windows - add `execution_windows_acp` and `execution_windows_ocp`? #45

Execution character set on Windows - add `execution_windows_acp` and `execution_windows_ocp`? #45

eternaleye commented Jun 19, 2024 •

edited

Loading

eternaleye commented Jun 21, 2024 •

edited

Loading