utf8tok - A non-allocating single-file C++-17 library to split UTF-8 strings into grapheme clusters. Supports Unicode 11.0.
Unicode defines user-perceived characters as grapheme clusters, often consisting of multiple code points. utf8tok splits UTF-8 encoded strings into grapheme clusters, implementing a part of UAX #29.
Just add utf8tok.h and graphemebreakproperties.inc to your project, include utf8tok.h where needed, also define UTF8TOK_IMPLEMENTATION
once. The implementation of all utf8tok-related functions is generated there.
utf8tok supplies several functions, but typically only std::optional<ut8tok::grapheme_cluster_view> utf8tok::next_grapheme_cluster(std::string_view &str_view, uint8_t* scratchBuffer, size_t scratchBufferSize)
is required for use.
next_grapheme_cluster
expects a string:view
containing the UTF-8 encoded text to separate. The function returns a grapheme_cluster_view
(which is another name for string_view
). If a cluster is separated successfully, it is also removed from the given string_view to simplify continued parsing. To let you control all allocations, you need to supply a scratch buffer. The contents of this buffer are not required to be stored between calls to next_grapheme_cluster
. If the buffer was to small to separate the next grapheme cluster, std::nullopt
is returned. Normally a buffer size of 50 bytes is sufficient for most grapheme clusters, but as f.e. emoji can be extended quite a lot, you might need more in extreme cases.
The grapheme cluster break property data is stored in graphemebreakproperty.inc, which can be regenerated by compiling and running utf8tok_generator. The program expects paths to the Unicode consortium's grapheme break property file (found here) and the emoji data (found here).
To test conformance to the UAX #29, the Unicode consortium has published test cases here. These test case definitions can be converted to doctest test cases using the C# program found in tests/GraphemeTestGenerator
.
Tests are run using the doctest library, licensed unter MIT.