utf8tok

utf8tok - A non-allocating single-file C++-17 library to split UTF-8 strings into grapheme clusters. Supports Unicode 11.0.

Unicode defines user-perceived characters as grapheme clusters, often consisting of multiple code points. utf8tok splits UTF-8 encoded strings into grapheme clusters, implementing a part of UAX #29.

Setup

Just add utf8tok.h and graphemebreakproperties.inc to your project, include utf8tok.h where needed, also define UTF8TOK_IMPLEMENTATION once. The implementation of all utf8tok-related functions is generated there.

Usage

utf8tok supplies several functions, but typically only std::optional<ut8tok::grapheme_cluster_view> utf8tok::next_grapheme_cluster(std::string_view &str_view, uint8_t* scratchBuffer, size_t scratchBufferSize) is required for use.

next_grapheme_cluster expects a string:view containing the UTF-8 encoded text to separate. The function returns a grapheme_cluster_view (which is another name for string_view). If a cluster is separated successfully, it is also removed from the given string_view to simplify continued parsing. To let you control all allocations, you need to supply a scratch buffer. The contents of this buffer are not required to be stored between calls to next_grapheme_cluster. If the buffer was to small to separate the next grapheme cluster, std::nullopt is returned. Normally a buffer size of 50 bytes is sufficient for most grapheme clusters, but as f.e. emoji can be extended quite a lot, you might need more in extreme cases.

Generation

The grapheme cluster break property data is stored in graphemebreakproperty.inc, which can be regenerated by compiling and running utf8tok_generator. The program expects paths to the Unicode consortium's grapheme break property file (found here) and the emoji data (found here).

Tests

To test conformance to the UAX #29, the Unicode consortium has published test cases here. These test case definitions can be converted to doctest test cases using the C# program found in tests/GraphemeTestGenerator.

Tests are run using the doctest library, licensed unter MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
tests		tests
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
graphemebreakproperty.inc		graphemebreakproperty.inc
utf8tok.h		utf8tok.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

utf8tok

Setup

Usage

Generation

Tests

About

Releases

Packages

Languages

License

rtrbt/utf8tok

Folders and files

Latest commit

History

Repository files navigation

utf8tok

Setup

Usage

Generation

Tests

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages