[Feature Request] Unicode support #24

choeger · 2014-10-15T20:15:38Z

At a glance, this whole library seems like a very well-thought piece of software (limited scope, defined solution). Unfortunately, it does not support unicode right now. But unicode should be the standard in this millenium. So here is my proposal: Instead of using chars and strings exclusively, abstract the library over the concrete code-point and input representations. Then someone (me) could simply extend the library by providing a suitable unicode support. I understand that this kind of abstraction might yield some performance regressions, but it would yield a whole batch of new usecases.

c-cube · 2014-12-02T09:50:31Z

Could D. Bunzli's Uutf be used to iterate over unicode chars? That might also help to parametrize over the input stream (string, bigarray, stream of strings, etc.) for #20 ...

vouillon · 2014-12-02T17:28:41Z

The main issue to make the implementation generic is that it is table-based. This works well when there are only 256 possible characters, but does not scale to the one million Unicode code points...

One thing that should work is to translate regular expressions defined in term of Unicode code points into regular expressions defined in term of bytes and match UTF-8 strings byte by byte.

zoggy · 2016-01-12T07:51:34Z

Any hope to have unicode supported soon ?

Drup · 2016-01-12T12:09:25Z

I don't think @nojb or anyone else is working on it right now, but it could change if someone was motivated. ;)

XVilka · 2018-02-05T06:07:50Z

Surprising that it wasn't still implemented

c-cube · 2018-02-05T14:32:26Z

Someone needs to do it, and it's hard™ 🙂

nojb · 2018-02-05T14:37:48Z

As far as I understand from the discussion in #48, the implementation there is viable and could be used as a basis for further work. I can rebase that PR against the current master, but unfortunately I am rather overloaded at the moment so cannot commit to doing the "further work" that may be necessary to get it integrated.

nojb mentioned this issue Dec 20, 2014

[RFC] Add Unicode support #48

Closed

bcc32 mentioned this issue Aug 13, 2021

Support of Chinese characters #196

Closed

mseri mentioned this issue Dec 8, 2021

Unicode regexp for NLP owlbarn/owl#599

Open

leque mentioned this issue Dec 29, 2021

Proposal: change regexp backend gfngfn/SATySFi#306

Open

toots mentioned this issue Mar 26, 2024

Incorrect handling of non-ascii characters in regex savonet/liquidsoap#3824

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Unicode support #24

[Feature Request] Unicode support #24

choeger commented Oct 15, 2014

c-cube commented Dec 2, 2014

vouillon commented Dec 2, 2014

zoggy commented Jan 12, 2016

Drup commented Jan 12, 2016

XVilka commented Feb 5, 2018

c-cube commented Feb 5, 2018

nojb commented Feb 5, 2018

[Feature Request] Unicode support #24

[Feature Request] Unicode support #24

Comments

choeger commented Oct 15, 2014

c-cube commented Dec 2, 2014

vouillon commented Dec 2, 2014

zoggy commented Jan 12, 2016

Drup commented Jan 12, 2016

XVilka commented Feb 5, 2018

c-cube commented Feb 5, 2018

nojb commented Feb 5, 2018