Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support of Chinese characters #196

Closed
chengtie opened this issue Aug 13, 2021 · 2 comments
Closed

Support of Chinese characters #196

chengtie opened this issue Aug 13, 2021 · 2 comments

Comments

@chengtie
Copy link

chengtie commented Aug 13, 2021

I have the following code to remove tab, newlines, etc.

module Str = Re.Str

let strip (str:string) :string = 
  let str = Str.replace_first (Str.regexp "^[ \012\r\t\n]+") "" str in
  Str.replace_first (Str.regexp "[ \012\r\t\n]+$") "" str

I just realized that this code broke Chinese characters. For instance, strip "程铁" returned \231\168, which does not make sense.

Does Re support Chinese characters? If not, is there any workaround?

@bcc32
Copy link
Contributor

bcc32 commented Aug 13, 2021

How are you evaluating strip "程铁"? I copy pasted the expression your wrote into utop just now:

# strip "程铁";;
- : string = "程铁"

Maybe this is something to do with the file encoding of your OCaml source code, if you are compiling from a file?

FWIW, there have been feature requests to support Unicode (#24) but it has not been implemented. In this case, however, I would would not expect Re to mess up your string.

@chengtie
Copy link
Author

Indeed, strip works fine, it is the operations before-ward (e.g., String.sub) which cause the problems.

Anyway, I should not use native operations of OCaml such String.sub, String.length to manipulate these strings. I will use third-party libraries.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants