Japanese Characters cause the entire string to be detected as a URL #39

joeyfedor · 2023-03-03T20:02:54Z

If you run the detector in the text below, it thinks the whole text is a URL.

我进入你的主页很卡顿，也许是你的关注人数或者其他数据太多了，其他人主页没有这么卡顿。来自amethyst客户端

Characters 。 and ， are single characters and are not considered spaces in this library.

mattn · 2023-05-01T15:59:23Z

Using linkedin/URL-Detector is not good for detecting URLs for content which can be contained with multi-byte strings. Following test case matches Chinese/Japanese text usual.

URL-Detector/url-detector/src/test/java/com/linkedin/urls/detection/TestUriDetection.java

Line 214 in 368c4e4

runTest("\u9053 \u83dc\u3002\u3002\u3002\u3002", UrlDetectorOptions.Default);

mattn mentioned this issue May 1, 2023

[BUG] Japanese text is misidentified as URL. vitorpamplona/amethyst#390

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Japanese Characters cause the entire string to be detected as a URL #39

Japanese Characters cause the entire string to be detected as a URL #39

joeyfedor commented Mar 3, 2023

mattn commented May 1, 2023

Japanese Characters cause the entire string to be detected as a URL #39

Japanese Characters cause the entire string to be detected as a URL #39

Comments

joeyfedor commented Mar 3, 2023

mattn commented May 1, 2023