-
-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ICU4X to run parts of util.unicode.org #1004
Comments
Hmmm, maybe we could block that user, and throttle anyone with more than 1
query per 10 seconds?
…On Sun, Jan 26, 2025, 06:30 Robin Leroy ***@***.***> wrote:
Here’s the current traffic from one specific (slightly odd) user agent:
image.png (view on web)
<https://github.com/user-attachments/assets/b70a6309-d506-45f9-b321-959704609b7c>
—
Reply to this email directly, view it on GitHub
<#1004 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMCIWJO35MX5HAIJ2ID2MTWPHAVCNFSM6AAAAABV4DY2O2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJUGQ2DQMJSHE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
@sffc is this ticket a dup? or have we just talked about it without an issue? if node is nodejs maybe someone is scraping. |
This issue is to track migrating parts of until.unicode.org to ICU4X, which has been discussed in various forums, but I couldn't find a canonical issue. Investigation on other server cost mitigation or rate limiting techniques could be discussed elsewhere. That doesn't however invalidate the motivation for making popular parts of the site run client-side. |
Summarizing a discussion with @sffc in Zürich: It would likely be useful to publish, somewhere on unicode.org, a limited ICU4X-based properties inspection and UnicodeSet query tool based on current published data, but this should be independent of the existing tools. When it comes to « the whole UCD (including Unihan, Unikemet, etc.) and the kitchensink, for all past versions and for draft data » that the JSPs provide for UTC work, reimplementing that (in ICU4X or anywhere) would be a very difficult project. In addition, the benefits are not so clear; historical UCD data is measured in gigabytes, so sending it to the client to perform the query locally is not very practical. In addition, for UTC work, we want the properties that are displayed by these tools to be based on the same implementation as the invariant tests and the data file generation. I have been slowly removing parts of the tools that were using properties from ICU rather than from implementations in the tools, see #502, #835, etc., as well as moving as much as possible to the modern (2011) properties implementation from the older (1996) one, see #488, etc. Adding another UCD parser and UnicodeSet parser in the mix would be unhelpful. As for the rate limiting issues mentioned in the OP, they came from queries for niche properties that ICU4X probably shouldn’t support, and whose implementation in the unicodetools is ridiculously inefficient, see #1018. Eventually we should write some reasonable data structures in the unicodetools to properly support queries on multivalued properties with many different values. |
Currently util.unicode.org runs on top of ICU4J. It works fine, but sometimes it is slow or hits rate limits that we've imposed to cap server costs, as it is doing as I write this message:
We should add ICU4X-backed tooling to parts of util.unicode.org via WebAssembly. This has the benefit of reducing latency (all calculations are client-side) and serving costs (the ICU4X wasm file can be cost-efficiently cached and served in a CDN).
The Unicode Tools are designed to run on the latest (even unreleased) version of the Unicode Standard, and so part of this project may involve improving some of the ICU4X tooling so that it can read raw UCD files. See unicode-org/icu4x#4602
CC @josh-hadley @eggrobin
The text was updated successfully, but these errors were encountered: