Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more efficient stream parsing #212

Closed
movy opened this issue Dec 1, 2023 · 9 comments
Closed

more efficient stream parsing #212

movy opened this issue Dec 1, 2023 · 9 comments

Comments

@movy
Copy link

movy commented Dec 1, 2023

Hello,
I continuously stream massive amounts of market data from ClickHouse and after profiling my app I discovered that most ticks are spent in stream parsing, i..e:

       const rows = await this.#client.query({
            query: `
                SELECT
                *
            FROM...
   `})
  
       const stream = rows.stream()              

        stream.on('data', async (rows) => {
            for (const row of rows) {
                const [
                    event,
                    time,
                    ...
                ] = await row.json()
      })

I briefly checked the module source code, and looks like quite expensive JSON.parse() is used internally even when we call text().

My question would be **is there a faster way to parse incoming stream maybe by using some raw output format? ** Or am I using reading streams wrong?

@slvrtrn
Copy link
Contributor

slvrtrn commented Dec 1, 2023

You could try CSV/TSV formats for selects and do your own row parsing after calling row.text.
row.text does not call JSON.parse, it is for row.json() only: https://github.com/ClickHouse/clickhouse-js/blob/main/packages/client-node/src/result_set.ts#L77-L81

I am also considering adding RowBinary to supported formats, at least for selects.

@movy
Copy link
Author

movy commented Dec 5, 2023

Thanks for your advice. I wrote a simple datatype-aware parser to split TSV and convert "numbers" to numbers, and speed improved by x10 at least, amazing.
Just a thought: as the structure of JSON is known based on columns types, maybe something like simdjson can be used?

@slvrtrn
Copy link
Contributor

slvrtrn commented Dec 5, 2023

@movy, I think RowBinary implementation could yield even more significant gains. See https://github.com/luizperes/simdjson_nodejs#benchmarks - simdjson is 1.5-2x faster than JSON.parse. It is probably worth adding as an optional switch, I will eventually look at what's possible here.

I am considering, however, adding RowBinary alongside the third "get data" method to the Row object: something like array(), maybe? That will get all values of a particular row in the correct order, utilizing the metadata about the columns, which we will also receive. And the overhead should be pretty minimal.

@movy
Copy link
Author

movy commented Dec 6, 2023

Yes, array() would be great as long as the type conversion is not done in JS.
In my case I receive an array of ~20 values via TabSeparated and as most of them are expected to be numbers, hence I need to parse them. Parsing the whole array is even slower than JSON.parse(), so I selectively parse 3-5 values needed for a particular event and the speed gains are very noticeable. Would be nice to avoid parsing altogether for the sake of code simplicity and for performance as well.

@slvrtrn
Copy link
Contributor

slvrtrn commented Dec 6, 2023

@movy, I am curious to run some internal tests on this (JSON vs. Text + "manual" numbers parsing vs. RowBinary + row as an array). Can you please contact me in the community Slack and share the code snippets and, maybe, a sample dataset (or its definitions) for your scenarios?

@chrisjh
Copy link

chrisjh commented Jan 26, 2024

@slvrtrn @movy running into a similar scenario where JSON.parse isn't efficient enough for the result set I'm working with. Played around with the options described in this thread and have measured some improvement (particularly using simdjson instead of JSON.parse) but was curious to see how your testing is going/if there's an optimized strategy for these cases.

@slvrtrn
Copy link
Contributor

slvrtrn commented Mar 8, 2024

I wanted to post an update here, as this issue looks stale while it is not.

I've been studying RowBinary format and writing an experimental implementation of it with select streaming support out of the box for quite some time already; current progress is here: 1.0.0...row_binary

At the moment of writing this, that branch supports decoding a minimal set of useful data types:

  • Boolean
  • (U)Int8/16/32 as a number
  • Float32/64 as a number
  • (U)Int64/128/256 as a native JS BigInt
  • String
  • Date/Date32 as a JS Date.
  • Nullable

Currently in progress, as I consider the following types should be supported straight away (there are drafts decoders/placeholders for some already): Decimal, Enum, UUID, FixedString, DateTime(64), Array, Map, Tuple.

NB: under the hood, Date is just a UInt16 (number of days since Unix Epoch), and Date32 is just an Int32 (also days, but before/after). JS Date is chosen only as the default option for these types; there will be a mapping option to convert the "days" numeric value to any other object.

For example, consider the Date type. The mapper is just:

<T>(daysSinceEpoch: number) => T

And, by default, if no custom mapper is provided, it is something like this:

const defaultDateMapper = (daysSinceEpoch: number) => new Date(daysSinceEpoch * DayMillis)

Similarly, Date32, DateTime, DateTime64 (especially DateTime(9) as it requires proper nanoseconds support), and Decimal will have optional mappers support from their decoded "raw" representation.


@movy @chrisjh (and @cjk, I saw your reaction, too!), if you are still interested, I'd be curious to learn a few things:

  1. What are the table definitions of your datasets where selecting data in JSON format is too slow? I'd also appreciate it if you could share some of these with me (don't hesitate to ping me in the community Slack!)
  2. Directly related to 1 - what is a minimal set of supported data types you would like to see implemented for RowBinary in the client so that it could already be helpful in your scenario?
  3. Are you selecting or inserting the data into these tables?

@slvrtrn slvrtrn added this to the 1.1.0 milestone Mar 11, 2024
@slvrtrn
Copy link
Contributor

slvrtrn commented Mar 15, 2024

Let's continue tracking this issue here: #216; please don't hesitate to comment. Cheers.

@slvrtrn slvrtrn closed this as completed Mar 15, 2024
@slvrtrn slvrtrn removed this from the 1.1.0 milestone Mar 15, 2024
@slvrtrn
Copy link
Contributor

slvrtrn commented Mar 15, 2024

@chrisjh, for your request, I created this issue to track: #246

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants