Arrow table from query result much larger than equivalent inserted table #1741

keller-mark · 2024-05-22T18:08:58Z

What happens?

Arrow tables returned by conn.query are much larger than expected due to lack of usage of Dictionary encoding. In the web browser, large (>1 million row) tables that can be inserted into the database successfully cannot be subsequently queried in their entirety due to the memory footprint of the returned Arrow table.

To Reproduce

This Observable notebook contains a minimal reproduction: https://observablehq.com/d/56ceaa780133858a

Browser/Environment:

Firefox Developer Edition 127.0b4

Device:

Macbook

DuckDB-Wasm Version:

1.24.0

DuckDB-Wasm Deployment:

Observable Notebook

Full Name:

Mark Keller

Affiliation:

Harvard Medical School

The text was updated successfully, but these errors were encountered:

keller-mark · 2024-05-27T13:59:41Z

I think this comes down to these lines:

const buffer = await this._bindings.runQuery(this._conn, text);
const reader = arrow.RecordBatchReader.from<T>(buffer);

On the JS side, the arrow.RecordBatchReader could potentially be modified to return dictionary-encoded columns. However if the buffer passed is not already dictionary-encoded, then the memory issue may persist due to the buffer being very large.

domoritz · 2024-05-27T14:48:53Z

Hmm, but if it's a buffer, doesn't that mean that JS here is just instantiating an arrow record batch from IPC and the IPC dictates the schema and so we can't just dictionary encode after the fact?

keller-mark · 2024-05-27T14:53:08Z

Yes that is what i was trying to convey with

the memory issue may persist due to the buffer being very large

i.e., that the fix might need to be on the C++ side

keller-mark · 2024-05-27T14:55:53Z

That aside, on the JS side, couldn't the dictionary encoding be performed in incremental fashion as the buffer is parsed? For example you could keep a set of values that have been "seen" already and if an unseen value is encountered, add a mapping to the dictionary

EDIT

I suppose the above statement does not make sense if there is no "parsing" step https://github.com/apache/arrow/blob/ff9921ffa89585be69ae85674bb365d03cb22ba4/js/src/ipc/reader.ts#L357 - then the only option would be on the C++ side

domoritz · 2024-05-27T15:55:42Z

Yep, that's what I'm thinking as well but I may be wrong.

keller-mark mentioned this issue Aug 15, 2024

Try DuckDB vitessce/vitessce#1828

Draft

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow table from query result much larger than equivalent inserted table #1741

Arrow table from query result much larger than equivalent inserted table #1741

keller-mark commented May 22, 2024

keller-mark commented May 27, 2024

domoritz commented May 27, 2024

keller-mark commented May 27, 2024

keller-mark commented May 27, 2024 •

edited

Loading

domoritz commented May 27, 2024

Arrow table from query result much larger than equivalent inserted table #1741

Arrow table from query result much larger than equivalent inserted table #1741

Comments

keller-mark commented May 22, 2024

What happens?

To Reproduce

Browser/Environment:

Device:

DuckDB-Wasm Version:

DuckDB-Wasm Deployment:

Full Name:

Affiliation:

keller-mark commented May 27, 2024

domoritz commented May 27, 2024

keller-mark commented May 27, 2024

keller-mark commented May 27, 2024 • edited Loading

domoritz commented May 27, 2024

keller-mark commented May 27, 2024 •

edited

Loading