Any plan to support JSON or JSONB? #7845

dojiong · 2023-10-17T06:10:29Z

Is your feature request related to a problem or challenge?

Datafusion does not support JSON/JSONB datatype. Is there a plan to support it in the future?

Describe the solution you'd like

support JSON/JSONB data

Describe alternatives you've considered

Treat string/binary data as json/jsonb in json functions or json operators.

Additional context

json functions in postgres: functions-json

alamb · 2023-10-18T00:44:03Z

DataFusion supports reading JSON in and some basic things like field access (like json['field_name']

However, it doesn't have the range of operatos that postgres does

This is similar to #5493

It would be great if someone wanted to start working on adding more native JSON support (starting by spec'ing it out, writing the appropriate tickets), but at the moment I don't know of anyone doing do.

dojiong · 2023-10-18T01:32:24Z

DataFusion supports reading JSON in and some basic things like field access (like json['field_name']

Indexed field access is only valid for List or Struct types.

DataFusion/Arrow does not have JSON/JSONB datatype, nor the related json functions.

We could solve it in two ways:

add Json/Jsonb type to Arrow, then support it in DataFusion natively
only add json functions to DataFusion, treat utf8/binary as json/jsonb

The first one is more naturely, the type is more precise, and could seamless integrate with Parquet's JSON/BSON type.

The second one could be more faster to land, but there would be some type issues.

alamb · 2023-10-18T20:53:40Z

A third way could be to parse JSON data into Arrow Structs (which is what the json reader does now) and then improve the Struct support in DataFusion with the various JSON opertors.

One limitation of this approach is that it requires all the JSON records to have the same schema, which is often, but not always, the case. The more dynamic "JSON/BSON" type approach would allow it, but would not support as fast performance

alamb · 2023-10-18T20:55:12Z

I also think there is a solution that is part way between what @dojiong proposes in #7845 (comment): store BSON as Binary in datafusion and then implement all datafusion functions that take binary arguments

dojiong · 2023-10-19T01:57:23Z

A third way could be to parse JSON data into Arrow Structs
One limitation of this approach is that it requires all the JSON records to have the same schema

Yeah, JSON is schemaless, it's hard to fix schema to Arrow Struct.

store BSON as Binary in datafusion

That could be land quickly, without modifying the types of Arrow. But there are some limitations:

json_extract_path in postgres returns json data type, it's easy to process returned data(extract inner or convert type).
Datafusion does not support convert binary to json element types (e.g. int, boolean).
Then we should add json_extract_path_int/json_extract_path_str/json_extract_path_bool...., that's annoying.

If we use binary as BSON, we should consider a pleasure way to convert BSON data to other datatypes.

wjones127 · 2023-10-19T04:49:23Z

We could solve it in two ways:

add Json/Jsonb type to Arrow, then support it in DataFusion natively

only add json functions to DataFusion, treat utf8/binary as json/jsonb

A third option (or maybe you meant this by either 1 or 2): have JSON and BSON be extension types, so DataFusion could recognize them via field metadata. I think the Arrow community would be open to eventually have canonical extension types for these, given their prevalence in various formats, including Parquet.

If we use binary as BSON, we should consider a pleasure way to convert BSON data to other datatypes.

I wonder if json_extract_path could return a sparse union type of [Utf8, Int64, Float64, Boolean, List, JSON]. Then this type could be implicitly or explicitly casted to the expected type.

dojiong · 2023-10-19T07:37:17Z

have JSON and BSON be extension types, so DataFusion could recognize them via field metadata.

I didn't find ways to use extension type in datafusion, do you means add extension types support to DataFusion? But DataFusion use DataType in arrow, maybe it's more hard to add JSON type to Arrow.

I wonder if json_extract_path could return a sparse union type of [Utf8, Int64, Float64, Boolean, List, JSON].

That a way to archive, then json_extract_path should accepts union type as input (json_extract_path may accept the result of another json_extract_path in subquery)

dojiong · 2023-10-19T08:01:47Z

Maybe add ExtensionType to Arrow's DataType is more naturely:

trait ExtensionType {
    fn inner_type(&self) -> &DataType;
    // ....
}

enum DataType {
    Int8,
    Utf8,
    // ....
    Extension(Arc<dyn ExtensionType>)
}

impl DataType {
    fn raw_type(&self) -> &DataType {
        let mut ty = self;
        while let Self::Extension(ext) = ty {
             ty = ext.inner_type();
        }
        ty
    }
}

yukkit · 2023-10-23T03:22:24Z

@dojiong Discussion of support for extension type can be found at apache/arrow-rs#4472

alamb · 2023-10-25T17:34:01Z

Related proposal for user defined types: #7923

thinkharderdev · 2023-10-25T18:24:15Z

We (Coralogix) built our own binary jsonb format (we call it jsona for json arrow) that we are planning on open-sourcing in the next couple months (hopefully Jan/Feb time frame, need to fill in some details) that has some nifty features for vectorized processing.

In broad strokes we took the Tape representation and tweaked that a bit. So a JsonaArray is encoded as

        DataType::Struct(Fields::from(vec![
            Field::new(
                "nodes",
                DataType::List(Arc::new(Field::new("item", DataType::UInt32, true))),
                false,
            ),
            Field::new(
                "keys",
                DataType::List(Arc::new(Field::new(
                    "item",
                    DataType::Dictionary(Box::new(DataType::UInt16), Box::new(DataType::Utf8)),
                    true,
                ))),
                false,
            ),
            Field::new(
                "values",
                DataType::List(Arc::new(Field::new("item", DataType::Utf8, true))),
                false,
            ),
        ]))

So you have three child arrays:
nodes - Basically TapeElement encoded as a u32 where the top 4 bits encode the type (StartObject,EndObject,StartArray,EndArray,Key,String,Number) and the bottom 28 bits store an offset
keys - The JSON keys stored in a dict array
values - The JSON leaf values

If this is something other's would be interested in, we would be happy to upstream it into arrow-rs proper as a native array type (not in the sense of adding the the arrow spec but as something with "native" APIs in arrow-rs to avoid the ceremony around dealing with struct arrays) and add support in DataFusion.

The benefits of this over just using JSONB in a regular binary array (and the reason we built it) are roughly:

You can test for existence/non-existence/nullity of a json path using just the nodes/keys arrays which are generally quite compact and cache friendly. This is especially helpful in cases where you are doing predicate pushdown into parquet since you can potentially prune significant IO from reading the values array
Manipulating the "structure" (removing paths, inserting paths, etc) are quite efficient as they are mostly manipulating the nodes array.
It's very efficient to serialize back to a JSON string

philippemnoel · 2024-02-01T16:19:10Z

Definitely interested!

alamb · 2024-02-01T20:29:41Z

Cross posting. There are some interesting ideas in #9103 (comment)

alamb · 2024-02-01T20:44:29Z

I think there are two major themes here:

Theme 1

How to query such semi-structured data.

DataFusion today supports the Arrow type system, which while powerful can not describe arbitrary json schemas. For example you have to know up front what all the possible fields are for a StructArray

Systems like Postgres handle this by adding a new type and operators (see Postgres Docs for example the containment operators

-- This array contains the primitive string value:
SELECT '["foo", "bar"]'::jsonb @> '"bar"'::jsonb;

-- String exists as array element:
SELECT '["foo", "bar", "baz"]'::jsonb ? 'bar';

-- String exists as object key:
SELECT '{"foo": "bar"}'::jsonb ? 'foo';

-- Object values are not considered:
SELECT '{"foo": "bar"}'::jsonb ? 'bar';  -- yields false

Theme 2: how to store such data efficiently

Once we have some sort of idea how to query the data then there is the question about how to store / implement those queries efficiently. I think that is where #7845 (comment) from @thinkharderdev and ArroyoSystems/arroyo#477 from @mwylde are coming from

Suggestions on paths forward

I personally think it may be tough to reach a consensus on storing JSON vs BSON vs JSONA as they seem to have different usecases and pros/cons (interop vs performance for example)

However, I could easily see this supported as "extension types" (basically views over StringArray/BinaryArray).

I think we could implement such support today using functions, albiet with a poor UX. Like why can't we do something like

-- create jsonb constant from a json string, like `make_json`
select make_jsonb('{"foo": "bar"}');

-- create json string from jsonb
select json_from_jsonb(<binary json>);

--  test if data in binary_column contains the structure {foo: bar}
select jsonb_contains(binary_column, make_jsonb('{"foo": "bar"}');

🤔

mwylde · 2024-02-01T23:11:53Z

Our immediate concern (which motivated our json extension type and the changes in https://github.com/ArroyoSystems/arrow-rs/tree/49.0.0/json) is being able to support partial deserialization/serialization of json.

The arrow typesystem is necessarily limited compared to the schemas enabled by, for example, json-schema, and for a system like ours it's important to be able to handle real-world json with arbitrarily complex schemas.

We do that by taking json schema and attempting to convert it into arrow as far as that's possible; when we encounter a subschema that isn't representable as arrow, we use utf8 with the json extension type. Then, with those arrow-json changes we're able to partially deserialize/serialize the arrow-schema fields, while leaving the unsupported fields as string-encoded json.

Similarly, for tables defined via SQL DDL, we support a JSON type that has the same behavior.

We would be interested in more native json support, particularly if we could parse the json once and store it in some form that enables more efficient json functions (which seems to be the direction that @thinkharderdev is going in with the tape representation).

alamb · 2024-02-02T10:47:45Z

Similarly, for tables defined via SQL DDL, we support a JSON type that has the same behavior.

How do people query such types? Do you have native operator support (like postres @>)? I looked briefly in The Arroyo Docs but I didn't see any mention of JSON yet

BTW Adding something new to the arrow spec is possible, but it is quite a process.

There is a slightly less intense way here:
https://arrow.apache.org/docs/format/CanonicalExtensions.html

However, I think it might be good to figure out how to support this initially and then consider trying to standardize. It isn't clear to me there is a consensus on the actual storage / representation yet.

It seems there is a consensus on the need to be able to access / query types with an arbitrary structure (JSON, JSONB, etc)

mwylde · 2024-02-02T18:35:20Z

We support a few JSON functions (https://doc.arroyo.dev/sql/scalar-functions#json-functions) for querying JSON data, and for more complex needs users can write Rust UDFs with serde_json. In practice both approaches seem pretty widely used.

rtyler · 2024-02-12T22:20:12Z

From a user's standpoint I've run into this now from a Datafusion SQL standpoint. As a SQL user I am hurting mostly by the lack of a get_json_object() function to use in order to extract keys from within strings.

For example, Delta Lake has no native JSON type, so my applications rely on String columns which are known a priori to contain JSON data, and then a common query pattern might be:

SELECT id, get_json_object(ctx_data, '$.name') as name FROM sometable WHERE ds = current_date()

Where the schema might be:

+-------------------+
|     id   |  Int64 |
| ctx_data |  Utf8  | 
+-------------------+

In the meantime I might just shove a UDF into place.

abuisman · 2024-03-24T19:25:46Z

I'd very much like it if there were jsonb or json support. I want to use the pg_analytics extension and they use datafusion.

alamb · 2024-03-24T21:51:32Z

I think we are now pretty close to being able to support JSON / JSONB via scalar functions

The basic idea might be:

Implement ScalarUDFs for the relevant JSON/BSON functioanlity
Implement a rewrtier that rewrites whatever syntax you want on those columns into the corresponding function calls

This is basically how Array function support is now implemented

Here is the library of array functions https://github.com/apache/arrow-datafusion/tree/main/datafusion/functions-array/src

Here is the rewrite

https://github.com/apache/arrow-datafusion/blob/1e4ddb6d86328cb6596bb50da9ccc654f19a83ea/datafusion/functions-array/src/rewrite.rs#L40-L180

alamb · 2024-03-26T02:33:46Z

FYI this topic came up in our first meetup #8522

samuelcolvin · 2024-04-17T11:01:59Z

@alamb if you're interested in JSON parsing support I might be interested in contributing.

We (Pydantic) maintain a very fast Rust JSON parser (generally significantly faster thane serde-json), jiter which I think would be perfect for this case as it allows you to iterate over the JSON, rather than materialize the data before querying it.

(Jiter is pretty stable and included in pydantic-core, meaning it's the most downloaded non-std JSON parser for Python)

I did a brief experiment with using Jiter to implement a json_contains method (equivalent of the ? operator in postgres) as a scalar UDF, which looked like this:

fn json_contains(args: &[ColumnarValue]) -> Result<ColumnarValue> {
    // first argument is the array of json values
    let json_values = match &args[0] {
        // ColumnarValue::Array(array) => as_generic_binary_array::<i32>(array),
        ColumnarValue::Array(array) => as_string_array(array),
        _ => panic!("first argument: unexpected argument type"),
    };

    // second argument is the key name
    let key_name = match &args[1] {
        ColumnarValue::Scalar(ScalarValue::Utf8(Some(s))) => s.clone(),
        _ => panic!("second argument: unexpected argument type"),
    };

    let array = json_values
        .iter()
        .map(|json| {
            if let Some(json) = json {
                Some(jiter_json_contains(json.as_bytes(), &key_name))
            } else {
                None
            }
        })
        .collect::<BooleanArray>();

    Ok(ColumnarValue::from(Arc::new(array) as ArrayRef))
}

fn jiter_json_contains(json_data: &[u8], expected_key: &str) -> bool {
    let mut jiter = Jiter::new(json_data, false);
    let first_key = match jiter.next_object() {
        Ok(Some(key)) => key,
        _ => return false,
    };
    if first_key == expected_key {
        return true;
    }
    // TODO we should skip over the next value, rather than fully parse it, needs a small change to jiter
    if jiter.next_value().is_err() {
        return false;
    }
    while let Ok(Some(key)) = jiter.next_key() {
        if key == expected_key {
            return true;
        }
        // TODO we should skip over the next value, rather than fully parse it, needs a small change to jiter
        if jiter.next_value().is_err() {
            return false;
        }
    }
    false
}

Using that and querying a set of parquet files (28gb), even without the next_skip optimisation I mentioned above, the performance is comparable with duckDB:

--------------------- count(*) ---------------------
-- duckdb:
  SELECT count(*) FROM records_parquet_local;
  -- 102275349 in 99ms

-- datafusion:
  SELECT count(*) FROM records;
  -- 102275349 in 58ms

--------------------- ilike span_name ---------------------
-- duckdb:
  SELECT count(*) FROM records_parquet_local where span_name ilike 'converting traces to database%';
  -- 3151751 in 1.805s

-- datafusion:
  SELECT count(*) FROM records where span_name ilike 'converting traces to database%';
  -- 3151751 in 1068ms

--------------------- ilike span_name, list attributes ---------------------
-- duckdb:
  SELECT attributes FROM records_parquet_local where span_name ilike 'converting traces to database%';
  -- 7.6s

-- datafusion:
  SELECT attributes FROM records where span_name ilike 'converting traces to database%';
  -- 13.9s

--------------------- json_contains attributes ---------------------
-- duckdb
  SELECT count(*) FROM records_parquet_local where list_contains(json_keys(attributes), 'size');
  -- 6165747 in 11.5s

-- datafusion
  SELECT count(*) FROM records where json_contains(attributes, 'size');
  -- 6165747 in 18.0s

The slowdown relative to duckdb of using json_contains is about the same as just accessing the attributes column.

of course, if you want to use an alternative storage structure for a custom JSONB type similar to postgres or msgpack, then jiter won't help, but that would be much more work unless there's an existing rust library that provides serialization and iterable deserialization?

adriangb · 2024-04-17T11:31:03Z

For what it’s worth I think having the ability to performantly parse JSON stored as a String or Binary is valuable in and of itself. You don’t always control how the data gets stored. If a binary format improves performance that’s a great option for when you can control the storage format.

samuelcolvin · 2024-04-17T12:14:49Z

tiny update to my example above, I realised there’s a much better comparison query:

-- datafusion
  SELECT count(*) FROM records where json_contains(attributes, 'size');
  -- 6165747 in 18.0s
-- datafusion
  SELECT count(*) FROM records where attributes like '%"size":%';
  -- 6165747 in 15.4s
-- datafusion
  SELECT count(*) FROM records where attributes ilike '%"size":%';
  -- 6165747 in 22.6s

so even unoptimised, jiter/json_contains is only 16% slower than like and already faster than ilike.

alamb · 2024-04-19T10:30:45Z

Nice -- thank you for the offer and information @samuelcolvin and @adriangb

High level proposal

I think it would initailly possible to implement JSON support using the existing datafusion apis (and implement it in a separate crate). I think @philippemnoel is also interested in such an endeavor as well.

Using the existing extension APIs would both

Allow initial iteration to go faster (without the ASF governance)
Ensure that the datafusion APIs are sufficient for the usecase

While I don't think I personally have the bandwidth to help implement the JSON functionality I think the API design is critical to the success of DataFusion and would be very interested in helping make it happen.

Specific proposal

So, in this case the solution might look like

Create a new repo in crate like datafusion-functions-json
Follow the the model of the functions-array that are built into DataFusion (kudos to @jayzhan211 for making most of that happen): https://github.com/apache/arrow-datafusion/blob/19356b26f515149f96f9b6296975a77ac7260149/datafusion/functions-array/src/lib.rs#L102-L149
Add the appropriate functions (like json_contains) as ScalarUDFs
Add a rewrite pass to rewrite json operators like --> into the appropriate function calls (see ArrayFunctionRewriter for an example: https://github.com/apache/arrow-datafusion/blob/19356b26f515149f96f9b6296975a77ac7260149/datafusion/functions-array/src/rewrite.rs#L41-L64

Then using the JSON functionality would look something like

let ctx = SessionContext::new();
datafusion_json_functions::register_all(&ctx)?;

let results = ctx.sql("SELECT count(*) FROM records where json_contains(attributes, 'size')
  .await?
.collect()?;

Next Steps

Once we have that working I think we could then discuss with the community if we should incorporate this set of functions into the main datafusion repo for maintenance (given the demand I suspect this would be a popular idea)

New Repo

If anyone is interested. I created https://github.com/datafusion-contrib/datafuison-functions-json in the datafusion-contrib organization. I can add you as admin if you are interested.

samuelcolvin · 2024-04-19T18:20:29Z

This is great news @alamb.

Just to confirm, are you happy to use jiter and the approach I described above in datafuison-functions-json?

If so, I'm happy to help both implementing the functions and making some alterations in jiter to make things faster. Feel free to invite me, or I can just submit PRs.

alamb · 2024-04-21T11:22:44Z

This is great news @alamb.

Just to confirm, are you happy to use jiter and the approach I described above in datafuison-functions-json?

If you are going to write the code I think you should decide how to do so :)

If we get to the point where we want to propose moving the code from datafuison-functions-json into datafusion (for maintenance / apache governance) we would likely evaluate benefits of using the arrow-json reader instead of a new dependency, but there is no reason not to start with jiter (and maybe we'll find areas it is better than arrow-json)

If so, I'm happy to help both implementing the functions and making some alterations in jiter to make things faster.

That is a great idea. I suspect that will make both DataFusion and jiter better

Feel free to invite me, or I can just submit PRs.

I invited you -- I don't think I will likely have time to work on / review PRs at this time but I suspect others on this ticket may. Please chime in if you do

WenyXu · 2024-04-21T14:19:29Z

Hi @alamb, We (GreptimeDB) are also planning to support JSON. We are participating in the OSPP (Open Source Promotion Plan), similar to Google Summer of Code, and are currently recruiting students to implement this feature. We aim to find an efficient and elegant solution for storing JSON data. As part of this effort, we also plan to evaluate various storage implementations of JSON formats in the OLAP scenario.

Thank you for @thinkharderdev's work. JSONA greatly inspired me. I propose a JSONA variant, which may work well in our scenario.

A naive proposal of JSONA variant

For JSON [false, 10, {"k":"v"}, null] can be stored as the following struct.

Struct {
    Nodes: [StartArray, FALSE, Number, StartObject, Key, String, EndObject, NULL, EndArray]
    Offsets: [NULL, NULL, 0, NULL, 0, 0, NULL,…]
    Keys: ["k"]
    Strings: ["v"]
    Numbers: [10]
}

The Struct data can be efficiently encoded into compact files using the underlying file format. In our scenario, we use the Parquet as the underlying file format. For instance, the Nodes field can be represented as UINT8 and efficiently encoded using default dictionary encoding.

alamb · 2024-04-22T21:44:14Z

Thanks @WenyXu -- sounds very neat. FWI I think @samuelcolvin is also thinking about the representation in datafusion-contrib/datafusion-functions-json#2

samuelcolvin · 2024-04-23T23:55:17Z

https://github.com/datafusion-contrib/datafusion-functions-json now provides the following methods, I think we're nearly ready for a first release, see datafusion-contrib/datafusion-functions-json#5.

json_contains(json: str, *keys: str | int) -> bool - true if a JSON object has a specific key
json_get(json: str, *keys: str | int) -> JsonUnion - Get a value from a JSON object by its "path"
json_get_str(json: str, *keys: str | int) -> str - Get a string value from a JSON object by its "path"
json_get_int(json: str, *keys: str | int) -> int - Get an integer value from a JSON object by its "path"
json_get_float(json: str, *keys: str | int) -> float - Get a float value from a JSON object by its "path"
json_get_bool(json: str, *keys: str | int) -> bool - Get a boolean value from a JSON object by its "path"
json_get_json(json: str, *keys: str | int) -> str - Get any value from a JSON object by its "path", represented as a string
json_length(json: str, *keys: str | int) -> int - get the length of a JSON object or array

Cast expressions with json_get are rewritten to the appropriate method, e.g.

select * from foo where json_get(attributes, 'bar')::string='ham'

Will be rewritten to:

select * from foo where json_get_str(attributes, 'bar')='ham'

Oh, and as per the micro-benchmarks on pydantic/jiter#84, the performance of these methods should now be on a par with duckdb.

@alamb, you might want to take a look at the serde-json (which arrow-json uses) case I added - ignoring the overhead of running a query it's 4x slower than jiter :)

mwylde · 2024-04-24T18:25:19Z

@samuelcolvin this is awesome! I'm very excited to try it in Arroyo.

samuelcolvin · 2024-09-22T14:27:54Z

I've done some further digging into fast querying of semi-structured data, in particularly I have a prototype library "batson" (binary alternative to (J)SON), pydantic/jiter#136.

Batson is heavily inspired by Postgres's JSONB type and snowflake's Variant type (see datafusion-contrib/datafusion-functions-variant#11), and the apache initiative Open Variant. The main advantages of batson are it maintains order of items in objects, ints > i64::MAX are supported via rust's BigInt.

Performance:

batson is 14x faster than jiter (used in the current datafusion-functions-json) when a key exists
around 126x faster than jiter when the key does not exist

(for reference, and for those wondering why datafusion-functions-json doesn't use serde-json, batson is 106x and 588x faster than serde-json in the same cases respectively)

So far so good.

The problem is that batson isn't actually that much faster than datafusion-functions-json using jiter (DFJ) in real world uses cases:

with all data in memory, batson is around 8x faster than DFJ
with the the data being read from a local parquet file however, batson is only around 2x faster DFJ

The improvement would be even worse in cases (like ours) where the data is actually being read from an object store.

The main difference is that decompression takes up the lion’s share of time when reading parquet files locally.

Running a query like select count(*) from records where json_contains(attributes, 'needle') over 1m rows:

In memory:

+-----------------+
| count(Int64(1)) |
+-----------------+
| 105461          |
+-----------------+
mode: FilterJson zstd(1), query took 600.49625ms
+-----------------+
| count(Int64(1)) |
+-----------------+
| 105461          |
+-----------------+
mode: FilterBatson zstd(1), query took 75.818625ms

From parquet:

running queries...
+-----------------+
| count(Int64(1)) |
+-----------------+
| 105461          |
+-----------------+
mode: FilterJson zstd(1), query took 1.143363958s
+-----------------+
| count(Int64(1)) |
+-----------------+
| 105461          |
+-----------------+
mode: FilterBatson zstd(1), query took 604.522375ms

(happy to share the code I used, if it helps anyone)

We can see why that is using samply:

decompression is taking 71% of time
batson contains is taking 6.2% of the time

So making batson faster, or implementing it at all isn't immediately that useful.

Simple data shredding (as described in Open Variant here) won’t help I believe since datafusion will get and decompress the body column if they’re included in the query, even if no rows in the body column are accessed.

ways forward:

use row group stats to enable pruning of row groups, will help a lot with needle-in-haystack search, but not with aggregations on common columns
somehow get filter pushdown / late materialization to work based on the result of a UDF so some columns aren't decompressed (or even aren't fetched) unless they're needed
disable parquet compression on these columns, then do our own compression on body data, but not headers, only decompress the body when necessary — sounds hard
something else I haven’t thought of ???

Any pointers from those who understand datafusion better than me (👀 @alamb @andygrove @jayzhan211) on the best way forward would be much appreciate.

jayzhan211 · 2024-09-23T03:31:42Z

select count(*) from records where json_contains(attributes, 'needle')

What is the the explained plain for this query? Maybe we could push down filter json_contains(attributes, 'needle') to parquet then only the filtered data is decompressed? 🤔

alamb · 2024-09-23T17:19:13Z

somehow get filter pushdown / late materialization to work based on the result of a UDF so some columns aren't decompressed (or even aren't fetched) unless they're needed

This seems like the right idea to pursue to me

The ParquetExec already can push filtering down into the scan via

datafusion.execution.parquet.pushdown_filters true

(we have it on in InfluxData)

But it is not turned on by default as it isn't faster for all cases yet: #12524

samuelcolvin · 2024-09-24T07:48:54Z

I've shared the code I'm using to run these tests: https://github.com/samuelcolvin/batson-perf

Unfortunately setting datafusion.execution.parquet.pushdown_filters to true makes things slightly worse:

code/batson-perf 130 ➤ cargo run --release                                             
   Compiling jiter v0.5.0 (https://github.com/pydantic/jiter.git?branch=batson#6513df59)
   Compiling batson v0.5.0 (https://github.com/pydantic/jiter.git?branch=batson#6513df59)
   Compiling batson-perf v0.1.0 (/Users/samuel/code/batson-perf)
    Finished `release` profile [optimized] target(s) in 2.47s
     Running `target/release/batson-perf`
loading data from records_1000000_zstd(1).parquet
running queries...
+-----------------+
| count(Int64(1)) |
+-----------------+
| 105461          |
+-----------------+
mode: FilterJson zstd(1), pushdown_filters: false, query took 1.123133083s
+-----------------+
| count(Int64(1)) |
+-----------------+
| 105461          |
+-----------------+
mode: FilterBatson zstd(1), pushdown_filters: false, query took 544.314375ms
code/batson-perf 0 ➤ cargo run --release
   Compiling batson-perf v0.1.0 (/Users/samuel/code/batson-perf)
    Finished `release` profile [optimized] target(s) in 1.53s
     Running `target/release/batson-perf`
loading data from records_1000000_zstd(1).parquet
running queries...
+-----------------+
| count(Int64(1)) |
+-----------------+
| 105461          |
+-----------------+
mode: FilterJson zstd(1), pushdown_filters: true, query took 1.426721083s
+-----------------+
| count(Int64(1)) |
+-----------------+
| 105461          |
+-----------------+
mode: FilterBatson zstd(1), pushdown_filters: true, query took 957.441ms

With pushdown_filters enabled, the flamegraph looks completely different (there are two blocks of decompression, with the batson parsing in the middle) (Update, actually three blocks of decompression, one fairly small)

However, overall decompression still dominates:

You can view this flamegraph properly here.

Since these flamegraphs run right to left (you can see when functions were actually running by selecting them and viewing the timeline), the chronology seems to be:

decompress something big - 42% of the time - presumably this is the binary column of structured data
allocate that - 6%
run batson - 4.2%
decompress something else big (inside parquet::arrow::array_reader::skip_records ) - 37% - could this be that skip_records is accidentally decompressing data it shouldn't?
decompress something small parquet::arrow::array_reader::read_records - 6%

My guess is that pushdown_filters isn't working correctly?

alamb · 2024-09-26T20:12:35Z

My guess is that pushdown_filters isn't working correctly?

I think this falls into the category of "pushdown_filters still needs more optimization so it doesn't slow down" -- it is a matter of engineering I think (probably preventing the same column from being decoded more than once e.g. apache/arrow-rs#6454 (comment)

thinkharderdev · 2024-09-27T14:28:47Z

https://github.com/samuelcolvin/batson-perf

I think this is a general issue with low-selectivity filters pushed down to the parquet scan. How the row filtering works now is that the column will be decoded once to execute the filter and then decoded again to produce the output batches. If the decoding time is non-trivial (eg it requires zstd decompression of a lot of data) and the filter is not particularly selective then the redundant decoding can easily more than offset the cost of just materializing the whole column and filtering.

When the row filtering was initially implemented we discussed keeping the decoded data cached but ended up deciding against it because it can potentially consume a lot of memory

alamb · 2024-09-28T16:11:05Z

When the row filtering was initially implemented we discussed keeping the decoded data cached but ended up deciding against it because it can potentially consume a lot of memory

Yes I would like to revisit this assumption eventually but I need to study the code more to convince myself it is possible

tustvold · 2024-10-07T09:23:10Z

you might want to take a look at the serde-json (which arrow-json uses) case I added - ignoring the overhead of running a query it's 4x slower than jiter :)

FWIW arrow-json doesn't make use of serde-json for reading or writing anymore - apache/arrow-rs#3479 apache/arrow-rs#5318

adriangb · 2024-11-07T20:29:57Z

Is there any way to do per-batch rewrites, or otherwise modify how an operator gets applied?

The "traditional" way to getting performance from JSON data in an analytical system is to dynamically create columns for keys (e.g. ClickHouse). For our use case that breaks down because the data is extremely variable, e.g. in one place user may be an int and somewhere else it's a str. There is also overhead of keeping track of which columns have their own column and which don't, what the type of each key/column is, etc. But within a small chunk of data the keys are much more likely to be homogeneous and thus the win of breaking them out into their own column much larger. I'm wondering if we could do per record batch rewrites such that given the query json_column->>'foo' for each record batch we check if there is a column called __json_column__foo in the file and if so pull the value from there. This would mean that at write time we'd have to introspect the json data to determine which keys have enough repetition, have a stable type, etc. to be worth breaking out. But there'd be no need to maintain any global state around which keys map to which columns, etc.

alamb · 2024-11-07T22:33:07Z

FWIW DataFusion has pretty strong assumptions that the schema matches (aka inserting extra columns would likely be challenging)

One thing to potentially look into is building an "inverted index" externally maybe 🤔 That is basically "state of the art" in text search / search engines 🤔

dojiong added the enhancement New feature or request label Oct 17, 2023

alamb mentioned this issue Oct 18, 2023

Spatial data support #7859

Open

yukkit mentioned this issue Oct 25, 2023

[Proposal] Support User-Defined Types (UDT) #7923

Open

mwylde mentioned this issue Jan 3, 2024

Arrow json decoding ArroyoSystems/arroyo#477

Merged

WenyXu mentioned this issue Apr 19, 2024

Supports JSON format GreptimeTeam/greptimedb#3686

Closed

samuelcolvin mentioned this issue Apr 21, 2024

Next skip pydantic/jiter#84

Merged

samuelcolvin mentioned this issue Apr 24, 2024

Support Union as a function #10206

Open

alamb mentioned this issue Apr 27, 2024

Constructing and Destructing Objects (JSON) #6631

Closed

dmontagu mentioned this issue Jun 18, 2024

Introduce support for parsing common JSON operators #10981

Closed

wjones127 mentioned this issue Jun 18, 2024

Open Variant Type for semi-structured data #10987

Open

samuelcolvin mentioned this issue Sep 25, 2024

parquet::column::reader::GenericColumnReader::skip_records still decompresses most data apache/arrow-rs#6454

Open

ZENOTME mentioned this issue Feb 26, 2025

Run / test Datafusion with JSON Bench from ClickHouse #14874

Open

adriangb mentioned this issue Mar 3, 2025

Support Push down expression evaluation in TableProviders #14993

Open

Any plan to support JSON or JSONB? #7845

Any plan to support JSON or JSONB? #7845

Comments

dojiong commented Oct 17, 2023

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Oct 18, 2023

dojiong commented Oct 18, 2023

alamb commented Oct 18, 2023

alamb commented Oct 18, 2023

dojiong commented Oct 19, 2023

wjones127 commented Oct 19, 2023

dojiong commented Oct 19, 2023

dojiong commented Oct 19, 2023

yukkit commented Oct 23, 2023 • edited Loading

alamb commented Oct 25, 2023

thinkharderdev commented Oct 25, 2023

philippemnoel commented Feb 1, 2024

alamb commented Feb 1, 2024

alamb commented Feb 1, 2024 • edited Loading

Theme 1

Theme 2: how to store such data efficiently

Suggestions on paths forward

mwylde commented Feb 1, 2024

alamb commented Feb 2, 2024 • edited Loading

mwylde commented Feb 2, 2024

rtyler commented Feb 12, 2024 • edited Loading

abuisman commented Mar 24, 2024

alamb commented Mar 24, 2024

alamb commented Mar 26, 2024

samuelcolvin commented Apr 17, 2024 • edited Loading

adriangb commented Apr 17, 2024 • edited Loading

samuelcolvin commented Apr 17, 2024 • edited Loading

alamb commented Apr 19, 2024

High level proposal

Specific proposal

Next Steps

New Repo

samuelcolvin commented Apr 19, 2024

alamb commented Apr 21, 2024 • edited Loading

WenyXu commented Apr 21, 2024 • edited Loading

A naive proposal of JSONA variant

alamb commented Apr 22, 2024

samuelcolvin commented Apr 23, 2024 • edited Loading

mwylde commented Apr 24, 2024

samuelcolvin commented Sep 22, 2024

jayzhan211 commented Sep 23, 2024 • edited Loading

alamb commented Sep 23, 2024

samuelcolvin commented Sep 24, 2024 • edited Loading

alamb commented Sep 26, 2024

thinkharderdev commented Sep 27, 2024

alamb commented Sep 28, 2024

tustvold commented Oct 7, 2024

adriangb commented Nov 7, 2024 • edited Loading

alamb commented Nov 7, 2024

yukkit commented Oct 23, 2023 •

edited

Loading

alamb commented Feb 1, 2024 •

edited

Loading

alamb commented Feb 2, 2024 •

edited

Loading

rtyler commented Feb 12, 2024 •

edited

Loading

samuelcolvin commented Apr 17, 2024 •

edited

Loading

adriangb commented Apr 17, 2024 •

edited

Loading

samuelcolvin commented Apr 17, 2024 •

edited

Loading

alamb commented Apr 21, 2024 •

edited

Loading

WenyXu commented Apr 21, 2024 •

edited

Loading

samuelcolvin commented Apr 23, 2024 •

edited

Loading

jayzhan211 commented Sep 23, 2024 •

edited

Loading

samuelcolvin commented Sep 24, 2024 •

edited

Loading

adriangb commented Nov 7, 2024 •

edited

Loading