diff --git a/docs/rfcs/2024-08-06-json-datatype.md b/docs/rfcs/2024-08-06-json-datatype.md index e3c596a78dda..52e7f1310207 100644 --- a/docs/rfcs/2024-08-06-json-datatype.md +++ b/docs/rfcs/2024-08-06-json-datatype.md @@ -13,10 +13,21 @@ JSON is widely used across various scenarios. Direct support for writing and que # Details -## User Interface -The feature introduces a new data type, `JSON`, for the database. Similar to the common JSON type, data is written as JSON strings and can be queried using functions. +## Storage and Query -For example: +The type system of GreptimeDB is based on the types of arrow/datafusion, each type has a corresponding physical type from arrow/datafusion. Thus, the json type is built on top of the `Binary` type, utilizing current implementation of both `Value` and `Vector` of it. JSON type performs the same as Binary type inside the storage layer and query engine. + +This also brings 2 problems: insertion and query interface. + +## Insertion + +User commonly write JSON data as strings. Thus we need to make conversion between string and binary data. There are 2 ways to do this: + +1. MySQL and PostgreSQL servers provide auto-conversion between string and JSON data. When a string is inserted into a JSON column, the server will try to parse the string as JSON data and convert it to binary data of JSON type. The non-JSON string will be rejected. + +2. A function `parse_json` is provided to convert string to JSON data. The function will return a binary data of JSON type. If the string is not a valid JSON string, the function will return an error. + +For example, in MySQL client: ```SQL CREATE TABLE IF NOT EXISTS test ( ts TIMESTAMP TIME INDEX, @@ -34,70 +45,75 @@ INSERT INTO test VALUES( }' ); -SELECT json_get(b, 'name') FROM test; -+---------------------+ -| b.name | -+---------------------+ -| jHl2oDDnPc1i2OzlP5Y | -+---------------------+ - -SELECT CAST(json_get_by_paths(b, 'attributes', 'event_attributes') AS DOUBLE) + 1 FROM test; -+-------------------------------+ -| b.attributes.event_attributes | -+-------------------------------+ -| 49.28667 | -+-------------------------------+ - +INSERT INTO test VALUES( + 0, + 0, + parse_json('{ + "name": "jHl2oDDnPc1i2OzlP5Y", + "timestamp": "2024-07-25T04:33:11.369386Z", + "attributes": { "event_attributes": 48.28667 } + }') +); ``` +Are both valid. -## Storage and Query +For former the conversion is done by the server, while for the latter the conversion is done by the query engine. -Data of `JSON` type is stored as JSONB format in the database. For storage layer and query engine, data is represented as a binary array and can be queried through pre-defined JSON functions. For clients, data is shown as strings. +## Query Interface -Insertions of `JSON` goes through following steps: +Correspondingly, users prefer to display JSON data as strings. Thus we need to make conversion between binary data and string data. There are alsol 2 ways to do this: auto-conversions on MySQL and PostgreSQL servers, and function `json_to_string`. -1. Client gets JSON strings and sends it to the frontend. -2. Frontend encode JSON strings to binary data of JSONB format and sends it to the datanode. -3. Datanode stores binary data in the database. +For example, in MySQL client: +```SQL +SELECT b FROM test; +SELECT json_to_string(b) FROM test; ``` -Insertion: - Encode Store - JSON Strings ┌────────────┐ JSONB ┌────────────┐ JSONB - client ------------->│ Frontend │------>│ Datanode │------> Storage - └────────────┘ └────────────┘ -``` +Will both return the JSON string. + +Specifically, we attach a message to the binary data of JSON type in the `metadata` of `Field` in arrow/datafusion schema. Frontend servers could identify the type of the binary data and convert it to string data if necessary. But for functions with a JSON return type, the metadata method is not applicable. Thus the functions of JSON type should specify the return type explicitly, such as `json_get_int` and `json_get_float` which return `INT` and `FLOAT` respectively. -The data of `JSON` type is represented by `Binary` data type in arrow. There are 2 types of JSON queries: get JSON elements through keys and compute over JSON elements. +## Functions +Similar to the common JSON type, data is written as JSON strings and can be queried with functions. -For the former, the query engine performs queries directly over binary data. We provide functions like `json_get` and `json_get_by_paths` to extract JSON elements through keys. +For example: +```SQL +CREATE TABLE IF NOT EXISTS test ( + ts TIMESTAMP TIME INDEX, + a INT, + b JSON +); -For the latter, users need to manually specify the data type of the JSON elements for computing. Users can use `CAST` to convert the JSON elements to the specified data type. Computation without explicit conversion will result in an error. +INSERT INTO test VALUES( + 0, + 0, + '{ + "name": "jHl2oDDnPc1i2OzlP5Y", + "timestamp": "2024-07-25T04:33:11.369386Z", + "attributes": { "event_attributes": 48.28667 } + }' +); -Queries of `JSON` goes through following steps: +SELECT json_get_int(b, 'name') FROM test; ++---------------------+ +| b.name | ++---------------------+ +| jHl2oDDnPc1i2OzlP5Y | ++---------------------+ -1. Client sends query to frontend, and frontend sends it to datafusion, which is the query engine of GreptimeDB. -2. Datafusion performs query over binray data of JSONB format, and returns binary data to frontend. -3. If no computation is needed, frontend directly decodes the binary data to JSON strings and return it to clients. -4. If computation is needed, the binary data is decoded and converted to the specified data type to perform computation. There's no need for further decoding in the frontend. +SELECT json_get_float(b, 'attributes.event_attributes') FROM test; ++--------------------------------+ +| b.attributes.event_attributes | ++--------------------------------+ +| 48.28667 | ++--------------------------------+ ``` -Queries without computation, decoding in frontend: - Decode Query - JSON Strings ┌────────────┐ JSONB ┌──────────────┐ JSONB - client <-------------│ Frontend │<------│ Datafusion │<------ Storage - └────────────┘ └──────────────┘ - -Queries with computation, decoding in datafusion: - Query - Data of Specified Type ┌────────────┐ Data of Specified Type ┌──────────────┐ JSONB - client <-----------------------│ Frontend │<-----------------------│ Datafusion │<------ Storage - └────────────┘ └──────────────┘ -``` +And more functions can be added in the future. # Drawbacks -As a general purpose data type, JSONB may not be as efficient as specialized data types for specific scenarios. +As a general purpose JSON data type, JSONB may not be as efficient as specialized data types for specific scenarios. # Alternatives