-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature](hive)support hive catalog read json table. #43469
Conversation
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
@@ -332,6 +336,11 @@ class DataTypeSerDe { | |||
Arena& mem_pool, int row_num) const; | |||
virtual Status read_one_cell_from_json(IColumn& column, const rapidjson::Value& result) const; | |||
|
|||
virtual DataTypeSerDeSPtrs get_nested_serdes() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can support
get_key_serde()
get_value_serde()
for map
get_data_serde()
for array nullcolumn
get_serde(int)
for struct
not get_nested_serdes()
for any complex type
TeamCity be ut coverage result: |
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
TeamCity be ut coverage result: |
for (auto* slot_desc : slot_descs) { | ||
if (_is_hive_table) { | ||
//don't like _fuzzy_parse,each line read in must modify name_map once. | ||
for (auto* v : slot_descs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will seriously impact the performance.
What if we set a session variable to control it? And by default, in most cases, there is no duplicate column name in json value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we not only need to deal with the same column names, but also the order of the table column names and the json data. I think I can simplify this double for loop by using a layer of for loop plus map.
if (slot_desc->is_nullable()) { | ||
vectorized::IColumn* data_column_ptr = column_ptr; | ||
DataTypeSerDeSPtr data_serde = serde; | ||
vectorized::DataTypeSerDe::FormatOptions _options; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this _options
can be created only once at reuse it in each call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah , i will fix it.
*value, "Json value is null, but the column `{}` is not nullable.", | ||
slot_desc->col_name(), valid)); | ||
return Status::OK(); | ||
} else if (type_desc.type == TYPE_STRUCT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use switch ... case
for following logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i need check _is_load
and type_desc
, It seems that there is no way to use switch
.
} else { | ||
return Status::InternalError<false>("It should not here."); | ||
// Maybe we can `switch (value->GetType()) case: kNumberType`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using switch (value->GetType())
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if using switch (value->GetType())
to get the string behavior would improve performance, but I might give it a try.
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
c10850b
to
b478158
Compare
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR approved by at least one committer and no changes requested. |
PR approved by anyone and no changes requested. |
@@ -1040,7 +1178,8 @@ Status NewJsonReader::_read_one_message(std::unique_ptr<uint8_t[]>* file_buf, si | |||
} | |||
// ---------SIMDJSON---------- | |||
// simdjson, replace none simdjson function if it is ready | |||
Status NewJsonReader::_simdjson_init_reader() { | |||
Status NewJsonReader::_simdjson_init_reader(bool is_load) { | |||
_is_load = is_load; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
already set in Status NewJsonReader::init_reader
.
And not need to pass it from function parameter
@@ -292,6 +296,18 @@ class NewJsonReader : public GenericReader { | |||
std::unordered_map<std::string, std::string> _col_default_value_map; | |||
|
|||
int32_t skip_bitmap_col_idx {-1}; | |||
|
|||
bool _is_load = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment in code to describe this field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
1 similar comment
clang-tidy review says "All clean, LGTM! 👍" |
7b9904d
to
5e9e0ad
Compare
run buildall |
TPC-H: Total hot run time: 39819 ms
|
TeamCity be ut coverage result: |
TPC-DS: Total hot run time: 196362 ms
|
ClickBench: Total hot run time: 32.03 s
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR approved by at least one committer and no changes requested. |
Problem Summary: Support reading json format hive table like: ```mysql mysql> show create table basic_json_table; CREATE TABLE `basic_json_table`( `id` int, `name` string, `age` tinyint, `salary` float, `is_active` boolean, `join_date` date, `last_login` timestamp, `height` double, `profile` binary, `rating` decimal(10,2)) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' ``` Behavior changed: To implement this feature, this pr modifies `new_json_reader`. Previously, `new_json_reader` could only insert data into columnString. In order to support inserting data into columns of other types, `DataTypeSerDe` is introduced to insert data into columns. To maintain compatibility with previous versions, changes to this pr are triggered only when reading hive json tables. Limitation of Use: 1. Currently, only query is supported, and writing is not supported. 2. Currently, only the `ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';` scenario is supported. For some properties specified in `with serdeproperties`, Doris does not take effect. 3. Since Hive does not allow columns with the same name but different case when creating a table in Json format (including inside a Struct), we convert the field names in the Json data to lowercase when reading the Json data file, and then match according to the lowercase field names. For field names that are duplicated after being converted to lowercase in the data, the value of the last field is used (consistent with Hive behavior). example: ``` create table json_table( column int )ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'; a.json: {"column":1,"COLumn",2,"COLUMN":3} {"column":10,"COLumn",20} {"column":100} in Hive : load a.json to table json_table in Doris query: --- 3 20 100 --- ``` Todo(in next pr): Merge `serde` and `json_reader` ,because they have logical conflicts. Hive catalog support read json format table.
Problem Summary: Support reading json format hive table like: ```mysql mysql> show create table basic_json_table; CREATE TABLE `basic_json_table`( `id` int, `name` string, `age` tinyint, `salary` float, `is_active` boolean, `join_date` date, `last_login` timestamp, `height` double, `profile` binary, `rating` decimal(10,2)) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' ``` Behavior changed: To implement this feature, this pr modifies `new_json_reader`. Previously, `new_json_reader` could only insert data into columnString. In order to support inserting data into columns of other types, `DataTypeSerDe` is introduced to insert data into columns. To maintain compatibility with previous versions, changes to this pr are triggered only when reading hive json tables. Limitation of Use: 1. Currently, only query is supported, and writing is not supported. 2. Currently, only the `ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';` scenario is supported. For some properties specified in `with serdeproperties`, Doris does not take effect. 3. Since Hive does not allow columns with the same name but different case when creating a table in Json format (including inside a Struct), we convert the field names in the Json data to lowercase when reading the Json data file, and then match according to the lowercase field names. For field names that are duplicated after being converted to lowercase in the data, the value of the last field is used (consistent with Hive behavior). example: ``` create table json_table( column int )ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'; a.json: {"column":1,"COLumn",2,"COLUMN":3} {"column":10,"COLumn",20} {"column":100} in Hive : load a.json to table json_table in Doris query: --- 3 20 100 --- ``` Todo(in next pr): Merge `serde` and `json_reader` ,because they have logical conflicts. Hive catalog support read json format table.
fix coredump with new_json_reader if we read invalid data will core which source from #43469
fix coredump with new_json_reader if we read invalid data will core which source from #43469
What problem does this PR solve?
Problem Summary:
Support reading json format hive table like:
Behavior changed:
To implement this feature, this pr modifies
new_json_reader
. Previously,new_json_reader
could only insert data into columnString. In order to support inserting data into columns of other types,DataTypeSerDe
is introduced to insert data into columns. To maintain compatibility with previous versions, changes to this pr are triggered only when reading hive json tables.Limitation of Use:
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
scenario is supported. For some properties specified inwith serdeproperties
, Doris does not take effect.example:
Todo(in next pr):
Merge
serde
andjson_reader
,because they have logical conflicts.Release note
Hive catalog support read json format table.
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)