Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature](hive)support hive catalog read json table. #43469

Merged
merged 8 commits into from
Nov 28, 2024

Conversation

hubgeter
Copy link
Contributor

@hubgeter hubgeter commented Nov 7, 2024

What problem does this PR solve?

Problem Summary:
Support reading json format hive table like:

mysql> show create table basic_json_table;
CREATE TABLE `basic_json_table`(
  `id` int,
  `name` string,
  `age` tinyint,
  `salary` float,
  `is_active` boolean,
  `join_date` date,
  `last_login` timestamp,
  `height` double,
  `profile` binary,
  `rating` decimal(10,2))
ROW FORMAT SERDE
  'org.apache.hive.hcatalog.data.JsonSerDe'

Behavior changed:
To implement this feature, this pr modifies new_json_reader. Previously, new_json_reader could only insert data into columnString. In order to support inserting data into columns of other types, DataTypeSerDe is introduced to insert data into columns. To maintain compatibility with previous versions, changes to this pr are triggered only when reading hive json tables.

Limitation of Use:

  1. Currently, only query is supported, and writing is not supported.
  2. Currently, only the ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'; scenario is supported. For some properties specified in with serdeproperties, Doris does not take effect.
  3. Since Hive does not allow columns with the same name but different case when creating a table in Json format (including inside a Struct), we convert the field names in the Json data to lowercase when reading the Json data file, and then match according to the lowercase field names. For field names that are duplicated after being converted to lowercase in the data, the value of the last field is used (consistent with Hive behavior).
    example:
create table json_table(
    column int 
)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';

a.json:
{"column":1,"COLumn",2,"COLUMN":3}
{"column":10,"COLumn",20}
{"column":100}
in Hive : load a.json to table json_table

in Doris query:
---
3
20
100
---

Todo(in next pr):
Merge serde and json_reader ,because they have logical conflicts.

Release note

Hive catalog support read json format table.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@hubgeter
Copy link
Contributor Author

hubgeter commented Nov 7, 2024

run buildall

Copy link
Contributor

github-actions bot commented Nov 7, 2024

clang-tidy review says "All clean, LGTM! 👍"

@@ -332,6 +336,11 @@ class DataTypeSerDe {
Arena& mem_pool, int row_num) const;
virtual Status read_one_cell_from_json(IColumn& column, const rapidjson::Value& result) const;

virtual DataTypeSerDeSPtrs get_nested_serdes() const {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can support
get_key_serde() get_value_serde() for map
get_data_serde() for array nullcolumn
get_serde(int) for struct
not get_nested_serdes() for any complex type

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.88% (9862/26034)
Line Coverage: 29.03% (82003/282472)
Region Coverage: 28.26% (42240/149458)
Branch Coverage: 24.83% (21419/86256)
Coverage Report: http://coverage.selectdb-in.cc/coverage/4120ac989a900c6d6af8401251c242a6690de1df_4120ac989a900c6d6af8401251c242a6690de1df/report/index.html

@hubgeter
Copy link
Contributor Author

run buildall

@hubgeter hubgeter marked this pull request as ready for review November 10, 2024 10:19
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.91% (9868/26033)
Line Coverage: 29.07% (82263/282975)
Region Coverage: 28.23% (42347/150029)
Branch Coverage: 24.78% (21447/86550)
Coverage Report: http://coverage.selectdb-in.cc/coverage/14b5e53e374c7b4d59ae9dc746ff46a37d551ace_14b5e53e374c7b4d59ae9dc746ff46a37d551ace/report/index.html

for (auto* slot_desc : slot_descs) {
if (_is_hive_table) {
//don't like _fuzzy_parse,each line read in must modify name_map once.
for (auto* v : slot_descs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will seriously impact the performance.
What if we set a session variable to control it? And by default, in most cases, there is no duplicate column name in json value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we not only need to deal with the same column names, but also the order of the table column names and the json data. I think I can simplify this double for loop by using a layer of for loop plus map.

if (slot_desc->is_nullable()) {
vectorized::IColumn* data_column_ptr = column_ptr;
DataTypeSerDeSPtr data_serde = serde;
vectorized::DataTypeSerDe::FormatOptions _options;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this _options can be created only once at reuse it in each call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah , i will fix it.

*value, "Json value is null, but the column `{}` is not nullable.",
slot_desc->col_name(), valid));
return Status::OK();
} else if (type_desc.type == TYPE_STRUCT) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use switch ... case for following logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i need check _is_load and type_desc , It seems that there is no way to use switch.

} else {
return Status::InternalError<false>("It should not here.");
// Maybe we can `switch (value->GetType()) case: kNumberType`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using switch (value->GetType())?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if using switch (value->GetType()) to get the string behavior would improve performance, but I might give it a try.

@hubgeter
Copy link
Contributor Author

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@hubgeter
Copy link
Contributor Author

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

morningman
morningman previously approved these changes Nov 13, 2024
Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 13, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

@@ -1040,7 +1178,8 @@ Status NewJsonReader::_read_one_message(std::unique_ptr<uint8_t[]>* file_buf, si
}
// ---------SIMDJSON----------
// simdjson, replace none simdjson function if it is ready
Status NewJsonReader::_simdjson_init_reader() {
Status NewJsonReader::_simdjson_init_reader(bool is_load) {
_is_load = is_load;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already set in Status NewJsonReader::init_reader.
And not need to pass it from function parameter

@@ -292,6 +296,18 @@ class NewJsonReader : public GenericReader {
std::unordered_map<std::string, std::string> _col_default_value_map;

int32_t skip_bitmap_col_idx {-1};

bool _is_load = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment in code to describe this field

eldenmoon
eldenmoon previously approved these changes Nov 13, 2024
Copy link
Member

@eldenmoon eldenmoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hubgeter hubgeter dismissed stale reviews from eldenmoon and morningman via 7da0fde November 13, 2024 18:04
@hubgeter
Copy link
Contributor Author

run buildall

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Nov 13, 2024
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@hubgeter hubgeter force-pushed the feature_hive_json_serde branch from 7b9904d to 5e9e0ad Compare November 26, 2024 16:48
@hubgeter
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 39819 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 5e9e0ad6494ccf06a4bb615c7d5efb804f54bda6, data reload: false

------ Round 1 ----------------------------------
q1	17645	7548	7278	7278
q2	2053	181	167	167
q3	10649	1096	1191	1096
q4	10501	726	697	697
q5	7608	2716	2751	2716
q6	246	146	145	145
q7	1007	625	601	601
q8	9230	1900	1955	1900
q9	6545	6409	6448	6409
q10	7024	2305	2371	2305
q11	472	267	266	266
q12	428	212	213	212
q13	17779	3026	3017	3017
q14	243	218	217	217
q15	567	520	516	516
q16	672	601	583	583
q17	1018	607	537	537
q18	7288	6724	6653	6653
q19	1334	929	1053	929
q20	482	181	179	179
q21	4143	3228	3084	3084
q22	371	318	312	312
Total cold run time: 107305 ms
Total hot run time: 39819 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7293	7263	7221	7221
q2	331	227	231	227
q3	2913	2750	2923	2750
q4	2054	1814	1883	1814
q5	5733	5706	5674	5674
q6	221	138	136	136
q7	2251	1810	1773	1773
q8	3449	3562	3480	3480
q9	8849	8953	8932	8932
q10	3592	3564	3572	3564
q11	604	498	531	498
q12	811	614	612	612
q13	13184	3281	3159	3159
q14	321	288	275	275
q15	574	530	524	524
q16	679	634	640	634
q17	1879	1636	1606	1606
q18	8413	7726	7646	7646
q19	1684	1560	1574	1560
q20	2128	1885	1899	1885
q21	5700	5472	5354	5354
q22	662	605	563	563
Total cold run time: 73325 ms
Total hot run time: 59887 ms

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 38.34% (9977/26022)
Line Coverage: 29.43% (83526/283837)
Region Coverage: 28.57% (42983/150466)
Branch Coverage: 25.17% (21833/86748)
Coverage Report: http://coverage.selectdb-in.cc/coverage/5e9e0ad6494ccf06a4bb615c7d5efb804f54bda6_5e9e0ad6494ccf06a4bb615c7d5efb804f54bda6/report/index.html

@doris-robot
Copy link

TPC-DS: Total hot run time: 196362 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 5e9e0ad6494ccf06a4bb615c7d5efb804f54bda6, data reload: false

query1	1285	932	923	923
query2	6237	2116	2017	2017
query3	10818	4064	4141	4064
query4	67176	28452	23494	23494
query5	4943	456	451	451
query6	407	189	179	179
query7	5500	303	297	297
query8	319	226	225	225
query9	8623	2736	2720	2720
query10	456	263	253	253
query11	16995	15149	15788	15149
query12	160	107	103	103
query13	1504	462	433	433
query14	10383	7196	7463	7196
query15	236	180	181	180
query16	7067	493	451	451
query17	1079	553	564	553
query18	1949	310	292	292
query19	215	148	144	144
query20	115	115	116	115
query21	205	103	111	103
query22	4859	4631	4560	4560
query23	34815	34305	33939	33939
query24	5528	2559	2502	2502
query25	491	385	389	385
query26	664	147	148	147
query27	1947	282	290	282
query28	4727	2536	2498	2498
query29	676	432	417	417
query30	216	158	147	147
query31	1012	814	850	814
query32	76	55	52	52
query33	469	287	292	287
query34	915	524	538	524
query35	849	724	745	724
query36	1081	1015	960	960
query37	121	79	89	79
query38	4588	4482	4415	4415
query39	1541	1476	1475	1475
query40	210	114	104	104
query41	50	47	47	47
query42	115	102	103	102
query43	554	506	494	494
query44	1151	815	818	815
query45	188	218	165	165
query46	1149	750	704	704
query47	2070	1869	1915	1869
query48	418	322	313	313
query49	750	385	418	385
query50	850	402	413	402
query51	7320	7190	7078	7078
query52	103	86	82	82
query53	257	206	176	176
query54	500	390	385	385
query55	77	80	78	78
query56	243	239	240	239
query57	1299	1161	1170	1161
query58	227	214	219	214
query59	3192	3012	3025	3012
query60	288	236	240	236
query61	111	106	103	103
query62	779	657	678	657
query63	213	181	194	181
query64	1375	652	633	633
query65	3291	3176	3202	3176
query66	703	306	317	306
query67	15985	15793	15582	15582
query68	3710	569	577	569
query69	421	256	244	244
query70	1214	1186	1128	1128
query71	368	245	252	245
query72	6374	4025	3988	3988
query73	780	363	363	363
query74	10210	9055	8999	8999
query75	3404	2697	2661	2661
query76	1798	1107	1243	1107
query77	502	308	266	266
query78	10505	9410	9429	9410
query79	2072	615	604	604
query80	1388	423	456	423
query81	533	235	234	234
query82	1273	119	121	119
query83	193	156	153	153
query84	281	72	72	72
query85	996	296	293	293
query86	413	301	297	297
query87	4769	4585	4611	4585
query88	3682	2232	2204	2204
query89	423	286	307	286
query90	1940	187	183	183
query91	133	100	103	100
query92	67	49	51	49
query93	2976	545	547	545
query94	835	296	291	291
query95	354	244	240	240
query96	629	278	282	278
query97	2864	2670	2735	2670
query98	218	201	201	201
query99	1624	1334	1307	1307
Total cold run time: 320664 ms
Total hot run time: 196362 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 32.03 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 5e9e0ad6494ccf06a4bb615c7d5efb804f54bda6, data reload: false

query1	0.03	0.03	0.03
query2	0.06	0.04	0.03
query3	0.24	0.08	0.06
query4	1.63	0.10	0.11
query5	0.41	0.42	0.42
query6	1.13	0.67	0.66
query7	0.02	0.02	0.01
query8	0.04	0.03	0.03
query9	0.57	0.52	0.51
query10	0.55	0.56	0.57
query11	0.14	0.11	0.10
query12	0.14	0.11	0.12
query13	0.61	0.59	0.59
query14	2.74	2.92	2.75
query15	0.89	0.83	0.82
query16	0.38	0.38	0.37
query17	1.06	0.97	1.07
query18	0.23	0.22	0.20
query19	1.93	1.90	1.95
query20	0.01	0.01	0.01
query21	15.35	0.60	0.58
query22	3.17	2.23	1.72
query23	16.98	1.24	0.85
query24	3.14	0.81	0.52
query25	0.30	0.05	0.22
query26	0.38	0.14	0.13
query27	0.04	0.04	0.04
query28	11.46	1.10	1.08
query29	12.55	3.26	3.22
query30	0.24	0.06	0.06
query31	2.85	0.38	0.38
query32	3.29	0.46	0.49
query33	2.98	2.99	3.07
query34	16.97	4.46	4.46
query35	4.49	4.57	4.48
query36	0.66	0.49	0.48
query37	0.09	0.06	0.06
query38	0.04	0.03	0.03
query39	0.03	0.02	0.02
query40	0.15	0.12	0.12
query41	0.08	0.02	0.03
query42	0.03	0.02	0.02
query43	0.04	0.03	0.03
Total cold run time: 108.12 s
Total hot run time: 32.03 s

Copy link
Contributor

@kaka11chen kaka11chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 28, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@morningman morningman merged commit 358ed8c into apache:master Nov 28, 2024
25 of 27 checks passed
hubgeter added a commit to hubgeter/doris that referenced this pull request Dec 2, 2024
Problem Summary:
Support reading json format hive table like:
```mysql
mysql> show create table basic_json_table;
CREATE TABLE `basic_json_table`(
  `id` int,
  `name` string,
  `age` tinyint,
  `salary` float,
  `is_active` boolean,
  `join_date` date,
  `last_login` timestamp,
  `height` double,
  `profile` binary,
  `rating` decimal(10,2))
ROW FORMAT SERDE
  'org.apache.hive.hcatalog.data.JsonSerDe'
```

Behavior changed:
To implement this feature, this pr modifies `new_json_reader`.
Previously, `new_json_reader` could only insert data into columnString.
In order to support inserting data into columns of other types,
`DataTypeSerDe` is introduced to insert data into columns. To maintain
compatibility with previous versions, changes to this pr are triggered
only when reading hive json tables.

Limitation of Use:
1. Currently, only query is supported, and writing is not supported.
2. Currently, only the `ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe';` scenario is supported. For
some properties specified in `with serdeproperties`, Doris does not take
effect.
3. Since Hive does not allow columns with the same name but different
case when creating a table in Json format (including inside a Struct),
we convert the field names in the Json data to lowercase when reading
the Json data file, and then match according to the lowercase field
names. For field names that are duplicated after being converted to
lowercase in the data, the value of the last field is used (consistent
with Hive behavior).
example:
```
create table json_table(
    column int
)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';

a.json:
{"column":1,"COLumn",2,"COLUMN":3}
{"column":10,"COLumn",20}
{"column":100}
in Hive : load a.json to table json_table

in Doris query:
---
3
20
100
---
```

Todo(in next pr):
Merge `serde` and `json_reader` ,because they have logical conflicts.

Hive catalog support read json format table.
hubgeter added a commit to hubgeter/doris that referenced this pull request Dec 4, 2024
Problem Summary:
Support reading json format hive table like:
```mysql
mysql> show create table basic_json_table;
CREATE TABLE `basic_json_table`(
  `id` int,
  `name` string,
  `age` tinyint,
  `salary` float,
  `is_active` boolean,
  `join_date` date,
  `last_login` timestamp,
  `height` double,
  `profile` binary,
  `rating` decimal(10,2))
ROW FORMAT SERDE
  'org.apache.hive.hcatalog.data.JsonSerDe'
```

Behavior changed:
To implement this feature, this pr modifies `new_json_reader`.
Previously, `new_json_reader` could only insert data into columnString.
In order to support inserting data into columns of other types,
`DataTypeSerDe` is introduced to insert data into columns. To maintain
compatibility with previous versions, changes to this pr are triggered
only when reading hive json tables.

Limitation of Use:
1. Currently, only query is supported, and writing is not supported.
2. Currently, only the `ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe';` scenario is supported. For
some properties specified in `with serdeproperties`, Doris does not take
effect.
3. Since Hive does not allow columns with the same name but different
case when creating a table in Json format (including inside a Struct),
we convert the field names in the Json data to lowercase when reading
the Json data file, and then match according to the lowercase field
names. For field names that are duplicated after being converted to
lowercase in the data, the value of the last field is used (consistent
with Hive behavior).
example:
```
create table json_table(
    column int
)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';

a.json:
{"column":1,"COLumn",2,"COLUMN":3}
{"column":10,"COLumn",20}
{"column":100}
in Hive : load a.json to table json_table

in Doris query:
---
3
20
100
---
```

Todo(in next pr):
Merge `serde` and `json_reader` ,because they have logical conflicts.

Hive catalog support read json format table.
eldenmoon pushed a commit that referenced this pull request Jan 2, 2025
fix coredump with new_json_reader if we read invalid data will core
which source from #43469
github-actions bot pushed a commit that referenced this pull request Jan 2, 2025
fix coredump with new_json_reader if we read invalid data will core
which source from #43469
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.1.x-experimental dev/3.0.4-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants