Skip to content

Commit

Permalink
[feature](hive)support hive catalog read json table. (apache#43469)
Browse files Browse the repository at this point in the history
Problem Summary:
Support reading json format hive table like:
```mysql
mysql> show create table basic_json_table;
CREATE TABLE `basic_json_table`(
  `id` int,
  `name` string,
  `age` tinyint,
  `salary` float,
  `is_active` boolean,
  `join_date` date,
  `last_login` timestamp,
  `height` double,
  `profile` binary,
  `rating` decimal(10,2))
ROW FORMAT SERDE
  'org.apache.hive.hcatalog.data.JsonSerDe'
```

Behavior changed:
To implement this feature, this pr modifies `new_json_reader`.
Previously, `new_json_reader` could only insert data into columnString.
In order to support inserting data into columns of other types,
`DataTypeSerDe` is introduced to insert data into columns. To maintain
compatibility with previous versions, changes to this pr are triggered
only when reading hive json tables.

Limitation of Use:
1. Currently, only query is supported, and writing is not supported.
2. Currently, only the `ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe';` scenario is supported. For
some properties specified in `with serdeproperties`, Doris does not take
effect.
3. Since Hive does not allow columns with the same name but different
case when creating a table in Json format (including inside a Struct),
we convert the field names in the Json data to lowercase when reading
the Json data file, and then match according to the lowercase field
names. For field names that are duplicated after being converted to
lowercase in the data, the value of the last field is used (consistent
with Hive behavior).
example:
```
create table json_table(
    column int
)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';

a.json:
{"column":1,"COLumn",2,"COLUMN":3}
{"column":10,"COLumn",20}
{"column":100}
in Hive : load a.json to table json_table

in Doris query:
---
3
20
100
---
```

Todo(in next pr):
Merge `serde` and `json_reader` ,because they have logical conflicts.

Hive catalog support read json format table.
  • Loading branch information
hubgeter committed Dec 4, 2024
1 parent fff936b commit a21926b
Show file tree
Hide file tree
Showing 22 changed files with 830 additions and 158 deletions.
2 changes: 2 additions & 0 deletions be/src/vec/data_types/serde/data_type_array_serde.h
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,8 @@ class DataTypeArraySerDe : public DataTypeSerDe {
nested_serde->set_return_object_as_string(value);
}

virtual DataTypeSerDeSPtrs get_nested_serdes() const override { return {nested_serde}; }

private:
template <bool is_binary_format>
Status _write_column_to_mysql(const IColumn& column, MysqlRowBuffer<is_binary_format>& result,
Expand Down
4 changes: 4 additions & 0 deletions be/src/vec/data_types/serde/data_type_map_serde.h
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,10 @@ class DataTypeMapSerDe : public DataTypeSerDe {
value_serde->set_return_object_as_string(value);
}

virtual DataTypeSerDeSPtrs get_nested_serdes() const override {
return {key_serde, value_serde};
}

private:
template <bool is_binary_format>
Status _write_column_to_mysql(const IColumn& column, MysqlRowBuffer<is_binary_format>& result,
Expand Down
2 changes: 2 additions & 0 deletions be/src/vec/data_types/serde/data_type_nullable_serde.h
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,8 @@ class DataTypeNullableSerDe : public DataTypeSerDe {
int row_num) const override;
Status read_one_cell_from_json(IColumn& column, const rapidjson::Value& result) const override;

virtual DataTypeSerDeSPtrs get_nested_serdes() const override { return {nested_serde}; }

private:
template <bool is_binary_format>
Status _write_column_to_mysql(const IColumn& column, MysqlRowBuffer<is_binary_format>& result,
Expand Down
12 changes: 9 additions & 3 deletions be/src/vec/data_types/serde/data_type_serde.h
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,10 @@ class IColumn;
class Arena;
class IDataType;

class DataTypeSerDe;
using DataTypeSerDeSPtr = std::shared_ptr<DataTypeSerDe>;
using DataTypeSerDeSPtrs = std::vector<DataTypeSerDeSPtr>;

// Deserialize means read from different file format or memory format,
// for example read from arrow, read from parquet.
// Serialize means write the column cell or the total column into another
Expand Down Expand Up @@ -332,6 +336,11 @@ class DataTypeSerDe {
Arena& mem_pool, int row_num) const;
virtual Status read_one_cell_from_json(IColumn& column, const rapidjson::Value& result) const;

virtual DataTypeSerDeSPtrs get_nested_serdes() const {
throw doris::Exception(ErrorCode::NOT_IMPLEMENTED_ERROR,
"Method get_nested_serdes is not supported for this serde");
}

protected:
bool _return_object_as_string = false;
// This parameter indicates what level the serde belongs to and is mainly used for complex types
Expand Down Expand Up @@ -374,9 +383,6 @@ inline void checkArrowStatus(const arrow::Status& status, const std::string& col
}
}

using DataTypeSerDeSPtr = std::shared_ptr<DataTypeSerDe>;
using DataTypeSerDeSPtrs = std::vector<DataTypeSerDeSPtr>;

DataTypeSerDeSPtrs create_data_type_serdes(
const std::vector<std::shared_ptr<const IDataType>>& types);
DataTypeSerDeSPtrs create_data_type_serdes(const std::vector<SlotDescriptor*>& slots);
Expand Down
2 changes: 2 additions & 0 deletions be/src/vec/data_types/serde/data_type_struct_serde.h
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,8 @@ class DataTypeStructSerDe : public DataTypeSerDe {
}
}

virtual DataTypeSerDeSPtrs get_nested_serdes() const override { return elem_serdes_ptrs; }

private:
std::optional<size_t> try_get_position_by_name(const String& name) const;

Expand Down
Loading

0 comments on commit a21926b

Please sign in to comment.