-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement] Support some compress functions #47307
base: master
Are you sure you want to change the base?
Conversation
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and remember to format your file
|
||
Status execute_impl(FunctionContext* context, Block& block, const ColumnNumbers& arguments, | ||
uint32_t result, size_t input_rows_count) const override { | ||
// LOG(INFO) << "Executing FunctionCompress with " << input_rows_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove these commented lines
col_data[idx] = '0', col_data[idx + 1] = 'x'; | ||
for (int i = 0; i < 4; i++) { | ||
unsigned char byte = (value >> (i * 8)) & 0xFF; | ||
col_data[idx + 2 + i * 2] = "0123456789ABCDEF"[byte >> 4]; // 高4位 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont use Chinese
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and make magic values
|
||
auto st = compression_codec->compress(data, &compressed_str); | ||
|
||
if (!st.ok()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment about when will it fails
regression-test/suites/query_p0/sql_functions/string_functions/test_compress_uncompress.groovy
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we dont need modify this file anymore
std::string func_name = "compress"; | ||
InputTypeSet input_types = {TypeIndex::String}; | ||
|
||
// 压缩多个不同的字符串 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont use Chinese comment
std::string uncompressed; | ||
Slice data; | ||
Slice uncompressed_slice; | ||
for (int row = 0; row < input_rows_count; row++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use size_t
, not int
illegal = 1; | ||
} else { | ||
if (data[0] != '0' || data[1] != 'x') { | ||
LOG(INFO) << "illegal: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont log info here
if (x >= 'A' && x <= 'F') return true; | ||
return false; | ||
}; | ||
auto trans = [](char x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just use from_chars
and to_chars
to replace your user implemented lambdas
// Print the compressed string (after compression) | ||
// LOG(INFO) << "Compressed string at row " << row << ": " | ||
// << std::string(reinterpret_cast<const char*>(col_data.data())); | ||
col_offset[row] = col_offset[row - 1] + 10 + compressed_str.size() * 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this value for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first ten digits of the compress value are "0x" and eight digits long, followed by each digit split into two hexadecimal values
fa80a74
to
ca2b27e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the correcteness
Slice data; | ||
for (size_t row = 0; row < input_rows_count; row++) { | ||
null_map[row] = false; | ||
const auto& str = arg_column.get_data_at(row); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to use virtual function here
\N \N | ||
|
||
-- !const_not_nullable -- | ||
0x05000000789C73C92FCA2C060005B00202 0x446F726973 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
carefully review your result!!!
Slice data; | ||
Slice uncompressed_slice; | ||
for (size_t row = 0; row < input_rows_count; row++) { | ||
auto check = [](char x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try to use std function firstly
const auto& str = arg_column.get_data_at(row); | ||
data = Slice(str.data, str.size); | ||
|
||
int illegal = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not bool?
unsigned char* src = compressed_str.data(); | ||
{ | ||
for (size_t i = 0; i < compressed_str.size(); i++) { | ||
col_data[idx] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so tricky here. try to improve code like it
Slice data; | ||
Slice uncompressed_slice; | ||
for (size_t row = 0; row < input_rows_count; row++) { | ||
std::function<bool(char)> check = [](char x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use isxdigit
?
be/src/util/block_compression.cpp
Outdated
@@ -854,8 +854,13 @@ class ZlibBlockCompression : public BlockCompressionCodec { | |||
Slice s(*output); | |||
|
|||
auto zres = ::compress((Bytef*)s.data, &s.size, (Bytef*)input.data, input.size); | |||
if (zres != Z_OK) { | |||
return Status::InvalidArgument("Fail to do ZLib compress, error={}", zError(zres)); | |||
if (zres == Z_MEM_ERROR) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also change other same calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
split them to another PR may be better
implements UnaryExpression, ExplicitlyCastableSignature, PropagateNullable { | ||
|
||
public static final List<FunctionSignature> SIGNATURES = ImmutableList.of( | ||
FunctionSignature.ret(StringType.INSTANCE).args(StringType.INSTANCE)); |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
implements UnaryExpression, ExplicitlyCastableSignature, AlwaysNullable { | ||
|
||
public static final List<FunctionSignature> SIGNATURES = ImmutableList.of( | ||
FunctionSignature.ret(StringType.INSTANCE).args(StringType.INSTANCE)); |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
|
||
unsigned int length = 0; | ||
for (size_t i = 2; i <= 9; i += 2) { | ||
unsigned char byte = (hex_ctoi.at(data[i]) << 4) + hex_ctoi.at(data[i + 1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove hex_ctoi
and just use from_chars
unsigned int length = 0; | ||
for (size_t i = 2; i <= 9; i += 2) { | ||
unsigned char byte = (hex_ctoi.at(data[i]) << 4) + hex_ctoi.at(data[i + 1]); | ||
length += (byte << (8 * (i / 2 - 1))); //Little Endian : 0x01000000 -> 1 |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
std::string uncompressed; | ||
Slice data; | ||
Slice uncompressed_slice; | ||
for (size_t row = 0; row < input_rows_count; row++) { |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
} | ||
idx += 10; | ||
|
||
col_data.resize(col_data.size() + 2 * compressed_str.size()); |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
//Converts a hexadecimal readable string to a compressed byte stream | ||
std::string s(((int)data.size - 10) / 2, ' '); // byte stream data.size >= 10 | ||
for (size_t i = 10, j = 0; i < data.size; i += 2, j++) { | ||
s[j] = (hex_ctoi.at(data[i]) << 4) + hex_ctoi.at(data[i + 1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
6ef40dd
to
b8b3bad
Compare
run buildall |
1 similar comment
run buildall |
TPC-H: Total hot run time: 32100 ms
|
TPC-DS: Total hot run time: 184954 ms
|
ClickBench: Total hot run time: 30.68 s
|
run buildall |
1 similar comment
run buildall |
TPC-H: Total hot run time: 32971 ms
|
TPC-DS: Total hot run time: 191631 ms
|
ClickBench: Total hot run time: 30.67 s
|
run buildall |
TPC-H: Total hot run time: 32328 ms
|
TPC-DS: Total hot run time: 185720 ms
|
ClickBench: Total hot run time: 30.3 s
|
run buildall |
PR approved by anyone and no changes requested. |
namespace doris::vectorized { | ||
|
||
class FunctionCompress : public IFunction { | ||
string hex_itoc = "0123456789ABCDEF"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HEX_ITOC for const data. need constexpr, better be std::array
} | ||
|
||
// first ten digits represent the length of the uncompressed string | ||
col_data.resize(col_data.size() + 10 + 2 * compressed_str.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so what the function do ? seems maybe the result bigger than before compress ? Mysql do the same thing ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, mysql does the same thing. What I do is stream the compressed bytes into a visible hexadecimal string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to change the compressed bytes into a visible hexadecimal string.
- the work maybe the result bigger than before compress
- nobody care about the content of compressed bytes, people only care the compress really compress the data and decompress can get the same result before compress
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be not rery reasonable. There's reason for Mysql to behave like this.
- after compressing, the bytes in corresponding memory just a stream of bytes. so any case is possible. just interpret it as chars doesn’t keep consistency. consider a memory region of “a\b”. after printing it’s “” because ‘\b’ deletes ‘a’.
- for the compression ratio, it’s guaranteed by compression algorithm. it has a very large ratio. so even we print it as chars which would double it length, it doesn’t matter.
run buildall |
TPC-H: Total hot run time: 32303 ms
|
TPC-DS: Total hot run time: 190787 ms
|
ClickBench: Total hot run time: 30.96 s
|
TeamCity be ut coverage result: |
namespace doris::vectorized { | ||
|
||
class FunctionCompress : public IFunction { | ||
std::array<char, 16> hex_itoc = {'0', '1', '2', '3', '4', '5', '6', '7', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HEX_ITOC and constexpr and static
run buildall |
TPC-H: Total hot run time: 32141 ms
|
TPC-DS: Total hot run time: 190795 ms
|
ClickBench: Total hot run time: 30.65 s
|
TeamCity be ut coverage result: |
namespace doris::vectorized { | ||
|
||
class FunctionCompress : public IFunction { | ||
static constexpr std::array<char, 16> hex_itoc = {'0', '1', '2', '3', '4', '5', '6', '7', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please keep constexpr UPPER CASE
run buildall |
TPC-H: Total hot run time: 32399 ms
|
TPC-DS: Total hot run time: 191844 ms
|
ClickBench: Total hot run time: 31.38 s
|
TeamCity be ut coverage result: |
124faf8
to
42df82b
Compare
run buildall |
TeamCity be ut coverage result: |
What problem does this PR solve?
Added the compress and uncompressed functions similar to mysql
Issue Number: close #45530
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)