Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(parquet): Add boolean rle decoder for Parquet #11282

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jkhaliqi
Copy link
Contributor

@jkhaliqi jkhaliqi commented Oct 16, 2024

RLE/BP is an Encoding for Boolean values for Parquet Version 2 files.
https://parquet.apache.org/docs/file-format/data-pages/encodings/
Fixes: #10943

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 16, 2024
Copy link

netlify bot commented Oct 16, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit d1d1dcd
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67916f98ae810c0008c65cc7

@Yuhta Yuhta changed the title added in rle encoder for boolean added in rle decoder for boolean Oct 17, 2024
@Yuhta Yuhta self-requested a review October 17, 2024 21:04
@jkhaliqi jkhaliqi force-pushed the rle_encoding branch 2 times, most recently from 358855b to f638683 Compare October 29, 2024 19:22
@jkhaliqi jkhaliqi force-pushed the rle_encoding branch 3 times, most recently from f5735ba to 163bdb3 Compare October 30, 2024 18:03
@jkhaliqi jkhaliqi marked this pull request as ready for review October 30, 2024 18:17
@jkhaliqi jkhaliqi force-pushed the rle_encoding branch 2 times, most recently from a8f69b9 to c25886f Compare October 30, 2024 19:24
@minhancao minhancao force-pushed the rle_encoding branch 2 times, most recently from e8e82c4 to c760214 Compare November 1, 2024 20:45
@ethanyzhang
Copy link

@yingsu00 can you also take a look at this PR? Thank you!

@majetideepak majetideepak changed the title added in rle decoder for boolean feat: Add boolean rle decoder for Parquet Nov 12, 2024
@jkhaliqi jkhaliqi force-pushed the rle_encoding branch 2 times, most recently from 99e8783 to 4b8412d Compare November 12, 2024 22:32
public:
using super = RleBpDecoder;
RleBooleanDecoder(const char* start, const char* end, int32_t& len)
: super::RleBpDecoder{start + 4, end, 1} {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 4 is used multiple times. Please make it a static const here and use an appropriate name for it.

You don't need super:: here. There is no ambiguity here.

"Received invalid length : " + std::to_string(len) +
" (corrupt data page?)");
}
// num_bytes will be the first 4 bytes that tell us the length of encoded
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Do we need this comment once the 4 is not magic number anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right removing comment since it is not necessary, thank you!


template <bool hasNulls>
inline void skip(int32_t numValues, int32_t current, const uint64_t* nulls) {
if (hasNulls) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a constexpr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to constexpr, thank you!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function itself doesn't need to be constexpr. It is the if condition that should be constexpr.

if constexpr (hasNulls)

That means if the template argument is false this if expression is not generated.

numValues = bits::countNonNulls(nulls, current, current + numValues);
}

super::skip(numValues);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be RleBpDecoder::skip(numValues) to disambiguate the function from this->skip(numValues)?

int32_t current = visitor.start();

skip<hasNulls>(current, 0, nulls);
int32_t toSkip;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets also initialize it to 0.

int32_t toSkip;
bool atEnd = false;
const bool allowNulls = hasNulls && visitor.allowNulls();
std::vector<uint64_t> outputBuffer(20);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the size of the vector 20?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry was using this output buffer for some other testing since it's not being used anymore will delete this line of code

++current;
if (toSkip) {
skip<hasNulls>(toSkip, current, nulls);
current += toSkip;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no problem if toSkip > 0 but we already advanced current by 1 on line 97? I suppose this someting about what visitor represents? This might need a comment to explain why this is ok.

Or maybe some comment on how the algorithm works when the read occurs.

}

const char* bufferStart_;
uint32_t num_bytes = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be named numBytes_.

int64_t readBitField() {
auto value =
dwio::common::safeLoadBits(
super::bufferStart_, bitOffset_, bitWidth_, lastSafeWord_) &
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment above about using super vs the base class name to disambiguate.


class RleBooleanDecoder : public RleBpDecoder {
public:
using super = RleBpDecoder;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you've replaced super with the actual base class we don't need this anymore. I did not see that you defined super here. This explains why it was working before. But someone not familiar with Java would be confused so better to be explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you removed this line!

RleBooleanDecoder(const char* start, const char* end, int32_t& len)
: RleBpDecoder{start + kLengthOffset, end, 1} {
if (len < kLengthOffset) {
VELOX_FAIL(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets replace the std::to_string with fmt provided in the VELOX_FAIL like so:

VELOX_FAIL("Received invalid length : {} (corrupt data page?)", len);

for all occurrences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, updated!


template <bool hasNulls>
inline void skip(int32_t numValues, int32_t current, const uint64_t* nulls) {
if (hasNulls) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function itself doesn't need to be constexpr. It is the if condition that should be constexpr.

if constexpr (hasNulls)

That means if the template argument is false this if expression is not generated.

public:
using super = RleBpDecoder;
static constexpr int32_t kLengthOffset = 4;
RleBooleanDecoder(const char* start, const char* end, int32_t& len)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len is not modified here and we don't need a reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thank you!

RleBpDecoder::bufferStart_, bitOffset_, bitWidth_, lastSafeWord_) &
bitMask_;
bitOffset_ += bitWidth_;
RleBpDecoder::bufferStart_ += bitOffset_ >> 3;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is not clear to me is why you modify the base class member here. You have your own bufferStart_ so why not processing and modifying this using the base class methods (which you have to some degree).
The base class member (with the same name) is initialized in the constructor. But because you declared a new member of the same name that is never used on line 118 you need to explicitly refer to the base class here when this member is inherited.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch forgot to remove my own bufferStart_ which was being used for something else cleaned up the code with removing it, thank you!

@jkhaliqi jkhaliqi force-pushed the rle_encoding branch 2 times, most recently from 6bbaab0 to 27ad3f1 Compare November 23, 2024 00:09
Copy link
Collaborator

@czentgr czentgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good now.
Please update the commit and PR abstract (header line) to have feat(parquet):.

@majetideepak please take a look when you get a chance.

--remainingValues_;
}
// Will increment the current by one and if value of toSkip > 0
// We will count the number of non nulls for the bpdecoding and skip
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the formatting.
The comment says counting the number of non-null values - but it might as well nulls because hasNulls is the template that could be called either way.
I notice that is code from other decoders as well. Basically, if it is determined something can be skipped after processing do it.
Perhaps the comment isn't necessary after all.

@jkhaliqi jkhaliqi changed the title feat: Add boolean rle decoder for Parquet feat(parquet): Add boolean rle decoder for Parquet Dec 9, 2024
Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some high-level comments.

@@ -700,6 +700,16 @@ void PageReader::makeDecoder() {
break;
}
FMT_FALLTHROUGH;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed for case Encoding::DELTA_BYTE_ARRAY: to go to default:.
This change breaks that.
Move case Encoding::RLE: before case Encoding::DELTA_BYTE_ARRAY:

pageData_, pageData_ + encodedDataSize_, encodedDataSize_);
break;
default:
VELOX_UNSUPPORTED("RLE decoder only supports boolean");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BOOLEAN

TEST_F(E2EFilterTest, booleanRle) {
options_.enableDictionary = false;
options_.encoding = facebook::velox::parquet::arrow::Encoding::RLE;
options_.parquetDataPageVersion = "V2";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this is set to V1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if this is set to V1 the test will still run and pass, since the encoding type will be facebook::velox::parquet::arrow::Encoding::RLE. But in the real case this file type would not be possible to create since RLE encoding for boolean columns is only a version 2 feature, and for version 1 they use the bit packing instead of RLE. So I guess the case where encoding is RLE and version is 1 never exists so we do not need to worry if that would ever be V1 with RLE encoding

@@ -108,7 +108,7 @@ struct WriterOptions : public dwio::common::WriterOptions {
/// Timestamp time zone for Parquet write through Arrow bridge.
std::optional<std::string> parquetWriteTimestampTimeZone;
bool writeInt96AsTimestamp = false;

std::optional<std::string> parquetDataPageVersion = std::nullopt;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this an enum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Writer.h and Writer.cpp changes come from #11151. Should I just cherry-pick that PR for this as well?

Copy link
Collaborator

@czentgr czentgr Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is probably an argument to get this PR in. It does have the enum for the datapage version.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just copy the enum definition from that PR here. We don't need the default in that PR. Just V1, V2.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Christian it does make sense to get #11151 in first. I will review that.

@@ -1123,6 +1123,150 @@ TEST_F(ParquetTableScanTest, deltaByteArray) {
assertSelect({"a"}, "SELECT a from expected");
}

TEST_F(ParquetTableScanTest, rleBoolean) {
loadData(
getExampleFilePath("rleboolean.parquet"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was the file rleboolean.parquet generated? Do we need this file? Can't we use the writer?

Copy link
Contributor Author

@jkhaliqi jkhaliqi Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I created rleboolean.parquet file through presto java. This file needs to be in fbvelox/examples in order to run the test and can be copied into that folder since it is currently located in velox/velox/dwio/parquet/tests/examples/rleboolean.parquet. I was just following the same format of creating the file and running the tests as the rest of the tests in this file are doing

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Velox, we can use the arrow writer to test RLE. We can test presto Java in the Prestissimo E2E tests. Let's remove this file here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you removed the rleboolean.parquet file and using the writer instead for this test

@jkhaliqi jkhaliqi force-pushed the rle_encoding branch 2 times, most recently from ecd8060 to 0784da6 Compare January 14, 2025 01:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add decoder for RLE parquet encoding
6 participants