feat(parquet): Add boolean rle decoder for Parquet #11282

jkhaliqi · 2024-10-16T21:28:59Z

RLE/BP is an Encoding for Boolean values for Parquet Version 2 files.
https://parquet.apache.org/docs/file-format/data-pages/encodings/
Fixes: #10943

netlify · 2024-10-16T21:29:15Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`d1d1dcd`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/67916f98ae810c0008c65cc7

velox/dwio/parquet/reader/RleBooleanDecoder.h

ethanyzhang · 2024-11-12T16:16:00Z

@yingsu00 can you also take a look at this PR? Thank you!

czentgr · 2024-11-13T23:25:26Z

velox/dwio/parquet/reader/RleBooleanDecoder.h

+ public:
+  using super = RleBpDecoder;
+  RleBooleanDecoder(const char* start, const char* end, int32_t& len)
+      : super::RleBpDecoder{start + 4, end, 1} {


The magic number 4 is used multiple times. Please make it a static const here and use an appropriate name for it.

You don't need super:: here. There is no ambiguity here.

czentgr · 2024-11-13T23:29:15Z

velox/dwio/parquet/reader/RleBooleanDecoder.h

+          "Received invalid length : " + std::to_string(len) +
+          " (corrupt data page?)");
+    }
+    // num_bytes will be the first 4 bytes that tell us the length of encoded


Nit: Do we need this comment once the 4 is not magic number anymore?

Right removing comment since it is not necessary, thank you!

czentgr · 2024-11-13T23:35:52Z

velox/dwio/parquet/reader/RleBooleanDecoder.h

+
+  template <bool hasNulls>
+  inline void skip(int32_t numValues, int32_t current, const uint64_t* nulls) {
+    if (hasNulls) {


This should be a constexpr.

updated to constexpr, thank you!

The function itself doesn't need to be constexpr. It is the if condition that should be constexpr.

if constexpr (hasNulls)

That means if the template argument is false this if expression is not generated.

czentgr · 2024-11-13T23:43:24Z

velox/dwio/parquet/reader/RleBooleanDecoder.h

+      numValues = bits::countNonNulls(nulls, current, current + numValues);
+    }
+
+    super::skip(numValues);


Shouldn't this be RleBpDecoder::skip(numValues) to disambiguate the function from this->skip(numValues)?

czentgr · 2024-11-13T23:44:50Z

velox/dwio/parquet/reader/RleBooleanDecoder.h

+    int32_t current = visitor.start();
+
+    skip<hasNulls>(current, 0, nulls);
+    int32_t toSkip;


Lets also initialize it to 0.

czentgr · 2024-11-13T23:47:51Z

velox/dwio/parquet/reader/RleBooleanDecoder.h

+    int32_t toSkip;
+    bool atEnd = false;
+    const bool allowNulls = hasNulls && visitor.allowNulls();
+    std::vector<uint64_t> outputBuffer(20);


Why is the size of the vector 20?

Sorry was using this output buffer for some other testing since it's not being used anymore will delete this line of code

czentgr · 2024-11-13T23:53:00Z

velox/dwio/parquet/reader/RleBooleanDecoder.h

+      ++current;
+      if (toSkip) {
+        skip<hasNulls>(toSkip, current, nulls);
+        current += toSkip;


There is no problem if toSkip > 0 but we already advanced current by 1 on line 97? I suppose this someting about what visitor represents? This might need a comment to explain why this is ok.

Or maybe some comment on how the algorithm works when the read occurs.

czentgr · 2024-11-13T23:54:02Z

velox/dwio/parquet/reader/RleBooleanDecoder.h

+  }
+
+  const char* bufferStart_;
+  uint32_t num_bytes = 0;


This needs to be named numBytes_.

czentgr · 2024-11-13T23:55:05Z

velox/dwio/parquet/reader/RleBooleanDecoder.h

+  int64_t readBitField() {
+    auto value =
+        dwio::common::safeLoadBits(
+            super::bufferStart_, bitOffset_, bitWidth_, lastSafeWord_) &


Same comment above about using super vs the base class name to disambiguate.

czentgr · 2024-11-19T14:31:08Z

velox/dwio/parquet/reader/RleBooleanDecoder.h

+
+class RleBooleanDecoder : public RleBpDecoder {
+ public:
+  using super = RleBpDecoder;


Now that you've replaced super with the actual base class we don't need this anymore. I did not see that you defined super here. This explains why it was working before. But someone not familiar with Java would be confused so better to be explicit.

Thank you removed this line!

czentgr · 2024-11-19T14:35:04Z

velox/dwio/parquet/reader/RleBooleanDecoder.h

+  RleBooleanDecoder(const char* start, const char* end, int32_t& len)
+      : RleBpDecoder{start + kLengthOffset, end, 1} {
+    if (len < kLengthOffset) {
+      VELOX_FAIL(


Lets replace the std::to_string with fmt provided in the VELOX_FAIL like so:

VELOX_FAIL("Received invalid length : {} (corrupt data page?)", len);

for all occurrences.

Thank you, updated!

czentgr · 2024-11-19T14:36:45Z

velox/dwio/parquet/reader/RleBooleanDecoder.h

+
+  template <bool hasNulls>
+  inline void skip(int32_t numValues, int32_t current, const uint64_t* nulls) {
+    if (hasNulls) {


The function itself doesn't need to be constexpr. It is the if condition that should be constexpr.

if constexpr (hasNulls)

That means if the template argument is false this if expression is not generated.

czentgr · 2024-11-19T14:40:26Z

velox/dwio/parquet/reader/RleBooleanDecoder.h

+ public:
+  using super = RleBpDecoder;
+  static constexpr int32_t kLengthOffset = 4;
+  RleBooleanDecoder(const char* start, const char* end, int32_t& len)


len is not modified here and we don't need a reference.

updated, thank you!

czentgr · 2024-11-19T14:48:26Z

velox/dwio/parquet/reader/RleBooleanDecoder.h

+            RleBpDecoder::bufferStart_, bitOffset_, bitWidth_, lastSafeWord_) &
+        bitMask_;
+    bitOffset_ += bitWidth_;
+    RleBpDecoder::bufferStart_ += bitOffset_ >> 3;


What is not clear to me is why you modify the base class member here. You have your own bufferStart_ so why not processing and modifying this using the base class methods (which you have to some degree).
The base class member (with the same name) is initialized in the constructor. But because you declared a new member of the same name that is never used on line 118 you need to explicitly refer to the base class here when this member is inherited.

Good catch forgot to remove my own bufferStart_ which was being used for something else cleaned up the code with removing it, thank you!

czentgr

I think this looks good now.
Please update the commit and PR abstract (header line) to have feat(parquet):.

@majetideepak please take a look when you get a chance.

czentgr · 2024-12-09T20:52:36Z

velox/dwio/parquet/reader/RleBooleanDecoder.h

+        --remainingValues_;
+      }
+      // Will increment the current by one and if value of toSkip > 0
+      // We will count the number of non nulls for the bpdecoding and skip


Please check the formatting.
The comment says counting the number of non-null values - but it might as well nulls because hasNulls is the template that could be called either way.
I notice that is code from other decoders as well. Basically, if it is determined something can be skipped after processing do it.
Perhaps the comment isn't necessary after all.

majetideepak

Some high-level comments.

majetideepak · 2025-01-07T03:04:33Z

velox/dwio/parquet/reader/PageReader.cpp

@@ -700,6 +700,16 @@ void PageReader::makeDecoder() {
        break;
      }
      FMT_FALLTHROUGH;


This is needed for case Encoding::DELTA_BYTE_ARRAY: to go to default:.
This change breaks that.
Move case Encoding::RLE: before case Encoding::DELTA_BYTE_ARRAY:

majetideepak · 2025-01-07T03:04:57Z

velox/dwio/parquet/reader/PageReader.cpp

+              pageData_, pageData_ + encodedDataSize_, encodedDataSize_);
+          break;
+        default:
+          VELOX_UNSUPPORTED("RLE decoder only supports boolean");


majetideepak · 2025-01-07T03:13:16Z

velox/dwio/parquet/tests/reader/E2EFilterTest.cpp

+TEST_F(E2EFilterTest, booleanRle) {
+  options_.enableDictionary = false;
+  options_.encoding = facebook::velox::parquet::arrow::Encoding::RLE;
+  options_.parquetDataPageVersion = "V2";


What happens if this is set to V1?

So if this is set to V1 the test will still run and pass, since the encoding type will be facebook::velox::parquet::arrow::Encoding::RLE. But in the real case this file type would not be possible to create since RLE encoding for boolean columns is only a version 2 feature, and for version 1 they use the bit packing instead of RLE. So I guess the case where encoding is RLE and version is 1 never exists so we do not need to worry if that would ever be V1 with RLE encoding

majetideepak · 2025-01-07T03:13:37Z

velox/dwio/parquet/writer/Writer.h

@@ -108,7 +108,7 @@ struct WriterOptions : public dwio::common::WriterOptions {
  /// Timestamp time zone for Parquet write through Arrow bridge.
  std::optional<std::string> parquetWriteTimestampTimeZone;
  bool writeInt96AsTimestamp = false;
-
+  std::optional<std::string> parquetDataPageVersion = std::nullopt;


Make this an enum.

The Writer.h and Writer.cpp changes come from #11151. Should I just cherry-pick that PR for this as well?

There is probably an argument to get this PR in. It does have the enum for the datapage version.

Let's just copy the enum definition from that PR here. We don't need the default in that PR. Just V1, V2.

I agree with Christian it does make sense to get #11151 in first. I will review that.

majetideepak · 2025-01-07T03:15:14Z

velox/dwio/parquet/tests/reader/ParquetTableScanTest.cpp

@@ -1123,6 +1123,150 @@ TEST_F(ParquetTableScanTest, deltaByteArray) {
  assertSelect({"a"}, "SELECT a from expected");
 }

+TEST_F(ParquetTableScanTest, rleBoolean) {
+  loadData(
+      getExampleFilePath("rleboolean.parquet"),


How was the file rleboolean.parquet generated? Do we need this file? Can't we use the writer?

I believe I created rleboolean.parquet file through presto java. This file needs to be in fbvelox/examples in order to run the test and can be copied into that folder since it is currently located in velox/velox/dwio/parquet/tests/examples/rleboolean.parquet. I was just following the same format of creating the file and running the tests as the rest of the tests in this file are doing

For Velox, we can use the arrow writer to test RLE. We can test presto Java in the Prestissimo E2E tests. Let's remove this file here.

Thank you removed the rleboolean.parquet file and using the writer instead for this test

Co-authored-by: Minhan Cao <[email protected]>

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 16, 2024

Yuhta changed the title ~~added in rle encoder for boolean~~ added in rle decoder for boolean Oct 17, 2024

Yuhta self-requested a review October 17, 2024 21:04

jkhaliqi force-pushed the rle_encoding branch 2 times, most recently from 358855b to f638683 Compare October 29, 2024 19:22

jkhaliqi mentioned this pull request Oct 29, 2024

added in rle decoder for boolean #11374

Closed

minhancao force-pushed the rle_encoding branch from f638683 to a259672 Compare October 29, 2024 23:46

jkhaliqi force-pushed the rle_encoding branch 3 times, most recently from f5735ba to 163bdb3 Compare October 30, 2024 18:03

jkhaliqi marked this pull request as ready for review October 30, 2024 18:17

jkhaliqi requested a review from majetideepak as a code owner October 30, 2024 18:17

jkhaliqi force-pushed the rle_encoding branch 2 times, most recently from a8f69b9 to c25886f Compare October 30, 2024 19:24

minhancao force-pushed the rle_encoding branch 2 times, most recently from e8e82c4 to c760214 Compare November 1, 2024 20:45

minhancao reviewed Nov 11, 2024

View reviewed changes

velox/dwio/parquet/reader/RleBooleanDecoder.h Outdated Show resolved Hide resolved

minhancao reviewed Nov 11, 2024

View reviewed changes

velox/dwio/parquet/reader/RleBooleanDecoder.h Outdated Show resolved Hide resolved

jkhaliqi force-pushed the rle_encoding branch from c760214 to 5b72579 Compare November 12, 2024 19:45

majetideepak changed the title ~~added in rle decoder for boolean~~ feat: Add boolean rle decoder for Parquet Nov 12, 2024

jkhaliqi force-pushed the rle_encoding branch 2 times, most recently from 99e8783 to 4b8412d Compare November 12, 2024 22:32

minhancao approved these changes Nov 12, 2024

View reviewed changes

czentgr reviewed Nov 13, 2024

View reviewed changes

jkhaliqi force-pushed the rle_encoding branch from 4b8412d to b850e02 Compare November 14, 2024 20:29

czentgr reviewed Nov 19, 2024

View reviewed changes

jkhaliqi force-pushed the rle_encoding branch 2 times, most recently from 6bbaab0 to 27ad3f1 Compare November 23, 2024 00:09

jkhaliqi force-pushed the rle_encoding branch from 27ad3f1 to 0eba7a4 Compare November 26, 2024 05:42

czentgr reviewed Dec 9, 2024

View reviewed changes

jkhaliqi force-pushed the rle_encoding branch from 0eba7a4 to 7fb23f6 Compare December 9, 2024 22:05

jkhaliqi changed the title ~~feat: Add boolean rle decoder for Parquet~~ feat(parquet): Add boolean rle decoder for Parquet Dec 9, 2024

jkhaliqi force-pushed the rle_encoding branch from 7fb23f6 to 2d0751e Compare December 9, 2024 22:54

majetideepak reviewed Jan 7, 2025

View reviewed changes

jkhaliqi force-pushed the rle_encoding branch 2 times, most recently from ecd8060 to 0784da6 Compare January 14, 2025 01:22

jkhaliqi force-pushed the rle_encoding branch from 0784da6 to e9230c5 Compare January 22, 2025 22:20

feat(parquet): Added in rle decoder for boolean

d1d1dcd

Co-authored-by: Minhan Cao <[email protected]>

jkhaliqi force-pushed the rle_encoding branch from e9230c5 to d1d1dcd Compare January 22, 2025 22:22

feat(parquet): Add boolean rle decoder for Parquet #11282

Are you sure you want to change the base?

feat(parquet): Add boolean rle decoder for Parquet #11282

Conversation

jkhaliqi commented Oct 16, 2024 • edited by majetideepak Loading

netlify bot commented Oct 16, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

ethanyzhang commented Nov 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

czentgr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

majetideepak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

czentgr Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkhaliqi Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkhaliqi commented Oct 16, 2024 •

edited by majetideepak

Loading

netlify bot commented Oct 16, 2024 •

edited

Loading

czentgr Jan 16, 2025 •

edited

Loading

jkhaliqi Jan 7, 2025 •

edited

Loading