Skip to content

Commit f2de2cd

Browse files
etseidlBugenZhaoXiangpengHaoRachelintJesse-Bakker
authored
No longer write Parquet column metadata after column chunks *and* in the footer (#6117)
* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` (#6041) * bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` Signed-off-by: Bugen Zhao <[email protected]> * fix example tests Signed-off-by: Bugen Zhao <[email protected]> --------- Signed-off-by: Bugen Zhao <[email protected]> * Remove `impl<T: AsRef<[u8]>> From<T> for Buffer` that easily accidentally copies data (#6043) * deprecate auto copy, ask explicit reference * update comments * make cargo doc happy * Make display of interval types more pretty (#6006) * improve dispaly for interval. * update test in pretty, and fix display problem. * tmp * fix tests in arrow-cast. * fix tests in pretty. * fix style. * Update snafu (#5930) * Update Parquet thrift generated structures (#6045) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * Revert "Revert "Write Bloom filters between row groups instead of the end (#…" (#5933) This reverts commit 22e0b44. * Revert "Update snafu (#5930)" (#6069) This reverts commit 756b1fb. * Update pyo3 requirement from 0.21.1 to 0.22.1 (fixed) (#6075) * Update pyo3 requirement from 0.21.1 to 0.22.1 Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md) - [Commits](PyO3/pyo3@v0.21.1...v0.22.1) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * refactor: remove deprecated `FromPyArrow::from_pyarrow` "GIL Refs" are being phased out. * chore: update `pyo3` in integration tests --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * remove repeated codes to make the codes more concise. (#6080) * Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` (#6068) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * add support for unencoded_byte_array_data_bytes * add comments * change sig of ColumnMetrics::update_variable_length_bytes() * rename ParquetOffsetIndex to OffsetSizeIndex * rename some functions * suggestion from review Co-authored-by: Andrew Lamb <[email protected]> * add Default trait to ColumnMetrics as suggested in review * rename OffsetSizeIndex to OffsetIndexMetaData --------- Co-authored-by: Andrew Lamb <[email protected]> * Update pyo3 requirement from 0.21.1 to 0.22.2 (#6085) Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md) - [Commits](PyO3/pyo3@v0.21.1...v0.22.2) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Deprecate read_page_locations() and simplify offset index in `ParquetMetaData` (#6095) * deprecate read_page_locations * add to_thrift() to OffsetIndexMetaData * no longer write inline column metadata * Update parquet/src/column/writer/mod.rs Co-authored-by: Ed Seidl <[email protected]> * suggestion from review Co-authored-by: Andrew Lamb <[email protected]> * add some more documentation * remove write_metadata from PageWriter --------- Signed-off-by: Bugen Zhao <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Bugen Zhao <[email protected]> Co-authored-by: Xiangpeng Hao <[email protected]> Co-authored-by: kamille <[email protected]> Co-authored-by: Jesse <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Marco Neumann <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
1 parent ee6fb87 commit f2de2cd

File tree

5 files changed

+19
-42
lines changed

5 files changed

+19
-42
lines changed

parquet/src/arrow/arrow_writer/mod.rs

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ use crate::column::writer::{
4343
};
4444
use crate::data_type::{ByteArray, FixedLenByteArray};
4545
use crate::errors::{ParquetError, Result};
46-
use crate::file::metadata::{ColumnChunkMetaData, KeyValue, RowGroupMetaData};
46+
use crate::file::metadata::{KeyValue, RowGroupMetaData};
4747
use crate::file::properties::{WriterProperties, WriterPropertiesPtr};
4848
use crate::file::reader::{ChunkReader, Length};
4949
use crate::file::writer::{SerializedFileWriter, SerializedRowGroupWriter};
@@ -489,11 +489,6 @@ impl PageWriter for ArrowPageWriter {
489489
Ok(spec)
490490
}
491491

492-
fn write_metadata(&mut self, _metadata: &ColumnChunkMetaData) -> Result<()> {
493-
// Skip writing metadata as won't be copied anyway
494-
Ok(())
495-
}
496-
497492
fn close(&mut self) -> Result<()> {
498493
Ok(())
499494
}

parquet/src/column/page.rs

Lines changed: 1 addition & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ use bytes::Bytes;
2121

2222
use crate::basic::{Encoding, PageType};
2323
use crate::errors::{ParquetError, Result};
24-
use crate::file::{metadata::ColumnChunkMetaData, statistics::Statistics};
24+
use crate::file::statistics::Statistics;
2525
use crate::format::PageHeader;
2626

2727
/// Parquet Page definition.
@@ -350,12 +350,6 @@ pub trait PageWriter: Send {
350350
/// either data page or dictionary page.
351351
fn write_page(&mut self, page: CompressedPage) -> Result<PageWriteSpec>;
352352

353-
/// Writes column chunk metadata into the output stream/sink.
354-
///
355-
/// This method is called once before page writer is closed, normally when writes are
356-
/// finalised in column writer.
357-
fn write_metadata(&mut self, metadata: &ColumnChunkMetaData) -> Result<()>;
358-
359353
/// Closes resources and flushes underlying sink.
360354
/// Page writer should not be used after this method is called.
361355
fn close(&mut self) -> Result<()>;

parquet/src/column/writer/mod.rs

Lines changed: 3 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -579,7 +579,7 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a, E> {
579579
self.write_dictionary_page()?;
580580
}
581581
self.flush_data_pages()?;
582-
let metadata = self.write_column_metadata()?;
582+
let metadata = self.build_column_metadata()?;
583583
self.page_writer.close()?;
584584

585585
let boundary_order = match (
@@ -1041,24 +1041,18 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a, E> {
10411041
Ok(())
10421042
}
10431043

1044-
/// Assembles and writes column chunk metadata.
1045-
fn write_column_metadata(&mut self) -> Result<ColumnChunkMetaData> {
1044+
/// Assembles column chunk metadata.
1045+
fn build_column_metadata(&mut self) -> Result<ColumnChunkMetaData> {
10461046
let total_compressed_size = self.column_metrics.total_compressed_size as i64;
10471047
let total_uncompressed_size = self.column_metrics.total_uncompressed_size as i64;
10481048
let num_values = self.column_metrics.total_num_values as i64;
10491049
let dict_page_offset = self.column_metrics.dictionary_page_offset.map(|v| v as i64);
10501050
// If data page offset is not set, then no pages have been written
10511051
let data_page_offset = self.column_metrics.data_page_offset.unwrap_or(0) as i64;
10521052

1053-
let file_offset = match dict_page_offset {
1054-
Some(dict_offset) => dict_offset + total_compressed_size,
1055-
None => data_page_offset + total_compressed_size,
1056-
};
1057-
10581053
let mut builder = ColumnChunkMetaData::builder(self.descr.clone())
10591054
.set_compression(self.codec)
10601055
.set_encodings(self.encodings.iter().cloned().collect())
1061-
.set_file_offset(file_offset)
10621056
.set_total_compressed_size(total_compressed_size)
10631057
.set_total_uncompressed_size(total_uncompressed_size)
10641058
.set_num_values(num_values)
@@ -1138,8 +1132,6 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a, E> {
11381132
}
11391133

11401134
let metadata = builder.build()?;
1141-
self.page_writer.write_metadata(&metadata)?;
1142-
11431135
Ok(metadata)
11441136
}
11451137

@@ -3589,10 +3581,6 @@ mod tests {
35893581
Ok(res)
35903582
}
35913583

3592-
fn write_metadata(&mut self, _metadata: &ColumnChunkMetaData) -> Result<()> {
3593-
Ok(())
3594-
}
3595-
35963584
fn close(&mut self) -> Result<()> {
35973585
Ok(())
35983586
}

parquet/src/file/metadata/mod.rs

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -682,7 +682,12 @@ impl ColumnChunkMetaData {
682682
self.file_path.as_deref()
683683
}
684684

685-
/// Byte offset in `file_path()`.
685+
/// Byte offset of `ColumnMetaData` in `file_path()`.
686+
///
687+
/// Note that the meaning of this field has been inconsistent between implementations
688+
/// so its use has since been deprecated in the Parquet specification. Modern implementations
689+
/// will set this to `0` to indicate that the `ColumnMetaData` is solely contained in the
690+
/// `ColumnChunk` struct.
686691
pub fn file_offset(&self) -> i64 {
687692
self.file_offset
688693
}
@@ -1040,6 +1045,14 @@ impl ColumnChunkMetaDataBuilder {
10401045
}
10411046

10421047
/// Sets file offset in bytes.
1048+
///
1049+
/// This field was meant to provide an alternate to storing `ColumnMetadata` directly in
1050+
/// the `ColumnChunkMetadata`. However, most Parquet readers assume the `ColumnMetadata`
1051+
/// is stored inline and ignore this field.
1052+
#[deprecated(
1053+
since = "53.0.0",
1054+
note = "The Parquet specification requires this field to be 0"
1055+
)]
10431056
pub fn set_file_offset(mut self, value: i64) -> Self {
10441057
self.0.file_offset = value;
10451058
self
@@ -1453,7 +1466,6 @@ mod tests {
14531466
let col_metadata = ColumnChunkMetaData::builder(column_descr.clone())
14541467
.set_encodings(vec![Encoding::PLAIN, Encoding::RLE])
14551468
.set_file_path("file_path".to_owned())
1456-
.set_file_offset(100)
14571469
.set_num_values(1000)
14581470
.set_compression(Compression::SNAPPY)
14591471
.set_total_compressed_size(2000)

parquet/src/file/writer.rs

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -649,13 +649,10 @@ impl<'a, W: Write + Send> SerializedRowGroupWriter<'a, W> {
649649
));
650650
}
651651

652-
let file_offset = self.buf.bytes_written() as i64;
653-
654652
let map_offset = |x| x - src_offset + write_offset as i64;
655653
let mut builder = ColumnChunkMetaData::builder(metadata.column_descr_ptr())
656654
.set_compression(metadata.compression())
657655
.set_encodings(metadata.encodings().clone())
658-
.set_file_offset(file_offset)
659656
.set_total_compressed_size(metadata.compressed_size())
660657
.set_total_uncompressed_size(metadata.uncompressed_size())
661658
.set_num_values(metadata.num_values())
@@ -680,7 +677,6 @@ impl<'a, W: Write + Send> SerializedRowGroupWriter<'a, W> {
680677
}
681678
}
682679

683-
SerializedPageWriter::new(self.buf).write_metadata(&metadata)?;
684680
let (_, on_close) = self.get_on_close();
685681
on_close(close)
686682
}
@@ -808,14 +804,6 @@ impl<'a, W: Write + Send> PageWriter for SerializedPageWriter<'a, W> {
808804
Ok(spec)
809805
}
810806

811-
fn write_metadata(&mut self, metadata: &ColumnChunkMetaData) -> Result<()> {
812-
let mut protocol = TCompactOutputProtocol::new(&mut self.sink);
813-
metadata
814-
.to_column_metadata_thrift()
815-
.write_to_out_protocol(&mut protocol)?;
816-
Ok(())
817-
}
818-
819807
fn close(&mut self) -> Result<()> {
820808
self.sink.flush()?;
821809
Ok(())

0 commit comments

Comments
 (0)