-
Notifications
You must be signed in to change notification settings - Fork 46
IndexOutOfBoundsException when loading compressed IPC format #230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
David Li / @lidavidm: |
Georeth Zhou: |
David Dali Susanibar Arce / @davisusanibar: |
David Dali Susanibar Arce / @davisusanibar: There is a problem with the Validity Buffer, for example for 2049 rows initially there is assigned 504 buffer size, but at the end is requested 512 length size. Need to continue reviewing for changes needed.
|
David Dali Susanibar Arce / @davisusanibar: Base on the current implementation the default compression codec is no compression.
|
David Dali Susanibar Arce / @davisusanibar: Vector module was designed to support Compression codec (Lz4/Zstd)? Because I only see abstract class AbstractCompressionCodec, then doDecompress is only implemented on Compression module and if I try to used that this will cause cyclic dependency Vector <–> Compression.
Could you help us about a way to implement compression on Vector module? |
David Li / @lidavidm: In any case, the first issue here is that Java should detect the file is compressed and error if it doesn't support the codec. |
David Dali Susanibar Arce / @davisusanibar: Please consider this PR to add cookbook for read compressed files. File file = new File("src/main/resources/compare/lz4.arrow");
try (
BufferAllocator rootAllocator = new RootAllocator();
FileInputStream fileInputStream = new FileInputStream(file);
// ArrowFileReader reader = new ArrowFileReader(fileInputStream.getChannel(), rootAllocator): Use CommonsCompressionFactory for compressed files
ArrowFileReader reader = new ArrowFileReader(fileInputStream.getChannel(),
rootAllocator, CommonsCompressionFactory.INSTANCE)
) {
System.out.println("Record batches in file: " + reader.getRecordBlocks().size());
for (ArrowBlock arrowBlock : reader.getRecordBlocks()) {
reader.loadRecordBatch(arrowBlock);
VectorSchemaRoot vectorSchemaRootRecover = reader.getVectorSchemaRoot();
System.out.println("Size: --> " + vectorSchemaRootRecover.getRowCount());
System.out.print(vectorSchemaRootRecover.contentToTSVString());
}
} catch (IOException e) {
e.printStackTrace();
} |
Georeth Zhou: It works now. |
I encountered this bug when I loaded a dataframe stored in the Arrow IPC format.
Call stack:
This bug can be reproduced by a simple dataframe created by pandas:
Pandas compresses the dataframe by default. If the compression is turned off, Java can load the dataframe. Thus, I guess the bounds checking code is buggy when loading compressed file.
That dataframe can be loaded in polars, pandas and pyarrow, so it's unlikely to be a pandas bug.
Environment: Linux and Windows.
Apache Arrow Java version: 10.0.0, 9.0.0, 4.0.1.
Pandas 1.4.2 using pyarrow 8.0.0 (anaconda3-2022.05)
Reporter: Georeth Zhou
Note: This issue was originally created as ARROW-18198. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: