Intermittent ORC File corruptions while using spark-acid write #106

gowtamchandrahasa · 2021-01-19T06:23:23Z

Hi guys, we use spark to read from/ write to Hive Acid tables. We have been using spark-acid for both read and writes. And we started seeing the below error during read on some of the partitions.

Caused by: java.lang.IllegalArgumentException: Buffer size too small. size = 16384 needed = 6139113 at com.qubole.shaded.orc.impl.InStream$CompressedStream.readHeader(InStream.java:212) at com.qubole.shaded.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:263) at com.qubole.shaded.orc.impl.InStream$CompressedStream.read(InStream.java:250) at java.io.InputStream.read(InputStream.java:101) at com.qubole.shaded.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:737) at com.qubole.shaded.protobuf.CodedInputStream.isAtEnd(CodedInputStream.java:701) at com.qubole.shaded.protobuf.CodedInputStream.readTag(CodedInputStream.java:99) at com.qubole.shaded.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11144) at com.qubole.shaded.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11108) at com.qubole.shaded.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11213) at com.qubole.shaded.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11208) at com.qubole.shaded.protobuf.AbstractParser.parseFrom(AbstractParser.java:89) at com.qubole.shaded.protobuf.AbstractParser.parseFrom(AbstractParser.java:95) at com.qubole.shaded.protobuf.AbstractParser.parseFrom(AbstractParser.java:49) at com.qubole.shaded.orc.OrcProto$StripeFooter.parseFrom(OrcProto.java:11441) at com.qubole.shaded.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:275) at com.qubole.shaded.orc.impl.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:311) at com.qubole.shaded.orc.impl.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:1102) at com.qubole.shaded.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1064) at com.qubole.shaded.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1232) at com.qubole.shaded.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1267) at com.qubole.shaded.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:282) at com.qubole.shaded.hadoop.hive.ql.io.orc.RecordReaderImpl.<init>(RecordReaderImpl.java:67) at com.qubole.shaded.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:83) at com.qubole.shaded.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPairAcid.<init>(OrcRawRecordMerger.java:246) at com.qubole.shaded.hadoop.hive.ql.io.orc.OrcRawRecordMerger.<init>(OrcRawRecordMerger.java:1063) at com.qubole.shaded.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:2091) at com.qubole.shaded.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1989) at com.qubole.spark.hiveacid.rdd.HiveAcidRDD$$anon$1.liftedTree1$1(HiveAcidRDD.scala:243) at com.qubole.spark.hiveacid.rdd.HiveAcidRDD$$anon$1.<init>(HiveAcidRDD.scala:239) at com.qubole.spark.hiveacid.rdd.HiveAcidRDD.compute(HiveAcidRDD.scala:226) at com.qubole.spark.hiveacid.rdd.HiveAcidRDD.compute(HiveAcidRDD.scala:90) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

This was happening when trying to read specific orc files in the delta folder.
We tried reading the partition using HWC and beeline as well and all of them failed with the same buffer size issue.
We felt the files might be corrupted and we tried running the below command on the anomalous file
/usr/bin/hive --orcfiledump /home/sshuser/bucket_00257 and it failed with the error below.
java.lang.IllegalArgumentException: Buffer size too small. size = 16384 needed = 3749831 at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:212) at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:263) at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:250) at java.io.InputStream.read(InputStream.java:101) at com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:737) at com.google.protobuf.CodedInputStream.isAtEnd(CodedInputStream.java:701) at com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:99) at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11144) at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11108) at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11213) at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11208) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:89) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:95) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49) at org.apache.orc.OrcProto$StripeFooter.parseFrom(OrcProto.java:11441) at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:270) at org.apache.orc.impl.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:299) at org.apache.orc.impl.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:1080) at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1042) at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1210) at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1245) at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:275) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:627) at org.apache.orc.tools.PrintData.printJsonData(PrintData.java:199) at org.apache.orc.tools.PrintData.main(PrintData.java:241) at org.apache.orc.tools.FileDump.main(FileDump.java:127) at org.apache.orc.tools.FileDump.main(FileDump.java:142) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:318) at org.apache.hadoop.util.RunJar.main(RunJar.java:232)

The command worked fine on other files in the delta folder.

The same job, when repeated with the same args, created a delta folder that was perfectly fine.

We went to the exact jobs that created these corrupted delta folders and saw that the common thing in these jobs is that they had retried stages and below was an error seen in the retried stages' executor logs.

21/01/11 06:37:09 ERROR AcidUtils: Failed to create abfs://[email protected]/xxx/list_page_event/1/partition_key=2021-01-11/delta_0014708_0014708/_orc_acid_version due to: Operation failed: "The specified path already exists.", 409, PUT, https://xxx.dfs.core.windows.net/store/xxx/list_page_event/1/partition_key%3D2021-01-11/delta_0014708_0014708/_orc_acid_version?resource=file&timeout=90, PathAlreadyExists, "The specified path already exists. RequestId:4c544a41-d01f-0000-0ae4-e732f9000000 Time:2021-01-11T06:37:09.5413274Z"

The issue was never seen when we were using HWC for writing to Hive Acid from spark.
Could you please help us with this error? Thanks!

The text was updated successfully, but these errors were encountered:

gowtamchandrahasa · 2021-01-21T07:01:37Z

@amoghmargoor, could you help here? Thanks!

vinaymeghraj · 2022-04-19T22:18:51Z

Facing similar exception where orc files are being corrupted. Increasing the hive.exec.orc.default.buffer.size doesn't help

riteshporwal · 2022-08-31T13:03:21Z

I'm facing same issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent ORC File corruptions while using spark-acid write #106

Intermittent ORC File corruptions while using spark-acid write #106

gowtamchandrahasa commented Jan 19, 2021 •

edited

Loading

gowtamchandrahasa commented Jan 21, 2021

vinaymeghraj commented Apr 19, 2022

riteshporwal commented Aug 31, 2022

Intermittent ORC File corruptions while using spark-acid write #106

Intermittent ORC File corruptions while using spark-acid write #106

Comments

gowtamchandrahasa commented Jan 19, 2021 • edited Loading

gowtamchandrahasa commented Jan 21, 2021

vinaymeghraj commented Apr 19, 2022

riteshporwal commented Aug 31, 2022

gowtamchandrahasa commented Jan 19, 2021 •

edited

Loading