Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent ORC File corruptions while using spark-acid write #106

Open
gowtamchandrahasa opened this issue Jan 19, 2021 · 3 comments
Open

Comments

@gowtamchandrahasa
Copy link

gowtamchandrahasa commented Jan 19, 2021

Hi guys, we use spark to read from/ write to Hive Acid tables. We have been using spark-acid for both read and writes. And we started seeing the below error during read on some of the partitions.

Caused by: java.lang.IllegalArgumentException: Buffer size too small. size = 16384 needed = 6139113 at com.qubole.shaded.orc.impl.InStream$CompressedStream.readHeader(InStream.java:212) at com.qubole.shaded.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:263) at com.qubole.shaded.orc.impl.InStream$CompressedStream.read(InStream.java:250) at java.io.InputStream.read(InputStream.java:101) at com.qubole.shaded.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:737) at com.qubole.shaded.protobuf.CodedInputStream.isAtEnd(CodedInputStream.java:701) at com.qubole.shaded.protobuf.CodedInputStream.readTag(CodedInputStream.java:99) at com.qubole.shaded.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11144) at com.qubole.shaded.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11108) at com.qubole.shaded.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11213) at com.qubole.shaded.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11208) at com.qubole.shaded.protobuf.AbstractParser.parseFrom(AbstractParser.java:89) at com.qubole.shaded.protobuf.AbstractParser.parseFrom(AbstractParser.java:95) at com.qubole.shaded.protobuf.AbstractParser.parseFrom(AbstractParser.java:49) at com.qubole.shaded.orc.OrcProto$StripeFooter.parseFrom(OrcProto.java:11441) at com.qubole.shaded.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:275) at com.qubole.shaded.orc.impl.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:311) at com.qubole.shaded.orc.impl.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:1102) at com.qubole.shaded.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1064) at com.qubole.shaded.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1232) at com.qubole.shaded.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1267) at com.qubole.shaded.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:282) at com.qubole.shaded.hadoop.hive.ql.io.orc.RecordReaderImpl.<init>(RecordReaderImpl.java:67) at com.qubole.shaded.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:83) at com.qubole.shaded.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPairAcid.<init>(OrcRawRecordMerger.java:246) at com.qubole.shaded.hadoop.hive.ql.io.orc.OrcRawRecordMerger.<init>(OrcRawRecordMerger.java:1063) at com.qubole.shaded.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:2091) at com.qubole.shaded.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1989) at com.qubole.spark.hiveacid.rdd.HiveAcidRDD$$anon$1.liftedTree1$1(HiveAcidRDD.scala:243) at com.qubole.spark.hiveacid.rdd.HiveAcidRDD$$anon$1.<init>(HiveAcidRDD.scala:239) at com.qubole.spark.hiveacid.rdd.HiveAcidRDD.compute(HiveAcidRDD.scala:226) at com.qubole.spark.hiveacid.rdd.HiveAcidRDD.compute(HiveAcidRDD.scala:90) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

This was happening when trying to read specific orc files in the delta folder.
We tried reading the partition using HWC and beeline as well and all of them failed with the same buffer size issue.
We felt the files might be corrupted and we tried running the below command on the anomalous file
/usr/bin/hive --orcfiledump /home/sshuser/bucket_00257 and it failed with the error below.
java.lang.IllegalArgumentException: Buffer size too small. size = 16384 needed = 3749831 at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:212) at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:263) at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:250) at java.io.InputStream.read(InputStream.java:101) at com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:737) at com.google.protobuf.CodedInputStream.isAtEnd(CodedInputStream.java:701) at com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:99) at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11144) at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11108) at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11213) at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11208) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:89) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:95) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49) at org.apache.orc.OrcProto$StripeFooter.parseFrom(OrcProto.java:11441) at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:270) at org.apache.orc.impl.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:299) at org.apache.orc.impl.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:1080) at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1042) at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1210) at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1245) at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:275) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:627) at org.apache.orc.tools.PrintData.printJsonData(PrintData.java:199) at org.apache.orc.tools.PrintData.main(PrintData.java:241) at org.apache.orc.tools.FileDump.main(FileDump.java:127) at org.apache.orc.tools.FileDump.main(FileDump.java:142) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:318) at org.apache.hadoop.util.RunJar.main(RunJar.java:232)

The command worked fine on other files in the delta folder.

The same job, when repeated with the same args, created a delta folder that was perfectly fine.

We went to the exact jobs that created these corrupted delta folders and saw that the common thing in these jobs is that they had retried stages and below was an error seen in the retried stages' executor logs.

21/01/11 06:37:09 ERROR AcidUtils: Failed to create abfs://[email protected]/xxx/list_page_event/1/partition_key=2021-01-11/delta_0014708_0014708/_orc_acid_version due to: Operation failed: "The specified path already exists.", 409, PUT, https://xxx.dfs.core.windows.net/store/xxx/list_page_event/1/partition_key%3D2021-01-11/delta_0014708_0014708/_orc_acid_version?resource=file&timeout=90, PathAlreadyExists, "The specified path already exists. RequestId:4c544a41-d01f-0000-0ae4-e732f9000000 Time:2021-01-11T06:37:09.5413274Z"

The issue was never seen when we were using HWC for writing to Hive Acid from spark.
Could you please help us with this error? Thanks!

@gowtamchandrahasa
Copy link
Author

@amoghmargoor, could you help here? Thanks!

@vinaymeghraj
Copy link

Facing similar exception where orc files are being corrupted. Increasing the hive.exec.orc.default.buffer.size doesn't help

@riteshporwal
Copy link

I'm facing same issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants