You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi guys, we use spark to read from/ write to Hive Acid tables. We have been using spark-acid for both read and writes. And we started seeing the below error during read on some of the partitions.
Caused by: java.lang.IllegalArgumentException: Buffer size too small. size = 16384 needed = 6139113 at com.qubole.shaded.orc.impl.InStream$CompressedStream.readHeader(InStream.java:212) at com.qubole.shaded.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:263) at com.qubole.shaded.orc.impl.InStream$CompressedStream.read(InStream.java:250) at java.io.InputStream.read(InputStream.java:101) at com.qubole.shaded.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:737) at com.qubole.shaded.protobuf.CodedInputStream.isAtEnd(CodedInputStream.java:701) at com.qubole.shaded.protobuf.CodedInputStream.readTag(CodedInputStream.java:99) at com.qubole.shaded.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11144) at com.qubole.shaded.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11108) at com.qubole.shaded.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11213) at com.qubole.shaded.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11208) at com.qubole.shaded.protobuf.AbstractParser.parseFrom(AbstractParser.java:89) at com.qubole.shaded.protobuf.AbstractParser.parseFrom(AbstractParser.java:95) at com.qubole.shaded.protobuf.AbstractParser.parseFrom(AbstractParser.java:49) at com.qubole.shaded.orc.OrcProto$StripeFooter.parseFrom(OrcProto.java:11441) at com.qubole.shaded.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:275) at com.qubole.shaded.orc.impl.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:311) at com.qubole.shaded.orc.impl.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:1102) at com.qubole.shaded.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1064) at com.qubole.shaded.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1232) at com.qubole.shaded.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1267) at com.qubole.shaded.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:282) at com.qubole.shaded.hadoop.hive.ql.io.orc.RecordReaderImpl.<init>(RecordReaderImpl.java:67) at com.qubole.shaded.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:83) at com.qubole.shaded.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPairAcid.<init>(OrcRawRecordMerger.java:246) at com.qubole.shaded.hadoop.hive.ql.io.orc.OrcRawRecordMerger.<init>(OrcRawRecordMerger.java:1063) at com.qubole.shaded.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:2091) at com.qubole.shaded.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1989) at com.qubole.spark.hiveacid.rdd.HiveAcidRDD$$anon$1.liftedTree1$1(HiveAcidRDD.scala:243) at com.qubole.spark.hiveacid.rdd.HiveAcidRDD$$anon$1.<init>(HiveAcidRDD.scala:239) at com.qubole.spark.hiveacid.rdd.HiveAcidRDD.compute(HiveAcidRDD.scala:226) at com.qubole.spark.hiveacid.rdd.HiveAcidRDD.compute(HiveAcidRDD.scala:90) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
This was happening when trying to read specific orc files in the delta folder.
We tried reading the partition using HWC and beeline as well and all of them failed with the same buffer size issue.
We felt the files might be corrupted and we tried running the below command on the anomalous file
/usr/bin/hive --orcfiledump /home/sshuser/bucket_00257 and it failed with the error below. java.lang.IllegalArgumentException: Buffer size too small. size = 16384 needed = 3749831 at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:212) at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:263) at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:250) at java.io.InputStream.read(InputStream.java:101) at com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:737) at com.google.protobuf.CodedInputStream.isAtEnd(CodedInputStream.java:701) at com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:99) at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11144) at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11108) at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11213) at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11208) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:89) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:95) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49) at org.apache.orc.OrcProto$StripeFooter.parseFrom(OrcProto.java:11441) at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:270) at org.apache.orc.impl.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:299) at org.apache.orc.impl.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:1080) at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1042) at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1210) at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1245) at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:275) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:627) at org.apache.orc.tools.PrintData.printJsonData(PrintData.java:199) at org.apache.orc.tools.PrintData.main(PrintData.java:241) at org.apache.orc.tools.FileDump.main(FileDump.java:127) at org.apache.orc.tools.FileDump.main(FileDump.java:142) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:318) at org.apache.hadoop.util.RunJar.main(RunJar.java:232)
The command worked fine on other files in the delta folder.
The same job, when repeated with the same args, created a delta folder that was perfectly fine.
We went to the exact jobs that created these corrupted delta folders and saw that the common thing in these jobs is that they had retried stages and below was an error seen in the retried stages' executor logs.
21/01/11 06:37:09 ERROR AcidUtils: Failed to create abfs://[email protected]/xxx/list_page_event/1/partition_key=2021-01-11/delta_0014708_0014708/_orc_acid_version due to: Operation failed: "The specified path already exists.", 409, PUT, https://xxx.dfs.core.windows.net/store/xxx/list_page_event/1/partition_key%3D2021-01-11/delta_0014708_0014708/_orc_acid_version?resource=file&timeout=90, PathAlreadyExists, "The specified path already exists. RequestId:4c544a41-d01f-0000-0ae4-e732f9000000 Time:2021-01-11T06:37:09.5413274Z"
The issue was never seen when we were using HWC for writing to Hive Acid from spark.
Could you please help us with this error? Thanks!
The text was updated successfully, but these errors were encountered:
Hi guys, we use spark to read from/ write to Hive Acid tables. We have been using spark-acid for both read and writes. And we started seeing the below error during read on some of the partitions.
Caused by: java.lang.IllegalArgumentException: Buffer size too small. size = 16384 needed = 6139113 at com.qubole.shaded.orc.impl.InStream$CompressedStream.readHeader(InStream.java:212) at com.qubole.shaded.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:263) at com.qubole.shaded.orc.impl.InStream$CompressedStream.read(InStream.java:250) at java.io.InputStream.read(InputStream.java:101) at com.qubole.shaded.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:737) at com.qubole.shaded.protobuf.CodedInputStream.isAtEnd(CodedInputStream.java:701) at com.qubole.shaded.protobuf.CodedInputStream.readTag(CodedInputStream.java:99) at com.qubole.shaded.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11144) at com.qubole.shaded.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11108) at com.qubole.shaded.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11213) at com.qubole.shaded.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11208) at com.qubole.shaded.protobuf.AbstractParser.parseFrom(AbstractParser.java:89) at com.qubole.shaded.protobuf.AbstractParser.parseFrom(AbstractParser.java:95) at com.qubole.shaded.protobuf.AbstractParser.parseFrom(AbstractParser.java:49) at com.qubole.shaded.orc.OrcProto$StripeFooter.parseFrom(OrcProto.java:11441) at com.qubole.shaded.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:275) at com.qubole.shaded.orc.impl.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:311) at com.qubole.shaded.orc.impl.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:1102) at com.qubole.shaded.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1064) at com.qubole.shaded.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1232) at com.qubole.shaded.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1267) at com.qubole.shaded.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:282) at com.qubole.shaded.hadoop.hive.ql.io.orc.RecordReaderImpl.<init>(RecordReaderImpl.java:67) at com.qubole.shaded.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:83) at com.qubole.shaded.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPairAcid.<init>(OrcRawRecordMerger.java:246) at com.qubole.shaded.hadoop.hive.ql.io.orc.OrcRawRecordMerger.<init>(OrcRawRecordMerger.java:1063) at com.qubole.shaded.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:2091) at com.qubole.shaded.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1989) at com.qubole.spark.hiveacid.rdd.HiveAcidRDD$$anon$1.liftedTree1$1(HiveAcidRDD.scala:243) at com.qubole.spark.hiveacid.rdd.HiveAcidRDD$$anon$1.<init>(HiveAcidRDD.scala:239) at com.qubole.spark.hiveacid.rdd.HiveAcidRDD.compute(HiveAcidRDD.scala:226) at com.qubole.spark.hiveacid.rdd.HiveAcidRDD.compute(HiveAcidRDD.scala:90) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
This was happening when trying to read specific orc files in the delta folder.
We tried reading the partition using HWC and beeline as well and all of them failed with the same buffer size issue.
We felt the files might be corrupted and we tried running the below command on the anomalous file
/usr/bin/hive --orcfiledump /home/sshuser/bucket_00257 and it failed with the error below.
java.lang.IllegalArgumentException: Buffer size too small. size = 16384 needed = 3749831 at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:212) at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:263) at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:250) at java.io.InputStream.read(InputStream.java:101) at com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:737) at com.google.protobuf.CodedInputStream.isAtEnd(CodedInputStream.java:701) at com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:99) at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11144) at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11108) at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11213) at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11208) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:89) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:95) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49) at org.apache.orc.OrcProto$StripeFooter.parseFrom(OrcProto.java:11441) at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:270) at org.apache.orc.impl.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:299) at org.apache.orc.impl.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:1080) at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1042) at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1210) at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1245) at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:275) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:627) at org.apache.orc.tools.PrintData.printJsonData(PrintData.java:199) at org.apache.orc.tools.PrintData.main(PrintData.java:241) at org.apache.orc.tools.FileDump.main(FileDump.java:127) at org.apache.orc.tools.FileDump.main(FileDump.java:142) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:318) at org.apache.hadoop.util.RunJar.main(RunJar.java:232)
The command worked fine on other files in the delta folder.
The same job, when repeated with the same args, created a delta folder that was perfectly fine.
We went to the exact jobs that created these corrupted delta folders and saw that the common thing in these jobs is that they had retried stages and below was an error seen in the retried stages' executor logs.
21/01/11 06:37:09 ERROR AcidUtils: Failed to create abfs://[email protected]/xxx/list_page_event/1/partition_key=2021-01-11/delta_0014708_0014708/_orc_acid_version due to: Operation failed: "The specified path already exists.", 409, PUT, https://xxx.dfs.core.windows.net/store/xxx/list_page_event/1/partition_key%3D2021-01-11/delta_0014708_0014708/_orc_acid_version?resource=file&timeout=90, PathAlreadyExists, "The specified path already exists. RequestId:4c544a41-d01f-0000-0ae4-e732f9000000 Time:2021-01-11T06:37:09.5413274Z"
The issue was never seen when we were using HWC for writing to Hive Acid from spark.
Could you please help us with this error? Thanks!
The text was updated successfully, but these errors were encountered: