Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPE while reading Multiple Partitions #101

Open
adiu19 opened this issue Nov 23, 2020 · 6 comments
Open

NPE while reading Multiple Partitions #101

adiu19 opened this issue Nov 23, 2020 · 6 comments
Assignees

Comments

@adiu19
Copy link

adiu19 commented Nov 23, 2020

Hi Guys, we have integrated spark-acid library into our production pipeline and recently started facing an issue while reading data from a lot of partitions. Below is the stack trace -

Caused by: java.lang.NullPointerException at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:820) at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:440) at com.qubole.spark.hiveacid.rdd.HiveAcidRDD$.getJobConf(HiveAcidRDD.scala:457) at com.qubole.spark.hiveacid.reader.hive.HiveAcidPartitionComputer$$anonfun$2.apply(HiveAcidPartitionComputer.scala:73) at com.qubole.spark.hiveacid.reader.hive.HiveAcidPartitionComputer$$anonfun$2.apply(HiveAcidPartitionComputer.scala:69) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at scala.collection.AbstractIterator.to(Iterator.scala:1334) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at scala.collection.AbstractIterator.toArray(Iterator.scala:1334) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

All our tables are Hive ACID tables and our partitions have a two-level nesting based on date and are created dynamically. The read works perfectly fine if we execute it in smaller chunks. Has anyone faced this issue?

@amoghmargoor amoghmargoor self-assigned this Nov 27, 2020
@amoghmargoor
Copy link
Collaborator

amoghmargoor commented Nov 27, 2020

Hi @adiu19, Seems like some bug in broadcasting Jobconf. Can you set this config to really high value so that it bypasses this code path spark.hiveAcid.parallel.partitioning.threshold ? This should be set more than your number of partitions.

@adiu19
Copy link
Author

adiu19 commented Dec 3, 2020

@amoghmargoor thanks a lot, this worked. Are we planning to address this bug in any upcoming release?

@amoghmargoor
Copy link
Collaborator

amoghmargoor commented Dec 5, 2020

@adiu19 yeah will take a look at it for next release. thanks for reporting.I may need some help with reproducing it if i cannot on our end.

@maheshk114
Copy link
Contributor

@amoghmargoor looks similar to the kryo serialization issue.

@amoghmargoor
Copy link
Collaborator

@maheshk114 in that case i believe it should have failed without the flag being disabled too. but anyways a good point to consider. @adiu19 Can you check if you guys were using Kyro on your end ?

@adiu19
Copy link
Author

adiu19 commented Dec 30, 2020

@amoghmargoor : we aren't using Kyro on our side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants