You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The machine has 40 cores and 64 GB ran available. This command is failing when it is run concurrently with other tools that take up 32 cores and are also memory intensive; if no other tools are running, then the DECA command above runs successfully all the way. These are the error messages from the failed attempt:
20/04/20 20:16:39 ERROR TaskSetManager: Task 47 in stage 47.0 failed 1 times; aborting job
20/04/20 20:16:41 INFO DAGScheduler: ShuffleMapStage 47 (mapPartitions at Coverage.scala:173) failed in 6957.414 s due to Job aborted due to stage failure: Task 47 in stage 47.0 failed 1 times, most recent failure: Lost task 47.0 in stage 47.0 (TID 3017, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 299196 msorg.apache.spark.SparkException: Job aborted due to stage failure: Task 47 in stage 47.0 failed 1 times, most recent failure: Lost task 47.0 in stage 47.0 (TID 3017, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 299196 ms
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 47 in stage 47.0 failed 1 times, most recent failure: Lost task 47.0 in stage 47.0 (TID 3017, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 299196 ms
As you can see in the command, we are using 16 threads, --executor-memory 32G and --driver-memory 16G. In order to ensure that the command runs successfully even when other tools are running, which of these parameters would you recommend to decrease from their current settings? Could you also please briefly describe the difference between the executor and driver memory - looks like the executor is per process - does that mean per thread?
Thanks very much.
The text was updated successfully, but these errors were encountered:
Hi, I'm running the cnv module with the following parameters:
deca-submit --master local[16] --conf spark.local.dir=/data/cnv/temp --conf spark.driver.maxResultSize=0 --conf spark.kryo.registrationRequired=true --executor-memory 32G --driver-memory 16G -- cnv -I $bam_list_dir/test_allrefs_females.list -l -o "CNVs_"$BATCH"_Females_withAllRefSamples.gff3" -L /data/cnv/reference_files/target_padded_exons_with_transcripts.bed
The machine has 40 cores and 64 GB ran available. This command is failing when it is run concurrently with other tools that take up 32 cores and are also memory intensive; if no other tools are running, then the DECA command above runs successfully all the way. These are the error messages from the failed attempt:
20/04/20 20:16:39 ERROR TaskSetManager: Task 47 in stage 47.0 failed 1 times; aborting job
20/04/20 20:16:41 INFO DAGScheduler: ShuffleMapStage 47 (mapPartitions at Coverage.scala:173) failed in 6957.414 s due to Job aborted due to stage failure: Task 47 in stage 47.0 failed 1 times, most recent failure: Lost task 47.0 in stage 47.0 (TID 3017, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 299196 msorg.apache.spark.SparkException: Job aborted due to stage failure: Task 47 in stage 47.0 failed 1 times, most recent failure: Lost task 47.0 in stage 47.0 (TID 3017, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 299196 ms
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 47 in stage 47.0 failed 1 times, most recent failure: Lost task 47.0 in stage 47.0 (TID 3017, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 299196 ms
As you can see in the command, we are using 16 threads, --executor-memory 32G and --driver-memory 16G. In order to ensure that the command runs successfully even when other tools are running, which of these parameters would you recommend to decrease from their current settings? Could you also please briefly describe the difference between the executor and driver memory - looks like the executor is per process - does that mean per thread?
Thanks very much.
The text was updated successfully, but these errors were encountered: