Spark-submit task fails on deploy mode cluster #21799
Replies: 2 comments 6 replies
-
I have no idea if that's it (I have no idea about spark submit) but you have a typo: |
Beta Was this translation helpful? Give feedback.
-
I hope this help someone like me :D In case of me, I'm using
Set
Go to Airflow > connection > edit: spark-default
Now I can see log on dag's log as below.(AIRFLOW__LOGGING__LOGGING_LEVEL: 'DEBUG') [2024-03-25, 15:03:36 KST] {spark_submit.py:559} DEBUG - polling status of spark driver with id driver-20240325060331-0000
[2024-03-25, 15:03:36 KST] {spark_submit.py:359} INFO - ['/usr/bin/curl', '--max-time', '30', 'http://spark-master:6066/v1/submissions/status/driver-20240325060331-0000']
[2024-03-25, 15:03:36 KST] {spark_submit.py:382} DEBUG - Poll driver status cmd: ['/usr/bin/curl', '--max-time', '30', 'http://spark-master:6066/v1/submissions/status/driver-20240325060331-0000']
[2024-03-25, 15:03:36 KST] {spark_submit.py:513} DEBUG - spark driver status log: % Total % Received % Xferd Average Speed Time Time Time Current
[2024-03-25, 15:03:36 KST] {spark_submit.py:513} DEBUG - spark driver status log: Dload Upload Total Spent Left Speed
[2024-03-25, 15:03:36 KST] {spark_submit.py:513} DEBUG - spark driver status log:
[2024-03-25, 15:03:36 KST] {spark_submit.py:513} DEBUG - spark driver status log: 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
[2024-03-25, 15:03:36 KST] {spark_submit.py:513} DEBUG - spark driver status log: 100 272 100 272 0 0 45333 0 --:--:-- --:--:-- --:--:-- 45333
[2024-03-25, 15:03:36 KST] {spark_submit.py:513} DEBUG - spark driver status log: {
[2024-03-25, 15:03:36 KST] {spark_submit.py:513} DEBUG - spark driver status log: "action" : "SubmissionStatusResponse",
[2024-03-25, 15:03:36 KST] {spark_submit.py:513} DEBUG - spark driver status log: "driverState" : "FAILED",
[2024-03-25, 15:03:36 KST] {spark_submit.py:513} DEBUG - spark driver status log: "serverSparkVersion" : "3.4.1",
[2024-03-25, 15:03:36 KST] {spark_submit.py:513} DEBUG - spark driver status log: "submissionId" : "driver-20240325060331-0000",
[2024-03-25, 15:03:36 KST] {spark_submit.py:513} DEBUG - spark driver status log: "success" : true,
[2024-03-25, 15:03:36 KST] {spark_submit.py:513} DEBUG - spark driver status log: "workerHostPort" : "172.28.0.2:17141",
[2024-03-25, 15:03:36 KST] {spark_submit.py:513} DEBUG - spark driver status log: "workerId" : "worker-20240325060303-172.28.0.2-17141"
[2024-03-25, 15:03:36 KST] {spark_submit.py:513} DEBUG - spark driver status log: } And please double check on your dag file; spark_conf={
"spark.standalone.submit.waitAppCompletion":"true",
...
}
# we can set interval second via "status_poll_interval"
spark_submit = SparkSubmitOperator(
....
conf=spark_conf,
status_poll_interval=5,
) |
Beta Was this translation helpful? Give feedback.
-
Airflow v2.2.4 running on Linux (same error with Airflow v2.2.3)
Spark v3.1.2 running on Linux
Hi,
I'm trying to run a spark-submit task on a spark cluster. My Spark connection is configured with extra {"deploy-mode": "cluster"}.
My task is executed successfully on Spark, but when it finishes, I receive the following error:
[2022-02-24, 08:33:11 ] {spark_submit.py:488} INFO - 22/02/24 11:33:11 INFO ClientEndpoint: Driver successfully submitted as driver-20220224113311-0599
[2022-02-24, 08:33:16 ] {spark_submit.py:488} INFO - 22/02/24 11:33:16 INFO ClientEndpoint: State of driver-20220224113311-0599 is FINISHED
[2022-02-24, 08:33:16 ] {spark_submit.py:488} INFO - 22/02/24 11:33:16 INFO ClientEndpoint: State of driver driver-20220224113311-0599 is FINISHED, exiting spark-submit JVM.
[2022-02-24, 08:33:16 ] {spark_submit.py:488} INFO - 22/02/24 11:33:16 INFO ShutdownHookManager: Shutdown hook called
[2022-02-24, 08:33:16 ] {spark_submit.py:488} INFO - 22/02/24 11:33:16 INFO ShutdownHookManager: Deleting directory /tmp/spark-08e72231-8b3d-4842-9b86-09e85226aa64
[2022-02-24, 08:33:52 ] {taskinstance.py:1718} ERROR - Task failed with exception
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1334, in _run_raw_task
self._execute_task_with_callbacks(context)
File "/root/anaconda3/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1460, in _execute_task_with_callbacks
result = self._execute_task(context, self.task)
File "/root/anaconda3/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1516, in _execute_task
result = execute_callable(context=context)
File "/root/anaconda3/lib/python3.9/site-packages/airflow/providers/apache/spark/operators/spark_submit.py", line 157, in execute
self._hook.submit(self._application)
File "/root/anaconda3/lib/python3.9/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py", line 436, in submit
self._start_driver_status_tracking()
File "/root/anaconda3/lib/python3.9/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py", line 575, in _start_driver_status_tracking
raise AirflowException(
airflow.exceptions.AirflowException: Failed to poll for the driver status 10 times: returncode = 1
[2022-02-24, 08:33:52 ] {taskinstance.py:1272} INFO - Marking task as FAILED. dag_id=teste_spark_submit_airflow, task_id=spark_processing, execution_date=20220224T143307, start_date=20220224T143308, end_date=20220224T143352
[2022-02-24, 08:33:52 ] {standard_task_runner.py:89} ERROR - Failed to execute job 9 for task spark_processing
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.9/site-packages/airflow/task/task_runner/standard_task_runner.py", line 85, in _start_by_fork
args.func(args, dag=self.dag)
File "/root/anaconda3/lib/python3.9/site-packages/airflow/cli/cli_parser.py", line 48, in command
return func(*args, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/airflow/utils/cli.py", line 92, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/airflow/cli/commands/task_command.py", line 298, in task_run
_run_task_by_selected_method(args, dag, ti)
File "/root/anaconda3/lib/python3.9/site-packages/airflow/cli/commands/task_command.py", line 107, in _run_task_by_selected_method
_run_raw_task(args, ti)
File "/root/anaconda3/lib/python3.9/site-packages/airflow/cli/commands/task_command.py", line 180, in _run_raw_task
ti._run_raw_task(
File "/root/anaconda3/lib/python3.9/site-packages/airflow/utils/session.py", line 70, in wrapper
return func(*args, session=session, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1334, in _run_raw_task
self._execute_task_with_callbacks(context)
File "/root/anaconda3/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1460, in _execute_task_with_callbacks
result = self._execute_task(context, self.task)
File "/root/anaconda3/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1516, in _execute_task
result = execute_callable(context=context)
File "/root/anaconda3/lib/python3.9/site-packages/airflow/providers/apache/spark/operators/spark_submit.py", line 157, in execute
self._hook.submit(self._application)
File "/root/anaconda3/lib/python3.9/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py", line 436, in submit
self._start_driver_status_tracking()
File "/root/anaconda3/lib/python3.9/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py", line 575, in _start_driver_status_tracking
raise AirflowException(
airflow.exceptions.AirflowException: Failed to poll for the driver status 10 times: returncode = 1
[2022-02-24, 08:33:53 ] {local_task_job.py:154} INFO - Task exited with return code 1
[2022-02-24, 08:33:53 ] {local_task_job.py:264} INFO - 0 downstream tasks scheduled from follow-on schedule check
When I remove the extra "deploy-mode: cluster", this error doesn't occur.
I saw that this in spark_submit.py code:
def _resolve_should_track_driver_status(self) -> bool: """ Determines whether or not this hook should poll the spark driver status through subsequent spark-submit status requests after the initial spark-submit request :return: if the driver status should be tracked """ return 'spark://' in self._connection['master'] and self._connection['deploy_mode'] == 'cluster'
How can I solve this? I really need to run on cluster mode.
Thank you
Beta Was this translation helpful? Give feedback.
All reactions