Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Dataproc (Python models) do not retry polling on 5xx errors #1271

Open
2 tasks done
OSalama opened this issue Jul 4, 2024 · 2 comments · May be fixed by #1275
Open
2 tasks done

[Bug] Dataproc (Python models) do not retry polling on 5xx errors #1271

OSalama opened this issue Jul 4, 2024 · 2 comments · May be fixed by #1275
Labels
enhancement New feature or request python_models

Comments

@OSalama
Copy link

OSalama commented Jul 4, 2024

Is this a new bug in dbt-bigquery?

  • I believe this is a new bug in dbt-bigquery
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

After dbt-bigquery submits a dataproc batch job, it enters a polling mode, waiting for a response to indicate the job completed (dbt/adapters/bigquery/dataproc/batch.py#29). This polling process is not retrying transient errors, so the dbt run ends up failing with an error, while the actual dataproc job runs to completion successfully.

Expected Behavior

We should be performing the dataproc batch polling with a retry strategy to retry transient errors.

The BatchControllerClient.get_batch() function takes a retry parameter:

retry (google.api_core.retry.Retry): Designation of what errors, if any,should be retried.`

so we should pass one in!

Steps To Reproduce

This relies on GCP API throwing a 5xx error so it's not very reproducible, though we are hitting the error at least once a day in our runs.

  1. Use dbt-bigquery 1.8.1
  2. Configure Dataproc serverless in profiles.yml:
  target: dev
  outputs:
    dev:
      type: bigquery
      location: EU
      job_retries: 5
      dataproc_batch:
        environment_config:
          execution_config:
            service_account: [email protected]
            subnetwork_uri: projects/data-dev/regions/europe-west1/subnetworks/dataproc-dbt-subnet
  1. Submit many long-running Python models
  2. Wait until one of them fails.

Relevant log output

11:05:53.104591 [info ] [Thread-1  ]: BigQuery adapter: Submitting batch job with id: d9cb4840-a1f1-49f3-bd8b-2af2d8dbcbfd
11:06:40.287936 [error] [Thread-1  ]: Unhandled error while executing target/run/core/models/data_tables/DT_segments_variants.py
503 502:Bad Gateway
11:06:40.298010 [debug] [Thread-1  ]: Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/google/api_core/grpc_helpers.py", line 76, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 1181, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 1006, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "502:Bad Gateway"
	debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-07-04T11:06:40.285827131+00:00", grpc_status:14, grpc_message:"502:Bad Gateway"}"
>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/dbt/task/base.py", line 368, in safe_run
    result = self.compile_and_execute(manifest, ctx)
  File "/usr/local/lib/python3.9/site-packages/dbt/task/base.py", line 314, in compile_and_execute
    result = self.run(ctx.node, manifest)
  File "/usr/local/lib/python3.9/site-packages/dbt/task/base.py", line 415, in run
    return self.execute(compiled_node, manifest)
  File "/usr/local/lib/python3.9/site-packages/dbt/task/run.py", line 298, in execute
    result = MacroGenerator(
  File "/usr/local/lib/python3.9/site-packages/dbt/clients/jinja.py", line 84, in __call__
    return self.call_macro(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/dbt_common/clients/jinja.py", line 298, in call_macro
    return macro(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/jinja2/runtime.py", line 768, in __call__
    return self._invoke(arguments, autoescape)
  File "/usr/local/lib/python3.9/site-packages/jinja2/runtime.py", line 782, in _invoke
    rv = self._func(*arguments)
  File "<template>", line 89, in macro
  File "/usr/local/lib/python3.9/site-packages/jinja2/sandbox.py", line 394, in call
    return __context.call(__obj, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/jinja2/runtime.py", line 303, in call
    return __obj(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/dbt/clients/jinja.py", line 84, in __call__
    return self.call_macro(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/dbt_common/clients/jinja.py", line 298, in call_macro
    return macro(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/jinja2/runtime.py", line 768, in __call__
    return self._invoke(arguments, autoescape)
  File "/usr/local/lib/python3.9/site-packages/jinja2/runtime.py", line 782, in _invoke
    rv = self._func(*arguments)
  File "<template>", line 55, in macro
  File "/usr/local/lib/python3.9/site-packages/jinja2/sandbox.py", line 394, in call
    return __context.call(__obj, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/jinja2/runtime.py", line 303, in call
    return __obj(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/dbt/context/providers.py", line 1444, in submit_python_job
    return self.adapter.submit_python_job(parsed_model, compiled_code)
  File "/usr/local/lib/python3.9/site-packages/dbt/adapters/base/impl.py", line 175, in execution_with_log
    response = code_execution_function(*args)
  File "/usr/local/lib/python3.9/site-packages/dbt/adapters/base/impl.py", line 1536, in submit_python_job
    submission_result = job_helper.submit(compiled_code)
  File "/usr/local/lib/python3.9/site-packages/dbt/adapters/bigquery/python_submissions.py", line 75, in submit
    return self._submit_dataproc_job()
  File "/usr/local/lib/python3.9/site-packages/dbt/adapters/bigquery/python_submissions.py", line 144, in _submit_dataproc_job
    return poll_batch_job(
  File "/usr/local/lib/python3.9/site-packages/dbt/adapters/bigquery/dataproc/batch.py", line 38, in poll_batch_job
    response = job_client.get_batch(  # type: ignore
  File "/usr/local/lib/python3.9/site-packages/google/cloud/dataproc_v1/services/batch_controller/client.py", line 950, in get_batch
    response = rpc(
  File "/usr/local/lib/python3.9/site-packages/google/api_core/gapic_v1/method.py", line 131, in __call__
    return wrapped_func(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/google/api_core/grpc_helpers.py", line 78, in error_remapped_callable
    raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.ServiceUnavailable: 503 502:Bad Gateway

Environment

- OS: docker pull python:3.9.8-slim-buster (Debian)
- Python: Python 3.9.8
- dbt-core: 1.8.3
- dbt-bigquery: 1.8.1

Additional Context

Feels similar to #682, but I believe the Dataproc element means we're dealing with a separate code path that doesn't benefit from RETRYABLE_ERRORS. The stacktrace also doesn't mention dbt/adapters/bigquery/connections.py.

@OSalama OSalama added bug Something isn't working triage labels Jul 4, 2024
@OSalama OSalama linked a pull request Jul 8, 2024 that will close this issue
4 tasks
@OSalama
Copy link
Author

OSalama commented Jul 29, 2024

I've been running my proposed fix (#1275) in production for 2 weeks and have not seen any further occurrences of the issue.

@amychen1776 amychen1776 added python Pull requests that update Python code enhancement New feature or request python_models and removed triage bug Something isn't working python Pull requests that update Python code labels Aug 27, 2024
@jmesterh
Copy link

jmesterh commented Sep 24, 2024

Just ran across this today as well (Python model / GCP / Dataproc)

Unhandled error while executing /tmp/tmppi1e0ffu/run/common/models/modelname.py
503 Socket closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python_models
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants