[ADAP-727] [Bug] Redshift adapter does not retry on connection timeouts #556

tlento · 2023-07-26T15:40:20Z

Is this a new bug in dbt-redshift?

I believe this is a new bug in dbt-redshift
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

When running a large number of concurrent queries through the dbt-redshift adapter (e.g., by running the MetricFlow test suite using a dbt-redshift client), we observe intermittent failures with this message in the logs:

E           redshift_connector.error.InterfaceError: ('communication error', TimeoutError(60, 'Operation timed out'))

We tried changing the retry count to a larger number, and this made no apparent difference. According to the stack trace, the retry is not happening, and on further inspection we discovered that the InterfaceError is not considered a retryable exception:

https://github.com/dbt-labs/dbt-redshift/blob/main/dbt/adapters/redshift/connections.py#L335-L337

Stack trace excerpt:

metricflow/cli/dbt_connectors/adapter_backed_client.py:125: in query
    result = self._adapter.execute(sql=statement, auto_begin=True, fetch=True)
../../.venvs/metricflow/zJM50P7N/redshift-env/lib/python3.9/site-packages/dbt/adapters/base/impl.py:290: in execute
    return self.connections.execute(sql=sql, auto_begin=auto_begin, fetch=fetch, limit=limit)
../../.venvs/metricflow/zJM50P7N/redshift-env/lib/python3.9/site-packages/dbt/adapters/redshift/connections.py:357: in execute
    _, cursor = self.add_query(sql, auto_begin)
../../.venvs/metricflow/zJM50P7N/redshift-env/lib/python3.9/site-packages/dbt/adapters/redshift/connections.py:382: in add_query
    connection, cursor = super().add_query(
../../.venvs/metricflow/zJM50P7N/redshift-env/lib/python3.9/site-packages/dbt/adapters/sql/connections.py:57: in add_query
    self.begin()
../../.venvs/metricflow/zJM50P7N/redshift-env/lib/python3.9/site-packages/dbt/adapters/sql/connections.py:172: in begin
    self.add_begin_query()
../../.venvs/metricflow/zJM50P7N/redshift-env/lib/python3.9/site-packages/dbt/adapters/sql/connections.py:155: in add_begin_query
    return self.add_query("BEGIN", auto_begin=False)
../../.venvs/metricflow/zJM50P7N/redshift-env/lib/python3.9/site-packages/dbt/adapters/redshift/connections.py:382: in add_query
    connection, cursor = super().add_query(
../../.venvs/metricflow/zJM50P7N/redshift-env/lib/python3.9/site-packages/dbt/adapters/sql/connections.py:79: in add_query
    cursor = connection.handle.cursor()
../../.venvs/metricflow/zJM50P7N/redshift-env/lib/python3.9/site-packages/dbt/contracts/connection.py:94: in handle
    self._handle.resolve(self)
../../.venvs/metricflow/zJM50P7N/redshift-env/lib/python3.9/site-packages/dbt/contracts/connection.py:118: in resolve
    return self.opener(connection)
../../.venvs/metricflow/zJM50P7N/redshift-env/lib/python3.9/site-packages/dbt/adapters/redshift/connections.py:340: in open
    return cls.retry_connection(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = <class 'dbt.adapters.redshift.connections.RedshiftConnectionManager'>
connection = Connection(type='redshift', name='MetricFlow_request_mf_rid__9o0a7sct', state=<ConnectionState.CLOSED: 'closed'>, tran...lse, connect_timeout=None, role=None, sslmode=<UserSSLMode.prefer: 'prefer'>, retries=1, region=None, autocommit=True))
connect = <function RedshiftConnectMethodFactory.get_connect_method.<locals>.connect at 0x280207430>, logger = AdapterLogger(name='Redshift')
retryable_exceptions = [<class 'redshift_connector.error.OperationalError'>, <class 'redshift_connector.error.DatabaseError'>, <class 'redshift_connector.error.DataError'>], retry_limit = 1
retry_timeout = <function RedshiftConnectionManager.open.<locals>.exponential_backoff at 0x28026cdc0>, _attempts = 0

    @classmethod
    def retry_connection(
        cls,
        connection: Connection,
        connect: Callable[[], AdapterHandle],
        logger: AdapterLogger,
        retryable_exceptions: Iterable[Type[Exception]],
        retry_limit: int = 1,
        retry_timeout: Union[Callable[[int], SleepTime], SleepTime] = 1,
        _attempts: int = 0,
    ) -> Connection:
        """Given a Connection, set its handle by calling connect.

        The calls to connect will be retried up to retry_limit times to deal with transient
        connection errors. By default, one retry will be attempted if retryable_exceptions is set.

        :param Connection connection: An instance of a Connection that needs a handle to be set,
            usually when attempting to open it.
        :param connect: A callable that returns the appropiate connection handle for a
            given adapter. This callable will be retried retry_limit times if a subclass of any
            Exception in retryable_exceptions is raised by connect.
        :type connect: Callable[[], AdapterHandle]
        :param AdapterLogger logger: A logger to emit messages on retry attempts or errors. When
            handling expected errors, we call debug, and call warning on unexpected errors or when
            all retry attempts have been exhausted.
        :param retryable_exceptions: An iterable of exception classes that if raised by
            connect should trigger a retry.
        :type retryable_exceptions: Iterable[Type[Exception]]
        :param int retry_limit: How many times to retry the call to connect. If this limit
            is exceeded before a successful call, a FailedToConnectError will be raised.
            Must be non-negative.
        :param retry_timeout: Time to wait between attempts to connect. Can also take a
            Callable that takes the number of attempts so far, beginning at 0, and returns an int
            or float to be passed to time.sleep.
        :type retry_timeout: Union[Callable[[int], SleepTime], SleepTime] = 1
        :param int _attempts: Parameter used to keep track of the number of attempts in calling the
            connect function across recursive calls. Passed as an argument to retry_timeout if it
            is a Callable. This parameter should not be set by the initial caller.
        :raises dbt.exceptions.FailedToConnectError: Upon exhausting all retry attempts without
            successfully acquiring a handle.
        :return: The given connection with its appropriate state and handle attributes set
            depending on whether we successfully acquired a handle or not.
        """
        timeout = retry_timeout(_attempts) if callable(retry_timeout) else retry_timeout
        if timeout < 0:
            raise dbt.exceptions.FailedToConnectError(
                "retry_timeout cannot be negative or return a negative time."
            )

        if retry_limit < 0 or retry_limit > sys.getrecursionlimit():
            # This guard is not perfect others may add to the recursion limit (e.g. built-ins).
            connection.handle = None
            connection.state = ConnectionState.FAIL
            raise dbt.exceptions.FailedToConnectError("retry_limit cannot be negative")

        try:
            connection.handle = connect()
            connection.state = ConnectionState.OPEN
            return connection

        except tuple(retryable_exceptions) as e:
            if retry_limit <= 0:
                connection.handle = None
                connection.state = ConnectionState.FAIL
                raise dbt.exceptions.FailedToConnectError(str(e))

            logger.debug(
                f"Got a retryable error when attempting to open a {cls.TYPE} connection.\n"
                f"{retry_limit} attempts remaining. Retrying in {timeout} seconds.\n"
                f"Error:\n{e}"
            )

            sleep(timeout)
            return cls.retry_connection(
                connection=connection,
                connect=connect,
                logger=logger,
                retry_limit=retry_limit - 1,
                retry_timeout=retry_timeout,
                retryable_exceptions=retryable_exceptions,
                _attempts=_attempts + 1,
            )

        except Exception as e:
            connection.handle = None
            connection.state = ConnectionState.FAIL
>           raise dbt.exceptions.FailedToConnectError(str(e))
E           dbt.exceptions.FailedToConnectError: Database Error
E             ('communication error', TimeoutError(60, 'Operation timed out'))

../../.venvs/metricflow/zJM50P7N/redshift-env/lib/python3.9/site-packages/dbt/adapters/base/connections.py:271: FailedToConnectError

We can see at the end that the error is thrown out of the broader except block, rather than the retry handling.

Expected Behavior

I expected these errors to be retried, but I understand why that isn't happening.

I would also expect redshift_connector to not hide this inside of an InterfaceException, that's just odd, but it is what it is.

Steps To Reproduce

I think the only way to do this is to run lots of concurrent queries on an undersized cluster until you can trigger intermittent failures of this type. I've got a PR up that might trigger this in the MetricFlow CI as well, I'll see if it happens there. If it's just on my local machine maybe I need to do whatever that weird Redshift override thing is.

Relevant log output

No response

Environment

- OS: Mac OS X Ventura 13.4.1 (c)
- Python: 3.9.latest
- dbt-core: 1.6.0rc1
- dbt-redshift: 1.6.0rc1

Additional Context

I am not using dbt proper, just the adapter. See dbt-labs/metricflow#678

The text was updated successfully, but these errors were encountered:

tlento · 2023-07-26T18:09:04Z

Update - I have not yet been able to repro this in GitHub's CI runners, so it's possible this error is due to some local machine configuration. Hopefully somebody better versed in the Redshift connector error state reporting can figure this out, because I'm stumped.

Also worth noting, this does not happen if I use the sqlalchemy-redshift library with redshift_connector specified.

Fleid · 2023-07-31T17:28:44Z

Hey @tlento,

I don't know how I feel about the premise, it feels like an edge case. You are doing high concurrency work on an adapter that's not designed for that. I'm saying that because I won't quality this issue as a bug, but rather an enhancement.

Now to unlock you:

We could add InterfaceError in the list of retry-able error types. But I'm not sure what could be the side effects here. Do you know what other error types we would start retrying? That sounds like too blunt an option to me at this point.
I'm going to tag in our AWS friends here, they may have some answers from the library standpoint. Hopefully we will hear from them soon.

tlento · 2023-07-31T17:44:46Z

Thanks @Fleid, I tagged it as a bug in case this was happening for all timeouts but if it's just this scenario I agree it's not really an adapter bug that needs fixing but more of a nice to have support enhancement around retry behavior.

I didn't bother putting up a PR to add InterfaceError because I don't have any idea what that opens up to retries.

We are not currently blocked - our CI test suite seems to be fine and that's at the edge of what we might expect an end user doing metrics development to dispatch - so this more of a developer nuisance than anything else. Hopefully the AWS folks have some insights to share!

jiezhen-chen · 2023-08-23T22:53:09Z

Since the PR on the redshift_connector side is merged, shall we go ahead and close out this issue? @Fleid

dataders · 2023-08-31T17:54:16Z

hey @jiezhen-chen has the change been released to PyPI? If so, that's awesome! The next step is to open a PR to increment the version defined in setup.py

dbt-redshift/setup.py

Lines 88 to 90 in e95685f

    
           # dbt-redshift depends deeply on this package. it does not follow SemVer, therefore there have been breaking changes in previous patch releases 
        
           # Pin to the patch or minor version, and bump in each new minor version of dbt-redshift. 
        
           "redshift-connector==2.0.913",

gajanand-jm · 2023-09-29T23:12:47Z

facing the same error when using dbt debug for redshift connect via github actions

tlento added bug Something isn't working triage labels Jul 26, 2023

github-actions bot changed the title ~~[Bug] Redshift adapter does not retry on connection timeouts~~ [ADAP-727] [Bug] Redshift adapter does not retry on connection timeouts Jul 26, 2023

graciegoheen assigned dataders and Fleid and unassigned dataders Jul 27, 2023

Fleid added enhancement New feature or request and removed bug Something isn't working triage labels Jul 31, 2023

jiezhen-chen mentioned this issue Aug 7, 2023

Raise OperationalError for socket timeouts aws/amazon-redshift-python-driver#179

Merged

8 tasks

soksamnanglim mentioned this issue Sep 12, 2023

Update redshift-connector requirement from 2.0.913 to 2.0.915 #601

Merged

5 tasks

colin-rogers-dbt closed this as completed in #601 Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADAP-727] [Bug] Redshift adapter does not retry on connection timeouts #556

[ADAP-727] [Bug] Redshift adapter does not retry on connection timeouts #556

tlento commented Jul 26, 2023 •

edited

Loading

tlento commented Jul 26, 2023

Fleid commented Jul 31, 2023

tlento commented Jul 31, 2023

jiezhen-chen commented Aug 23, 2023

dataders commented Aug 31, 2023

gajanand-jm commented Sep 29, 2023

[ADAP-727] [Bug] Redshift adapter does not retry on connection timeouts #556

[ADAP-727] [Bug] Redshift adapter does not retry on connection timeouts #556

Comments

tlento commented Jul 26, 2023 • edited Loading

Is this a new bug in dbt-redshift?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

Additional Context

tlento commented Jul 26, 2023

Fleid commented Jul 31, 2023

tlento commented Jul 31, 2023

jiezhen-chen commented Aug 23, 2023

dataders commented Aug 31, 2023

gajanand-jm commented Sep 29, 2023

tlento commented Jul 26, 2023 •

edited

Loading