Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provider reconnection topic #1472

Open
aepfli opened this issue Dec 13, 2024 · 7 comments
Open

Provider reconnection topic #1472

aepfli opened this issue Dec 13, 2024 · 7 comments

Comments

@aepfli
Copy link
Member

aepfli commented Dec 13, 2024

Currently, all our providers implement their back-off for the stream connection. But in a buggy way, it only recreates the stream; GRPC handles the reconnection independently. see https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md

This might lead to false assumptions about our configuration and paint the wrong picture regarding what people can configure and what to expect.

languages

GRPC usually already offers a battle-tested reconnection logic. However, the configuration option is not the same for everyone.

For example, sometimes, we can set 3 out of the five defined options, but multiplier and jitter seem to be something we can't correctly set in all of them.

Java

Java has an exponential backoff, which is not exposed for modification like in other languages see grpc/grpc-java#9353

private long initialBackoffNanos = TimeUnit.SECONDS.toNanos(1);
  private long maxBackoffNanos = TimeUnit.MINUTES.toNanos(2);
  private double multiplier = 1.6;
  private double jitter = .2;

https://github.com/grpc/grpc-java/blob/3b39a83621626c844b16e64ec6389511903ee075/core/src/main/java/io/grpc/internal/ExponentialBackoffPolicy.java#L40-L43

findings

  • notifyWhenStateChanged(...) registers a one-off callback and you need to re-register again

C++, Python, Ruby, Objective-C, PHP, C#

Python uses the C++ library in the background and, therefore, offers a min and max backoff, and it will be a random value in that range. Values are mapped; see grpc/grpc#25540 (comment). Seems like every language is in sync, but the docs would not suggest this behavior. I found the comment, but I am still looking to find the proper documentation.

GRPC_ARG_INITIAL_RECONNECT_BACKOFF_MS i.e. "grpc.initial_reconnect_backoff_ms" corresponds to INITIAL_BACKOFF from the algorithm.
The other configurable args are -
GRPC_ARG_MIN_RECONNECT_BACKOFF_MS "grpc.min_reconnect_backoff_ms" which corresponds to MIN_CONNECT_TIMEOUT
GRPC_ARG_MAX_RECONNECT_BACKOFF_MS "grpc.max_reconnect_backoff_ms" which corresponds to MAX_BACKOFF

from the grpc c++ docs (maybe i am misintrepreting this docs):

#define GRPC_ARG_MAX_RECONNECT_BACKOFF_MS   "grpc.max_reconnect_backoff_ms"
The maximum time between subsequent connection attempts, in ms. 

#define GRPC_ARG_MIN_RECONNECT_BACKOFF_MS   "grpc.min_reconnect_backoff_ms"
The minimum time between subsequent connection attempts, in ms.

#define GRPC_ARG_INITIAL_RECONNECT_BACKOFF_MS   "grpc.initial_reconnect_backoff_ms"
The time between the first and second connection attempts, in ms. 

subchannel defaults
https://github.com/grpc/grpc/blob/782814e28006f132880c0aebbc3d0833d515fa61/src/core/client_channel/subchannel.cc#L75-L79

javaScript

deprecated implementation wrapping C++ client

utilizes C++ in the background, the same as python

grpc-js

Js implementation, here is a list of config properties: https://github.com/grpc/grpc-node/blob/master/packages/grpc-js/README.md#supported-channel-options. It only supports two out of the three in Python used, based on docs

grpc.initial_reconnect_backoff_ms
grpc.max_reconnect_backoff_ms

const INITIAL_BACKOFF_MS = 1000;
const BACKOFF_MULTIPLIER = 1.6;
const MAX_BACKOFF_MS = 120000;
const BACKOFF_JITTER = 0.2;

https://github.com/grpc/grpc-node/blob/263c478c9a0216e1c864850248ce8efffd7c2da5/packages/grpc-js/src/backoff-timeout.ts#L18C1-L21C28

Summary

Ultimately, all of them offer a reliable backoff mechanism but not a unified one. They're somewhat unified but different to configure, if even possible. Hence, whether we want to offer a custom backoff is questionable, especially as it is currently not behaving as it should.

language initial (grpc.initial_reconnect_backoff_ms) max (grpc.max_reconnect_backoff_ms) timeout (grpc.min_reconnect_backoff_ms) jitter multiplier
C++, Python, Ruby, Objective-C, PHP, C#, js(deprecated) 0.2 1.6
js 0.2 1.6
java 0.2 1.6

Implementation-wise, it would be good to delegate reconnection to GRPC as it reduces our complexity and code to maintain. On the other hand, people might rely on it (this ticket is a bug, and we need to fix it).

If we delegate reconnection to GRPC, we should try to configure it, if possible (see table). Furthermore, we should utilize connection listeners (java: notifyWhenStateChanged, python: subscribe ...) to handle our connection state and Wait-For-Ready for our streams to wait for a connection.

Generally, we should define a general approach to handling the connection topic, e.g., reconnect, error state, and best practice, so we sync all our implementations where possible and especially highlight provider implementations that differ from the recommendation.

sidenote

I created this ticket because there have already been many discussions regarding this. Documenting this is quite essential; otherwise, we will talk about this again in a year and start from scratch.

@beeme1mr
Copy link
Member

It's a bummer that it's inconsistently implemented in gRPC, but I agree that we should try and leverage as much of their reconnection logic as possible.

@aepfli
Copy link
Member Author

aepfli commented Dec 15, 2024

update: it seems like although misleading by the naming, python follows the connection strategy define here https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md. but the values are mapped strangely:

grpc/grpc#25540 (comment):

GRPC_ARG_INITIAL_RECONNECT_BACKOFF_MS i.e. "grpc.initial_reconnect_backoff_ms" corresponds to INITIAL_BACKOFF from the algorithm.
The other configurable args are -
GRPC_ARG_MIN_RECONNECT_BACKOFF_MS "grpc.min_reconnect_backoff_ms" which corresponds to MIN_CONNECT_TIMEOUT
GRPC_ARG_MAX_RECONNECT_BACKOFF_MS "grpc.max_reconnect_backoff_ms" which corresponds to MAX_BACKOFF

which is cool, but that leaves us still with the problem, that in java, it is hard to change the values, i will dig into this, and will also provide js examples

@aepfli
Copy link
Member Author

aepfli commented Dec 15, 2024

a note worthy link for the defaults of subchannel timeouts grpc/grpc#23861 (comment)

@aepfli
Copy link
Member Author

aepfli commented Dec 15, 2024

So contrary to my first comment, they all implement the same thing, based on the documentation, but the settings mapping is odd. Anyways our initial problem, of race condition, due to custom implementation on top still exists.

Sorry for all the confusion i am causing - but as you see, this is all collected through different github issues, rather than documentaion

@aepfli
Copy link
Member Author

aepfli commented Dec 15, 2024

So we can't modify the jitter and the multiplier, but it seems like we can modify some of the other properties, depending on the language

language initial max timeout jitter multiplier
C++, Python, Ruby, Objective-C, PHP, C#, js(deprecated) 0.2 1.6
js 0.2 1.6
java 0.2 1.6

Maybe i misread something, and somebody should cross-check

To summarize:

  • We should remove our custom backoff mechanism for connection and add those properties to the connection.
  • We might want to implement retryConfig for RPC mode
  • We want to configure the GRPC connection backoff, where we can. (seems like everywhere except java)
  • We want to document the behavior of our providers.

The good part is that after this weekend's exploration. We can keep our configuration and definition, as the defaults are the same as those used by GRPC. We can remove our own backoff and we can use Wait-For-Ready for our streams.

@toddbaert
Copy link
Member

So we can't modify the jitter and the multiplier, but it seems like we can modify some of the other properties, depending on the language

language initial max timeout jitter multiplier
C++, Python, Ruby, Objective-C, PHP, C#, js(deprecated) ✅ ✅ ✅ 0.2 1.6
js ✅ ✅ ❌ 0.2 1.6
java ❌ ❌ ❌ 0.2 1.6
Maybe i misread something, and somebody should cross-check

To summarize:

  • We should remove our custom backoff mechanism for connection and add those properties to the connection.
  • We might want to implement retryConfig for RPC mode
  • We want to configure the GRPC connection backoff, where we can. (seems like everywhere except java)
  • We want to document the behavior of our providers.

The good part is that after this weekend's exploration. We can keep our configuration and definition, as the defaults are the same as those used by GRPC. We can remove our own backoff and we can use Wait-For-Ready for our streams.

@aepfli I read the docs and did checked some of the inline ones as well and I can confirm your conclusions.

I agree with @beeme1mr that I think the best path forward is to leverage the underlying reconnect mechanism and sort of "piggyback" on that to re-establish the stream by listening to the connection state. This will mean that our "stream-re-establishment" algo will match gRPC's re-connection , and will be configurable everywhere gRPC's re-connection params are.

toddbaert pushed a commit that referenced this issue Dec 19, 2024
…#1478)

updating specification based on findings in
#1472

---------

Signed-off-by: Simon Schrottner <[email protected]>
Co-authored-by: chrfwow <[email protected]>
@aepfli
Copy link
Member Author

aepfli commented Dec 29, 2024

i think we can close this issue now, as we also documented this in the spec. wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants