-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic client endpoints #676
base: master
Are you sure you want to change the base?
Dynamic client endpoints #676
Conversation
/// set field to a tombstone specified in the field description. | ||
#[derive(Clone, Debug, Default, derive_builder::Builder)] | ||
#[non_exhaustive] | ||
pub struct ClientOptionsUpdate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the entire ClientOptions
struct be reused instead? Not every field has to be optional. You can require the callers provide the same data they did for original client options if they don't want to change it. The lang side doesn't need help here.
(would also be neat if some of the runtime options like headers was set/updated via this common mechanism and didn't require reconnect, but that doesn't have to be now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this probably makes decent sense too. I imagine the lang-side API will typically involve the user copying some existing options and changing them, in which case this all works out fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the only fields that I can change dynamically, because they are associated with the endpoint, and I think it makes it more clear what can be changed with a separate struct. Also, fields like target_url
are not optional in ClientOptions but need to be optional in update, or remember the original value set, which is a pain with concurrent updates...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go with swapping out the whole options behind a mutex which I mentioned in another comment, then the other fields should be changeable too I think, which could work out nicely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things like retry_config
are used by the Retry wrapper that I cannot change. client_name
or client_version
or identity
may be used in metrics, we probably don't want to change them, and I cannot see a use case to change them dynamically... We could fail the update if they don't match.
I still think that is easier in the API to do deltas with only the changeable attributes, but internally we could use ClientOptions instead of a changeset, if we are ok with a mutex on options. We would still need some state for the updates that does not belong in options, such as the channel, the version id,... but we could probably lock them independently from options...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't they be changed? I agree if they can't be changed it makes sense to accept something that only allows changing what can be. But, I'm not sure I follow why we couldn't change those. They are accessed each time a request is made - seems like changing them would affect subsequent requests and work properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for retry_config we cache it inside:
impl RetryClient {
/// Use the provided retry config with the provided client
pub fn new(client: SG, retry_config: RetryConfig) -> Self {
Self {
client,
retry_config: Arc::new(retry_config),
}
}
}
we could change that if needed, but not sure if it is that important to dynamically change retry_config...
The others we can, but the question is whether we should, if it is used in metrics, or server side, it may be confusing, and
I don't see much value of changing them dynamically...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have to actually expose those to end users, but I take your point. Especially since the RetryClient
is a level higher. I'm fine with having just the diff struct.
/// If the update fails, the client does not rollback to the previous configuration, and future | ||
/// connections will fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is rough. Why can't we only apply the update after the new endpoint is connected and works? Can't you do an atomic swap of the endpoint where there is no downtime and it only contains a successfully connected endpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I would call this try_update_options
and only swap if the connection succeeds. Also I find leaving the old options to be scary since now things aren't in sync.
It'd be fine to put a mutex around the options as well, so that they can be updated too. This also fits better with the idea of using the ClientOptions
directly rather than ClientOptionsUpdate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been looking into the atomic swap reusing the id for the endpoint, but the problem is that it s anything but atomic.
When you reuse the id the endpoint gets in a "preparing" queue, the old endpoint still active, when the new one gets connected the swap happens. The problem is that this is all asynchronous, and if there is a problem with the new endpoint there is no visible error, and the old one continuous forever. It is also impossible to know when this swap will happen, or validate that it happened without sending a carefully crafted request. The endpoints are also created in lazy mode, so they need to be part of channel to validate them.
I think the root problem is that this mechanism was designed for load balancing, not dynamic changes, and if you are adding more (similar) endpoints with different targets, you can be more relaxed about when exactly changes happened.
For this reason, I'm forcing the client to fail when there are problems, otherwise they get silently ignored...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that this is all asynchronous
Yeah I see that with the balance channel and makes sense. But can we wait to even send the insert until we have connected the endpoint? And can we send the remove after the insert? So logic would be:
endpoint = connect_new()
channel_tx.send(insert(endpoint))
channel_tx.send(remove(old_endpoint_seq))
Also curious, if you send an insert then a remove, is it guaranteed that the next call will be on the inserted? And in the case of what's there now w/ remove-then-insert, is it possible for multi-threaded use that there is a time where no endpoint is there?
I think the root problem is that this mechanism was designed for load balancing, not dynamic changes
I wonder if we can have our own "mutex_channel" or something, or if that's too much (EDIT: I see ::new
is crate-private)
The endpoints are also created in lazy mode, so they need to be part of channel to validate them.
This is unfortunate but if we can't eagerly validate, ok. I now wonder if we should go higher than channel. What I mean is if we can swap our own channel for a completely new one instead of updating the channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that you can easily swap the channel without creating a new client, and the problem of creating new clients is that there is a lot of shared state associated to them, like eager workflow start workers, or metrics,... that need to be moved across atomically.
The semantics of the control channel are really weak. Remove happens immediately, but inserts are queued in a "preparing" queue, so there is no guarantee that after sending insert ok you are getting the new options.
To avoid the gap you can reuse the id in insert, and then you don't need a remove, but there is no guarantee when, or if, the swap will happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that it is "lazy". It will create the connection with the first request, and we want to make sure that this connection is not created until we finish the upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should attempt a get-system-info call on the client before swapping it. IMO we want to make sure the connection is created and is a known config to be successful even way before we swap it out. We don't want to swap out lazy or a when given a bad client, the worker will just silently fail for a minute (our default retry iirc) then fatal out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before the swap we can check the connection, but that doesn't help for lazy. Internally Tonic defers the connection until it gets the first request. And even if we do a get-system-info, there is a race with someone else doing the first call after dropping the lock, if they grabbed it before the update started, and getting the error...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before the swap we can check the connection, but that doesn't help for lazy
This is why I'm suggesting a higher level client swap. Don't swap things at the Tonic level, swap the connected Temporal client.
there is a race with someone else doing the first call after dropping the lock, if they grabbed it before the update started, and getting the error
I think we can accept/document this race. If you're really concerned, we can use a separate lock for the entire client update process to prevent concurrent updates and put a short timeout on the get-system-info call, but of course we should never hold a lock during regular client operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably do this:
pub struct ConfiguredClient<C> {
client_and_options: ArcSwap<ClientAndOptions>,
headers: Arc<RwLock<HashMap<String, String>>>,
/// Capabilities as read from the `get_system_info` RPC call made on client connection
capabilities: Option<get_system_info_response::Capabilities>,
workers: Arc<SlotManager>,
}
pub struct ClientAndOptions<C> {
client: C,
options: ClientOptions
}
If ArcSwap doesn't work out for whatever reason (it might be hard to use load()
references in all cases) then you can do the Arc<RwLock<Arc<ClientAndOptions>>>
approach
@@ -87,6 +91,9 @@ static LONG_POLL_METHOD_NAMES: [&str; 3] = [ | |||
const LONG_POLL_TIMEOUT: Duration = Duration::from_secs(70); | |||
const OTHER_CALL_TIMEOUT: Duration = Duration::from_secs(30); | |||
|
|||
/// Buffer size for the channel that listens to change events | |||
const DEFAULT_BUFFER_SIZE: usize = 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this can probably be a lot smaller (though I doubt it matters a whole lot either way).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copied the constant used internally by tonic, frankly, no idea about the right size... But, yes, being a control channel, that looks large...
/// set field to a tombstone specified in the field description. | ||
#[derive(Clone, Debug, Default, derive_builder::Builder)] | ||
#[non_exhaustive] | ||
pub struct ClientOptionsUpdate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this probably makes decent sense too. I imagine the lang-side API will typically involve the user copying some existing options and changing them, in which case this all works out fine.
|
||
/// Metadata to enable the dynamic configuration of a client. | ||
#[derive(Clone, Debug)] | ||
pub struct DynamicUpdateInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't need to be pub
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They may need to access the changeset to get the current options when there is no mutable ref to the client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not accessible as it stands anyway. It's stored as a private field and there's no accessor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. We need to decide whether to get rid of changeset and update ClientOptions directly (with mutex), or not, and then either add the accessor or make everything private.
/// If the update fails, the client does not rollback to the previous configuration, and future | ||
/// connections will fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I would call this try_update_options
and only swap if the connection succeeds. Also I find leaving the old options to be scary since now things aren't in sync.
It'd be fine to put a mutex around the options as well, so that they can be updated too. This also fits better with the idea of using the ClientOptions
directly rather than ClientOptionsUpdate
This is what I tried first, but as I explained above, it is all asynchronous, with lazy endpoints, and there is no error callback, so it is difficult to make the update only when it works... If we can live with occasional inconsistencies, we could lock them independently... That would improve on what we have when we have no mutable ref...
|
What was changed
Clients can now dynamically update certain options fields such as the target url, the TLS configuration, the origin, or the keep alive settings, without creating a new client.
Why?
This will allow, e.g., updating a TLS certificate without restarting workers, or losing state associated with the client, such as, eager workflow start tracking, or metrics...
Checklist
[Feature Request] Dynamic client for workers #477
Integration tests and unit tests