Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marking of 'suppress_redundant' and 'heartbeat_interval' for deprecation #114

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

earies
Copy link

@earies earies commented Oct 22, 2021

As has been discussed prior OOB, I would like to initiate this PR to kick off
RFC regarding deprecation and possible future removal of the
suppress_redundant and heartbeat_interval fields in favor of alternate
options to various problem statements of determining liveliness.

NOTE: This PR is incomplete within this repository as well as the potential
removal from the openconfig-telemetry YANG schema that can be determined after
consensus for deprecation/removal.

Related problem statements:

  • To determine liveliness of a long running RPC (especially important on low
    volume subscriptions)
  • To satisfy middleboxes that hold state to not expire sessions (NAT, Stateful
    FW, etc..)
  • To request a data "refresh" (There are already alternate solutions here)
  • Feel free to add more

Should the instruction of a 'keepalive' exist at the RPC/message layer?

If so, then does it make sense to duplicate work in the RPC design everytime
these characteristics are shared?

There are various other services related to OpenConfig/P4 that share similar
characteristics but do not implement RPC level instructions for keepalives

Is it best served by the protocol stacks - gRPC/HTTP2 infra (keepalive/PING)?

If so, are all cases and language bindings accounted for where this would
suffice?

@hellt
Copy link
Contributor

hellt commented Nov 14, 2023

Hi @dplore @robshakir and others,

I wanted to revive this topic with a focus on the heartbeat_interval. It seems some operators are picking up on this spec capability with the goal of making sure ON_CHANGE subscriptions work nicely with TSDBs that are known to require a steady flow of metrics (otherwise the metrics go stale and flushed).

@earies I wonder when you refer to

To request a data "refresh" (There are already alternate solutions here)

what solutions you had in mind?

@dplore
Copy link
Member

dplore commented Nov 15, 2023

I think Ebben is suggesting a mechanism such as gRPC keepalive would be more efficient than using heartbeat_interval to maintain liveness of a gRPC session. I would tend to agree that a gRPC keepalive seems like an efficient solution to the liveness issue and has precedent across many other protocols with long lived sessions that may be have long silent periods.

As far as deprecating existing gNMI specification, we should understand is what dependencies or use cases exist regarding suppress_redundant and heartbeat_interval. It would be good to hear from the community if these are commonly implemented and in use features.

Ebben, regarding a "data refresh" use case (presumably one use case for heartbest_interval), you mention alternatives exist. Did you have one in mind?

@earies
Copy link
Author

earies commented Jun 15, 2024

Apologies I must have missed the activity on this PR but want to loop back on the topic. While the PR contents are somewhat stale, it's still probably the best spot to continue discussion for now...

Regarding liveliness - yes I was referring to gRPC keepalives for that problem statement (essentially at a lower layer in the stack)

Regarding heartbeat_interval alternatives for data refresh, the options we normally see in use are

  1. A separate subscription to the same subtrees as the initial ON_CHANGE but SAMPLE w/ ONCE mode triggered at some interval by a client
  2. Same as above but SAMPLE w/ STREAM + interval (mimicking a side-channel heartbeat interval)

Are both sufficient? Possibly - would like to hear opinions as it's obvious this brings an additional set of subscriptions into the mix...

Otherwise we keep heartbeat_interval around and the implementation can internally treat it like a combination of the above (internal sample/stream on an existing ON_CHANGE subscription)

I'll leave suppress_redundant out of the conversation for now as I feel that one needs additional discussion

@hellt
Copy link
Contributor

hellt commented Jun 15, 2024

I think there are some big operators that have plans to use hearbeat_interval as they see a separate subscription with SAMPLE/ONCE too cumbersome to maintain.

@earies
Copy link
Author

earies commented Jun 15, 2024

I think there are some big operators that have plans to use heartbeat_interval as they see a separate subscription with SAMPLE/ONCE too cumbersome to maintain.

Well I'd still currently put this in the bucket of "not well defined" even if the desire and perceived use-case is there the more I think about this.

An ON_CHANGE subscription states "updated values SHOULD only be transmitted when their value changes". The "SHOULD" terminology here opens this up for something like the interleaved behavior of heartbeat_interval

However let's consider the timestamp in the notification and per the current definition of sending updates:
https://github.com/openconfig/reference/blob/master/rpc/gnmi/gnmi-specification.md#352-sending-telemetry-updates

As a client, there are 2 choices to marshal a timestamp to the ultimate TSDB record. Time of reception or the timestamp embedded in the payload w/ the latter generally being the recommended truthful of the two (the true source). In normal ON_CHANGE scenarios this will indicate the timestamp closest to the source of the generation but now what does this mean for interleaved operations like refreshing a dataset?

  1. If the heartbeat_interval just sends the original timestamp (now in the past), the data refresh issue may very well not be solved. It really depends on the attributes of "expiry". A record overwrite to an entry shows some activity but the contents are identical to the previous which may be at some point in history that falls out of bounds
  2. If the heartbeat_interval sends a timestamp of the time of generation of the records even if they have not changed then it can be perceived there is a change (the client just received a bunch of ON_CHANGE notifications out of nowhere at the requested interval). It would then be very important that a dumb client that has no previous knowledge of the prior value NOT treat this as a "change" but rather the signal of a changed value is on the TSDB record downstream. This record is now being updated with new timestamps to not expire records and to trace back when the ultimate change event occurred means walking back in history.

There is also the consideration of race conditions from the implementation should actual change events occur during the same window in which the heartbeat_interval is walking and publishing notifications and the proper ordering/timestamping involved. A separate SAMPLE subscription should emulate no. 2 above today but I think all of this needs to be hashed out before any plans to "use" in operations (at least as of today).

@robshakir
Copy link
Contributor

I don't have a strong thought on whether this should be removed, other than if there are folks that are using it -- let's please make sure that the specification is robust to the deployed cases.

  1. If the heartbeat_interval just sends the original timestamp (now in the past), the data refresh issue may very well not be solved. It really depends on the attributes of "expiry". A record overwrite to an entry shows some activity but the contents are identical to the previous which may be at some point in history that falls out of bounds

In the case that someone designs a caching collector that avoids memory exhaustion by timing out records over age X, then I think there are cases where heartbeat_interval does solve this problem. It's not really an overwrite, since the client/cache can store both the expiry time of the value received (local to the cache), and the timestamp recorded by the target. In such cases, reception time indicates whether the information is stale -- allowing mark-and-sweep removal of data. In the scenario that one doesn't do this, then alternative designs are required to deal with targets that might create many new cache entries via different paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants