-
Notifications
You must be signed in to change notification settings - Fork 14
AzureKVP Telemetry Library for Hyper-V Integration #133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for hard work since a long time.
I think it might be a good idea to squash commits into a single one, as it is almost impossible to track the whole history of removing, renaming, or restructuring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good start!
We'll need to do some thinking about how to handle errors inside the on_close
span handler since there's a lot of fallible calls and the interface we need to implement doesn't let us return errors, but I decided not to dive into that much just yet since I need to think about it a bit and play with some approaches before suggesting things and it's lunch time.
…mentation across spans and diagnostic events.
c6dc74c
to
2b1ded8
Compare
…ne handle_event with on_event, combine handle_span with on_close, adjust kvp writer logic to rely on a tokio task to send the encoded KVP over a channel and other minor adjustmnets in reference to PR comments.
5824809
to
42ca26a
Compare
- Updated tracing initialization to align with OpenTelemetry API (0.27.0) and SDK (0.26.0) requirements, moving from the older 0.18.0 version. - Adjusted the tracing configuration in tracing.rs to reflect changes in layer setup and API separation in the new versions, specifically adapting to the SDK and API split in OpenTelemetry. - Centralized layer creation in tracing.rs to maintain all tracing configuration in a single module, reducing tracing setup in main.rs for cleaner organization. - Updated Cargo.toml dependencies to reflect new versions and restructured OpenTelemetry and tracing crates.
42ca26a
to
f119cc5
Compare
…ith some small whitespace markdown changes.
libazureinit/src/kvp.rs
Outdated
@@ -0,0 +1,649 @@ | |||
// Copyright (c) Microsoft Corporation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move both this file and the tracing.rs
file this PR adds to the azure-init
package instead of libazureinit
for the moment, for two reasons.
First, I'm not sure we want to expose the layer publicly in the libazureinit
API. If it's a generally useful tracing layer for KVP it probably makes sense to be its own crate, but I think we should give it a little bit of time to bake before we do something like that.
Secondly, libraries should not be configuring tracing subscribers. Libraries can use tracing
and then it's up to the consuming application how to configure the way those logs are formatted/filtered/output.
libazureinit/src/kvp.rs
Outdated
_field: &tracing::field::Field, | ||
value: &dyn std_fmt::Debug, | ||
) { | ||
write!(self.string, "{:?}", value).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things here
Unwrapping
This is an interesting case; according to the documentation, writing to a string is considered infallible.
There's one case where it very much can fail: we need to allocate memory to hold the string and there's no guarantee there's memory available. However, lots of Rust APIs don't expose this (Vec::with_capacity(), for example, could fail to allocate memory, but it doesn't return a Result
).
So, this might be a case where it's okay to panic on an error, but I'd recommend using expect to note why it's okay
write!(self.string, "{:?}", value).expect("Writing to a string should never fail");
If we're really out of memory, we're likely to be killed by the OOM killer anyway.
Formatting
This function will get called repeatedly for each field in an event. For example, say there's an event like:
tracing::info!(request_id="some-uuid-thingy", count=3, "Some message here");
This will be an event with Level=Info and three fields. We'll write the debug representation of each into our string and it'll end up looking something like Some message heresome-uuid-thingy3
depending on the order fields are processed. You could use something like the example in the tracing documentation:
write!(self.string, "{}={:?}; ", field.name(), value).expect("Writing to a string should never fail");
which will give you something like message=Some message here; request_id="some-uuid-thingy"; count=3;
.
We could also emit something like JSON if we want to make it easy to parse by field on the other side, although that adds overhead and we're space-limited here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the code to use expect() with an explanation to handle the unlikely failure case when writing to a string. I've also adjusted the formatting to include field names in the output, so it now looks like name=value, which aligns with the tracing documentation example. Thank you for the great suggestion!
JSON formatting would definitely make parsing easier, but I think it’s best to wait until we see how much space a fully-formed tracer setup takes. It may be worth it, or we may find that the name=value format is the best we can do with the space constraints. I'll make a note of this suggestion to address in the future, but for now, I'd lean towards just implementing the formatting change you recommended.
libazureinit/src/kvp.rs
Outdated
let mut key_buf = vec![0u8; HV_KVP_EXCHANGE_MAX_KEY_SIZE]; | ||
key_buf[..key_len].copy_from_slice(&key_bytes[..key_len]); | ||
|
||
if value_bytes.len() <= HV_KVP_AZURE_MAX_VALUE_SIZE { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may find this if/else easier to write with the help of chunks which will iterate over your slice, yielding chunks of the requested size (or, in the case of this flavor of the method, smaller when the slice isn't long enough). That way you don't have to special-case things.
…lues with constants in tests
…e-init - Renamed tracing.rs to logging.rs to avoid conflicts with external tracing dependency. - Moved kvp.rs and logging.rs into the main azure-init crate as modules in main.rs. - Refactored encode_kvp_item to use chunks for cleaner logic and more efficient logic. - Integrated handle_kvp_operation into the EmitKVPLayer struct. - Improved tracing mechanisms and adjusted string formatting for better readability and parsing.
dd6540e
to
1155e6d
Compare
…readability purposes.
1155e6d
to
8b2c829
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got a little side-tracked and haven't reviewed src/kvp.rs
itself just yet, but figured I'd stop here for today.
Looking back over my comments on the logging setup I realize I was pretty handwavy. The summarized version with clearly stated opinions would be:
- Go with layer-level filtering instead of a global filter
- Go with a (separate) EnvFilter for the KVP layer since runtime configurability is very helpful when it's your only debug option. Default to WARN or above.
- Adjust the stderr filter to default to INFO and apply it to the OTel layer as well.
Sorry I didn't make it all the way through it, I'll finish up tomorrow 😃
src/logging.rs
Outdated
pub fn setup_layers( | ||
tracer: sdktrace::Tracer, | ||
) -> Result<(), Box<dyn std::error::Error>> { | ||
let otel_layer = OpenTelemetryLayer::new(tracer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to consider applying a filter to this layer, or alternatively moving the current filter that's applied to stderr_layer
to a global filter like this:
let subscriber = Registry::default()
.with(stderr_layer)
.with(otel_layer)
.with(emit_kvp_layer)
.with(EnvFilter::from_env("AZURE_INIT_LOG"))
This works because EnvFilter
implements both the Filter
trait and the Layer
trait. It's important to note that a global filter will stop the other layers from even seeing the event so their filters can only filter from the events allowed through the global filter. This can be good (if you have a lot of events and layers filtering can become expensive!) but also can be confusing if you apply a global filter of say, WARN, and then try to get INFO-level events out of a particular layer with a configurable filter.
Global filters are nice for performance-sensitive applications. I once hit an issue where filtering was consuming something like 10% of the CPU cycles because of a trace-level event inside a loop that executed a lot and every time a half-dozen layers had to say "oh I don't care about this event". There are even features on tracing to allow you to compile out events for particularly performance-sensitive use-cases.
Since we don't have a ton of events or layers we can comfortably afford to have per-layer filtering, so I lean that direction (even if the OpenTelemetry layer and the stderr layer use the same Filter
) so we can adjust the KVP filter separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished up my review on the KVP bits and I think there are just a handful of little tweaks needed there.
1d4fd87
to
60264a3
Compare
…ing, adjust key_buf logic to avoid indexing, refine tracing and function calls, and streamline start/end time declarations as per PR feedback.
60264a3
to
0536a3b
Compare
@@ -0,0 +1,69 @@ | |||
# Tracing Logic Overview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear on reading the log how to disable writing to kvp? This is a strong requirement per Azure Privacy Review (that we need to document what is collected, and provide the customer with a mean to disable it)
This document probably focuses on the technical aspect of tracing/kvp. I would add another section on the main page about telemetry and describe what we're collecting
See this section
https://learn.microsoft.com/en-us/azure/virtual-machines/linux/using-cloud-init#telemetry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#152
Will address this great PR comment in this issue!
…e padded and not just the key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played around with the filters a bit and got pretty close to the set of logs you're marking with target: "kvp"
(I think). I would prefer to avoid changing the target if we can help it.
I also merged #151 so the CI should be happy once you merge that in (or rebase)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your hard work here! ✨
The azurekvp library provides a method for collecting and encoding tracing telemetry, designed specifically for integration with Hyper-V. This library captures span data through OpenTelemetry and Rust-based tracing, formats it into Hyper-V Key-Value Pair (KVP) format, and writes it directly to the Hyper-V KVP storage at /var/lib/hyperv/.kvp_pool_1.
Key changes in this PR:
Fixes #27