AzureKVP Telemetry Library for Hyper-V Integration #133

peytonr18 · 2024-09-20T06:09:21Z

The azurekvp library provides a method for collecting and encoding tracing telemetry, designed specifically for integration with Hyper-V. This library captures span data through OpenTelemetry and Rust-based tracing, formats it into Hyper-V Key-Value Pair (KVP) format, and writes it directly to the Hyper-V KVP storage at /var/lib/hyperv/.kvp_pool_1.

Key changes in this PR:

Telemetry Capture: Integrates with the existing tracing infrastructure to capture and transmit telemetry data in the correct format.
KVP Encoding: Encodes span data into the Hyper-V-specific KVP format for storage and retrieval.
Unit Tests: Includes unit tests for KVP formatting and verifies the correctness of the encoded telemetry.
Documentation: Updates the README to include details on the new azurekvp library and the updated tracing process.

Fixes #27

dongsupark

Thanks for hard work since a long time.

I think it might be a good idea to squash commits into a single one, as it is almost impossible to track the whole history of removing, renaming, or restructuring.

azurekvp/src/lib.rs

azurekvp/src/tracing.rs

spans.log

src/main.rs

README.md

azurekvp/src/tracing.rs

jeremycline

Looks like a good start!

We'll need to do some thinking about how to handle errors inside the on_close span handler since there's a lot of fallible calls and the interface we need to implement doesn't let us return errors, but I decided not to dive into that much just yet since I need to think about it a bit and play with some approaches before suggesting things and it's lunch time.

Cargo.toml

src/main.rs

azurekvp/src/tracing.rs

…mentation across spans and diagnostic events.

azurekvp/src/kvp.rs

azurekvp/src/tracing.rs

azurekvp/src/kvp.rs

…ne handle_event with on_event, combine handle_span with on_close, adjust kvp writer logic to rely on a tokio task to send the encoded KVP over a channel and other minor adjustmnets in reference to PR comments.

- Updated tracing initialization to align with OpenTelemetry API (0.27.0) and SDK (0.26.0) requirements, moving from the older 0.18.0 version. - Adjusted the tracing configuration in tracing.rs to reflect changes in layer setup and API separation in the new versions, specifically adapting to the SDK and API split in OpenTelemetry. - Centralized layer creation in tracing.rs to maintain all tracing configuration in a single module, reducing tracing setup in main.rs for cleaner organization. - Updated Cargo.toml dependencies to reflect new versions and restructured OpenTelemetry and tracing crates.

…ith some small whitespace markdown changes.

libazureinit/src/kvp.rs

jeremycline · 2024-12-06T17:04:56Z

libazureinit/src/kvp.rs

@@ -0,0 +1,649 @@
+// Copyright (c) Microsoft Corporation.


We should move both this file and the tracing.rs file this PR adds to the azure-init package instead of libazureinit for the moment, for two reasons.

First, I'm not sure we want to expose the layer publicly in the libazureinit API. If it's a generally useful tracing layer for KVP it probably makes sense to be its own crate, but I think we should give it a little bit of time to bake before we do something like that.

Secondly, libraries should not be configuring tracing subscribers. Libraries can use tracing and then it's up to the consuming application how to configure the way those logs are formatted/filtered/output.

jeremycline · 2024-12-06T17:50:03Z

libazureinit/src/kvp.rs

+        _field: &tracing::field::Field,
+        value: &dyn std_fmt::Debug,
+    ) {
+        write!(self.string, "{:?}", value).unwrap();


Two things here

Unwrapping

This is an interesting case; according to the documentation, writing to a string is considered infallible.

There's one case where it very much can fail: we need to allocate memory to hold the string and there's no guarantee there's memory available. However, lots of Rust APIs don't expose this (Vec::with_capacity(), for example, could fail to allocate memory, but it doesn't return a Result).

So, this might be a case where it's okay to panic on an error, but I'd recommend using expect to note why it's okay

write!(self.string, "{:?}", value).expect("Writing to a string should never fail");

If we're really out of memory, we're likely to be killed by the OOM killer anyway.

Formatting

This function will get called repeatedly for each field in an event. For example, say there's an event like:

tracing::info!(request_id="some-uuid-thingy", count=3, "Some message here");

This will be an event with Level=Info and three fields. We'll write the debug representation of each into our string and it'll end up looking something like Some message heresome-uuid-thingy3 depending on the order fields are processed. You could use something like the example in the tracing documentation:

write!(self.string, "{}={:?}; ", field.name(), value).expect("Writing to a string should never fail");

which will give you something like message=Some message here; request_id="some-uuid-thingy"; count=3;.

We could also emit something like JSON if we want to make it easy to parse by field on the other side, although that adds overhead and we're space-limited here.

I've updated the code to use expect() with an explanation to handle the unlikely failure case when writing to a string. I've also adjusted the formatting to include field names in the output, so it now looks like name=value, which aligns with the tracing documentation example. Thank you for the great suggestion!

JSON formatting would definitely make parsing easier, but I think it’s best to wait until we see how much space a fully-formed tracer setup takes. It may be worth it, or we may find that the name=value format is the best we can do with the space constraints. I'll make a note of this suggestion to address in the future, but for now, I'd lean towards just implementing the formatting change you recommended.

jeremycline · 2024-12-06T17:56:43Z

libazureinit/src/kvp.rs

+    let mut key_buf = vec![0u8; HV_KVP_EXCHANGE_MAX_KEY_SIZE];
+    key_buf[..key_len].copy_from_slice(&key_bytes[..key_len]);
+
+    if value_bytes.len() <= HV_KVP_AZURE_MAX_VALUE_SIZE {


You may find this if/else easier to write with the help of chunks which will iterate over your slice, yielding chunks of the requested size (or, in the case of this flavor of the method, smaller when the slice isn't long enough). That way you don't have to special-case things.

src/main.rs

…lues with constants in tests

…e-init - Renamed tracing.rs to logging.rs to avoid conflicts with external tracing dependency. - Moved kvp.rs and logging.rs into the main azure-init crate as modules in main.rs. - Refactored encode_kvp_item to use chunks for cleaner logic and more efficient logic. - Integrated handle_kvp_operation into the EmitKVPLayer struct. - Improved tracing mechanisms and adjusted string formatting for better readability and parsing.

…readability purposes.

jeremycline

I got a little side-tracked and haven't reviewed src/kvp.rs itself just yet, but figured I'd stop here for today.

Looking back over my comments on the logging setup I realize I was pretty handwavy. The summarized version with clearly stated opinions would be:

Go with layer-level filtering instead of a global filter
Go with a (separate) EnvFilter for the KVP layer since runtime configurability is very helpful when it's your only debug option. Default to WARN or above.
Adjust the stderr filter to default to INFO and apply it to the OTel layer as well.

Sorry I didn't make it all the way through it, I'll finish up tomorrow 😃

src/logging.rs

jeremycline · 2024-12-18T21:32:10Z

src/logging.rs

+pub fn setup_layers(
+    tracer: sdktrace::Tracer,
+) -> Result<(), Box<dyn std::error::Error>> {
+    let otel_layer = OpenTelemetryLayer::new(tracer);


You might want to consider applying a filter to this layer, or alternatively moving the current filter that's applied to stderr_layer to a global filter like this:

let subscriber = Registry::default() .with(stderr_layer) .with(otel_layer) .with(emit_kvp_layer) .with(EnvFilter::from_env("AZURE_INIT_LOG"))

This works because EnvFilter implements both the Filter trait and the Layer trait. It's important to note that a global filter will stop the other layers from even seeing the event so their filters can only filter from the events allowed through the global filter. This can be good (if you have a lot of events and layers filtering can become expensive!) but also can be confusing if you apply a global filter of say, WARN, and then try to get INFO-level events out of a particular layer with a configurable filter.

Global filters are nice for performance-sensitive applications. I once hit an issue where filtering was consuming something like 10% of the CPU cycles because of a trace-level event inside a loop that executed a lot and every time a half-dozen layers had to say "oh I don't care about this event". There are even features on tracing to allow you to compile out events for particularly performance-sensitive use-cases.

Since we don't have a ton of events or layers we can comfortably afford to have per-layer filtering, so I lean that direction (even if the OpenTelemetry layer and the stderr layer use the same Filter) so we can adjust the KVP filter separately.

src/main.rs

libazureinit/src/config.rs

libazureinit/src/http.rs

jeremycline

Finished up my review on the KVP bits and I think there are just a handful of little tweaks needed there.

src/kvp.rs

…ing, adjust key_buf logic to avoid indexing, refine tracing and function calls, and streamline start/end time declarations as per PR feedback.

src/kvp.rs

anhvoms · 2025-01-10T22:15:14Z

doc/libazurekvp.md

@@ -0,0 +1,69 @@
+# Tracing Logic Overview


It is not clear on reading the log how to disable writing to kvp? This is a strong requirement per Azure Privacy Review (that we need to document what is collected, and provide the customer with a mean to disable it)

This document probably focuses on the technical aspect of tracing/kvp. I would add another section on the main page about telemetry and describe what we're collecting
See this section
https://learn.microsoft.com/en-us/azure/virtual-machines/linux/using-cloud-init#telemetry

#152
Will address this great PR comment in this issue!

…e padded and not just the key.

jeremycline

I played around with the filters a bit and got pretty close to the set of logs you're marking with target: "kvp" (I think). I would prefer to avoid changing the target if we can help it.

I also merged #151 so the CI should be happy once you merge that in (or rebase)

src/logging.rs

… logic.

jeremycline

Thanks for all your hard work here! ✨

dongsupark reviewed Sep 20, 2024

View reviewed changes