Skip to content

AzureKVP Telemetry Library for Hyper-V Integration #133

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Feb 3, 2025

Conversation

peytonr18
Copy link
Contributor

The azurekvp library provides a method for collecting and encoding tracing telemetry, designed specifically for integration with Hyper-V. This library captures span data through OpenTelemetry and Rust-based tracing, formats it into Hyper-V Key-Value Pair (KVP) format, and writes it directly to the Hyper-V KVP storage at /var/lib/hyperv/.kvp_pool_1.

Key changes in this PR:

  • Telemetry Capture: Integrates with the existing tracing infrastructure to capture and transmit telemetry data in the correct format.
  • KVP Encoding: Encodes span data into the Hyper-V-specific KVP format for storage and retrieval.
  • Unit Tests: Includes unit tests for KVP formatting and verifies the correctness of the encoded telemetry.
  • Documentation: Updates the README to include details on the new azurekvp library and the updated tracing process.

Fixes #27

Copy link
Collaborator

@dongsupark dongsupark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for hard work since a long time.

I think it might be a good idea to squash commits into a single one, as it is almost impossible to track the whole history of removing, renaming, or restructuring.

@dongsupark dongsupark added the feature New feature or request label Sep 23, 2024
Copy link
Member

@jeremycline jeremycline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good start!

We'll need to do some thinking about how to handle errors inside the on_close span handler since there's a lot of fallible calls and the interface we need to implement doesn't let us return errors, but I decided not to dive into that much just yet since I need to think about it a bit and play with some approaches before suggesting things and it's lunch time.

…mentation across spans and diagnostic events.
@peytonr18 peytonr18 force-pushed the probertson-kvp-test branch from c6dc74c to 2b1ded8 Compare October 8, 2024 20:42
…ne handle_event with on_event, combine handle_span with on_close, adjust kvp writer logic to rely on a tokio task to send the encoded KVP over a channel and other minor adjustmnets in reference to PR comments.
@peytonr18 peytonr18 force-pushed the probertson-kvp-test branch 2 times, most recently from 5824809 to 42ca26a Compare November 12, 2024 23:08
- Updated tracing initialization to align with OpenTelemetry API (0.27.0) and SDK (0.26.0) requirements, moving from the older 0.18.0 version.
- Adjusted the tracing configuration in tracing.rs to reflect changes in layer setup and API separation in the new versions, specifically adapting to the SDK and API split in OpenTelemetry.
- Centralized layer creation in tracing.rs to maintain all tracing configuration in a single module, reducing tracing setup in main.rs for cleaner organization.
- Updated Cargo.toml dependencies to reflect new versions and restructured OpenTelemetry and tracing crates.
@@ -0,0 +1,649 @@
// Copyright (c) Microsoft Corporation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move both this file and the tracing.rs file this PR adds to the azure-init package instead of libazureinit for the moment, for two reasons.

First, I'm not sure we want to expose the layer publicly in the libazureinit API. If it's a generally useful tracing layer for KVP it probably makes sense to be its own crate, but I think we should give it a little bit of time to bake before we do something like that.

Secondly, libraries should not be configuring tracing subscribers. Libraries can use tracing and then it's up to the consuming application how to configure the way those logs are formatted/filtered/output.

_field: &tracing::field::Field,
value: &dyn std_fmt::Debug,
) {
write!(self.string, "{:?}", value).unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things here

Unwrapping

This is an interesting case; according to the documentation, writing to a string is considered infallible.

There's one case where it very much can fail: we need to allocate memory to hold the string and there's no guarantee there's memory available. However, lots of Rust APIs don't expose this (Vec::with_capacity(), for example, could fail to allocate memory, but it doesn't return a Result).

So, this might be a case where it's okay to panic on an error, but I'd recommend using expect to note why it's okay

write!(self.string, "{:?}", value).expect("Writing to a string should never fail");

If we're really out of memory, we're likely to be killed by the OOM killer anyway.

Formatting

This function will get called repeatedly for each field in an event. For example, say there's an event like:

tracing::info!(request_id="some-uuid-thingy", count=3, "Some message here");

This will be an event with Level=Info and three fields. We'll write the debug representation of each into our string and it'll end up looking something like Some message heresome-uuid-thingy3 depending on the order fields are processed. You could use something like the example in the tracing documentation:

write!(self.string, "{}={:?}; ", field.name(), value).expect("Writing to a string should never fail");

which will give you something like message=Some message here; request_id="some-uuid-thingy"; count=3;.

We could also emit something like JSON if we want to make it easy to parse by field on the other side, although that adds overhead and we're space-limited here.

Copy link
Contributor Author

@peytonr18 peytonr18 Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the code to use expect() with an explanation to handle the unlikely failure case when writing to a string. I've also adjusted the formatting to include field names in the output, so it now looks like name=value, which aligns with the tracing documentation example. Thank you for the great suggestion!

JSON formatting would definitely make parsing easier, but I think it’s best to wait until we see how much space a fully-formed tracer setup takes. It may be worth it, or we may find that the name=value format is the best we can do with the space constraints. I'll make a note of this suggestion to address in the future, but for now, I'd lean towards just implementing the formatting change you recommended.

let mut key_buf = vec![0u8; HV_KVP_EXCHANGE_MAX_KEY_SIZE];
key_buf[..key_len].copy_from_slice(&key_bytes[..key_len]);

if value_bytes.len() <= HV_KVP_AZURE_MAX_VALUE_SIZE {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may find this if/else easier to write with the help of chunks which will iterate over your slice, yielding chunks of the requested size (or, in the case of this flavor of the method, smaller when the slice isn't long enough). That way you don't have to special-case things.

…e-init

- Renamed tracing.rs to logging.rs to avoid conflicts with external tracing dependency.
- Moved kvp.rs and logging.rs into the main azure-init crate as modules in main.rs.
- Refactored encode_kvp_item to use chunks for cleaner logic and more efficient logic.
- Integrated handle_kvp_operation into the EmitKVPLayer struct.
- Improved tracing mechanisms and adjusted string formatting for better readability and parsing.
@peytonr18 peytonr18 force-pushed the probertson-kvp-test branch 4 times, most recently from dd6540e to 1155e6d Compare December 18, 2024 05:15
Copy link
Member

@jeremycline jeremycline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a little side-tracked and haven't reviewed src/kvp.rs itself just yet, but figured I'd stop here for today.

Looking back over my comments on the logging setup I realize I was pretty handwavy. The summarized version with clearly stated opinions would be:

  • Go with layer-level filtering instead of a global filter
  • Go with a (separate) EnvFilter for the KVP layer since runtime configurability is very helpful when it's your only debug option. Default to WARN or above.
  • Adjust the stderr filter to default to INFO and apply it to the OTel layer as well.

Sorry I didn't make it all the way through it, I'll finish up tomorrow 😃

src/logging.rs Outdated
pub fn setup_layers(
tracer: sdktrace::Tracer,
) -> Result<(), Box<dyn std::error::Error>> {
let otel_layer = OpenTelemetryLayer::new(tracer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to consider applying a filter to this layer, or alternatively moving the current filter that's applied to stderr_layer to a global filter like this:

    let subscriber = Registry::default()
        .with(stderr_layer)
        .with(otel_layer)
        .with(emit_kvp_layer)
        .with(EnvFilter::from_env("AZURE_INIT_LOG"))

This works because EnvFilter implements both the Filter trait and the Layer trait. It's important to note that a global filter will stop the other layers from even seeing the event so their filters can only filter from the events allowed through the global filter. This can be good (if you have a lot of events and layers filtering can become expensive!) but also can be confusing if you apply a global filter of say, WARN, and then try to get INFO-level events out of a particular layer with a configurable filter.

Global filters are nice for performance-sensitive applications. I once hit an issue where filtering was consuming something like 10% of the CPU cycles because of a trace-level event inside a loop that executed a lot and every time a half-dozen layers had to say "oh I don't care about this event". There are even features on tracing to allow you to compile out events for particularly performance-sensitive use-cases.

Since we don't have a ton of events or layers we can comfortably afford to have per-layer filtering, so I lean that direction (even if the OpenTelemetry layer and the stderr layer use the same Filter) so we can adjust the KVP filter separately.

Copy link
Member

@jeremycline jeremycline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished up my review on the KVP bits and I think there are just a handful of little tweaks needed there.

@peytonr18 peytonr18 force-pushed the probertson-kvp-test branch 2 times, most recently from 1d4fd87 to 60264a3 Compare December 26, 2024 21:11
…ing, adjust key_buf logic to avoid indexing, refine tracing and function calls, and streamline start/end time declarations as per PR feedback.
@@ -0,0 +1,69 @@
# Tracing Logic Overview
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear on reading the log how to disable writing to kvp? This is a strong requirement per Azure Privacy Review (that we need to document what is collected, and provide the customer with a mean to disable it)

This document probably focuses on the technical aspect of tracing/kvp. I would add another section on the main page about telemetry and describe what we're collecting
See this section
https://learn.microsoft.com/en-us/azure/virtual-machines/linux/using-cloud-init#telemetry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#152
Will address this great PR comment in this issue!

Copy link
Member

@jeremycline jeremycline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played around with the filters a bit and got pretty close to the set of logs you're marking with target: "kvp" (I think). I would prefer to avoid changing the target if we can help it.

I also merged #151 so the CI should be happy once you merge that in (or rebase)

Copy link
Member

@jeremycline jeremycline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your hard work here! ✨

@peytonr18 peytonr18 merged commit 795ba29 into main Feb 3, 2025
5 checks passed
@peytonr18 peytonr18 deleted the probertson-kvp-test branch February 3, 2025 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Report KVP telemetry to assist with troubleshooting
5 participants