Skip to content

wip - mana: save/restore for keepalive support #1033

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

justus-camp-microsoft
Copy link
Contributor

@justus-camp-microsoft justus-camp-microsoft commented Mar 14, 2025

Opening this early to get ahead of feedback while I work on save/restore for mana queues. I've merged #945 into this branch in the interest of writing a unit test, but it hasn't actually been implemented yet.

Some open questions:

  • How can I write a unit test for this scenario? I've attempted to use the changes in user_driver: moving emulated.rs file to its own crate and updating EmulatedDevice to accept a generic allocator #945 to write a test but have been unable to get it to work. The vmm_test I've added gets past test_eq so I'm relatively confident the GDMA driver is being restored at least somewhat properly.
  • The enablement code is from me just following nvme_keepalive's lead - is there any interest in changing how enablement works?
  • Should we add error paths to fall-back to completely re-initializing the device in instances of failure?
  • There's a bunch of info! logging I was using for testing - will remove or move to debug level if interest in keeping them around
  • The leak flag is currently set to true in private pool to get past leaking allocations as a result of not restoring the mana queues yet - will turn back to false as soon as restoration is working properly for those as well
  • Should deriving debug be removed for save/restore state?

Is there interest in staging these changes rather than merging one large change in the interest of smaller reviews?

dma_mode: GuestDmaMode,
dma_client: Arc<dyn DmaClient>,
) -> anyhow::Result<(
Self,
Vec<HclNetworkVFManagerEndpointInfo>,
RuntimeSavedState,
)> {
tracing::info!("creating mana device. mana_state: {:?}", mana_state);

let mana_state = mana_state.as_ref().map(|mana_state| mana_state[0].clone());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is certainly wrong - need to revisit what kind of relationship this is (does this need to be a vec at all?)

Comment on lines -164 to -217
impl<T: DeviceBacking> Drop for GdmaDriver<T> {
fn drop(&mut self) {
if self.hwc_failure {
return;
}
let data = self
.bar0
.mem
.read_u32(self.bar0.map.vf_gdma_sriov_shared_reg_start as usize + 28);
if data == u32::MAX {
tracing::error!("Device no longer present");
return;
}

let hdr = SmcProtoHdr::new()
.with_msg_type(SmcMessageType::SMC_MSG_TYPE_DESTROY_HWC.0)
.with_msg_version(SMC_MSG_TYPE_DESTROY_HWC_VERSION);

let hdr = u32::from_le_bytes(hdr.as_bytes().try_into().expect("known size"));
self.bar0.mem.write_u32(
self.bar0.map.vf_gdma_sriov_shared_reg_start as usize + 28,
hdr,
);
// Wait for the device to respond.
let max_wait_time =
std::time::Instant::now() + Duration::from_millis(HWC_POLL_TIMEOUT_IN_MS);
let header = loop {
let data = self
.bar0
.mem
.read_u32(self.bar0.map.vf_gdma_sriov_shared_reg_start as usize + 28);
if data == u32::MAX {
tracing::error!("Device no longer present");
return;
}
let header = SmcProtoHdr::from(data);
if !header.owner_is_pf() {
break header;
}
if std::time::Instant::now() > max_wait_time {
tracing::error!("MANA request timed out. SMC_MSG_TYPE_DESTROY_HWC");
return;
}
std::hint::spin_loop();
};

if !header.is_response() {
tracing::error!("expected response");
}
if header.status() != 0 {
tracing::error!("DESTROY_HWC failed: {}", header.status());
}
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This drop impl needs to be re-added and run conditionally based on if we're servicing or not - what's the best way to do that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants