Skip to content

mana: keepalive feature #1136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 56 commits into
base: main
Choose a base branch
from

Conversation

justus-camp-microsoft
Copy link
Contributor

@justus-camp-microsoft justus-camp-microsoft commented Apr 3, 2025

Opening this as a place to discuss changes, may end up splitting across multiple PRs to stage more thorough reviews.

Copy link
Contributor Author

@justus-camp-microsoft justus-camp-microsoft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've "pre-reviewed" some things I specifically want to talk about, but also wanted to add some overall questions here as well:

  • I split out a module mana_save_restore to make managing the saved state easier, but there's still some random saved state scattered in other modules. What's the right way to structure the modules?
  • The queue save/restore flow was particularly difficult to get working. What I have now is relatively convoluted. The Mana queues are saved as part of the VMBus saved state and put back into the ManaSavedState during the restore flow. I'm not sure this is the right way to do this but can go into more detail about why I ended up doing it this way.
  • I'm not sure I'm saving/restoring enough information to feel like this is a sufficient implementation. Of particular note is the restart_queues method in netvsp/src/lib.rs.

Comment on lines 522 to 536
if let Some(state) = servicing_unit_state.as_ref() {
if let Some(s) = state.iter().find(|s| s.name.contains("net:")) {
if let Ok(netvsp_state) = s.state.parse::<netvsp::saved_state::SavedState>() {
if let Some(init_state) = servicing_init_state.as_mut() {
if let Some(mana_state) = init_state.mana_state.as_mut() {
if !mana_state.is_empty() {
if let Some(queues) = netvsp_state.saved_queues {
mana_state[0].queues = queues;
}
}
}
}
}
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be thinking about this wrong, but I'm not sure how to link the saved state unit to the correct ManaSavedState unit. I have the pci_id in ManaSavedState, but I have a GUID in the saved state unit that doesn't seem to be directly linked to the pci_id.

Comment on lines 4248 to 4372
c_state
.endpoint
.get_queues(queue_config, rss.as_ref(), &mut queues)
.await
.map_err(WorkerError::Endpoint)?;
if let Some(saved_queues) = &c_state.saved_queues {
c_state
.endpoint
.restore_queues(queue_config, saved_queues.clone(), &mut queues)
.await
.map_err(WorkerError::Endpoint)?;
} else {
c_state
.endpoint
.get_queues(queue_config, rss.as_ref(), &mut queues)
.await
.map_err(WorkerError::Endpoint)?;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub doesn't give a good way to discuss code outside of the diff, but there's a lot of code here above that isn't directly saved/restored. What elements of this flow need to be added to the saved state vs. can be re-initialized in the restore flow?

Comment on lines 868 to 907
fn get_dma_buffer(
&self,
len: usize,
base_pfn: u64,
) -> anyhow::Result<user_driver::memory::MemoryBlock> {
tracing::info!("looking for slot: {:x}", base_pfn);
tracing::info!("slot state: {:?}", self.inner.state.lock().slots);

let size_pages = NonZeroU64::new(len as u64 / PAGE_SIZE)
.context("allocation of size 0 not supported")?
.get();

let mut inner = self.inner.state.lock();
let inner = &mut *inner;
let slot = inner.slots.iter().find(|slot| {
if let SlotState::Leaked {
device_id: _,
tag: _,
} = &slot.state
{
slot.base_pfn == base_pfn
} else {
false
}
});

if slot.is_none() {
anyhow::bail!("allocation doesn't exist");
}

let handle = PagePoolHandle {
inner: self.inner.clone(),
base_pfn,
size_pages,
mapping_offset: slot.unwrap().mapping_offset,
};

tracing::info!("successfully got memory block");
handle.into_memory_block()
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is the result of the queues not being re-initialized until after the DmaClient has already validated that memory has been restored and marks the MemoryBlocks as having been leaked. There's certainly a better solution here.

.run()
.await?;

validate_mana_nic(&agent).await?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, this validate_mana_nic function only validates that we see the right MAC address and IP, but doesn't actually validate any traffic. What's the right way to go about doing that?

Comment on lines +82 to +95
/// Test servicing an OpenHCL VM from the current version to itself
/// with mana keepalive support.
#[openvmm_test(openhcl_linux_direct_x64 [LATEST_LINUX_DIRECT_TEST_X64])]
async fn openhcl_servicing_mana_keepalive(
config: PetriVmConfigOpenVmm,
(igvm_file,): (ResolvedArtifact<impl petri_artifacts_common::tags::IsOpenhclIgvm>,),
) -> Result<(), anyhow::Error> {
openhcl_servicing_core(
config,
"OPENHCL_ENABLE_VTL2_GPA_POOL=512 OPENHCL_MANA_KEEP_ALIVE=1",
igvm_file,
OpenHclServicingFlags {
enable_nvme_keepalive: false,
enable_mana_keepalive: true,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is basically a duplicate and probably should be removed or the other one should be moved here with the validate_mana_nic code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants