-
Notifications
You must be signed in to change notification settings - Fork 121
mana: keepalive feature #1136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
mana: keepalive feature #1136
Conversation
…actually necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've "pre-reviewed" some things I specifically want to talk about, but also wanted to add some overall questions here as well:
- I split out a module
mana_save_restore
to make managing the saved state easier, but there's still some random saved state scattered in other modules. What's the right way to structure the modules? - The queue save/restore flow was particularly difficult to get working. What I have now is relatively convoluted. The Mana queues are saved as part of the VMBus saved state and put back into the ManaSavedState during the restore flow. I'm not sure this is the right way to do this but can go into more detail about why I ended up doing it this way.
- I'm not sure I'm saving/restoring enough information to feel like this is a sufficient implementation. Of particular note is the
restart_queues
method innetvsp/src/lib.rs
.
openhcl/underhill_core/src/worker.rs
Outdated
if let Some(state) = servicing_unit_state.as_ref() { | ||
if let Some(s) = state.iter().find(|s| s.name.contains("net:")) { | ||
if let Ok(netvsp_state) = s.state.parse::<netvsp::saved_state::SavedState>() { | ||
if let Some(init_state) = servicing_init_state.as_mut() { | ||
if let Some(mana_state) = init_state.mana_state.as_mut() { | ||
if !mana_state.is_empty() { | ||
if let Some(queues) = netvsp_state.saved_queues { | ||
mana_state[0].queues = queues; | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be thinking about this wrong, but I'm not sure how to link the saved state unit to the correct ManaSavedState
unit. I have the pci_id
in ManaSavedState
, but I have a GUID in the saved state unit that doesn't seem to be directly linked to the pci_id
.
vm/devices/net/netvsp/src/lib.rs
Outdated
c_state | ||
.endpoint | ||
.get_queues(queue_config, rss.as_ref(), &mut queues) | ||
.await | ||
.map_err(WorkerError::Endpoint)?; | ||
if let Some(saved_queues) = &c_state.saved_queues { | ||
c_state | ||
.endpoint | ||
.restore_queues(queue_config, saved_queues.clone(), &mut queues) | ||
.await | ||
.map_err(WorkerError::Endpoint)?; | ||
} else { | ||
c_state | ||
.endpoint | ||
.get_queues(queue_config, rss.as_ref(), &mut queues) | ||
.await | ||
.map_err(WorkerError::Endpoint)?; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitHub doesn't give a good way to discuss code outside of the diff, but there's a lot of code here above that isn't directly saved/restored. What elements of this flow need to be added to the saved state vs. can be re-initialized in the restore flow?
vm/page_pool_alloc/src/lib.rs
Outdated
fn get_dma_buffer( | ||
&self, | ||
len: usize, | ||
base_pfn: u64, | ||
) -> anyhow::Result<user_driver::memory::MemoryBlock> { | ||
tracing::info!("looking for slot: {:x}", base_pfn); | ||
tracing::info!("slot state: {:?}", self.inner.state.lock().slots); | ||
|
||
let size_pages = NonZeroU64::new(len as u64 / PAGE_SIZE) | ||
.context("allocation of size 0 not supported")? | ||
.get(); | ||
|
||
let mut inner = self.inner.state.lock(); | ||
let inner = &mut *inner; | ||
let slot = inner.slots.iter().find(|slot| { | ||
if let SlotState::Leaked { | ||
device_id: _, | ||
tag: _, | ||
} = &slot.state | ||
{ | ||
slot.base_pfn == base_pfn | ||
} else { | ||
false | ||
} | ||
}); | ||
|
||
if slot.is_none() { | ||
anyhow::bail!("allocation doesn't exist"); | ||
} | ||
|
||
let handle = PagePoolHandle { | ||
inner: self.inner.clone(), | ||
base_pfn, | ||
size_pages, | ||
mapping_offset: slot.unwrap().mapping_offset, | ||
}; | ||
|
||
tracing::info!("successfully got memory block"); | ||
handle.into_memory_block() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API is the result of the queues not being re-initialized until after the DmaClient has already validated that memory has been restored and marks the MemoryBlocks as having been leaked. There's certainly a better solution here.
.run() | ||
.await?; | ||
|
||
validate_mana_nic(&agent).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, this validate_mana_nic function only validates that we see the right MAC address and IP, but doesn't actually validate any traffic. What's the right way to go about doing that?
/// Test servicing an OpenHCL VM from the current version to itself | ||
/// with mana keepalive support. | ||
#[openvmm_test(openhcl_linux_direct_x64 [LATEST_LINUX_DIRECT_TEST_X64])] | ||
async fn openhcl_servicing_mana_keepalive( | ||
config: PetriVmConfigOpenVmm, | ||
(igvm_file,): (ResolvedArtifact<impl petri_artifacts_common::tags::IsOpenhclIgvm>,), | ||
) -> Result<(), anyhow::Error> { | ||
openhcl_servicing_core( | ||
config, | ||
"OPENHCL_ENABLE_VTL2_GPA_POOL=512 OPENHCL_MANA_KEEP_ALIVE=1", | ||
igvm_file, | ||
OpenHclServicingFlags { | ||
enable_nvme_keepalive: false, | ||
enable_mana_keepalive: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is basically a duplicate and probably should be removed or the other one should be moved here with the validate_mana_nic
code.
Opening this as a place to discuss changes, may end up splitting across multiple PRs to stage more thorough reviews.