Skip to content

DMA hint calculation in OpenHCL bootshim and fallback mem allocator for NVMe #1190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

yupavlen-ms
Copy link
Contributor

To get some early feedback before creating a proper PR.

There is a comment that we should not be creating a big lookup table and always use heuristics instead. There is a problem with that: heuristics might significantly change depending on the device types assigned to this specific VM (MFND, ASAP, MANA) so if we really want heuristics then some host changes are needed:

  1. Host (worker process) must check if at least one MFND device is assigned to VTL2 and then append "nvme" flag.
  2. WP must check if at least one ASAP device is assigned to VTL2 and then append "asap" flag.
  3. WP must check if at least one MANA device is assigned to VTL2 and then append "mana" flag.

Currently, in AH2025+, "nvme" flag is always present and the deistinction is made by non-zero dma-pages property. We may need to rework that to make "nvme" flag optional only if MFND devices are detected.

@yupavlen-ms
Copy link
Contributor Author

@chris-oo @justus-camp-microsoft please read the PR description.

dma_hint_4k = f.dma_hint_mb as u64 * 1048576 / PAGE_SIZE_4K;
break;
} else {
// Prepare for possible extrapolation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just instead round to the next bucket, aka be pessimistic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible but also will reserve more than needed. If VTL2 Linux is okay with that (could be) then we can simplify it. Some thorough testing is needed, e.g. to create all Underhill VM sizes on TiP.

},
];

/// Returns calculated DMA hint value, in 4k pages.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably want some more rationale on what we should do if we don't match one of these lookup tables exactly.

@@ -463,7 +467,17 @@ impl PartitionInfo {
crate::cmdline::parse_boot_command_line(storage.cmdline.as_str())
.enable_vtl2_gpa_pool;

max(dt_page_count.unwrap_or(0), cmdline_page_count.unwrap_or(0))
let hostval = max(dt_page_count.unwrap_or(0), cmdline_page_count.unwrap_or(0));
if hostval == 0 &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should also make some indication to usermode that we calculated dma memory ourselves based on heuristics, instead of the host providing it. And make that available via inspection and log it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be nice, so it looks like you mean another vmbus protocol chunk? Besides some logging later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't know if we need to do anything other than report something more in the device tree we pass to the linux kernel (which i think we already do today?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do it today, I haven't found a case where we need to change anything there yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah i mean that we should report some indication (that we can log) so we know that we didn't get a hint from the host, and cacluated it ourselves. I don't think we have that part today.

@yupavlen-ms yupavlen-ms changed the title [DRAFT] [Sneak Peek] DMA hint calculation in OpenHCL bootshim [DRAFT] DMA hint calculation in OpenHCL bootshim Apr 18, 2025
@yupavlen-ms yupavlen-ms marked this pull request as ready for review April 24, 2025 18:37
@yupavlen-ms yupavlen-ms requested review from a team as code owners April 24, 2025 18:37
@yupavlen-ms yupavlen-ms changed the title [DRAFT] DMA hint calculation in OpenHCL bootshim DMA hint calculation in OpenHCL bootshim Apr 24, 2025
@yupavlen-ms yupavlen-ms changed the title DMA hint calculation in OpenHCL bootshim DMA hint calculation in OpenHCL bootshim and fallback mem allocator for NVMe Apr 24, 2025
let hostval = max(dt_page_count.unwrap_or(0), cmdline_page_count.unwrap_or(0));
if hostval == 0
&& parsed.nvme_keepalive
&& params.isolation_type == IsolationType::None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on other discussion, this may need to be revisited.

}

/// Not supported for this allocator.
fn fallback_alloc_size(&self) -> u64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this value is useful, I'm wondering if is_persistent is enough here? If you query at save time "are all the allocations tied to this client persistent?" and the answer is no, then that's good enough right?

How does this PR handle if a given allocation has persistent and non-persistent allocations in a client? Do we expect clients to do full teardown before we save/restore dma_manager state? Or are we expecting dma_manager to know that the "allocations" tied to a client that did fallback are actually free ranges?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of is_persistent() is to return the default behavior for the allocator. It is like informational value
To distinguish if fallback was used the function fallback_alloc_size must return non-zero, this triggers runtime disable of persistent state save.

So looks like you want a function which just tells that fallback did happen?

@@ -161,6 +162,8 @@ enum NvmeWorkerRequest {
CreateIssuer(Rpc<u32, ()>),
/// Save worker state.
Save(Rpc<(), anyhow::Result<NvmeDriverWorkerSavedState>>),
/// Query how much memory was allocated with fallback allocator.
QueryAllocatorStats(Rpc<(), DmaClientAllocStats>),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep a clone of the client in nvme_manager, do we even need this? Since we have an Arc<dyn DmaClient> couldn't we just ask directly without needing to rpc to the driver?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We keep dma manager in nvme manager, but we don't know which clients were spawned for the individual devices.
I was looking into keeping everything in nvme manager, but don't see an easy connection between the manager and get_namespace / get_driver where we allocate a new client.

@@ -341,6 +414,45 @@ impl NvmeManagerWorker {
.map_err(|source| InnerError::Namespace { nsid, source })
}

/// Copy of the code from get_driver.
fn get_dma_client(&self, pci_id: String) -> Result<Arc<dyn DmaClient>, InnerError> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer fn names without get. In this case, it's actually a new operation right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

context: I would like to use this function in both places where it is needed: get_driver and restore. But get_driver uses &mut self and I cannot call it from there.

Open for suggestions.

@@ -295,18 +346,40 @@ impl NvmeManagerWorker {
let driver = match self.devices.entry(pci_id.to_owned()) {
hash_map::Entry::Occupied(entry) => entry.into_mut(),
hash_map::Entry::Vacant(entry) => {
let device_name = format!("nvme_{}", pci_id);
let lower_vtl_policy = LowerVtlPermissionPolicy::Any;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this block copy pasted if we have the new_dma_client fn below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block:
let driver = match self.devices.entry(pci_id.to_owned()) {

prevents me from calling it (error about mutable vs immutable reference). If there is a solution please advise.

{
Ok(s) => {
if s.stats.fallback_alloc > 0 {
tracing::warn!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could log this within dma_manager itself? why do it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok let me check.

the idea is that we need to know the final amount of fallback mem allocs at the moment of servicing. or we can log every fallback allocation, maybe, but I think the final number is more useful.

@chris-oo
Copy link
Member

We discussed splitting the dma calculation and fallback code into different PRs, because they are logically distinct. Are you doing that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants