-
Notifications
You must be signed in to change notification settings - Fork 121
DMA hint calculation in OpenHCL bootshim and fallback mem allocator for NVMe #1190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@chris-oo @justus-camp-microsoft please read the PR description. |
dma_hint_4k = f.dma_hint_mb as u64 * 1048576 / PAGE_SIZE_4K; | ||
break; | ||
} else { | ||
// Prepare for possible extrapolation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just instead round to the next bucket, aka be pessimistic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible but also will reserve more than needed. If VTL2 Linux is okay with that (could be) then we can simplify it. Some thorough testing is needed, e.g. to create all Underhill VM sizes on TiP.
}, | ||
]; | ||
|
||
/// Returns calculated DMA hint value, in 4k pages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably want some more rationale on what we should do if we don't match one of these lookup tables exactly.
@@ -463,7 +467,17 @@ impl PartitionInfo { | |||
crate::cmdline::parse_boot_command_line(storage.cmdline.as_str()) | |||
.enable_vtl2_gpa_pool; | |||
|
|||
max(dt_page_count.unwrap_or(0), cmdline_page_count.unwrap_or(0)) | |||
let hostval = max(dt_page_count.unwrap_or(0), cmdline_page_count.unwrap_or(0)); | |||
if hostval == 0 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should also make some indication to usermode that we calculated dma memory ourselves based on heuristics, instead of the host providing it. And make that available via inspection and log it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be nice, so it looks like you mean another vmbus protocol chunk? Besides some logging later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't know if we need to do anything other than report something more in the device tree we pass to the linux kernel (which i think we already do today?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do it today, I haven't found a case where we need to change anything there yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah i mean that we should report some indication (that we can log) so we know that we didn't get a hint from the host, and cacluated it ourselves. I don't think we have that part today.
7a45be2
to
7fb5071
Compare
let hostval = max(dt_page_count.unwrap_or(0), cmdline_page_count.unwrap_or(0)); | ||
if hostval == 0 | ||
&& parsed.nvme_keepalive | ||
&& params.isolation_type == IsolationType::None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on other discussion, this may need to be revisited.
} | ||
|
||
/// Not supported for this allocator. | ||
fn fallback_alloc_size(&self) -> u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this value is useful, I'm wondering if is_persistent
is enough here? If you query at save time "are all the allocations tied to this client persistent?" and the answer is no, then that's good enough right?
How does this PR handle if a given allocation has persistent and non-persistent allocations in a client? Do we expect clients to do full teardown before we save/restore dma_manager state? Or are we expecting dma_manager to know that the "allocations" tied to a client that did fallback are actually free ranges?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea of is_persistent()
is to return the default behavior for the allocator. It is like informational value
To distinguish if fallback was used the function fallback_alloc_size
must return non-zero, this triggers runtime disable of persistent state save.
So looks like you want a function which just tells that fallback did happen?
@@ -161,6 +162,8 @@ enum NvmeWorkerRequest { | |||
CreateIssuer(Rpc<u32, ()>), | |||
/// Save worker state. | |||
Save(Rpc<(), anyhow::Result<NvmeDriverWorkerSavedState>>), | |||
/// Query how much memory was allocated with fallback allocator. | |||
QueryAllocatorStats(Rpc<(), DmaClientAllocStats>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep a clone of the client in nvme_manager, do we even need this? Since we have an Arc<dyn DmaClient>
couldn't we just ask directly without needing to rpc to the driver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We keep dma manager in nvme manager, but we don't know which clients were spawned for the individual devices.
I was looking into keeping everything in nvme manager, but don't see an easy connection between the manager and get_namespace / get_driver where we allocate a new client.
@@ -341,6 +414,45 @@ impl NvmeManagerWorker { | |||
.map_err(|source| InnerError::Namespace { nsid, source }) | |||
} | |||
|
|||
/// Copy of the code from get_driver. | |||
fn get_dma_client(&self, pci_id: String) -> Result<Arc<dyn DmaClient>, InnerError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer fn names without get
. In this case, it's actually a new
operation right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
context: I would like to use this function in both places where it is needed: get_driver
and restore
. But get_driver
uses &mut self and I cannot call it from there.
Open for suggestions.
@@ -295,18 +346,40 @@ impl NvmeManagerWorker { | |||
let driver = match self.devices.entry(pci_id.to_owned()) { | |||
hash_map::Entry::Occupied(entry) => entry.into_mut(), | |||
hash_map::Entry::Vacant(entry) => { | |||
let device_name = format!("nvme_{}", pci_id); | |||
let lower_vtl_policy = LowerVtlPermissionPolicy::Any; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this block copy pasted if we have the new_dma_client
fn below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block:
let driver = match self.devices.entry(pci_id.to_owned()) {
prevents me from calling it (error about mutable vs immutable reference). If there is a solution please advise.
{ | ||
Ok(s) => { | ||
if s.stats.fallback_alloc > 0 { | ||
tracing::warn!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could log this within dma_manager
itself? why do it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok let me check.
the idea is that we need to know the final amount of fallback mem allocs at the moment of servicing. or we can log every fallback allocation, maybe, but I think the final number is more useful.
We discussed splitting the dma calculation and fallback code into different PRs, because they are logically distinct. Are you doing that? |
To get some early feedback before creating a proper PR.
There is a comment that we should not be creating a big lookup table and always use heuristics instead. There is a problem with that: heuristics might significantly change depending on the device types assigned to this specific VM (MFND, ASAP, MANA) so if we really want heuristics then some host changes are needed:
Currently, in AH2025+, "nvme" flag is always present and the deistinction is made by non-zero dma-pages property. We may need to rework that to make "nvme" flag optional only if MFND devices are detected.