Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

variable: return native Go pointers to bpf map memory #1607

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ti-mo
Copy link
Collaborator

@ti-mo ti-mo commented Nov 7, 2024

Pending the proposal in golang/go#70224, here's the current API proposal for returning Go pointers to 'off-heap' bpf map memory:

func MemoryPointer[T any](mm *Memory, off uint64) (*T, error)
func VariablePointer[T any](v *Variable) (*T, error)

We need a bit of help from the Go runtime to manage the lifecycle of the underlying memory mapping. Currently, this mmaps over a piece of Go heap to allow the GC to track pointers into the bpf map memory, obviating the need for tying the mmap's lifetime to an ebpf.Memory. Instead, any Go pointers into the mmap will keep it alive, removing the risk of use-after-free.

Clearly, this involves some dark arts, so we're not comfortable pushing this on users since Collection.Variables is populated by default. Any change in the runtime or allocator may break this, and the fallout means Go programs segfaulting without any hint as to why, which is not acceptable.

--- Edit ---

Since the runtime.AddManualMemoryCleanup proposal may take a while to implement, land and ship in a Go version we target, I've revived the patch set and put the heap-mapping behaviour behind CollectionOptions.UnsafeVariableExperiment. MemoryPointer is yet unexported, since Map.Memory() is typically called directly by the user and we'd need to store the feature flag in *Map, which doesn't feel ideal.

Alternatively, we could go for an env feature flag instead of CollectionOptions (EBPFUNSAFEVARIABLES=true?), which would allow us to export MemoryPointer and transparantly switch between implementations.

@ti-mo ti-mo mentioned this pull request Feb 18, 2025
@ti-mo ti-mo changed the title [WIP] variable: return native Go pointers to underlying bpf map values variable: return native Go pointers to underlying bpf map values Feb 25, 2025
@ti-mo ti-mo changed the title variable: return native Go pointers to underlying bpf map values variable: return native Go pointers to bpf map memory Feb 27, 2025
@ti-mo ti-mo force-pushed the tb/atomic-memory branch 3 times, most recently from 2d43b45 to 821aa80 Compare February 27, 2025 15:13
@ti-mo ti-mo marked this pull request as ready for review February 27, 2025 15:48
@ti-mo ti-mo requested a review from a team as a code owner February 27, 2025 15:48
Comment on lines +264 to +269
func VariablePointer[T any](v *Variable) (*T, error) {
if err := checkVariable[T](v); err != nil {
return nil, fmt.Errorf("variable pointer %s: %w", v.name, err)
}
return memoryPointer[T](v.mm, v.offset)
}
Copy link
Contributor

@mejedi mejedi Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like it is possible to support slices within the same API, does it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in the current proposal, no. I think supporting slices is a hazard, and we should probably document that clearly. (append() will reallocate and copy, unsharing the backing array)

Of course the caller can always slice the array, but we should encourage passing around the array pointer instead.

Copy link
Collaborator

@lmb lmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can either of you refresh my mind again what the killer feature for this is? Seems like @mejedi is interested in using this for slices?

}

var t T
size := binary.Size(t)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this binary.Size and not unsafe.Sizeof?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also doesn't take unsafe.Alignof into account.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this binary.Size and not unsafe.Sizeof?

unsafe.Sizeof() tolerates variable-sized types like untyped ints and also returns the size of the slice descriptor if the argument is a slice. binary.Size() rejects both. I think it's safer to reject variable-sized for now to avoid potential surprises.

This also doesn't take unsafe.Alignof into account.

Don't the next few lines take care of ensuring aligned access?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safer to reject variable-sized for now to avoid potential surprises.

Unfortunately binary.Size ignores trailing padding, which means that it is not safe to use that function here.

struct foo {
  uint64_t a;
  uint32_t b;
  // implicit 4 byte padding
}

I agree with you that these types shouldn't be allowed, but I think it might be better to do that in the caller of this function. Then we have a separation of concerns:

  • this function ensures that we only transmute memory that is "safe" per the memory model, respecting full size and alignment.
  • the caller can further restrict this to sized types, maybe ones that embed structs.HostLayout, etc.

Don't the next few lines take care of ensuring aligned access?

The alignment of a type can be smaller than its size, so I think that the check below is incorrect in the sense that it rejects correctly aligned structs. For example:

struct foo {
  uint64_t a, b;
}

This must be aligned to 8 bytes, not 16.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately binary.Size ignores trailing padding, which means that it is not safe to use that function here.

I see. I guess that's the reasoning behind unsafeBackingMemory as well; there you're using reflect.Type.Size(), which is analogous to unsafe.Sizeof(). Then, you compare it to binary.Size() to detect padding.

Thanks for the feedback! I'll see what I can come up with. Looks like we also need better tests. 🙂

// heap allocation, even individual struct fields or arbitrary offsets within
// a slice. In this case, finalizers set on struct fields or slice offsets
// will only run when the whole struct or backing array are collected. The
// accepted runtime.AddCleanup proposal makes this behaviour more explicit and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does all of this still work when switching to AddCleanup?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check, but afaik AddCleanup is SetFinalizer modulo object resurrection.

}

var t T
size := binary.Size(t)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be clearer to check that unsafe.Sizeof() == binary.Size()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would make things more complex, not less. The reader would have to be aware of the intricacies and the differences between both, and we'd still need the comparison to v.size, so it wouldn't eliminate any code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are lots of assumptions that aren't explored yet:

  • GC behaviour on mappings which need to be aligned and therefore don't alias the original allocation
  • What happens when the heap mapping is dropped while the mapmap still exists
  • What happens when a user sets a finalizer on type-cast memory
  • Test that things are actually freed at the right point in time
  • Test that a pointer at an offset to the original allocation keeps the whole thing alive

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • GC behaviour on mappings which need to be aligned and therefore don't alias the original allocation
  • What happens when a user sets a finalizer on type-cast memory

https://go.dev/play/p/FKVnnU85C8b

  • Test that a pointer at an offset to the original allocation keeps the whole thing alive

Could this question be nuanced? This seems to be inherent to how GC works.

  • What happens when the heap mapping is dropped while the mapmap still exists

Not sure how this can happen without a moving GC and while there are still references to a span.

  • Test that things are actually freed at the right point in time

Let me come up with something.

@ti-mo ti-mo force-pushed the tb/atomic-memory branch from 821aa80 to fe8e5b1 Compare March 13, 2025 11:36
@ti-mo
Copy link
Collaborator Author

ti-mo commented Mar 13, 2025

Can either of you refresh my mind again what the killer feature for this is? Seems like @mejedi is interested in using this for slices?

I think the most pressing use case is atomics, which is currently not supported at all. Array variables currently work through Get() and Set(), but there's always a marshaling cost. Also, simply having the convenience of manipulating BPF variables like they are Go variables is nice.

@ti-mo ti-mo requested a review from lmb March 13, 2025 11:43
}

var t T
size := binary.Size(t)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safer to reject variable-sized for now to avoid potential surprises.

Unfortunately binary.Size ignores trailing padding, which means that it is not safe to use that function here.

struct foo {
  uint64_t a;
  uint32_t b;
  // implicit 4 byte padding
}

I agree with you that these types shouldn't be allowed, but I think it might be better to do that in the caller of this function. Then we have a separation of concerns:

  • this function ensures that we only transmute memory that is "safe" per the memory model, respecting full size and alignment.
  • the caller can further restrict this to sized types, maybe ones that embed structs.HostLayout, etc.

Don't the next few lines take care of ensuring aligned access?

The alignment of a type can be smaller than its size, so I think that the check below is incorrect in the sense that it rejects correctly aligned structs. For example:

struct foo {
  uint64_t a, b;
}

This must be aligned to 8 bytes, not 16.


// Bump the value by 1 using a bpf program.
want := uint32(1338)
a32, err := VariablePointer[atomic.Uint32](obj.Atomic)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should've said: it also only contains unexported fields. Passing the type to binary.Read or Write will cause an error. Is that the correct choice to make here?

func unmap(size int) func(*byte) {
return func(a *byte) {
// Create another mapping at the same address to undo the original mapping.
// This will cause the kernel to repair the slab since we're using the same
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if the kernel didn't "repair the slab" for some reason, and kept a separate mapping instead? Nothing in the man pages suggests that this is required behaviour.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if the kernel didn't "repair the slab" for some reason, and kept a separate mapping instead?

I just documented this as an observation; you're right that this isn't documented behaviour, it's just how Linux behaves. There would be no behavioural change if the kernel doesn't repair the slab (basically, if the kernel no longer defragments contiguous page table entries/vmas). munmap takes addr and length, so if the runtime decides to release the mspan, it would munmap the whole range it thought it owned, including the intermediate mappings created by our hack (https://pubs.opengroup.org/onlinepubs/009696799/functions/munmap.html):

The munmap() function shall remove any mappings for those entire pages containing any part of the address space of the process starting at addr and continuing for len bytes.

// align the requested allocation to a page boundary. This is needed for
// MAP_FIXED and makes sure we don't mmap over some other allocation on the Go
// heap.
size = internal.Align(size+os.Getpagesize(), os.Getpagesize())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something we discussed on the previous PR I think: this will be expensive on platforms that have large page sizes like arm64. Not sure there is anything we can do about it, but might make sense to call it out in the docs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, okay. Looks like 64k pages need to be explicitly selected while building the kernel, and Ubuntu still defaults to 4k pages for now. This is all moot when runtime.AddForeignCleanup lands, so I'm not convinced we should burden the API docs with this detail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants