-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose the MemPoolInfo of matching memory pools to typed endpoints #2092
Comments
@gpalmer-latai it seems you want to use this for GPU computation. I've no experience with GPUs and would need to ask e.g. @elfenpiff but I'm wondering if your goals could also be achieved by extending the publisher and subscriber, e.g. by adding hooks like I think alternative 1 will be quite a lot of work. Funnily this is exactly the approach we are taking with the Rust implementation. There is exactly one memory pool per service. Having something like that in iceoryx would also possible but is definitely nothing that can be achieved in the short term. Regarding option 2. With the introspection you might be able to find a memory pool for a specific service but for that you would also need to know under which user the process with the publisher/subscriber was started and you won't be able to acquire the base pointer. What exactly do you hope to achieve with the information from the introspection? In the short term something like your proof of concept should be possible. Maybe with a more constraint API but that is more of a detail. My preferred solution would be the refactoring of the posh layer similar to the Rust implementation where we experimented with these ideas and have good experience. |
Having To give you a bit more context on my use case - I'm working on a platform with shared graphics memory. That means that the CPU and GPU can access the same memory. But this doesn't just work out of the box - you have to first call an NVIdia API - specifically cudaHostRegister. This API takes a simple void*, size, and some flags and what it effectively does is "pins" the memory - page-locking it. I'm fuzzy on the details beyond that but I believe this has implications on the TLB and associated memory hardware and the reason why I want to restrict the range of memory I call this on as much as possible is because, quoted from the above documentations:
Furthermore, I cannot simply call this API every time I receive a message sample because it is a high latency operation, so ideally I want to make all the appropriate calls once early on when initializing the system. Anyway. My plan is to get a short term solution working for my needs and I would very happy if we could work out a reasonable way to do this upstream here. But I also agree with you that the proper long term solution is what you have already described in the Rust implementation. I'd be interested in having a lengthier discussion sometime about the work involved with bringing that to Iceoryx (or adopting the Rust implementation). I might be able to avoid needing it now, but I suspect there will be more use cases/requirements I stumble upon where I will be really sad not to have this level of control. |
There should be a way forward to bring this upstream. I think I have to take counsel with my pillow to think a little bit more about the best short term approach, especially what implication is has for the untyped API. Would you contribute the patch once we agreed on the API? Once the announcement for the Rust implementation is out we can have a deep dive. Either as part of a developer meetup or a separate meeting. I guess @elfenpiff is also eager to participate. |
Yeah I should have some latitude to work on the patch. |
Hahaha, great pun :) |
I had a few thought on this topic and currently I favor to continue on the idea we had when |
I'd need the ability to map memory pools do be able to process messages as both input and output to GPU accelerated algorithms. So yes, subscriber too. With the portconfiginfo - you could specify a different memory type for a port, but then how does this hook in to facilitating different hardware. I presume iceoryx isn't going to pull in Nvidia dependencies, so you'd need like the ability for the user to register behavior globally for different memory types. |
I've to ask the author of that code what the exact idea behind |
I like the idea of a compile time switch, possibly coupled with some bazel select statements. Then just have a compile time path that either calls the CUDA API or returns an error when a specific memory type is specified in the port config. |
One thing to note though is that it appears the port config / memory_info will apply to the entire memory segment and it is currently not possible to configure multiple segments differing only by what type of memory they support. We'd want to probably extend the Roudi config this way so that a publisher with a port config specifying "pinned host memory" would be told to use the corresponding segment. |
That's basically the idea. The end user would still not have easy access to the mempool and therefore also cannot easily tamper with it but it gives other projects the option to extend the framework. Yes, mid to long term the config should be extended. How to do it exactly might be something to discuss in a meetup. Maybe together with the discussion on how to bring the improvements of the Rust based implementation to the C++ implementation. |
Actually I mean in the short/medium term. I'm really trying to avoid pinning the entire shared memory segment used for all communications just because one publisher wants to use GPU acceleration when publishing messages in one specific mempool. If the effect of specifying Edit: What I linked is the codepath for Roudi setting up the shared memory in the first place. I meant to find the code path where the clients receive the info from Roudi. Second Edit: The publisher (for example) get's this information here. It then has access through the |
What might be interesting to consider as well is adding an entry to both MePooConfig and memPoolInfo. This would allow configuration at the memory pool level, which fits with my needs. Then we can see about adding some sort of "post_init" logic that is called after And this is something that could be added to the roudi config TOML as well at some point, though is also immediately available through manual roudi configuration. |
There is still a bit of awkwardness to all of these approaches in that although a segment or a memory pool might be configured as supporting "pinned host memory", really it is ultimately up to the publishers or subscribers to decide if they want to use that feature. You have to consider that there are cases where the publisher is a normal CPU publisher or the subscriber is a normal CPU subscriber. In these processes you wouldn't really want to call So I guess, what you would do to handle this is to combine the
Note this code path is when you are retrieving a chunk which would happen much later than where you want to call |
Maybe there is a misunderstanding. What I meant was to only register the mempool the publisher and subscriber are using and not the whole segment. As long as it the memory is shared between CPU and GPU this should be sufficient and RouDi does not need to do any special handling. But I'm no GPU expert so this might be a bit naive. So with that assumption I thought extending the config would not be required in the short term but only in the mid to long term. We already have a concept of a shared memory provider, although only used and tested on CPU memory. This could be used to extend the memory types. The idea was to have this on segment level not on mempool level. It is quite long since we discussed about this topic though and some of our assumptions might not hold anymore. |
Yes, the publisher and subscriber would opt in by specifying the desired memory type. I think that should be more or less straightforward. What might be more complicated is mixed memory mode where some subscriber work on CPU memory and other work on GPU memory. To solve this in an efficient way we might need to do a whiteboard session. |
Oh, and @elfenpiff already had some ideas how to solve this in an efficient way about 4 years ago but there was never time to implement it. Maybe time has come now ... or at least in the not so distant future 😄 |
Okay. I think I follow now. I guess all one needs is to specify something in the Do we simply not support those initially with this feature, since we do not know in advance which memory pools they will request? Related to the above - I've noticed that we appear to always do a linear search for the matching memory pool every time we get a chunk. Would it make sense to optimize this out for the typed endpoints in conjunction with the memory logic that is applied only to the corresponding memory pools? (Also in general we could consider doing a binary search) |
Note that there is an approach where you defer until It's a bit problematic though because that first publish/subscribe for a particular size message will incur a lot of latency. Though you could mitigate this by "priming the pump" - calling |
For the untyped publisher and subscriber we have two options (three after I read your latest response)
|
Oh, the typed API would then just call |
That sounds great to me. So let me see if I can summarize:
|
Here some remarks
|
I think you are right about the |
That sounds reasonable. I'm probably not knowledgable enough to quickly come up with something myself. Perhaps we can just reserve a range of values for now so that users implementing their own logic downstream do so with a value out of that range. |
Exactly. I've thought about something similar to |
... aaargh, too much Qt coding 😆 ... we can just define an enum to distinguish between user defined values and the ones defined by iceoryx |
So is the idea that the builder is passed as an argument like When you talk about tying the builder to the runtime, I'm guessing you mean that instead of
You'd have something like
|
Ah no, I think what you mean is that the builder pattern lets you return an error if the parameters aren't supported, right? It's a way of solving the classic "Constructors don't return anything so you have to raise an exception" problem. |
Exactly. We don't use exception and when something went horribly wrong in the ctor we currently terminate. It would be better to use the factory method pattern or the builder pattern and return an |
Interesting. How deep are you thinking of going in this sort of refactor? It looks like the So we could for example, also use the builder pattern for the port users. We can also refactor the |
In order to do this we will have to expose the index somehow - maybe instead of a This function can then both be used as a way to query |
Sounds good. Since this is no user facing API we are free to do anything :) |
On second thought and further study of the current architecture, I think maybe it might actually make sense for the
So basically, to avoid errors in construction we use the builder type for the chunk sender also, or we alternatively add a For simplicity it probably suffices to make the hint that the The CUDA calls themselves will be cached by the global registry so even if we end up back at the same hint it will become a noop. |
Question about the pointer repository though - is this class thread-safe? I don't see any sort of synchronization around registering a new segment. I assume this is because there is no expectation that this would ever be called from different threads? It's not practically a problem for my particular use case, but it would be problematic to introduce a code path that is able to be called at any time a publisher or subscriber takes or receives a message, that is not thread safe. |
The whole runtime could be refactored if there is time. The runtime belongs to a few remaining old constructs which have their roots in a R&D department. At that time the focus was more on trying out new things than on creating sophisticated APIs. In the meantime we also got new tools like the Ideally we could try to model the API in a way that it will become simpler to have the same API for the C++ bindings on the Rust basted next-gen implementation. |
There is a related issue #1701. We need to check this. |
I need to think more about this. Unfortunately I'm traveling at the end of the week till next week and won't have much time. Please ping me again for this if I forget to respond. |
That's okay. I'm also traveling as well until next Wednesday (It is a holiday weekend in the U.S.). Funnily enough - I am flying to Oahu tomorrow. |
So I've thought about it and it seems possible but very difficult to create a situation where one calls e.g. For this CUDA registry we can possibly get around this problem by registering with a combination of the memory pool id and the segment id. As for the "duplicate registration" race condition. If this occurs CUDA will set an error code along the lines of "This overlaps with some memory that is already registered". We could choose to ignore that one... though it hides other logic errors like actually accidentally overlapping memory pool info. |
It's been a while since I've been there. Unfortunately it's a bit far away from my place. Have fun :) |
@MatthiasKillat in case you are interested in the CUDA stuff :) |
So I've started prototyping on this new solution and have encountered an obstacle I don't think we've thought through yet. For subscribers - it appears as if the subscriber does not know anything about memory pools or chunks until a message has been delivered, or more precisely, pointer information to a shared chunk (through which it can figure out where the Do you have any thoughts on an API that would allow a subscriber to figure out which memory pool messages will come from before it starts receiving messages? I'd like to avoid in the happy path making calls to |
One thing that comes to mind that might be somewhat odd but would at least be relatively "safe" would be adding a |
Ah... but this approach runs into one more problem. Because of access controls we don't actually know which segment a message is going to come from, and therefore cannot acquire the memoryManager information the way the publisher does |
What about this alternative approach - We leverage the fact that the My thought is - what if we extend this Then, when creating a subscriber, the runtime can simply do an additional step to use either a template type size info or some size hints in the untyped API to find all memory pools the subscriber may read from and preemptively pin them. In addition to this, we would still do a check when dequeuing a message in the subscriber that the owning memory pool has been pinned, but this check would likely be a noop unless the message we received is a different size than we expected. |
I've been busy with other stuff the last days. Will have a look at your ideas tomorrow. |
Another idea along this line - if I expose the ability for the
|
I'm currently making me familiar with the code again. Regarding your last idea. If the number of pinned MemPools need to be minimal, this would not work since we do not know which MemPool a publisher or subscriber will use. We would basically need to pin the whole shared memory segment. Maybe I misunderstood your idea |
Ah no, you are right and I got too ahead of myself here. We of course do not want to go around pinning every single memory pool that might "support it" in every single process because each given process probably only really cares about sending/receiving messages on one or a few memory pools. So I'll backtrack again to the idea that, when creating a subscriber, the runtime goes through each data segment (if there are multiple) and pins the memory pools that match
And in order to support untyped subscribers, we would still need a code path as well to check, upon receipt of a message, if this message came from a new memory pool which has not yet been registered, in which case we simply eat the cost to pin the memory at this time. |
Brief feature description
I would like the ability to somehow query the info of a particular memory pool that will be used for a comms endpoint's messages.
Detailed information
A full proof of concept is linked below:
master...gpalmer-latai:iceoryx:gpalmer/expose_mempool
The basic gist is that I've added a
m_basePtr
to theMemPoolInfo
struct to inform consumers about where the actual memory pool is located. Then I've exposed some APIs to be able to lift that information all the way out to a typed publisher.Alternatives Considered
Instead of lifting out the memory pool information through the layers of port info, it might be nice/favorable if we could
The text was updated successfully, but these errors were encountered: