Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roudi Fatal error: Trying to convert a pointer to an index which is not aligned to the array #2380

Open
jmyvalour opened this issue Dec 2, 2024 · 13 comments

Comments

@jmyvalour
Copy link

Required information

Operating system:
Ubuntu 24.04 LTS

Compiler version:
Clang 19

Eclipse iceoryx version:
v2.90.0 commit b9ab7ee

Observed result or behaviour:
Hello,

We are starting roudi as an external application, and starting a few components. After a while of stopping and restarting components, the roudi application exit on error with the following trace:

2024-12-02 22:53:55.122 [Fatal]: Trying to convert a pointer to an index which is not aligned to the array! Base address: 0x7367884e795f; item size: 31256; pointer address: 0x7367884ef378
2024-12-02 22:53:55.122 [Fatal]: iceoryx_posh/source/mepoo/mem_pool.cpp:122 [PANIC] Invalid access

it occurs in the freeChunk function, called from iox::roudi::ProcessIntrospectioniox::popo::PublisherPortUser::send()

This does not occur when we are not restarting the components, From what I understand, the roudi app is not part of the sending / receiving process but only manage the shared memory and provide information to the subsriber / publisher to operate. (I was looking into the component where they could send a bad message that could trigger this error, is this even possible to trigger this message from a user application in the roudi main app ?)

Looking at the backtrace sent below, it looks like this is the ProcessIntrospectioniox that is trying to publish a message and it fails to do so.

Any hint, help would be greatly appreciated, getting confused here.

Thank you for your help,
Best regards,

Conditions where it occurred / Performed steps:
Start and Stop publisher / server & client / subscriber a few times

Additional helpful information

If there is a backtrace where this is hit in the roudi app:

#1 0x000055555563bc9e in void iox::er::panic<char const (&) [15]>(iox::er::SourceLocation const&, char const (&) [15]) ()
#2 0x000055555563b4c6 in void iox::er::forwardPanic<char const (&) [15]>(iox::er::SourceLocation const&, char const (&) [15]) ()
#3 0x000055555563b16f in iox::mepoo::MemPool::pointerToIndex(void const*, unsigned long, void const*) ()
#4 0x000055555563b1c1 in iox::mepoo::MemPool::freeChunk(void const*) ()
#5 0x000055555563c6e1 in iox::mepoo::SharedChunk::freeChunk() ()
#6 0x000055555563f42c in iox::popo::ChunkSender<iox::popo::ChunkSenderData<32u, iox::popo::ChunkDistributorData<iox::DefaultChunkDistributorConfig, iox::popo::ThreadSafePolicy, iox::popo::ChunkQueuePusher<iox::popo::ChunkQueueData<iox::DefaultChunkQueueConfig, iox::popo::ThreadSafePolicy> > > > >::send(iox::mepoo::ChunkHeader*) ()
#7 0x000055555563ece0 in iox::popo::PublisherPortUser::sendChunk(iox::mepoo::ChunkHeader*) ()
#8 0x00005555555e9efc in iox::roudi::ProcessIntrospectioniox::popo::PublisherPortUser::send() ()
#9 0x00005555555ea38e in void iox::storable_function<128ul, void ()>::invoke<iox::storable_function<128ul, void ()>::invoke<iox::roudi::ProcessIntrospectioniox::popo::PublisherPortUser, void>(iox::roudi::ProcessIntrospectioniox::popo::PublisherPortUser&, void (iox::roudi::ProcessIntrospectioniox::popo::PublisherPortUser::)())::{lambda()#1}>(void) ()
#10 0x00005555555eabe2 in iox::concurrent::detail::PeriodicTask<iox::storable_function<128ul, void ()> >::run() ()
#11 0x00005555555eae2b in void* std::__1::__thread_proxy[abi:ne190103]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_deletestd::__1::__thread_struct >, void (iox::concurrent::detail::PeriodicTask<iox::storable_function<128ul, void ()> >::)() noexcept, iox::concurrent::detail::PeriodicTask<iox::storable_function<128ul, void ()> >> >(void*) ()
#12 0x00007ffff7a9ca94 in start_thread (arg=) at ./nptl/pthread_create.c:447
#13 0x00007ffff7b29c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

@elfenpiff
Copy link
Contributor

@jmyvalour are you using iceoryx standalone or in combination with cyclone dds?

@jmyvalour
Copy link
Author

@elfenpiff I am using iceoryx in standalone mode

@jmyvalour
Copy link
Author

Here are the build constant used:

2024-12-03 09:05:46.777 [Trace]: Iceoryx contants is:
2024-12-03 09:05:46.777 [Trace]: IOX_MAX_PUBLISHERS = 4096
2024-12-03 09:05:46.777 [Trace]: IOX_MAX_SUBSCRIBERS = 8192
2024-12-03 09:05:46.777 [Trace]: IOX_MAX_SUBSCRIBERS_PER_PUBLISHER = 64
2024-12-03 09:05:46.777 [Trace]: IOX_MAX_CHUNKS_ALLOCATED_PER_PUBLISHER_SIMULTANEOUSLY = 32
2024-12-03 09:05:46.777 [Trace]: IOX_MAX_CHUNKS_HELD_PER_SUBSCRIBER_SIMULTANEOUSLY = 256
2024-12-03 09:05:46.777 [Trace]: IOX_MAX_CLIENTS_PER_SERVER = 512
2024-12-03 09:05:46.777 [Trace]: IOX_MAX_NUMBER_OF_NOTIFIERS = 512
2024-12-03 09:05:46.777 [Trace]: RouDi config is:
2024-12-03 09:05:46.777 [Trace]: Domain ID = 0
2024-12-03 09:05:46.777 [Trace]: Unique RouDi ID = 0
2024-12-03 09:05:46.777 [Trace]: Monitoring Mode = MonitoringMode::ON
2024-12-03 09:05:46.777 [Trace]: Shares Address Space With Applications = false
2024-12-03 09:05:46.777 [Trace]: Process Termination Delay = 0s 0ns
2024-12-03 09:05:46.777 [Trace]: Process Kill Delay = 45s 0ns
2024-12-03 09:05:46.777 [Trace]: Compatibility Check Level = CompatibilityCheckLevel::PATCH
2024-12-03 09:05:46.777 [Trace]: Introspection Chunk Count = 10
2024-12-03 09:05:46.777 [Trace]: Discovery Chunk Count = 10

@elfenpiff
Copy link
Contributor

@jmyvalour The ProcessIntrospection has a hard coded maximum number of processes which is 300. Are more than 300 processes active in your system?

@jmyvalour
Copy link
Author

jmyvalour commented Dec 3, 2024

@elfenpiff thank you for your reply

We have 26 processes per environment with two env running at the same time on the same server, so a total of 52 processes.
The issue will usually occur occurs when I stop / start one of the environement.

@elfenpiff
Copy link
Contributor

@jmyvalour, this absolutely makes sense that it happens when a process stops/starts since the process introspection, which detects and publishes those events, fails.

It seems like the mempool tries to access element 31257 despite there being only 31256 elements in there - like an off-by-one error?
(I came up with the index via: 0x7367884ef378 - 0x7367884e795f) from you error message

@elfenpiff
Copy link
Contributor

@jmyvalour something really weird is happening here which does not make sense at all. It seems like the pointer inside the internal sample of the process introspection was somehow corrupted.

The weirdness is that this seems to only happening on your side and never occurred somewhere else. I think a memory corruption in the sample would have been detected quickly since it is a central construct in iceoryx.

I think I need more details to grasp what is going on here.

  • Did your system run for a long time or can you reproduce the bug quickly?
  • Could you write a minimalistic example so that I can reproduce the bug?
  • Is there anything out-of-the ordinary you are using in your setup? I am asking since I am a bit lost where to start to look.

@jmyvalour
Copy link
Author

Thank you a lot for your answer, time and clarification, That is really helpful to understand the underlying problem here.
This is indeed a really weird one, Trying to reproduce on a minimal environement too and will let you know if I can do that and try to give you as many input i can.

regarding your comment: It seems like the mempool tries to access element 31257 despite there being only 31256 elements in there - like an off-by-one error?

You are correct, I think the publisher is trying to write at address 0x7367884ef378 (with memory pool start address 0x7367884e795f) a chunk size of 31256, so the check (0x7367884ef378 - 0x7367884e795f) % 31256 should gives us 0 (index 1), but 0x7367884ef378 - 0x7367884e795f is giving 31257, so indeed we are one byte away from the mem_pool[1] addr element (should be 0x7367884ef377 - 0x7367884e795f) for the index 1

for some unknow reason so far, ProcessIntrospectioniox chunk* is not at the correct place...

I don't see anything out of the ordinary, our setup was running fine , if we let one environement run it's fine for the day , as soon as we start / restart the second environement, the error trigger, sometime we have:
Locking of an inter-process mutex failed! This indicates that the application holding the lock was terminated or the resources were cleaned up by RouDi due to an unresponsive application.

probably because we are killing the process ?

Could it be that one of our component corrupt internal mempool memory somehow, and trigger this error in the roudi main app ? Don't see how that could be possible but we never know. Will keep you informed of the progress, might be good to update to latest release and give it another try ?

@jmyvalour
Copy link
Author

I have a hard time reproducing it outside the environement within a minimal example however, i rebuilt iceoryx with the test enabled and I got some interesting fails using the following: ./posh/test/posh_moduletests --gtest_filter="ChunkSender*" : all other test are OK.

It even more interesting because we recently change the MAX_CHUNKS_ALLOCATED_PER_PUBLISHER_SIMULTANEOUSLY from 16 to 32 after we get a fatal error on reaching this limit, And I wonder that the problem could be related.

I always get this warning during the test:
[Warn ]: Mempool [m_chunkSize = 176, numberOfChunks = 20, used_chunks = 20 ] has no more space left
2024-12-03 16:12:41.805 [Error]: MemoryManager: unable to acquire a chunk with a chunk-payload size of 8The following mempools are available: MemPool [ ChunkSize = 176, ChunkPayloadSize = 128, ChunkCount = 20 ] MemPool [ ChunkSize = 304, ChunkPayloadSize = 256, ChunkCount = 20 ]
2024-12-03 16:12:41.805 [Error]:iceoryx/src/1e4418d5cd-9826141654/iceoryx_posh/source/mepoo/memory_manager.cpp:197 [MEPOO__MEMPOOL_GETCHUNK_POOL_IS_RUNNING_OUT_OF_CHUNKS (code = 84)] in module [iceoryx_posh (id = 2)]

Adding the testing reports,

Note: Google Test filter = ChunkSender*
[==========] Running 28 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 28 tests from ChunkSender_test
[ RUN ] ChunkSender_test.allocate_OneChunkWithoutUserHeaderAndSmallUserPayloadAlignmentResultsInSmallChunk
[ OK ] ChunkSender_test.allocate_OneChunkWithoutUserHeaderAndSmallUserPayloadAlignmentResultsInSmallChunk (6 ms)
[ RUN ] ChunkSender_test.allocate_OneChunkWithoutUserHeaderAndLargeUserPayloadAlignmentResultsInLargeChunk
[ OK ] ChunkSender_test.allocate_OneChunkWithoutUserHeaderAndLargeUserPayloadAlignmentResultsInLargeChunk (0 ms)
[ RUN ] ChunkSender_test.allocate_OneChunkWithLargeUserHeaderResultsInLargeChunk
[ OK ] ChunkSender_test.allocate_OneChunkWithLargeUserHeaderResultsInLargeChunk (0 ms)
[ RUN ] ChunkSender_test.allocate_ChunkHasOriginIdSet
[ OK ] ChunkSender_test.allocate_ChunkHasOriginIdSet (0 ms)
[ RUN ] ChunkSender_test.allocate_MultipleChunks
[ OK ] ChunkSender_test.allocate_MultipleChunks (0 ms)
[ RUN ] ChunkSender_test.allocate_Overflow
iceoryx_posh/test/moduletests/test_popo_chunk_sender.cpp:217: Failure
Value of: m_memoryManager.getMemPoolInfo(0).m_usedChunks
Expected: is equal to 32
Actual: 20 (of type unsigned int)

Log start

2024-12-03 16:12:41.795 [Warn ]: Mempool [m_chunkSize = 176, numberOfChunks = 20, used_chunks = 20 ] has no more space left
2024-12-03 16:12:41.795 [Error]: MemoryManager: unable to acquire a chunk with a chunk-payload size of 8The following mempools are available: MemPool [ ChunkSize = 176, ChunkPayloadSize = 128, ChunkCount = 20 ] MemPool [ ChunkSize = 304, ChunkPayloadSize = 256, ChunkCount = 20 ]

Log end

iceoryx_posh/test/moduletests/test_popo_chunk_sender.cpp:226: Failure
Value of: maybeChunkHeader.error()
Expected: is equal to AllocationError::TOO_MANY_CHUNKS_ALLOCATED_IN_PARALLEL
Actual: AllocationError::RUNNING_OUT_OF_CHUNKS (of type iox::popo::AllocationError)

Log start

2024-12-03 16:12:41.795 [Warn ]: Mempool [m_chunkSize = 176, numberOfChunks = 20, used_chunks = 20 ] has no more space left
2024-12-03 16:12:41.795 [Error]: MemoryManager: unable to acquire a chunk with a chunk-payload size of 8The following mempools are available: MemPool [ ChunkSize = 176, ChunkPayloadSize = 128, ChunkCount = 20 ] MemPool [ ChunkSize = 304, ChunkPayloadSize = 256, ChunkCount = 20 ]
_OF_CHUNKS (code = 84)] in module [iceoryx_posh (id = 2)]

Log end

iceoryx_posh/test/moduletests/test_popo_chunk_sender.cpp:228: Failure
Value of: m_memoryManager.getMemPoolInfo(0).m_usedChunks
Expected: is equal to 32
Actual: 20 (of type unsigned int)

Log start

2024-12-03 16:12:41.795 [Warn ]: Mempool [m_chunkSize = 176, numberOfChunks = 20, used_chunks = 20 ] has no more space left
2024-12-03 16:12:41.795 [Error]: MemoryManager: unable to acquire a chunk with a chunk-payload size of 8The following mempools are available: MemPool [ ChunkSize = 176, ChunkPayloadSize = 128, ChunkCount = 20 ] MemPool [ ChunkSize = 304, ChunkPayloadSize = 256, ChunkCount = 20 ]

Log end

[ FAILED ] ChunkSender_test.allocate_Overflow (10 ms)
[ RUN ] ChunkSender_test.freeChunk
/iceoryx_posh/test/moduletests/test_popo_chunk_sender.cpp:251: Failure
Value of: m_memoryManager.getMemPoolInfo(0).m_usedChunks
Expected: is equal to 32
Actual: 20 (of type unsigned int)

/iceoryx_posh/test/moduletests/test_popo_chunk_sender.cpp:776: Failure
Value of: (HISTORY_CAPACITY + iox::MAX_CHUNKS_ALLOCATED_PER_PUBLISHER_SIMULTANEOUSLY) <= NUM_CHUNKS_IN_POOL
Actual: false
Expected: true
/iceoryx_posh/test/moduletests/test_popo_chunk_sender.cpp:797: Failure
Value of: maybeChunkHeader.has_error()
Actual: true
Expected: false

@jmyvalour jmyvalour reopened this Dec 3, 2024
@jmyvalour
Copy link
Author

jmyvalour commented Dec 3, 2024

@elfenpiff,

I reverted the build to MAX_CHUNKS_ALLOCATED_PER_PUBLISHER_SIMULTANEOUSLY=16 and all test are now passing, is there any hardcoded limit somewhere preventing the test to pass with MAX_CHUNKS_ALLOCATED_PER_PUBLISHER_SIMULTANEOUSLY=32 ? could that be related to our problem here ?

We can try to reproduce with the reverted value and see if this happen again, the thing is that we changed these values because it was hard for our sys to run without it, can't be sure that's going to work, we will try to get the max value for the test to pass and give it a try,

Regardless of the problem, when reaching this MAX constant, does this mean we are doing too much memory loan (try allocate) before calling the actual publish of the chunck (that would free the chunk from my understanding) meaning taht to mitigate this constraint: we should try to send as much as possible chunk before creating new one ?

@elfenpiff
Copy link
Contributor

@jmyvalour yes there is and this is on our technical debt list. The constants in: iceoryx_posh/include/iceoryx_posh/iceoryx_posh_types.hpp must be all compatible to each other and this must be tested. So I would assume that you have found something that is not compatible with each other.

The idea after the refactoring is, that the dependencies between the constants are described mathematically so that you can never create an invalid configuration. Currently, we are focusing on iceoryx2 and do not have the capacity to do such a refactoring.
You are welcome to take a look at this and try solve your issue and create a pull-request - and I help you where I can.

Btw. we are offering commercial support and could fix it via a contract if you like ([email protected]) - this is how we finance the open source work.

@jmyvalour
Copy link
Author

hello @elfenpiff and thank you for your answer,

For now I won't have the available time to focus on constexpr the configuation values and check that they are correct mathematically, but I see the point here and that would be a great upgrade (at least feeling more secure when reaching some limits and updating the config), we reverted to MAX_CHUNKS_ALLOCATED_PER_PUBLISHER_SIMULTANEOUSLY=16 so far and that fixed the reported problem.

I will try and get the time to revisit this issue at a later stage and report. Maybe we could close this one issue and open a new one related to "static" configuration checking when a pr is created ?

Thank you for your support,

@elBoberido
Copy link
Member

We have 26 processes per environment with two env running at the same time on the same server, so a total of 52 processes.
The issue will usually occur occurs when I stop / start one of the environement.

In the meantime, there is the option to run multiple RouDi in parallel. See also this example:
https://github.com/eclipse-iceoryx/iceoryx/blob/main/iceoryx_examples/experimental/node/iox_cpp_node_publisher.cpp

The fully flexible solution which can be controlled with an IOX_DOMAIN_ID environment variable, currently supports only pub-sub and the waitset. For reqest-response and the Listener some additional work would be required in order to finish this new API.

There is also the option to build iceoryx with a different resource prefix, which has a similar effect in running multiple roudi instances in parallel. But with this, you have to compile iceoryx with different IOX_DEFAULT_RESOURCE_PREFIX cmake parameter for each environment. But with that, all API calls work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants