Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinity loop on select() sleep with VMA_SPEC=latency #1098

Open
kc-eos opened this issue Dec 16, 2024 · 8 comments
Open

Infinity loop on select() sleep with VMA_SPEC=latency #1098

kc-eos opened this issue Dec 16, 2024 · 8 comments
Assignees

Comments

@kc-eos
Copy link

kc-eos commented Dec 16, 2024

Hi, I encountered an issue when integrating libvma with the Reuters library.
I am running in the latency profile but it seems that the user thread was busy spinning and stuck inside

#0  0x00007fab9c3eb81f in select () from /lib64/libc.so.6
#1  0x00007faba1ea0ceb in select_call::wait_os(bool) () from /lib64/libvma.so
#2  0x00007faba1e9db17 in io_mux_call::polling_loops() () from /lib64/libvma.so
#3  0x00007faba1e9f565 in io_mux_call::call() () from /lib64/libvma.so
#4  0x00007faba1f08048 in select_helper(int, fd_set*, fd_set*, fd_set*, timeval*, __sigset_t const*) () from /lib64/libvma.so
...

Step to reproduce

After some investigation, I wrote a simple program to narrow down the usage and able to reproduce the issue:

#include <iostream>
#include <sys/socket.h>
#include <unistd.h>
#include <sys/select.h>
#include <ctime>

void printTime() {
    std::time_t now = std::time(nullptr);
    std::cout << std::ctime(&now);  // Print current time
}

int main() {
    // Create a stream socket
    int sockfd = socket(AF_INET, SOCK_STREAM, 0);
    if (sockfd < 0) {
        std::cerr << "Error opening socket" << std::endl;
        return 1;
    }

    struct timeval tv;
    tv.tv_sec = 1;  // 1 seconds
    tv.tv_usec = 0;

    printTime();
    select(0, nullptr, nullptr, nullptr, &tv); // sleep by calling select(0..);
    printTime();

    // Close the socket
    close(sockfd);
    return 0;
}

Environment

  1. RedHat Enterprise Linux 8.7
  2. g++ 8.5.0-15
  3. libvma 9.8.60

Test result

$ g++ select_sleep.cpp -o test
$ ./test ## OK
$ LD_PRELOAD=libvma.so.9.8.60 ./test ## OK
$ LD_PRELOAD=libvma.so.9.8.60 VMA_SPEC=latency ./test ## Failed, thread stucks

Workaround

With some more trial-and-error, it seems that this issue can be workaround by disabling VMA_SELECT_POLL_OS_FORCE. i.e.

LD_PRELOAD=libvma.so.9.8.60 VMA_SPEC=latency VMA_SELECT_POLL_OS_FORCE=0 ./test ## OK

However, by doing this will unset the other 2 parameters VMA_SELECT_SKIP_OS & VMA_SELECT_POLL_OS_RATIO back to default value. Hence. my question:

  1. Is this a by-design behaviors in libvma or it is simply a bug?
  2. In order to workaround and proceed in time being, what will be the recommended setting for VMA_SELECT_SKIP_OS & VMA_SELECT_POLL_OS_RATIO ?
@igor-ivanov
Copy link
Collaborator

Hello @kc-eos I think that this behavior is described at https://github.com/Mellanox/libvma/blob/master/README#L629-L641

@kc-eos
Copy link
Author

kc-eos commented Dec 19, 2024

Hi @igor-ivanov , thanks for the reply.

I also noticed the description, but I still can't believe that libvma should block the user thread forever in any cases..

On the other hand, when looking at other libvma documentation, it mentions that:

  1. At: https://github.com/Mellanox/libvma/wiki/Architecture

If the data is routed to/from an supported network adapter, the VMA library intercepts the call and does the bypass work. If the data is passing to/from an unsupported network adapter, the VMA library passes the call to the usual kernel libraries responsible for handling network traffic.

  1. At: https://github.com/Mellanox/libvma/blob/master/README#L616-L627

The duration in micro-seconds (usec) in which to poll the hardware on Rx path before
going to sleep (pending an interrupt blocking on OS select(), poll() or epoll_wait().
The max polling duration will be limited by the timeout the user is using when
calling select(), poll() or epoll_wait().

Based on the above descriptions, I think one could expect the call select(0, nullptr, nullptr, nullptr, &tv); shall return after the 1 seconds, as defined by struct timeval tv.

Could you please discuss this with the team again to check if this is really an expected behaviors and consider fixing it?

One more thing: FYI, I tested the same program with an older version of libvma - 8.1.4, and it works fine without blocking the user thread!!

@igor-ivanov
Copy link
Collaborator

@galnoam probably some degradation is reported.

@kc-eos
Copy link
Author

kc-eos commented Dec 24, 2024

Hi @igor-ivanov & @galnoam , Merry Christmas!!

May I know is there any update on this thread?

@galnoam
Copy link
Collaborator

galnoam commented Dec 24, 2024

@AlexanderGrissik, check the reported issue?
Thanks.

@kc-eos
Copy link
Author

kc-eos commented Jan 15, 2025

Hi @AlexanderGrissik & @galnoam & @igor-ivanov,

I apologize for bringing this up again, but I wanted to follow up on this issue that was reported a month ago.
We would greatly appreciate any update on, whether this has been confirmed a defect and when the fix will be delivered.

Thanks in advance!

@tomerdbz
Copy link
Collaborator

Hi @kc-eos ,

Thank you for reaching out and following up on the issue you reported last month. We sincerely apologize for the delay in our response.

We want to inform you that we have identified and confirmed the defect you experienced.

The issue was related to an infinite loop occurring when there are no offloaded sockets, specifically when VMA_SELECT_POLL_OS_FORCE is enabled.

We developed a fix that addresses this problem by ensuring the system correctly polls OS sockets in the absence of offloaded sockets, regardless of the VMA_SELECT_POLL_OS_FORCE configuration. (#1104)

The fix is currently in a pending pull request and is undergoing our standard review and testing procedures to ensure its effectiveness and stability.

We expect to include this fix in our upcoming release.

We appreciate your patience and understanding while we work to resolve this issue. Please accept our apologies for any inconvenience this may have caused. If you have any further questions or need additional assistance, please don't hesitate to contact us!

P.S.

It's worth noting that this issue did not occur in libvma version 8.1.4 within the latency spec because that version did not enable VMA_SELECT_POLL_OS_FORCE, even though the documentation stated that it did.

@kc-eos
Copy link
Author

kc-eos commented Jan 20, 2025

Hi @tomerdbz,

Thank you very much for the update and the fix!
We are looking forward to the upcoming release, and will definitely try it out when it is ready!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants