-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] SRTO_ENFORCEDENCRYPTION set to false despite default configuration documented as true #999
Comments
Hi @toots Thanks for reporting. The condition |
This is something extremely weird. There's no possible trackable call from As for I don't know why you have these results in gdb, but there's something completely messed up there. |
Hi all, thanks for reviewing! The I'm not sure why Another interesting fact is that most instances are usually unstuck -- at least for a short while -- after releasing the I will try my best to reproduce this one (#787 is still a minor issue for us). It's a bit tricky as the implementation is buried in a big application and also because of general business but I'll try my best. Thanks for the hard work guys! |
Releasing gdb hold is different from sending a SIGINT (which stop the application). But I confirm that releasing gdb (probably a SIGCONT ?) do unstuck the application in our case. |
Thinking about this more, there's another interesting point: the code where this gets called raises a discrepancy in encryption configuration between client and server socket, but we never set encryption on either end. This all sounds like some kind of memory corruption. Furthermore, this seem to happen on quick disconnection/reconnection on our end and I believe that this can lead to the listener socket being recreated as well. Given that the accepted socket can inherit its properties from the listener socket, could it be possible that there is a race condition between the cleanup of a listener socket and the instantiation of a accepted socket? |
If you are talking about a cleanup in the listener socket, which normally happens when it's being closed, there possibly could happen something like that, but this would be still a valid listening socket, still existing, but no longer capable of accepting new connections, just for a short time to finish all remaining operations. There's no cleanup that would set all fields to zero or anything like that. BTW. note that using any condition-waiting function inside the log handler is an extremely bad idea. This way you can influence some time dependencies within the whole library and change it behavior to much worse. If you can't handle the request for a log immediately, the log should be either stored in some queue, from which it will be then slowly picked up, or dropped otherwise. |
I am finally getting back to this! After looking at the code, here's the part that looks suspicious for a race condition: // Change towards original UDT:
// Leave all the closing activities for garbageCollect to happen,
// however remove the listener from the RcvQueue IMMEDIATELY.
// Even though garbageCollect would eventually remove the listener
// as well, there would be some time interval between now and the
// moment when it's done, and during this time the application will
// be unable to bind to this port that the about-to-delete listener
// is currently occupying (due to blocked slot in the RcvQueue).
HLOGC(mglog.Debug, log << s->m_pUDT->CONID() << " CLOSING (removing listener immediately)");
{
CGuard cg(s->m_pUDT->m_ConnectionLock);
s->m_pUDT->m_bListening = false;
s->m_pUDT->m_pRcvQueue->removeListener(s->m_pUDT);
} It seems to me that there's a risk of a race condition here if there is a listener connection already happening before removing the listening socket from the listening loop. I haven't been able to reproduce the exact issue but, adding some
|
Ok, I've checked the things at the given location and it looks like there's a potential deadlock in the following form: This above mentioned function puts a lock on two mutexes at a time:
As I have tried to trace the theoretical use of it, locks should either happen in this order (the call to I have, however, something that can help tracking it down, but you'd have to merge one branch that I have prepared just a moment ago: Can you help? |
Ok, in short: we are currently unable to reproduce it, but you can help us. If you want, please merge the branch from the #1055 PR into your branch, then compile with You should be able to see in the logs which thread has lately acquired locks that are blocking the thread, and which these are. |
@toots, when you have time, please reopen it and retry according to what was written above. The thread logging feature must have been put aside as for now, we might reinstate it at some point, but it won't be fast. The feature is still in this branch, so you might try to use this together with your changes, just note it has now largely diverged from latest master. |
Thank you and thanks for the time spent looking at this issue. I'll let you know if I have more info/can reproduce. So far, our setup is stable and we're happy with it. |
Hi!
Describe the bug
We were investigating connection freeze in our srt setup and came across the following
gdb
stacktrace:This exact same trace was repeated across multiple instances.
Upon investigating, this appeared to be a live loop. Further, the code around frame 4 looks like this:
Intrigued, we then checked the value of
m_bOPT_StrictEncryption
:This is despite ever touching this value and despite the documentation at https://github.com/Haivision/srt/blob/master/docs/API.md:
It looks like the value is set to
true
as expected, except possibly at line 336:I would have pushed a PR instead of an issue but I'm not sure what to do with this constructor case.
As a side note, there's another inconsistent spot in the code, in
srtcore/core.h
at line 566:Desktop (please provide the following information):
1.4.0
The text was updated successfully, but these errors were encountered: