-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NoMethodError: undefined method `[]' for nil:NilClass #51
Comments
I don't know how this could happen. Faktory.configure_worker do |config|
config.faktory = { debug: true }
end |
We're currently digging into an unrelated issue (not sure yet if it's something on our end or not) where we were digging into the |
@jmirmina Great find, I was completely blind to that line. This will occur if we call I wonder if this issue can come about if we use a Faktory::Client in a thread-unsafe way; maybe it's possible for one thread to request |
Hello! I've been working with @jmirmina debugging this heavily on our side internally and with regards to what @mperham said about Faktory::Client being used in thread-unsafe ways I highly suspect this to be the case. So a bit of background about our setup, we run our rails setup inside of a standard container that boots up puma and as apart of that setup, we instantiate both as a producer (think: launcher class in this gem in a separate thread in an after initializer, we pass wid and pid to faktory). However, we also push jobs to faktory in this same rails container using the standard workflow where The other thing I've witnessed while sharing a connection pool is that sometimes, there is data in the |
I'll read through the FWR code and make sure it is using connection_pool as expected. If you can give me any more clues or reproduce the issue, that would help a lot too. |
What do you think about this? +++ b/lib/faktory/client.rb
@@ -295,9 +295,12 @@ module Faktory
begin
yield
- rescue Errno::EPIPE, Errno::ECONNRESET
+ rescue SystemCallError, SocketError, TimeoutError
if retryable
retryable = false
+
+ @sock.close rescue nil
+ @sock = nil
open(@timeout)
retry
else If the underlying socket raises an exception that is not caught, it could have unread data that will be read on the next use. This change:
|
I think that this should cover the cases I outlined above with things potentially being left over in the buffer - these changes would be perfect. However, I did notice something weird today, here is a stack trace of us getting something unexpected from the buffer when doing a push:
Now what was exceptionally weird about this was - as far as I can tell via the faktory-server go code - there is nothing in the |
That error would happen if the client tries to heartbeat for a worker which has been reaped. It's another case where the connection should be closed so the client opens a new socket and re-authenticates. Sounds like your worker process got into a bad state somehow and FWR's client connection logic isn't yet good enough to gracefully recover from every edge case. I've committed that change; feel free to run master and see if it improves things. |
We've been running into this recently. I'm not sure what's causing it, but it seems potentially to happen more when we deploy a new version. We're running in ECS, so I don't really understand why that would matter.. |
@arjun810 Please post a full backtrace when you get one. |
Sorry for the terrible formatting.. copying out of our error tracker. Details
|
Little bit more information for you... It seems like after our deploy, one of the workers started up unhappy. Here are the logs from that worker's startup. It eventually does successfully process a job, but then will randomly and intermittently get the same OK failure pasted above. Details
7 TID-2dak ERROR: Error fetching job: unexpected token at 'OK' |
It sounds like you've got something really dodgy in your network messing up connections. I'm not sure how that happens. Faktory::Timeout is a bad sign. Did you by any chance take down Faktory and then start it 4-5 seconds later without first shutting down the workers? Ruby is really flexible so for these things, I usually monkeypatch methods to get more runtime data associated with the crash, e.g. class Faktory::Client
def ok(retval = true)
resp = result
return retval if resp == "OK"
p [:crash, resp] if !resp.respond_to?(:[]) # added
resp[0].to_sym
end
end Now deploy and search the logs for "crash". Keep iterating on that debugging until you get more clarity. |
My mistake, since we know |
We think the issue is that our faktory server was under resourced. We gave it more CPU/RAM and haven't seen the issue anymore. It's a little odd because it never got to 100% cpu or memory.. In any case, it'd be nice if the client were more robust here -- not sure if you can reset the connection so the worker isn't in a permanently bad state. It does seem like there is a client issue, because 3/4 workers were totally fine, and the one that started out unhappy stayed pretty unhappy. |
Hmm, we are still experiencing this. I did notice that Faktory::Client already has a faktory_worker_ruby/lib/faktory/client.rb Line 346 in fdf7582
Seems like this works: Faktory.faktory = { debug: true } So maybe no need to monkey patch! |
Hm, we haven't tried the above yet because it seems like it'd produce a ton of logging data during normal operation. But we're considering monkey patching the client to set One thing I'm noticing now, looking at the It seems like the only time Faktory returns nil is if there are no jobs to fetch (either no jobs on the queues, or the worker is supposed to be "quiet") So it kinda feels like messages are getting mixed up and we're getting responses for different requests when this happens. I will note that at least for the incident that occurred today, the worker process started out with 4 Faktory::TimeoutErrors when trying to fetch work. Then, it got a job successfully, and appeared to complete it, but stuff started breaking down when it tried to ACK the job or enqueue new jobs. Selected logs
Will try and get more info if we can and see what we can find... this does seem pretty odd. One thing I was confused about previously was that it seems like the read_timeout method might lead to data getting misaligned if it times out trying to read an expected number of bytes (and those bytes later show up) but it looks like commands are wrapped in a transaction that closes down the socket and opens a new one if there are TimeoutErrors, so it should be fine? Presumably if a client got a partial message and then timed out, the rest of the message would not get sent when it opens a new socket. Though I'm not sure what happens on Line 331, if it fails to close the socket does it still try to open a new socket? Not sure what the |
there appears to be an edge case in the Ruby client which corrupts the state of the connection. No one has been able to get me a reproduction so I haven’t been able to track it down and fix it but it usually happens after some exception is raised. |
If I had an infinite amount of time I’d write a stress tester for the protocol in order to stress test all of the different Faktory clients with various scenarios. Things like timeouts, errors, malformed responses, etc. |
Hmm thanks, that's helpful to know! Is it always the same exception that we experienced? Do you happen to know of any other issue reports like this off-hand? I wonder if there's a way to detect the bad state and "reset" the connection somehow... I guess the other operational solution would be to just detect the bad state and remove the bad worker from the pool and spawn another one, hmm... Edit: d'oh, I re-read all the comments on this ticket and realized this is the issue where other people ran into the same problem, and the code I was looking at to disconnect the socket was added in response to this issue. FWIW, we are seeing this in Ruby Faktory workers and some of our Jobs also push other Jobs, so it's similar to the scenario others had mentioned where consumers and producers are in the same process. Also... while we haven't been able to figure out how to reproduce it, it does seem to keep reoccurring for us every few days, and it usually seems to happen with a deploy - it seems like a worker will boot up in a bad way and get stuck in that bad state, but if it starts out fine it usually seems to stay fine. |
One small update: We enabled debug logging for the Faktory protocol and saw this sequence of events when the error came up:
This seems to show that a PUSH got a -1 as a response (possibly intended as a response to a FETCH). However, it's kinda hard to know what this was a response to because there's no thread ID or other identifying info in these debug logs. It would be nice to use I'm not sure if the above actually tells us anything new, because we kind of already knew that that was what was happening given that the error was that something was nil where it wasn't expected to be a possible response. I also noticed in the above sequence that some FETCHes don't seem to have a response immediately after them. I also noticed this in another sequence of logs, where there's a mismatch between the number of FETCHes and responses. (Random tiny sample, but I hope it illustrates the situation.) I don't think this is necessarily a problem, unless the server is responding multiple times for the same FETCH and some of those responses go to a wrong client.
Will continue to investigate... it seems like one of our apps reproduces the issue more readily than the other, so that might help. Also, the app is in pre-production so there are long quiet periods (e.g. which explains the stretch of -1 responses to fetches). For reference, here's the monkey patch I used: class Faktory::Client
private
def debug(line)
Faktory::Logging.logger.info(line)
end
end Resulting in slightly more comprehensible logs (in that we can see which response is a response to which request).
|
^ Hah, was just about to send in a patch for that, thanks! |
I think I found something relevant. So far, in some of the cases where we see this error state of commands getting responses meant for other commands, I have noticed an SSL error that happens before it starts getting these other errors:
It seems that when calling read_nonblock, we try to disable exceptions, but there are still some exceptions that can come up, at least when OpenSSL is involved. If an exception is raised here, it seems like whatever partial reads happen get left on the buffer. I wonder if it would help to add this SSLError to the Submitted #83 in case that is a sensible thing to do. At the very least, it feels like this is an error that needs to be handled somehow for Faktory to recover successfully. Thanks for your consideration! |
Hmm... the above might still be a little off the mark. It seems like something must be calling the OpenSSL shutdown function for this to come up, and it seems like one of the times that FWR tries to close the socket is in this error handling case... so maybe Also... I'm looking at
@threads in manager.rb. I wonder if it's possible for a bad thread to keep fetching jobs if it gets into this state? I think though that if the transaction method handles an exception it should close the Connection and associated bad socket, so maybe the change I submitted will help with that.
Unfortunately most of the logs from the earlier incidents we had have passed out of our log retention policy, so it's hard for me to determine if it continued spawning threads, but I do remember that in the admin UI we would see one processor with way more jobs than the configured concurrency, so that might be related... We did have one such incident on 2/6 and I am seeing a bunch of different TIDs in the logs with messages like We'll be keeping an eye on this and will report back based on what we learn... unfortunately with these Heisenbugs it's hard to know if the issue is fixed or not when we don't know how to reproduce it reliably 😕. |
main has a fix to gracefully handle and reopen sockets which raise OpenSSL errors. See #83. If someone wants to try it out and see if it improves your experience, I can package it up and release v2.0.1. |
That would be awesome, thanks! I'm running it from git in one of our staging environments so we'll see if anything comes up. We're just waiting for now... But the release would be handy for another app that sees more real traffic and might be more likely to reproduce the issue. |
The text was updated successfully, but these errors were encountered: