-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid parsing nil response when connection timeout occured #235
Conversation
avoid parse nil response exception
response = Fetch.parse_response(response) | ||
state = %{state | correlation_id: state.correlation_id + 1} | ||
last_offset = response |> hd |> Map.get(:partitions) |> hd |> Map.get(:last_offset) | ||
if last_offset != nil && fetch_request.auto_commit do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function body is nested too deep (max depth is 2, was 3).
{response, state} | ||
else | ||
{response, state} | ||
response = NetworkClient.send_sync_request(broker, fetch_data, config_sync_timeout()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line is too long (max is 80, was 93).
_ -> | ||
response = Fetch.parse_response(response) | ||
state = %{state | correlation_id: state.correlation_id + 1} | ||
last_offset = response |> hd |> Map.get(:partitions) |> hd |> Map.get(:last_offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line is too long (max is 80, was 95).
|> NetworkClient.send_sync_request(fetch_data, config_sync_timeout()) | ||
|> Fetch.parse_response | ||
{response, %{state | correlation_id: state.correlation_id + 1}} | ||
response = NetworkClient.send_sync_request(broker, fetch_data, config_sync_timeout()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line is too long (max is 80, was 93).
Ebert has finished reviewing this Pull Request and has found:
You can see more details about this review at https://ebertapp.io/github/kafkaex/kafka_ex/pulls/235. |
Thanks for the PR! some comments: Ideally we'd handle this in a way that involves reconnecting, or at least returning an appropriate error. A test would be the best way to ensure this is handled correctly (either with a reconnect or appropriate failure message). As-is, this would return We're trying to not introduce new credo issues, so the Ebert line-too-long complaints should be addressed before merge. Thanks again for the PR, if you are stuck on a test case, I might try to create one, as this issue is affecting the apps I'm currently working on intermittently. Cheers! |
Testing this will be somewhat difficult, ideal fix will require reworking the |
I'm a little bit concerned here about swallowing the failed response. If the Do we know for sure what conditions causes the response to be nil? Do we know that we won't continually get nil until something crashes elsewhere? I have been considering trying to replace the network client with something using @fishcakez Connection behavior, though that would be a bit of an undertaking. |
This has been a problem forever, and we have been rejecting PRs like this without fixing the underlying issue, we are also causing folks apps to crash unduly, we should give the user a choice and not make a decision for them. The ideal fix IMO is to make |
I'm open to solving the underlying problem in various ways, but I'm relatively strongly against swallowing an error condition unless we know for a fact that that is not going to just cause a crash downstream somewhere. It's possible that this will not solve the crash at all but will instead cause a user's application to crash in an even more obtuse way. I'm very much in favor of fixing the underlying problem. It affects me directly. I literally sometimes have to wake up in the middle of the night to deal with its impact on our production systems. It would be really great if someone could track down exactly why we get a nil back from the network client, under what conditions that happens, and what kind of resolution it has (does it repeat, does the next request return OK, etc). I think @joshuawscott has some ideas around that. |
I am positive the kafka_ex/lib/kafka_ex/network_client.ex Line 47 in 7cc8a00
if you check your logs, you should find the entry here in it. |
reason == |
@joshuawscott sounds like someone just volunteered :), I agree whole-heartedly with the intent |
I agree with @bjhaid and @dantswain about reworking on the NetworkClient API. When I have the connection problem in our applications, every KafkaEx API where When the timeout or other connection failure comes, I think that NetworkClient should be able to handle it, either NetworkClient does the retry by itself or NetworkClient propagates the connection error to users(at least let us have the chance to capture it and deal with it), instead of crashing :kafka_ex. The connection timeout normally will last several seconds or minutes, so Supervisor's restart strategy couldn't help any more in this case. |
any update here? also seeing this issue during |
@joshuawscott Yes, it works for me! Great job! |
I have done a quick fix for the issue #233.