Avoid parsing nil response when connection timeout occured #235

tzzzoz · 2017-09-26T13:28:37Z

I have done a quick fix for the issue #233.

avoid parse nil response exception

sourcelevel-bot · 2017-09-26T13:29:13Z

lib/kafka_ex/server_0_p_8_p_2.ex

+            response = Fetch.parse_response(response)
+            state = %{state | correlation_id: state.correlation_id + 1}
+            last_offset = response |> hd |> Map.get(:partitions) |> hd |> Map.get(:last_offset)
+            if last_offset != nil && fetch_request.auto_commit do


Function body is nested too deep (max depth is 2, was 3).

sourcelevel-bot · 2017-09-26T13:29:13Z

lib/kafka_ex/server_0_p_8_p_2.ex

-          {response, state}
-        else
-          {response, state}
+        response = NetworkClient.send_sync_request(broker, fetch_data, config_sync_timeout())


Line is too long (max is 80, was 93).

sourcelevel-bot · 2017-09-26T13:29:13Z

lib/kafka_ex/server_0_p_8_p_2.ex

+          _ ->
+            response = Fetch.parse_response(response)
+            state = %{state | correlation_id: state.correlation_id + 1}
+            last_offset = response |> hd |> Map.get(:partitions) |> hd |> Map.get(:last_offset)


Line is too long (max is 80, was 95).

sourcelevel-bot · 2017-09-26T13:29:13Z

lib/kafka_ex/server_0_p_8_p_0.ex

-          |> NetworkClient.send_sync_request(fetch_data, config_sync_timeout())
-          |> Fetch.parse_response
-        {response, %{state | correlation_id: state.correlation_id + 1}}
+        response = NetworkClient.send_sync_request(broker, fetch_data, config_sync_timeout())


Line is too long (max is 80, was 93).

sourcelevel-bot · 2017-09-26T13:29:14Z

Ebert has finished reviewing this Pull Request and has found:

2 possible new issues (including those that may have been commented here).

You can see more details about this review at https://ebertapp.io/github/kafkaex/kafka_ex/pulls/235.

joshuawscott · 2017-09-26T15:40:29Z

Thanks for the PR! some comments:

Ideally we'd handle this in a way that involves reconnecting, or at least returning an appropriate error.

A test would be the best way to ensure this is handled correctly (either with a reconnect or appropriate failure message). As-is, this would return {nil, State.t} from that function, which really just pushes the nil up to somewhere else, and given the limited dialyzer coverage we have, I'm not certain that will behave correctly? (correct me if I'm wrong here)

We're trying to not introduce new credo issues, so the Ebert line-too-long complaints should be addressed before merge.

Thanks again for the PR, if you are stuck on a test case, I might try to create one, as this issue is affecting the apps I'm currently working on intermittently.

Cheers!

bjhaid · 2017-09-26T17:27:44Z

Testing this will be somewhat difficult, ideal fix will require reworking the NetworkClient API, I have been on the fence with this kind of fix, but given it's being there and we have not implemented a cleaner fix, I think this is fine. This PR should also document that fetch can return nil instead of the datastructure it claims to return today so the user should be prepared to handle that, also the fix should be extended to the stream function since it's a similar behavior, and 👍 on the ebert comment

dantswain · 2017-09-26T17:44:43Z

I'm a little bit concerned here about swallowing the failed response. If the nil is due to some error state in the system, then I would rather see us raise a more meaningful error. It would still crash, but it would be less obtuse than a match error.

Do we know for sure what conditions causes the response to be nil? Do we know that we won't continually get nil until something crashes elsewhere?

I have been considering trying to replace the network client with something using @fishcakez Connection behavior, though that would be a bit of an undertaking.

bjhaid · 2017-09-26T18:42:38Z

This has been a problem forever, and we have been rejecting PRs like this without fixing the underlying issue, we are also causing folks apps to crash unduly, we should give the user a choice and not make a decision for them. The ideal fix IMO is to make NetworkClient return a {:ok, binary}| {:error, :reason} and propagate that correctly to the caller, and for API like fetch return the error to the user and let them deal with correctly handling the error and create a fetch! equivalent, and for stream log it in debug mode and ignore such failures. We should also document this behavior. However if we don't have the bandwidth to adequately fix this we should stop crashing apps for folks and making people come back time and time again with this fix or asking questions why their apps are crashing.

dantswain · 2017-09-26T18:50:52Z

I'm open to solving the underlying problem in various ways, but I'm relatively strongly against swallowing an error condition unless we know for a fact that that is not going to just cause a crash downstream somewhere. It's possible that this will not solve the crash at all but will instead cause a user's application to crash in an even more obtuse way.

I'm very much in favor of fixing the underlying problem. It affects me directly. I literally sometimes have to wake up in the middle of the night to deal with its impact on our production systems.

It would be really great if someone could track down exactly why we get a nil back from the network client, under what conditions that happens, and what kind of resolution it has (does it repeat, does the next request return OK, etc). I think @joshuawscott has some ideas around that.

bjhaid · 2017-09-26T18:55:00Z

I am positive the nil is coming from here:

kafka_ex/lib/kafka_ex/network_client.ex

Line 47 in 7cc8a00

nil

if you check your logs, you should find the entry here in it.

joshuawscott · 2017-09-26T19:28:50Z

reason == :timeout in my case, (and in #233 )
I think returning the value of {:error, reason} would be much better, and it doesn't seem like that would require much change at all.

bjhaid · 2017-09-26T19:36:18Z

@joshuawscott sounds like someone just volunteered :), I agree whole-heartedly with the intent

tzzzoz · 2017-09-27T08:49:36Z

I agree with @bjhaid and @dantswain about reworking on the NetworkClient API. When I have the connection problem in our applications, every KafkaEx API where NetworkClient.send_sync_request/3 has been involved is affected.
Beside the issue I mentioned in #233, I got also the errors as below during the connection timeout:
[error] Could not connect to broker "xxx.xxx.xxx.xxx":19093 because of error {:error, :econnrefused}
[error] Unable to fetch metadata from any brokers. Timeout is 10000.
[error] Receiving data from broker "xxx.xxx.xxx.xxx":19093 failed with :timeout

When the timeout or other connection failure comes, I think that NetworkClient should be able to handle it, either NetworkClient does the retry by itself or NetworkClient propagates the connection error to users(at least let us have the chance to capture it and deal with it), instead of crashing :kafka_ex. The connection timeout normally will last several seconds or minutes, so Supervisor's restart strategy couldn't help any more in this case.

beedub · 2017-12-10T19:20:17Z

any update here? also seeing this issue during :timeouts

joshuawscott · 2018-02-15T22:27:28Z

@tzzzoz @beedub does #272 fix this for you?

joshuawscott · 2018-03-20T22:18:51Z

bump @tzzzoz @beedub

tzzzoz · 2018-03-23T10:03:24Z

@joshuawscott Yes, it works for me! Great job!

tzzzoz added 3 commits September 26, 2017 11:19

avoid parse nil response exception

1e7abb6

Merge pull request #1 from tzzzoz/fix_fetch_nil_response_exception

1f1d59f

avoid parse nil response exception

avoid parsing nil response exception

82a17f6

sourcelevel-bot bot reviewed Sep 26, 2017

View reviewed changes

Merge branch 'master' into fix_fetch_nil_response_exception

f32de68

bjhaid mentioned this pull request Nov 24, 2017

When a connection timeout occured, fetch/2 couldn't handle the exception #233

Closed

ashwini709 mentioned this pull request Dec 15, 2017

Refactor and test the 0.8.0 implementation #264

Merged

tzzzoz closed this Mar 23, 2018

tzzzoz deleted the fix_fetch_nil_response_exception branch March 23, 2018 10:04

Avoid parsing nil response when connection timeout occured #235

Avoid parsing nil response when connection timeout occured #235

Uh oh!

Conversation

tzzzoz commented Sep 26, 2017

Uh oh!

sourcelevel-bot bot Sep 26, 2017

Choose a reason for hiding this comment

Uh oh!

sourcelevel-bot bot Sep 26, 2017

Choose a reason for hiding this comment

Uh oh!

sourcelevel-bot bot Sep 26, 2017

Choose a reason for hiding this comment

Uh oh!

sourcelevel-bot bot Sep 26, 2017

Choose a reason for hiding this comment

Uh oh!

sourcelevel-bot bot commented Sep 26, 2017

Uh oh!

joshuawscott commented Sep 26, 2017

Uh oh!

bjhaid commented Sep 26, 2017

Uh oh!

dantswain commented Sep 26, 2017

Uh oh!

bjhaid commented Sep 26, 2017

Uh oh!

dantswain commented Sep 26, 2017

Uh oh!

bjhaid commented Sep 26, 2017

Uh oh!

joshuawscott commented Sep 26, 2017

Uh oh!

bjhaid commented Sep 26, 2017

Uh oh!

tzzzoz commented Sep 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

beedub commented Dec 10, 2017

Uh oh!

joshuawscott commented Feb 15, 2018

Uh oh!

joshuawscott commented Mar 20, 2018

Uh oh!

tzzzoz commented Mar 23, 2018

Uh oh!

Uh oh!

tzzzoz commented Sep 27, 2017 •

edited

Loading