Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid parsing nil response when connection timeout occured #235

Closed

Conversation

tzzzoz
Copy link

@tzzzoz tzzzoz commented Sep 26, 2017

I have done a quick fix for the issue #233.

response = Fetch.parse_response(response)
state = %{state | correlation_id: state.correlation_id + 1}
last_offset = response |> hd |> Map.get(:partitions) |> hd |> Map.get(:last_offset)
if last_offset != nil && fetch_request.auto_commit do

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function body is nested too deep (max depth is 2, was 3).

{response, state}
else
{response, state}
response = NetworkClient.send_sync_request(broker, fetch_data, config_sync_timeout())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line is too long (max is 80, was 93).

_ ->
response = Fetch.parse_response(response)
state = %{state | correlation_id: state.correlation_id + 1}
last_offset = response |> hd |> Map.get(:partitions) |> hd |> Map.get(:last_offset)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line is too long (max is 80, was 95).

|> NetworkClient.send_sync_request(fetch_data, config_sync_timeout())
|> Fetch.parse_response
{response, %{state | correlation_id: state.correlation_id + 1}}
response = NetworkClient.send_sync_request(broker, fetch_data, config_sync_timeout())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line is too long (max is 80, was 93).

@sourcelevel-bot
Copy link

Ebert has finished reviewing this Pull Request and has found:

  • 2 possible new issues (including those that may have been commented here).

You can see more details about this review at https://ebertapp.io/github/kafkaex/kafka_ex/pulls/235.

@joshuawscott
Copy link
Member

Thanks for the PR! some comments:

Ideally we'd handle this in a way that involves reconnecting, or at least returning an appropriate error.

A test would be the best way to ensure this is handled correctly (either with a reconnect or appropriate failure message). As-is, this would return {nil, State.t} from that function, which really just pushes the nil up to somewhere else, and given the limited dialyzer coverage we have, I'm not certain that will behave correctly? (correct me if I'm wrong here)

We're trying to not introduce new credo issues, so the Ebert line-too-long complaints should be addressed before merge.

Thanks again for the PR, if you are stuck on a test case, I might try to create one, as this issue is affecting the apps I'm currently working on intermittently.

Cheers!

@bjhaid
Copy link
Member

bjhaid commented Sep 26, 2017

Testing this will be somewhat difficult, ideal fix will require reworking the NetworkClient API, I have been on the fence with this kind of fix, but given it's being there and we have not implemented a cleaner fix, I think this is fine. This PR should also document that fetch can return nil instead of the datastructure it claims to return today so the user should be prepared to handle that, also the fix should be extended to the stream function since it's a similar behavior, and 👍 on the ebert comment

@dantswain
Copy link
Collaborator

I'm a little bit concerned here about swallowing the failed response. If the nil is due to some error state in the system, then I would rather see us raise a more meaningful error. It would still crash, but it would be less obtuse than a match error.

Do we know for sure what conditions causes the response to be nil? Do we know that we won't continually get nil until something crashes elsewhere?

I have been considering trying to replace the network client with something using @fishcakez Connection behavior, though that would be a bit of an undertaking.

@bjhaid
Copy link
Member

bjhaid commented Sep 26, 2017

This has been a problem forever, and we have been rejecting PRs like this without fixing the underlying issue, we are also causing folks apps to crash unduly, we should give the user a choice and not make a decision for them. The ideal fix IMO is to make NetworkClient return a {:ok, binary}| {:error, :reason} and propagate that correctly to the caller, and for API like fetch return the error to the user and let them deal with correctly handling the error and create a fetch! equivalent, and for stream log it in debug mode and ignore such failures. We should also document this behavior. However if we don't have the bandwidth to adequately fix this we should stop crashing apps for folks and making people come back time and time again with this fix or asking questions why their apps are crashing.

@dantswain
Copy link
Collaborator

I'm open to solving the underlying problem in various ways, but I'm relatively strongly against swallowing an error condition unless we know for a fact that that is not going to just cause a crash downstream somewhere. It's possible that this will not solve the crash at all but will instead cause a user's application to crash in an even more obtuse way.

I'm very much in favor of fixing the underlying problem. It affects me directly. I literally sometimes have to wake up in the middle of the night to deal with its impact on our production systems.

It would be really great if someone could track down exactly why we get a nil back from the network client, under what conditions that happens, and what kind of resolution it has (does it repeat, does the next request return OK, etc). I think @joshuawscott has some ideas around that.

@bjhaid
Copy link
Member

bjhaid commented Sep 26, 2017

I am positive the nil is coming from here:

if you check your logs, you should find the entry here in it.

@joshuawscott
Copy link
Member

reason == :timeout in my case, (and in #233 )
I think returning the value of {:error, reason} would be much better, and it doesn't seem like that would require much change at all.

@bjhaid
Copy link
Member

bjhaid commented Sep 26, 2017

@joshuawscott sounds like someone just volunteered :), I agree whole-heartedly with the intent

@tzzzoz
Copy link
Author

tzzzoz commented Sep 27, 2017

I agree with @bjhaid and @dantswain about reworking on the NetworkClient API. When I have the connection problem in our applications, every KafkaEx API where NetworkClient.send_sync_request/3 has been involved is affected.
Beside the issue I mentioned in #233, I got also the errors as below during the connection timeout:
[error] Could not connect to broker "xxx.xxx.xxx.xxx":19093 because of error {:error, :econnrefused}
[error] Unable to fetch metadata from any brokers. Timeout is 10000.
[error] Receiving data from broker "xxx.xxx.xxx.xxx":19093 failed with :timeout

When the timeout or other connection failure comes, I think that NetworkClient should be able to handle it, either NetworkClient does the retry by itself or NetworkClient propagates the connection error to users(at least let us have the chance to capture it and deal with it), instead of crashing :kafka_ex. The connection timeout normally will last several seconds or minutes, so Supervisor's restart strategy couldn't help any more in this case.

@beedub
Copy link

beedub commented Dec 10, 2017

any update here? also seeing this issue during :timeouts

@joshuawscott
Copy link
Member

@tzzzoz @beedub does #272 fix this for you?

@joshuawscott
Copy link
Member

bump @tzzzoz @beedub

@tzzzoz
Copy link
Author

tzzzoz commented Mar 23, 2018

@joshuawscott Yes, it works for me! Great job!

@tzzzoz tzzzoz closed this Mar 23, 2018
@tzzzoz tzzzoz deleted the fix_fetch_nil_response_exception branch March 23, 2018 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants