Bad bridge HTTP responses when Kafka cluster is not running/not reachable #488

ppatierno · 2020-10-29T08:17:16Z

It turned out that the HTTP bridge has quite bad behavior when a Kafka cluster is not running or is not reachable by the bridge itself.
After adding the admin client endpoint, there is a connection that this tries to establish with the Kafka cluster that of course fails if the cluster is not running/not reachable.
In this scenario, if an HTTP client sends requests to the bridge these following responses are returned:

on healthy and ready endpoints, 200 OK
on create, subscribe, get records, 200 OK
on send messages, 404 NOT FOUND

Of course, they are wrong. The bridge is not working well due to the lack of connection to the Kafka cluster.
I would say that if this happens, the HTTP server accepting requests should not start and not being reachable by the HTTP clients or at least returning propers error codes and mainly an error on the healthy and ready endpoints.

scholzj · 2020-10-30T09:31:09Z

I do not think being healthy and having the HTTP port opened is necessarily wrong. I would just fix the HTTP return codes to return some corresponding errors.

tombentley · 2021-02-08T12:29:54Z

I don't see it as a problem that the healthy and ready endpoints return 200 OK. If Kafka is down that's not a problem with the bridge. Restarting the bridge won't help.

I do think we need to look carefully at what the actual API methods are doing in this situation. My gut feeling is that the vertx client could be hiding certain errors. E.g. what does poll do? Anywhere where the API is returning 200 is suspect because it means the client cannot know that there's a problem.

ppatierno · 2021-02-08T17:13:18Z

From my point of view, it really depends on what users expect from the healthy and ready endpoints of the bridge.
As you said, reporting 200 OK would just mean that the bridge is fine ... healthy, and ready ... but it's kind of useless because Kafka is not running or there are some connection issues. Is that information really helpful?
I see the bridge as a door for Kafka (via HTTP) so it should reflect the Kafka "status".
I could agree that healthy is 200 OK but I would say that ready should be 503 Service Unavailable to say "I am ok, I am healthy, but I am not ready to get your requests because what's behind me, Kafka, is not working so it's useless you start to send requests for producing/consuming messages".

tombentley · 2021-02-08T17:48:54Z

So it sounds like you agree for /healthy.

I kinda agree with you about /ready, except that it rests on the assumption that we can know when Kafka is "up". /ready is consumed by Kube when deciding how the Service should be balanced, right? But because the bridge is stateful (for consumer's at least) you can't really have Service selecting >1 bridge pod. So /ready returning non-200 does not directly tell the client of the service anything that it couldn't learn if it could actually talk to the bridge and get an error code (and more specific error message). The difference is that with your definition of /ready:

it takes longer for the clients to be able to use the service once Kafka is "up" (since Kube doesn't poll the endpoint continually)
you have to define "up"

But "up" is not well defined, at all. You're having to add some extra code to poll metadata just to decide whether it's ready and actually there are loads of ways you can that metadata but still not be able to service the client (e.g. it wants to produce to a partition which lacks a leader, there's not enough replicas, or the client's not authorized for the topic). So this definition of ready doesn't seem to be achieving very much for the user except hiding what could be a more useful status message.

So I think /ready is a distraction, and the behaviour of poll etc in these conditions is what we should really care about, have tests for etc.

ppatierno · 2021-02-09T11:47:12Z

Just making the point that the bridge is supposed to run outside of Kubernetes as well so there is no concept of probe and service selection anymore.
That would be a reason why /ready could be meaningful to the client to know if the bridge is actually ready to get requests and convey to Kafka.
On one side I would agree that it's actually hard to know if Kafka is up and running but on the other side returning 200 OK from ready doesn't make much sense. In the end, it's a kind of useless endpoint and overlaps with the healthy one (that could be used by the Kubernetes probe anyway).

ppatierno added the bug label Oct 29, 2020

strimzi deleted a comment from quickbooks2018 Jan 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad bridge HTTP responses when Kafka cluster is not running/not reachable #488

Bad bridge HTTP responses when Kafka cluster is not running/not reachable #488

ppatierno commented Oct 29, 2020

scholzj commented Oct 30, 2020

tombentley commented Feb 8, 2021

ppatierno commented Feb 8, 2021

tombentley commented Feb 8, 2021

ppatierno commented Feb 9, 2021

Bad bridge HTTP responses when Kafka cluster is not running/not reachable #488

Bad bridge HTTP responses when Kafka cluster is not running/not reachable #488

Comments

ppatierno commented Oct 29, 2020

scholzj commented Oct 30, 2020

tombentley commented Feb 8, 2021

ppatierno commented Feb 8, 2021

tombentley commented Feb 8, 2021

ppatierno commented Feb 9, 2021