-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failover when zookeeper instance is down #3
Comments
Hi, So how it should work is at follows:
Your solution for having a load balancer for Zookeeper is what most people do. Usually it's good enough to just have a health check as you describe. Here's an example by using HAproxy as a load balancer: https://community.hortonworks.com/articles/139439/load-balance-zookeeper-using-haproxy.html I'm very busy at the moment with lots of different projects, so making a software load balancer for ZK might take a while before I get to it. But I'll keep you updated if there's any progression. I wouldn't expect it before the end of this month. Just so you know. |
Thanks for the quick reply! Good to know that 1&2 work as expected since the load balancer is not implemented yet as you say. I will have a look at the link you posted and probably end up load balancing the ZK hosts myself. No worries if you don't have time and thanks for keeping me updated. (I noticed that the Solarium package itself has a LoadBalancer plugin so that could be a starting point :) ) |
This is more of a general question about failover using solarium cloud connecting to a few zookeeper instances.
I have configured 3 Zookeeper servers with 3 shards and 3 replicas, all working ok and able to connect to them.
My solarium cloud config is the following:
I am able to use solarium cloud and connect ok, perform queries, etc.
But after finally finishing configuring everything and connecting, I decided to test what would happen if one of my instance were to actually go down - the reason behind using SolrCloud in the first place.
I have tried the following scenarios: stop the server altogether, stop Zookeer on the server, stop Solr on the server but have Zookeeper running.
And I got to the following conclusions:
If the server itself is completely down and solarim cloud happens to choose that host to direct the query to then I get "operation timeout" exception - I assume since port 2181 is not reachable at all so the timeout limit kicks in.
If I stop zookeeper on the server and solarim cloud sends the request to that host then I get "connection loss" - I assume since port 2181 is reachable but the service is not running at all so the connection is not established?
If zookeeper is running but I stop Solr on the server, then everything works fine - if solarium cloud sends a request to that host then zookeeper figures out that solr is down and directs the query to another instance which is up - so everything works fine in this case.
My question is whether it's actually possible to get it to failover correctly to the live instances in the first two scenarios? Or am I approaching this the wrong way? Or it's meant to behave that way and I should have failover at a different step?
Would a correct/possible solution be to stick a load balancer in front of the 3 zookeeper instances, and have the health check on port 2181 and if one of the zookeepers is not answering then don't direct any requests to it?
In that case my "zkhosts" would be "load_balancer_host:2181"
Not quite sure whether this question is suitable for this issue queue? or I should post it on stack overflow maybe?
Thanks!
The text was updated successfully, but these errors were encountered: