Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failover when zookeeper instance is down #3

Open
dumityty opened this issue Jan 11, 2018 · 2 comments
Open

Failover when zookeeper instance is down #3

dumityty opened this issue Jan 11, 2018 · 2 comments

Comments

@dumityty
Copy link

This is more of a general question about failover using solarium cloud connecting to a few zookeeper instances.

I have configured 3 Zookeeper servers with 3 shards and 3 replicas, all working ok and able to connect to them.

My solarium cloud config is the following:

[
  'zkhosts' => 'HOST1:2181,HOST2:2181,HOST3:2181',
  'defaultcollection' => 'COLLECTION_NAME',
]

I am able to use solarium cloud and connect ok, perform queries, etc.

But after finally finishing configuring everything and connecting, I decided to test what would happen if one of my instance were to actually go down - the reason behind using SolrCloud in the first place.

I have tried the following scenarios: stop the server altogether, stop Zookeer on the server, stop Solr on the server but have Zookeeper running.

And I got to the following conclusions:

  1. If the server itself is completely down and solarim cloud happens to choose that host to direct the query to then I get "operation timeout" exception - I assume since port 2181 is not reachable at all so the timeout limit kicks in.

  2. If I stop zookeeper on the server and solarim cloud sends the request to that host then I get "connection loss" - I assume since port 2181 is reachable but the service is not running at all so the connection is not established?

  3. If zookeeper is running but I stop Solr on the server, then everything works fine - if solarium cloud sends a request to that host then zookeeper figures out that solr is down and directs the query to another instance which is up - so everything works fine in this case.

My question is whether it's actually possible to get it to failover correctly to the live instances in the first two scenarios? Or am I approaching this the wrong way? Or it's meant to behave that way and I should have failover at a different step?

Would a correct/possible solution be to stick a load balancer in front of the 3 zookeeper instances, and have the health check on port 2181 and if one of the zookeepers is not answering then don't direct any requests to it?
In that case my "zkhosts" would be "load_balancer_host:2181"

Not quite sure whether this question is suitable for this issue queue? or I should post it on stack overflow maybe?

Thanks!

@jsteggink
Copy link
Contributor

Hi,
Thanks for your feedback!

So how it should work is at follows:

  1. If you have multiple Zookeepers, it should round-robin the servers. If one fails, it should be taken out of the list. However, when it is live again it should be added back to the list. This is something that I haven't build yet.
  2. I don't really get what you mean with this one. You stop Zookeeper service? Okay, so that means nothing is listening on port 2181. This is exactly the same as turning off the server. Or do you mean something else?
  3. I'm glad this works.

Your solution for having a load balancer for Zookeeper is what most people do. Usually it's good enough to just have a health check as you describe. Here's an example by using HAproxy as a load balancer: https://community.hortonworks.com/articles/139439/load-balance-zookeeper-using-haproxy.html

I'm very busy at the moment with lots of different projects, so making a software load balancer for ZK might take a while before I get to it. But I'll keep you updated if there's any progression. I wouldn't expect it before the end of this month. Just so you know.

@dumityty
Copy link
Author

Thanks for the quick reply!
It was more a question to make sure I understand what is happening and that what I am seeing is correct and as expected.
Yes you are right, 1 and 2 are technically the same (1 the server itself is down, 2 the zookeeper service is stopped which technically is the same thing as 1) - but I was just surprised to see two different responses.

Good to know that 1&2 work as expected since the load balancer is not implemented yet as you say. I will have a look at the link you posted and probably end up load balancing the ZK hosts myself.

No worries if you don't have time and thanks for keeping me updated. (I noticed that the Solarium package itself has a LoadBalancer plugin so that could be a starting point :) )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants