Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster.populate(n).start() does not wait for nodes to start #215

Open
kbr- opened this issue Jan 31, 2020 · 1 comment
Open

cluster.populate(n).start() does not wait for nodes to start #215

kbr- opened this issue Jan 31, 2020 · 1 comment
Assignees

Comments

@kbr-
Copy link

kbr- commented Jan 31, 2020

and in effect some dtests are failing.

e.g. with the following setup: scylla-ccm d1e62ba (next and master at the moment of writing this issue) + scylla-dtest https://github.com/scylladb/scylla-dtest/commit/cd83fd8f8956befd1af256b91dee6456093413c0 (next and master at the moment of writing this issue)

running:

nosetests -v -s concurrent_schema_changes_test.py:TestConcurrentSchemaChanges.snapshot_test

gives:

2020-01-31 15:08:22.142366 going to run tests sequentially
2020-01-31 15:08:22.142422 using the SingleClusterIdAllocator
snapshot_test (concurrent_schema_changes_test.TestConcurrentSchemaChanges) ... 2020-01-31 15:08:22.143590 cluster ccm directory: /home/kbraun/.dtest/dtest-2k9Sq9
2020-01-31 15:08:22.143644 Starting Scylla cluster from directory /home/kbraun/dev/scylla-dtest/../scylla/build/dev
2020-01-31 15:08:22.189404 snapshot_test()
ERROR
2020-01-31 15:08:28.350926 Test failed with exception: <class 'cassandra.cluster.NoHostAvailable'>: ('Unable to connect to any servers', {'127.0.0.1:9042': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
2020-01-31 15:08:28.354925 stopping ccm cluster test at: /home/kbraun/.dtest/dtest-2k9Sq9

======================================================================
ERROR: snapshot_test (concurrent_schema_changes_test.TestConcurrentSchemaChanges)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/kbraun/dev/scylla-dtest/concurrent_schema_changes_test.py", line 499, in snapshot_test
    session = self.cql_connection(node1)
  File "/home/kbraun/dev/scylla-dtest/dtest.py", line 704, in cql_connection
    protocol_version, port=port, ssl_opts=ssl_opts, **kwargs)
  File "/home/kbraun/dev/scylla-dtest/dtest.py", line 742, in _create_session
    session = cluster.connect()
  File "cassandra/cluster.py", line 1335, in cassandra.cluster.Cluster.connect
    with self._lock:
  File "cassandra/cluster.py", line 1371, in cassandra.cluster.Cluster.connect
    raise
  File "cassandra/cluster.py", line 1358, in cassandra.cluster.Cluster.connect
    self.control_connection.connect()
  File "cassandra/cluster.py", line 2896, in cassandra.cluster.ControlConnection.connect
    self._set_new_connection(self._reconnect_internal())
  File "cassandra/cluster.py", line 2939, in cassandra.cluster.ControlConnection._reconnect_internal
    raise NoHostAvailable("Unable to connect to any servers", errors)
NoHostAvailable: ('Unable to connect to any servers', {'127.0.0.1:9042': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
-------------------- >> begin captured logging << --------------------
dtest: DEBUG:  - going to run tests sequentially
dtest: DEBUG:  - using the SingleClusterIdAllocator
dtest: DEBUG: concurrent_schema_changes_test.TestConcurrentSchemaChanges.snapshot_testsnapshot_test - cluster ccm directory: /home/kbraun/.dtest/dtest-2k9Sq9
dtest: DEBUG: concurrent_schema_changes_test.TestConcurrentSchemaChanges.snapshot_testsnapshot_test - Starting Scylla cluster from directory /home/kbraun/dev/scylla-dtest/../scylla/build/dev
dtest: DEBUG: concurrent_schema_changes_test.TestConcurrentSchemaChanges.snapshot_testsnapshot_test - snapshot_test()
cassandra.cluster: WARNING: Cluster.__init__ called with contact_points specified, but load-balancing policies are not specified in some ExecutionProfiles. In the next major version, this will raise an error; please specify a load-balancing policy. (contact_points = ['127.0.0.1'], EPs without explicit LBPs = ('EXEC_P
ROFILE_DEFAULT',))
cassandra.cluster: WARNING: [control connection] Error connecting to 127.0.0.1:9042:
Traceback (most recent call last):
  File "cassandra/cluster.py", line 2928, in cassandra.cluster.ControlConnection._reconnect_internal
    return self._try_connect(host)
  File "cassandra/cluster.py", line 2950, in cassandra.cluster.ControlConnection._try_connect
    connection = self._cluster.connection_factory(host.endpoint, is_control_connection=True)
  File "cassandra/cluster.py", line 1292, in cassandra.cluster.Cluster.connection_factory
    return self.connection_class.factory(endpoint, self.connect_timeout, *args, **kwargs)
  File "cassandra/connection.py", line 471, in cassandra.connection.Connection.factory
    conn = cls(endpoint, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/cassandra/io/libevreactor.py", line 267, in __init__
    self._connect_socket()
  File "cassandra/connection.py", line 514, in cassandra.connection.Connection._connect_socket
    raise socket.error(sockerr.errno, "Tried connecting to %s. Last error: %s" % ([a[4] for a in addresses], sockerr.strerror or sockerr))
error: [Errno 111] Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused
cassandra.cluster: ERROR: Control connection failed to connect, shutting down Cluster:
Traceback (most recent call last):
  File "cassandra/cluster.py", line 1358, in cassandra.cluster.Cluster.connect
    self.control_connection.connect()
  File "cassandra/cluster.py", line 2896, in cassandra.cluster.ControlConnection.connect
    self._set_new_connection(self._reconnect_internal())
  File "cassandra/cluster.py", line 2939, in cassandra.cluster.ControlConnection._reconnect_internal
    raise NoHostAvailable("Unable to connect to any servers", errors)
NoHostAvailable: ('Unable to connect to any servers', {'127.0.0.1:9042': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 7.293s

FAILED (errors=1)

Here's how the test starts:

    def snapshot_test(self):
        debug("snapshot_test()")
        cluster = self.cluster
        cluster.set_configuration_options(values={'experimental': True})
        cluster.populate(2).start()
        node1, node2 = cluster.nodelist()
        wait(2)
        session = self.cql_connection(node1)

By reading the code and debugging, I arrived at the conclusion that the reason is 83bdc2d, specifically these lines:

-        if no_wait and not verbose:
-            # waiting 2 seconds to check for early errors and for the
-            # pid to be set
-            time.sleep(2)
-        else:
-            for node, p, mark in started:
-                start_message = "Starting listening for CQL clients"
-                try:
-                    # updated code, scylla starts CQL only by default
-                    # process should not be checked for scylla as the
-                    # process is a boot script (that ends after boot)
-                    node.watch_log_for(start_message, timeout=600,
-                                       verbose=verbose, from_mark=mark)
-                except RuntimeError:
-                    raise Exception("Not able to find start "
-                                    "message '%s' in Node '%s'" %
-                                    (start_message, node.name))

the else branch was causing the wait in older ccm versions (which caused start() to take ~25 seconds longer to finish than it currently does). I guess this is now responsible for the wait:

        if wait_for_binary_proto:
            for node, _, mark in started:
                node.watch_log_for("Starting listening for CQL clients",
                                   verbose=verbose, from_mark=mark)

but

  1. waiting for binary proto should probably be different/something more than just waiting for a certain message appearing in the node's logs (see wait_for_binary_interface in ccmlib/node.py)
  2. lots of dtests do cluster.populate(n).start() (so wait_for_binary_proto=False) and probably assume that the cluster is ready after this:
[kbraun@localhost scylla-dtest]$ grep -r "populate.*start()" | wc -l
185
@kbr-
Copy link
Author

kbr- commented Jan 31, 2020

I don't completely understand what's happening though, because e.g. snapshot_test, which fails on my laptop 100% of the time when trying to run dtests using branches scylla-ccm:master and scylla-dtest:master, sometimes passes on BYO:
https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/909/consoleFull

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants