When nodetool command timeout, kill scylla-jmx with sigquit #303

fruch · 2021-02-04T12:26:50Z

We had few reports across the broad that nodetool commands are getting stuck.

a suggestion was raise by @elcallio to try catch and collect enough information when those things happen:
scylladb/scylladb#7991 (comment)

bhalevy · 2021-02-07T12:02:47Z

Refs scylladb/scylladb#7244

bhalevy · 2021-02-07T12:03:12Z

@fruch who's working on this?

fruch · 2021-02-07T12:06:43Z

@fruch who's working on this?

Currently no one... there are more incidents of that getting stuck ?

bhalevy · 2021-02-07T12:18:11Z

Yes, this frequently happens in dtest, see scylladb/scylladb#7244
We need to get to this ASAP.

fruch · 2021-02-07T12:20:30Z

I'll take a closer look (it mainly happen on those sec. indexes tests ? or on more then one test file / class ?)

bhalevy · 2021-02-07T13:27:26Z

I'll take a closer look (it mainly happen on those sec. indexes tests ? or on more then one test file / class ?)

Yes, mostly.
Some recent examples:
https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/dtest-release-parallel/68/testReport/secondary_indexes_test/TestSecondaryIndexes/Run_Dtest_Parallel_Cloud_Machines___FullDtest___split005___test_insert_data_after_recreating_cf/
https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/dtest-release-parallel/68/testReport/secondary_indexes_test/TestSecondaryIndexes/Run_Dtest_Parallel_Cloud_Machines___FullDtest___split004___test_insert_data_after_recreating_ks/
https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/dtest-release-parallel/68/testReport/limits_test/TestLimits/Run_Dtest_Parallel_Cloud_Machines___FullDtest___split060___max_column_value_size_test/

bhalevy · 2021-02-07T14:01:06Z

Also https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/718/testReport/cfid_test/TestCFID/cfid_test/

fruch · 2021-02-08T08:10:16Z

@elcallio seem that other places point to using SIGQUIT, (and not SIGINT), https://access.redhat.com/solutions/18178

I wasn't able to reproduce those issue, but I want to make sure I'm using the correct signal (and capturing the correct logs)

…ired` When nodetool command get timeout, we try to send `SIGQUIT` to get a threaddump inforamtion into scylla-jmx stdout. Close: scylladb#303 Ref: scylladb/scylladb#7991 (comment)

When nodetool command get timeout, we try to send `SIGQUIT` to get a threaddump inforamtion into scylla-jmx stdout. Close: #303 Ref: scylladb/scylladb#7991 (comment)

The `SIGQUIT` is follow too soon by a forcefull kill, i.e. we don't let the process enough time to print the output to stdout Close: scylladb#303 Ref: scylladb/scylladb#7991 (comment)

The `SIGQUIT` is follow too soon by a forcefull kill, i.e. we don't let the process enough time to print the output to stdout Close: #303 Ref: scylladb/scylladb#7991 (comment)

bhalevy · 2021-02-17T09:26:53Z

Unfortunately, we still don't see the stacktrace :-(
https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/745/artifact/logs-release.2/dtest.log

2021-02-16 13:43:55,626 23284   ccm                            ERROR    | node2: nodetool timeout, going to kill scylla-jmx with SIGQUIT

2021-02-16 13:44:00,633 23284   dtest                          DEBUG    | secondary_indexes_test.py:TestSecondaryIndexes.test_insert_data_after_recreating_cf - Test failed with errors: [(<secondary_indexes_test.TestSecondaryIndexes testMethod=test_insert_data_after_recreating_cf>, (<class 'subprocess.TimeoutExpired'>, TimeoutExpired(['/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool', '-h', '127.0.60.2', '-p', '7260', 'info'], 60), <traceback object at 0x7f116ab0f3c0>))]

But https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/745/artifact/logs-release.2/1613483043675_secondary_indexes_test.TestSecondaryIndexes.test_insert_data_after_recreating_cf/node2_system.log.jmx

Using config file: /jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-4i__py64/test/node2/conf/scylla.yaml
Connecting to http://127.0.60.2:10000
Starting the JMX server
JMX is enabled to receive remote connections on port: 7260

bhalevy · 2021-02-17T09:30:07Z

Interestingly, but likely unrelated, shortly after, in a following test, there's this:

2021-02-16 13:47:50,886 2315    dtest                          DEBUG    | secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_after_index_build - Test failed with errors: [(<secondary_indexes_test.TestSecondaryIndexes testMethod=test_remove_node_after_index_build>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab0e500>))]

https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/745/artifact/logs-release.2/1613483955639_snapshot_test.TestSchemaFileInSnapshot.test_restoring_by_schema_with_mv_use_sstableloader_multiple_tables/node1_system.log.jmx

Using config file: /jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-f_t8avbo/test/node1/conf/scylla.yaml
Error: JMX connector server communication error: service:jmx:rmi://76836c581099:7177
sun.management.AgentConfigurationError: java.rmi.server.ExportException: Port already in use: 7177; nested exception is: 
	java.net.BindException: Address already in use (Bind failed)
	at sun.management.jmxremote.ConnectorBootstrap.exportMBeanServer(ConnectorBootstrap.java:800)
	at sun.management.jmxremote.ConnectorBootstrap.startRemoteConnectorServer(ConnectorBootstrap.java:468)
	at sun.management.Agent.startAgent(Agent.java:262)
	at sun.management.Agent.startAgent(Agent.java:452)
Caused by: java.rmi.server.ExportException: Port already in use: 7177; nested exception is: 
	java.net.BindException: Address already in use (Bind failed)
	at sun.rmi.transport.tcp.TCPTransport.listen(TCPTransport.java:346)
	at sun.rmi.transport.tcp.TCPTransport.exportObject(TCPTransport.java:254)
	at sun.rmi.transport.tcp.TCPEndpoint.exportObject(TCPEndpoint.java:411)
	at sun.rmi.transport.LiveRef.exportObject(LiveRef.java:147)
	at sun.rmi.server.UnicastServerRef.exportObject(UnicastServerRef.java:236)
	at sun.management.jmxremote.ConnectorBootstrap$PermanentExporter.exportObject(ConnectorBootstrap.java:199)
	at javax.management.remote.rmi.RMIJRMPServerImpl.export(RMIJRMPServerImpl.java:146)
	at javax.management.remote.rmi.RMIJRMPServerImpl.export(RMIJRMPServerImpl.java:122)
	at javax.management.remote.rmi.RMIConnectorServer.start(RMIConnectorServer.java:404)
	at sun.management.jmxremote.ConnectorBootstrap.exportMBeanServer(ConnectorBootstrap.java:796)
	... 3 more
Caused by: java.net.BindException: Address already in use (Bind failed)
	at java.net.PlainSocketImpl.socketBind(Native Method)
	at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:387)
	at java.net.ServerSocket.bind(ServerSocket.java:375)
	at java.net.ServerSocket.<init>(ServerSocket.java:237)
	at java.net.ServerSocket.<init>(ServerSocket.java:128)
	at sun.rmi.transport.proxy.RMIDirectSocketFactory.createServerSocket(RMIDirectSocketFactory.java:45)
	at sun.rmi.transport.proxy.RMIMasterSocketFactory.createServerSocket(RMIMasterSocketFactory.java:345)
	at sun.rmi.transport.tcp.TCPEndpoint.newServerSocket(TCPEndpoint.java:666)
	at sun.rmi.transport.tcp.TCPTransport.listen(TCPTransport.java:335)

Note that a different post is in use.

bhalevy · 2021-02-17T09:41:52Z

It smells like scylla-jmx might not exit cleanly and is holding on to the api port:

bhalevy@dt tmp$ egrep 'Unable to connect to Scylla API server|TimeoutExpired' dtest-745.log
2021-02-16 10:35:32,995 30817   dtest                          DEBUG    | nodetool_additional_test.py:TestNodetool.scrub_with_one_node_expect_data_loss_test - Test failed with errors: [(<nodetool_additional_test.TestNodetool testMethod=scrub_with_one_node_expect_data_loss_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.55.1 -p 7155 scrub ks' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab86230>))]
2021-02-16 13:44:00,633 23284   dtest                          DEBUG    | secondary_indexes_test.py:TestSecondaryIndexes.test_insert_data_after_recreating_cf - Test failed with errors: [(<secondary_indexes_test.TestSecondaryIndexes testMethod=test_insert_data_after_recreating_cf>, (<class 'subprocess.TimeoutExpired'>, TimeoutExpired(['/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool', '-h', '127.0.60.2', '-p', '7260', 'info'], 60), <traceback object at 0x7f116ab0f3c0>))]
2021-02-16 13:47:50,886 2315    dtest                          DEBUG    | secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_after_index_build - Test failed with errors: [(<secondary_indexes_test.TestSecondaryIndexes testMethod=test_remove_node_after_index_build>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab0e500>))]
2021-02-16 13:59:12,626 20989   dtest                          DEBUG    | snapshot_test.py:TestSchemaFileInSnapshot.test_restoring_by_schema_with_mv_use_sstableloader_multiple_tables - Test failed with errors: [(<snapshot_test.TestSchemaFileInSnapshot testMethod=test_restoring_by_schema_with_mv_use_sstableloader_multiple_tables>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.77.1 -p 7177 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab97910>))]
2021-02-16 14:13:00,291 1890    dtest                          DEBUG    | sstableloader_test.py:TestMigration_with_2_1_x.migrate_sstable_with_compact_storage_test - Test failed with errors: [(<sstableloader_test.TestMigration_with_2_1_x testMethod=migrate_sstable_with_compact_storage_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.77.1 -p 7177 flush -- ks' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ac84910>))]
2021-02-16 14:24:57,385 32096   dtest                          DEBUG    | sstableloader_test.py:TestMigration_with_3_0_mc_prepared.migrate_sstable_without_compression_test - Test failed with errors: [(<sstableloader_test.TestMigration_with_3_0_mc_prepared testMethod=migrate_sstable_without_compression_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.77.1 -p 7177 flush -- ks' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab8ca50>))]
2021-02-16 14:28:56,148 18175   dtest                          DEBUG    | sstableloader_test.py:TestMigration_with_3_0_md_prepared.migrate_sstable_with_range_tombstone_test - Test failed with errors: [(<sstableloader_test.TestMigration_with_3_0_md_prepared testMethod=migrate_sstable_with_range_tombstone_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.77.1 -p 7177 flush -- ks' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116abac730>))]
2021-02-16 14:34:15,017 10842   dtest                          DEBUG    | sstableloader_test.py:TestMigration_with_3_0_x_prepared.migrate_sstable_with_wrong_partitioner_test - Test failed with errors: [(<sstableloader_test.TestMigration_with_3_0_x_prepared testMethod=migrate_sstable_with_wrong_partitioner_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush -- ks' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab4b050>))]
2021-02-16 15:11:51,915 14222   dtest                          DEBUG    | update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_kill_new_node_while_bootstrapping_test - Test failed with errors: [(<update_cluster_layout_tests.TestUpdateClusterLayout testMethod=simple_kill_new_node_while_bootstrapping_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 status' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116a973f00>))]
2021-02-16 15:37:48,009 9598    dtest                          DEBUG    | wide_rows_test.py:TestWideRows_with_LeveledCompactionStrategy.test_large_cell_detector_with_ttl_on_row - Test failed with errors: [(<wide_rows_test.TestWideRows_with_LeveledCompactionStrategy testMethod=test_large_cell_detector_with_ttl_on_row>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab1e050>))]
2021-02-16 15:52:13,261 22285   dtest                          DEBUG    | wide_rows_test.py:TestWideRows_with_LeveledCompactionStrategy.test_multiple_rows_with_large_cells_detector - Test failed with errors: [(<wide_rows_test.TestWideRows_with_LeveledCompactionStrategy testMethod=test_multiple_rows_with_large_cells_detector>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.77.1 -p 7177 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f1170357640>))]
2021-02-16 15:58:03,477 6169    dtest                          DEBUG    | wide_rows_test.py:TestWideRows_with_SizeTieredCompactionStrategy.test_large_cell_in_materialized_view - Test failed with errors: [(<wide_rows_test.TestWideRows_with_SizeTieredCompactionStrategy testMethod=test_large_cell_in_materialized_view>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116a9686e0>))]
2021-02-16 16:39:23,616 9355    dtest                          DEBUG    | materialized_views_test.py:TestMaterializedViews.base_replica_repair_test - Test failed with errors: [(<materialized_views_test.TestMaterializedViews testMethod=base_replica_repair_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.77.1 -p 7177 repair ks t' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f11686f9cd0>))]
2021-02-16 16:54:31,954 23123   dtest                          DEBUG    | materialized_views_test.py:TestMaterializedViews.drop_mv_during_base_table_writes_test - Test failed with errors: [(<materialized_views_test.TestMaterializedViews testMethod=drop_mv_during_base_table_writes_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f117035c230>))]
2021-02-16 18:15:09,388 26671   dtest                          DEBUG    | bypass_cache_test.py:TestBypassCache.test_alter_table_caching_enable - Test failed with errors: [(<bypass_cache_test.TestBypassCache testMethod=test_alter_table_caching_enable>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116a9b6b90>))]

fruch · 2021-02-17T09:52:29Z

@bhalevy, oh great so it even made things worse ?

bhalevy · 2021-02-17T10:01:40Z

@bhalevy, oh great so it even made things worse ?

I'm not sure, it may have exposed an existing issue.
Looking at scylla_node.do_stop

scylla-ccm/ccmlib/scylla_node.py

Lines 692 to 726 in bc4dced

    
               def do_stop(self, gently=True): 
        
                   """ 
        
                   Stop the node. 
        
                     - gently: Let Scylla and Scylla JMX clean up and shut down properly. 
        
                       Otherwise do a 'kill -9' which shuts down faster. 
        
                   """ 
        
                   if not self.is_running(): 
        
                       return False 
        
                   self._update_jmx_pid(wait=False) 
        
                   if self.scylla_manager and self.scylla_manager.is_agent_available: 
        
                       self._update_scylla_agent_pid() 
        
                   for proc in [self._process_jmx, self._process_scylla, self._process_agent]: 
        
                       if proc: 
        
                           if gently: 
        
                               try: 
        
                                   proc.terminate() 
        
                               except OSError: 
        
                                   pass 
        
                           else: 
        
                               try: 
        
                                   proc.kill() 
        
                               except OSError: 
        
                                   pass 
        
                   else: 
        
                       signal_mapping = {True: signal.SIGTERM, False: signal.SIGKILL} 
        
                       for pid in [self.jmx_pid, self.pid, self.agent_pid]: 
        
                           if pid: 
        
                               try: 
        
                                   os.kill(pid, signal_mapping[gently]) 
        
                               except OSError: 
        
                                   pass 
        
                   return True

It looks like we're not stopping scylla_jmx if the node is not considered running.
I think that to be on the safe side will we shouldn't do this optimization and attempt to stop/kill all process regardless of node.is_running(), certainly with gently=False.

fruch · 2021-02-17T10:09:47Z

@bhalevy so 5sec wasn't enough for it to actually dump the threaddump...

fruch · 2021-02-17T10:10:50Z

@bhalevy so 5sec wasn't enough for it to actually dump the threaddump...

is those happening only on run on monster ? i.e. when all the suite is run in parallel ?

bhalevy · 2021-02-17T10:28:01Z

is those happening only on run on monster ? i.e. when all the suite is run in parallel ?

The timeouts happen also with no parallelism.
See https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release-parallel/33/artifact/logs-release.005/dtest.log
that's with `NOSE_PROCESSES=1'

bhalevy · 2021-02-17T10:29:35Z

The port-in-use issue could be unrelated.
If the scylla process dies for any reason we will not stop scylla_jmx.

bhalevy · 2021-02-17T11:29:45Z

@bhalevy so 5sec wasn't enough for it to actually dump the threaddump...

@fruch I doubt that it's a matter of time.
Dumping the stack trace should be instantaneous.
Maybe it's hung so hard it doesn't even respond to the signal.

bhalevy · 2021-02-17T12:20:54Z

Cc @penberg

fruch added the enhancement label Feb 4, 2021

fruch changed the title ~~When nodetool command timeout, kill scylla-jmx with sigint~~ When nodetool command timeout, kill scylla-jmx with sigquit Feb 8, 2021

fruch mentioned this issue Feb 8, 2021

fix(#303): collect more info when nodetool raises TimeoutExpired #304

Merged

fruch closed this as completed in #304 Feb 9, 2021

fruch mentioned this issue Feb 14, 2021

fix(#304): wait a bit more when killing scylla-jmx on timeout #307

Merged

bhalevy reopened this Feb 17, 2021

bhalevy mentioned this issue Feb 17, 2021

scylla_node do_stop does not stop scylla_jmx if the scylla node is not running #309

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When nodetool command timeout, kill scylla-jmx with sigquit #303

When nodetool command timeout, kill scylla-jmx with sigquit #303

fruch commented Feb 4, 2021

bhalevy commented Feb 7, 2021

bhalevy commented Feb 7, 2021

fruch commented Feb 7, 2021

bhalevy commented Feb 7, 2021

fruch commented Feb 7, 2021

bhalevy commented Feb 7, 2021 •

edited

Loading

bhalevy commented Feb 7, 2021

fruch commented Feb 8, 2021

bhalevy commented Feb 17, 2021 •

edited

Loading

bhalevy commented Feb 17, 2021

bhalevy commented Feb 17, 2021

fruch commented Feb 17, 2021

bhalevy commented Feb 17, 2021

fruch commented Feb 17, 2021

fruch commented Feb 17, 2021

bhalevy commented Feb 17, 2021

bhalevy commented Feb 17, 2021 •

edited

Loading

bhalevy commented Feb 17, 2021

bhalevy commented Feb 17, 2021

When nodetool command timeout, kill scylla-jmx with sigquit #303

When nodetool command timeout, kill scylla-jmx with sigquit #303

Comments

fruch commented Feb 4, 2021

bhalevy commented Feb 7, 2021

bhalevy commented Feb 7, 2021

fruch commented Feb 7, 2021

bhalevy commented Feb 7, 2021

fruch commented Feb 7, 2021

bhalevy commented Feb 7, 2021 • edited Loading

bhalevy commented Feb 7, 2021

fruch commented Feb 8, 2021

bhalevy commented Feb 17, 2021 • edited Loading

bhalevy commented Feb 17, 2021

bhalevy commented Feb 17, 2021

fruch commented Feb 17, 2021

bhalevy commented Feb 17, 2021

fruch commented Feb 17, 2021

fruch commented Feb 17, 2021

bhalevy commented Feb 17, 2021

bhalevy commented Feb 17, 2021 • edited Loading

bhalevy commented Feb 17, 2021

bhalevy commented Feb 17, 2021

bhalevy commented Feb 7, 2021 •

edited

Loading

bhalevy commented Feb 17, 2021 •

edited

Loading

bhalevy commented Feb 17, 2021 •

edited

Loading