Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cassandra-stress] creation code of the schema isn't safe #356

Open
1 of 2 tasks
fruch opened this issue Nov 28, 2023 · 9 comments
Open
1 of 2 tasks

[cassandra-stress] creation code of the schema isn't safe #356

fruch opened this issue Nov 28, 2023 · 9 comments

Comments

@fruch
Copy link
Contributor

fruch commented Nov 28, 2023

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

when multiple machines are using the same command we can end up in situation some of the command would fail like this:

java.lang.RuntimeException: Encountered exception creating schema
	at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpacesNative(SettingsSchema.java:100)
	at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpaces(SettingsSchema.java:69)
	at org.apache.cassandra.stress.settings.StressSettings.maybeCreateKeyspaces(StressSettings.java:228)
	at org.apache.cassandra.stress.StressAction.run(StressAction.java:58)
	at org.apache.cassandra.stress.Stress.run(Stress.java:143)
	at org.apache.cassandra.stress.Stress.main(Stress.java:62)
Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: Keyspace 'keyspace1' does not exist
	at com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:49)
	at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:35)
	at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:293)
	at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:58)
	at org.apache.cassandra.stress.util.JavaDriverClient.execute(JavaDriverClient.java:190)
	at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpacesNative(SettingsSchema.java:86)
	... 5 more
Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: Keyspace 'keyspace1' does not exist
	at com.datastax.driver.core.Responses$Error.asException(Responses.java:181)
	at com.datastax.driver.core.DefaultResultSetFuture.onSet(DefaultResultSetFuture.java:215)
	at com.datastax.driver.core.RequestHandler.setFinalResult(RequestHandler.java:229)
	at com.datastax.driver.core.RequestHandler.access$2600(RequestHandler.java:63)
	at com.datastax.driver.core.RequestHandler$SpeculativeExecution.setFinalResult(RequestHandler.java:1011)
	at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:814)
	at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1289)
	at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1207)
	at com.datastax.shaded.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
	at com.datastax.shaded.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
	at com.datastax.shaded.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
	at com.datastax.shaded.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:312)
	at com.datastax.shaded.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:286)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
	at com.datastax.shaded.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1304)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
	at com.datastax.shaded.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:921)
	at com.datastax.shaded.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:135)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:646)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:581)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460)
	at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
	at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:750)

since LOCAL_ONE is used on those functions, we might get into cases the 2nd command is run with a node that didn't yet got the update about the schema change.

All those should be using QURUM, or at least same Consistency Level as the user requested in the command

    /**
     * Create Keyspace with Standard and Super/Counter column families
     */
    public void createKeySpacesNative(StressSettings settings)
    {

        JavaDriverClient client  = settings.getJavaDriverClient(false);

        try
        {
            //Keyspace
            client.execute(createKeyspaceStatementCQL3(), org.apache.cassandra.db.ConsistencyLevel.LOCAL_ONE);

            client.execute("USE \""+keyspace+"\"", org.apache.cassandra.db.ConsistencyLevel.LOCAL_ONE);

            //Add standard1
            client.execute(createStandard1StatementCQL3(settings), org.apache.cassandra.db.ConsistencyLevel.LOCAL_ONE);

            System.out.println(String.format("Created keyspaces. Sleeping %ss for propagation.", settings.node.nodes.size()));
            Thread.sleep(settings.node.nodes.size() * 1000L); // seconds
        }
        catch (AlreadyExistsException e)
        {
            //Ok.
        }
        catch (Exception e)
        {
            throw new RuntimeException("Encountered exception creating schema", e);
        }
    }

Impact

cause of it, kind of randomly we have jobs that are failing on of the stress commands

How frequently does it reproduce?

Happens once in a while, not easily reproduced

Installation details

Kernel Version: 5.15.0-1050-aws
Scylla version (or git commit hash): 2023.3.0~dev-20231123.45a20d3f1b34 with build-id d8a9c10e69a4a724c9eb2ac44ce1ded995fbea6e

Cluster size: 3 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

  • perf-latency-nemesis-ubuntu-db-node-a1a475f2-3 (3.238.92.222 | 10.12.2.190) (shards: 7)
  • perf-latency-nemesis-ubuntu-db-node-a1a475f2-2 (44.199.195.232 | 10.12.0.27) (shards: 7)
  • perf-latency-nemesis-ubuntu-db-node-a1a475f2-1 (3.208.71.200 | 10.12.1.169) (shards: 7)

OS / Image: ami-096202c204a4676eb (aws: undefined_region)

Test: scylla-enterprise-perf-regression-latency-650gb-with-nemesis
Test id: a1a475f2-e20f-49af-86c5-1f2655194683
Test name: scylla-enterprise/scylla-enterprise-perf-regression-latency-650gb-with-nemesis
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor a1a475f2-e20f-49af-86c5-1f2655194683
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs a1a475f2-e20f-49af-86c5-1f2655194683

Logs:

Jenkins job URL
Argus

@tgrabiec
Copy link
Contributor

Is this with consistent_cluster_management: true? If not, concurrent schema changes are not supported by scylla and this is a known issue since forver.

@tgrabiec
Copy link
Contributor

"Consistency Level" is ignored by DDL statements, the driver waits for schema agreement.

@tgrabiec
Copy link
Contributor

Another problem is that the stress command is not prepared to handle concurrent schema setup properly:

    try
        {
            //Keyspace
            client.execute(createKeyspaceStatementCQL3(), org.apache.cassandra.db.ConsistencyLevel.LOCAL_ONE);
[1] ^
            client.execute("USE \""+keyspace+"\"", org.apache.cassandra.db.ConsistencyLevel.LOCAL_ONE);

            //Add standard1
            client.execute(createStandard1StatementCQL3(settings), org.apache.cassandra.db.ConsistencyLevel.LOCAL_ONE);
[2] ^
            System.out.println(String.format("Created keyspaces. Sleeping %ss for propagation.", settings.node.nodes.size()));
            Thread.sleep(settings.node.nodes.size() * 1000L); // seconds
        }
        catch (AlreadyExistsException e)
        {
            //Ok.
        }

It catches AlreadyExistsException and assumes it's OK, but if [1] throws it, it doesn't guarantee that [2] was already executed.

@fruch
Copy link
Contributor Author

fruch commented Nov 29, 2023

Is this with consistent_cluster_management: true? If not, concurrent schema changes are not supported by scylla and this is a known issue since forver.

it's enable:

consistent_cluster_management: true

@tgrabiec
Copy link
Contributor

What's the scylla-tools version?

@fruch
Copy link
Contributor Author

fruch commented Nov 29, 2023

What's the scylla-tools version?

Version: 2022.2.dev-0.20220330.eef4cbb20a51

@mykaul
Copy link
Contributor

mykaul commented Nov 29, 2023

What's the scylla-tools version?

Version: 2022.2.dev-0.20220330.eef4cbb20a51

Probably should sync it to the scylla-tools we release (say, in latest 2022.2.x). We don't change it that often though (we just did, due to 3rd party dependency, but we also IIRC bumped up the driver version)

@roydahan
Copy link
Collaborator

I think that it happens here:

client.execute("USE \""+keyspace+"\"", org.apache.cassandra.db.ConsistencyLevel.LOCAL_ONE);

roydahan added a commit that referenced this issue Nov 30, 2023
Stop using "Use keyspace" since it can fail when using multiple
stress commands in parallel due to a race condition.

fixes: #356
@fruch
Copy link
Contributor Author

fruch commented Feb 11, 2024

this boils down to this core issue: scylladb/scylladb#16909

and the fix for it in: scylladb/scylladb#16969

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants