roachprod/failure-injection: add initial framework for failure injection library #140548

DarrylWong · 2025-02-05T21:39:55Z

This PR adds the initial framework for the failure injection library within roachprod. The failures package can now be used which adds the FailureMode interface. A FailureMode describes a failure that can be injected into a roachprod cluster along with how to revert the failure. Additionally, it also adds the first supported failure: iptables network partitions.

See individual commits for details.

Release note: none
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-46439
Informs: #138970

I have a WIP branch here with rough implementations of the CLI, roachtest refactoring, and disk stall failures if you're curious how that works. I wanted to keep this PR small though to keep things reviewable + get feedback before I get too deep.

cockroach-teamcity · 2025-02-05T21:40:11Z

This change is

srosenberg · 2025-02-13T15:08:42Z

pkg/cmd/roachtest/tests/failure_injection.go

+	// TODO(darryl): In the future, roachtests should interact with the failure injection library
+	// through helper functions in roachtestutil so they don't have to interface with roachprod
+	// directly.
+	failure, err := fr.GetFailure(c.MakeNodes(), t.failureName, l, c.IsSecure())


Nit: GetFailureMode and failureMode to disambiguate? Seeing both failure and err in the same scope can be confusing.

srosenberg · 2025-02-13T15:16:02Z

pkg/cmd/roachtest/tests/failure_injection.go

+}
+
+// Helper function that uses nmap to check if connections between two nodes is blocked.
+func checkPortBlocked(


This could be lifted to a (lib) helper if it also takes the port arg.

srosenberg · 2025-02-13T15:19:51Z

pkg/cmd/roachtest/tests/failure_injection.go

+	// Run a light workload in the background so we have some traffic in the database.
+	c.Run(ctx, option.WithNodes(c.WorkloadNode()), "./cockroach workload init tpcc --warehouses=100 {pgurl:1}")
+	t.Go(func(goCtx context.Context, l *logger.Logger) error {
+		return c.RunE(goCtx, option.WithNodes(c.WorkloadNode()), "./cockroach workload run tpcc --tolerate-errors --warehouses=100 {pgurl:1-3}")


With --tolerate-errors, it's essentially equivalent to assert true. Is that intended?

Yeah my line of thinking here is that while it's possible our failure injection here does cause a legitimate cockroach issue, it's out of the scope of this test to catch it. i.e. Hopefully, any actual failures here will be caught on much more complex tests that do more than just spin up a TPCC workload. This way this test is more of a signal of "does the failure injection library work?" and less muddied by "does cockroach work under chaos?"

shailendra-patel

Nice Work @DarrylWong . I have left some minor comments.

For future enhancements from an observability perspective, we should consider adding Grafana annotations and Datadog events as part of the Inject and Restore failure methods when running failure injection as part of the roachtest/DRT cluster. This will help in better tracking and monitoring of failure events. Definitely this is not required as part of this PR but should be given a thought.

shailendra-patel · 2025-02-14T04:44:54Z

pkg/roachprod/failureinjection/failures/network_partition.go

+}
+
+type IPTablesPartitionNode struct {
+	c *install.SyncedCluster


nit: Can we rename IPTablesPartitionNode to something like IPTablesPartitionFailure? This would enhance code readability and intuitively indicate that it implements the FailureMode interface.

shailendra-patel · 2025-02-14T04:58:33Z

pkg/roachprod/failureinjection/failures/failure.go

+
+	// Cleanup uninstalls any dependencies that were installed by Setup.
+	Cleanup(ctx context.Context, l *logger.Logger, args FailureArgs) error
+}


Should we consider including ValidateFailure, ValidateRestore and WaitForFailureToPropogate as part of the FailureMode interface? This would relieve the user of the FailureMode from having to implement these validation functions and ensure that anyone implementing a new failure mode is required to write them.

shailendra-patel · 2025-02-14T04:58:42Z

pkg/cmd/roachtest/tests/failure_injection.go

+			return err
+		}
+		if !blocked {
+			return fmt.Errorf("expected connections from node 1 to node 3 to be blocked")


nit: Throughout errors.New/errors.Errorf can be used instead of fmt.Errorf here as we are not using any placeholders.

shailendra-patel · 2025-02-14T04:59:20Z

pkg/roachprod/failureinjection/failures/network_partition.go

+// runWithDetails is a wrapper function for SyncedCluster.RunWithDetails.
+func (f *IPTablesPartitionNode) runWithDetails(
+	ctx context.Context, l *logger.Logger, node install.Nodes, args ...string,
+) (install.RunResultDetails, error) {


Can we refactor run and runDetails to be utility functions instead of receiver methods? This would allow other failure modes to utilize them in the future.

Not sure I agree with utilities as it's a nightmare to mock if we come to that.

But maybe have an embedded generic struct that would support the SyncCluster and these functions?

type GenericFailure struct { c *install.SyncedCluster } func (f *IPTablesPartitionNode) run( ctx context.Context, l *logger.Logger, node install.Nodes, args ...string, ) error { ... } type IPTablesPartitionNode struct { GenericFailure }

This way, you wouldn't have to rewrite them for all the failure types, but you could also override them for specific failure types if necessary.

Done, thanks for the suggestion!

herkolategan · 2025-02-14T15:34:33Z

pkg/cmd/roachtest/tests/failure_injection.go

+			Type:        failures.Bidirectional,
+		}},
+	},
+	waitForFailureToPropagate: 5 * time.Second,


Alternatively, we could have a retry loop (possibly with exponential backoff) that polls for the failure, and rather have a maximum time wherein if the failure has not taken effect yet it would count as the failure not occurring. i.e., if failure has not taken effect within 30 seconds fail test.

golgeek

Pretty cool stuff!

I'm a bit bothered by the fact that the test writer has to specify a period to wait for after the failure has been applied. This seems like it will be a huge guess work based on a lot of trial and errors.

Maybe the framework could monitor the state of the cluster and send an event on a channel when something happened at the cluster state (e.g. node drop)?
This way the test could just f.WaitForFailureEffects(ctx.WithTimeout()).

golgeek · 2025-02-14T15:33:11Z

pkg/roachprod/failureinjection/failures/network_partition.go

+
+	// Drop all outgoing traffic to the ip address.
+	asymmetricOutputPartitionCmd = `
+sudo iptables %[1]s OUTPUT -s {ip:%[2]d} -p tcp --dport {pgport:%[2]d} -j DROP;


I think you meant -d instead of -s on both these rules.

Oops! Whole point of the smoke test is to catch silly mistakes like this except I didn't write one for outgoing traffic 🙃. Nice catch! Fixed and added a smoke test.

golgeek · 2025-02-14T15:43:36Z

pkg/roachprod/failureinjection/failures/network_partition.go

+// runWithDetails is a wrapper function for SyncedCluster.RunWithDetails.
+func (f *IPTablesPartitionNode) runWithDetails(
+	ctx context.Context, l *logger.Logger, node install.Nodes, args ...string,
+) (install.RunResultDetails, error) {


Not sure I agree with utilities as it's a nightmare to mock if we come to that.

But maybe have an embedded generic struct that would support the SyncCluster and these functions?

type GenericFailure struct { c *install.SyncedCluster } func (f *IPTablesPartitionNode) run( ctx context.Context, l *logger.Logger, node install.Nodes, args ...string, ) error { ... } type IPTablesPartitionNode struct { GenericFailure }

This way, you wouldn't have to rewrite them for all the failure types, but you could also override them for specific failure types if necessary.

golgeek · 2025-02-14T16:02:20Z

pkg/roachprod/failureinjection/failures/network_partition.go

+func (f *IPTablesPartitionNode) PacketsDropped(
+	ctx context.Context, l *logger.Logger, node install.Nodes,
+) (int, error) {
+	res, err := f.runWithDetails(ctx, l, node, "sudo iptables -L -v -n")


Maybe add -x to get exact numbers; else in case of a lot of dropped packets, the output will be rounded to the nearest K/M/G (e.g. 1104K instead of 1104123).

We should ideally grab only the network interface that's being exercised. Otherwise, there is a potential to introduce noise.

Both suggestions SGTM, but I actually just removed the function entirely for now. That was originally added in the network.go tests as a check to make sure the rules were actually applied. Once the test switches to the failure injection library, it shouldn't have to own that validation anymore. Will add it back in if needed.

This helper will be used in the new failure injection library.

srosenberg · 2025-02-19T17:21:50Z

pkg/cmd/roachtest/roachtestutil/utils.go

+// CheckPortBlocked returns true if a connection from a node to a port on another node
+// can be established. Requires nmap to be installed.
+func CheckPortBlocked(
+	ctx context.Context, l *logger.Logger, c cluster.Cluster, fromNode, toNode option.NodeListOption, port string,


Nit: typically port is a (positive) int; the weaker type could lead to some typos or API abuse.

Hmm, I intentionally left it as a string so you could take advantage of roachprod expanders, e.g. {pgport:1}. Not that there's anything wrong with calling

ports, err := `c.SQLPorts()` srcPort := ports[0] ...

but seemed like extra boilerplate to me. Don't feel strongly about it though, happy to switch to an int if we prefer.

srosenberg · 2025-02-19T17:25:35Z

pkg/cmd/roachtest/tests/failure_injection.go

+	failureName: failures.IPTablesNetworkPartitionName,
+	args: failures.NetworkPartitionArgs{
+		Partitions: []failures.NetworkPartition{{
+			Source:      install.Nodes{1},


Granted it's a "smoke test", I wonder if it would strengthen it by randomizing a pair of nodes instead of hardcoding.

srosenberg

Nice work! Great comments from the team, thanks! I don't see anything fundamentally unsound with the first iteration. There are a number of suggestions/improvements, which could be addressed in subsequent PRs, unless other folks feel strongly about something that should be done here.

DarrylWong · 2025-02-19T21:57:43Z

Lots of somewhat similar suggestions about how to wait for failure propagation (thanks!) so I'll just respond generally to all here. I've moved WaitForFailureToPropagate and WaitForFailureToRestore to FailureMode interface methods. Makes total sense that these are things the failure injection testing framework might want to do as well.

I like the idea to monitor the state of the cluster and send an event on a channel instead of just sleeping. I'm not sure without more investigation what this would actually entail though. e.g. If we start a disk stall, how do we know based on the state of the cluster that the disk stall is in full effect? We could check QPS or disk I/O and return when it appears to have stabilized, but what if there's external workloads running in parallel? How do we do it in a stable enough way that doesn't take 5-10+ minutes and effectively become a validation test? Perhaps iptables is not the most interesting first failure for this exercise since the rules are in effect immediately.

All that isn't to say I don't think we should do it, but rather I'm not sure how to do it without more experimenting and I'm open to ideas :).

Alternatively, we could have a retry loop

Good idea. Like mentioned, iptables is applied immediately so I didn't use it here, but I will steal this idea for the other failures.

we should consider adding Grafana annotations and Datadog

Good idea, I already added them in at the roachtest level on my WIP branch since it was helpful in debugging, but this is a good reminder that I should probably just add them in at the failure injection library level. Will do in a followup!

This commit adds the framework for the failure injection library, as well as the first supported failure: iptables network partitions. This failure can be used on roachprod clusters to create bidirectional and asymmetric network partitions between node(s).

This registry will allow for future usage of the failure injection library through the CLI and the failure injection planner/controller.

This adds an integration test for the failure injection library. The test spins up a cluster and randomly selects a failure to inject. It then validates that the failure was correctly injected. Afterwards, it reverts the failure and validates that the failure was correctly cleaned up.

golgeek

🚀

DarrylWong force-pushed the fi-lib branch 4 times, most recently from 70dc215 to 50ca456 Compare February 7, 2025 16:18

DarrylWong marked this pull request as ready for review February 7, 2025 17:50

DarrylWong requested a review from a team as a code owner February 7, 2025 17:50

DarrylWong requested review from herkolategan and srosenberg and removed request for a team February 7, 2025 17:50

srosenberg requested review from golgeek, shailendra-patel and nameisbhaskar February 11, 2025 15:14

srosenberg reviewed Feb 13, 2025

View reviewed changes

shailendra-patel reviewed Feb 14, 2025

View reviewed changes

herkolategan reviewed Feb 14, 2025

View reviewed changes

golgeek reviewed Feb 14, 2025

View reviewed changes

roachprod: make GetClusterFromCache public

a9b5879

This helper will be used in the new failure injection library.

DarrylWong force-pushed the fi-lib branch from 50ca456 to a2a212d Compare February 19, 2025 15:53

srosenberg reviewed Feb 19, 2025

View reviewed changes

srosenberg requested review from srosenberg and golgeek February 19, 2025 17:33

srosenberg approved these changes Feb 19, 2025

View reviewed changes

DarrylWong force-pushed the fi-lib branch 3 times, most recently from 5631dad to b7499cf Compare February 19, 2025 21:21

DarrylWong force-pushed the fi-lib branch 2 times, most recently from b9544e5 to 8721ab3 Compare February 19, 2025 21:35

DarrylWong force-pushed the fi-lib branch from 8721ab3 to f3f79ed Compare February 19, 2025 22:26

DarrylWong added 3 commits February 19, 2025 18:10

roachprod/failure-injection: add failure-injection registry

5186dd4

This registry will allow for future usage of the failure injection library through the CLI and the failure injection planner/controller.

DarrylWong force-pushed the fi-lib branch from f3f79ed to 067e289 Compare February 19, 2025 23:11

golgeek approved these changes Feb 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachprod/failure-injection: add initial framework for failure injection library #140548

roachprod/failure-injection: add initial framework for failure injection library #140548

DarrylWong commented Feb 5, 2025 •

edited

Loading

cockroach-teamcity commented Feb 5, 2025

srosenberg Feb 13, 2025

DarrylWong Feb 19, 2025

srosenberg Feb 13, 2025 •

edited

Loading

DarrylWong Feb 19, 2025

srosenberg Feb 13, 2025

DarrylWong Feb 19, 2025

shailendra-patel left a comment

shailendra-patel Feb 14, 2025

DarrylWong Feb 19, 2025

shailendra-patel Feb 14, 2025

shailendra-patel Feb 14, 2025

DarrylWong Feb 19, 2025

shailendra-patel Feb 14, 2025

golgeek Feb 14, 2025

DarrylWong Feb 19, 2025

herkolategan Feb 14, 2025

golgeek left a comment

golgeek Feb 14, 2025

DarrylWong Feb 19, 2025

golgeek Feb 14, 2025

golgeek Feb 14, 2025

srosenberg Feb 19, 2025

DarrylWong Feb 19, 2025

srosenberg Feb 19, 2025

DarrylWong Feb 19, 2025

srosenberg Feb 19, 2025

DarrylWong Feb 19, 2025

srosenberg left a comment

DarrylWong commented Feb 19, 2025

golgeek left a comment

roachprod/failure-injection: add initial framework for failure injection library #140548

Are you sure you want to change the base?

roachprod/failure-injection: add initial framework for failure injection library #140548

Conversation

DarrylWong commented Feb 5, 2025 • edited Loading

cockroach-teamcity commented Feb 5, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srosenberg Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shailendra-patel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

golgeek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srosenberg left a comment

Choose a reason for hiding this comment

DarrylWong commented Feb 19, 2025

golgeek left a comment

Choose a reason for hiding this comment

DarrylWong commented Feb 5, 2025 •

edited

Loading

srosenberg Feb 13, 2025 •

edited

Loading