Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod/failure-injection: add initial framework for failure injection library #140548

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

DarrylWong
Copy link
Contributor

@DarrylWong DarrylWong commented Feb 5, 2025

This PR adds the initial framework for the failure injection library within roachprod. The failures package can now be used which adds the FailureMode interface. A FailureMode describes a failure that can be injected into a roachprod cluster along with how to revert the failure. Additionally, it also adds the first supported failure: iptables network partitions.

See individual commits for details.

Release note: none
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-46439
Informs: #138970


I have a WIP branch here with rough implementations of the CLI, roachtest refactoring, and disk stall failures if you're curious how that works. I wanted to keep this PR small though to keep things reviewable + get feedback before I get too deep.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@DarrylWong DarrylWong force-pushed the fi-lib branch 4 times, most recently from 70dc215 to 50ca456 Compare February 7, 2025 16:18
@DarrylWong DarrylWong marked this pull request as ready for review February 7, 2025 17:50
@DarrylWong DarrylWong requested a review from a team as a code owner February 7, 2025 17:50
@DarrylWong DarrylWong requested review from herkolategan and srosenberg and removed request for a team February 7, 2025 17:50
// TODO(darryl): In the future, roachtests should interact with the failure injection library
// through helper functions in roachtestutil so they don't have to interface with roachprod
// directly.
failure, err := fr.GetFailure(c.MakeNodes(), t.failureName, l, c.IsSecure())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: GetFailureMode and failureMode to disambiguate? Seeing both failure and err in the same scope can be confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

// Helper function that uses nmap to check if connections between two nodes is blocked.
func checkPortBlocked(
Copy link
Member

@srosenberg srosenberg Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be lifted to a (lib) helper if it also takes the port arg.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// Run a light workload in the background so we have some traffic in the database.
c.Run(ctx, option.WithNodes(c.WorkloadNode()), "./cockroach workload init tpcc --warehouses=100 {pgurl:1}")
t.Go(func(goCtx context.Context, l *logger.Logger) error {
return c.RunE(goCtx, option.WithNodes(c.WorkloadNode()), "./cockroach workload run tpcc --tolerate-errors --warehouses=100 {pgurl:1-3}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With --tolerate-errors, it's essentially equivalent to assert true. Is that intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah my line of thinking here is that while it's possible our failure injection here does cause a legitimate cockroach issue, it's out of the scope of this test to catch it. i.e. Hopefully, any actual failures here will be caught on much more complex tests that do more than just spin up a TPCC workload. This way this test is more of a signal of "does the failure injection library work?" and less muddied by "does cockroach work under chaos?"

Copy link
Contributor

@shailendra-patel shailendra-patel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice Work @DarrylWong . I have left some minor comments.

For future enhancements from an observability perspective, we should consider adding Grafana annotations and Datadog events as part of the Inject and Restore failure methods when running failure injection as part of the roachtest/DRT cluster. This will help in better tracking and monitoring of failure events. Definitely this is not required as part of this PR but should be given a thought.

}

type IPTablesPartitionNode struct {
c *install.SyncedCluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we rename IPTablesPartitionNode to something like IPTablesPartitionFailure? This would enhance code readability and intuitively indicate that it implements the FailureMode interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


// Cleanup uninstalls any dependencies that were installed by Setup.
Cleanup(ctx context.Context, l *logger.Logger, args FailureArgs) error
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider including ValidateFailure, ValidateRestore and WaitForFailureToPropogate as part of the FailureMode interface? This would relieve the user of the FailureMode from having to implement these validation functions and ensure that anyone implementing a new failure mode is required to write them.

return err
}
if !blocked {
return fmt.Errorf("expected connections from node 1 to node 3 to be blocked")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Throughout errors.New/errors.Errorf can be used instead of fmt.Errorf here as we are not using any placeholders.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// runWithDetails is a wrapper function for SyncedCluster.RunWithDetails.
func (f *IPTablesPartitionNode) runWithDetails(
ctx context.Context, l *logger.Logger, node install.Nodes, args ...string,
) (install.RunResultDetails, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we refactor run and runDetails to be utility functions instead of receiver methods? This would allow other failure modes to utilize them in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I agree with utilities as it's a nightmare to mock if we come to that.

But maybe have an embedded generic struct that would support the SyncCluster and these functions?

type GenericFailure struct {
  c *install.SyncedCluster
}
func (f *IPTablesPartitionNode) run(
	ctx context.Context, l *logger.Logger, node install.Nodes, args ...string,
) error {
	...
}

type IPTablesPartitionNode struct {
	GenericFailure
}

This way, you wouldn't have to rewrite them for all the failure types, but you could also override them for specific failure types if necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks for the suggestion!

Type: failures.Bidirectional,
}},
},
waitForFailureToPropagate: 5 * time.Second,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could have a retry loop (possibly with exponential backoff) that polls for the failure, and rather have a maximum time wherein if the failure has not taken effect yet it would count as the failure not occurring. i.e., if failure has not taken effect within 30 seconds fail test.

Copy link
Contributor

@golgeek golgeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty cool stuff!

I'm a bit bothered by the fact that the test writer has to specify a period to wait for after the failure has been applied. This seems like it will be a huge guess work based on a lot of trial and errors.

Maybe the framework could monitor the state of the cluster and send an event on a channel when something happened at the cluster state (e.g. node drop)?
This way the test could just f.WaitForFailureEffects(ctx.WithTimeout()).


// Drop all outgoing traffic to the ip address.
asymmetricOutputPartitionCmd = `
sudo iptables %[1]s OUTPUT -s {ip:%[2]d} -p tcp --dport {pgport:%[2]d} -j DROP;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant -d instead of -s on both these rules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! Whole point of the smoke test is to catch silly mistakes like this except I didn't write one for outgoing traffic 🙃. Nice catch! Fixed and added a smoke test.

// runWithDetails is a wrapper function for SyncedCluster.RunWithDetails.
func (f *IPTablesPartitionNode) runWithDetails(
ctx context.Context, l *logger.Logger, node install.Nodes, args ...string,
) (install.RunResultDetails, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I agree with utilities as it's a nightmare to mock if we come to that.

But maybe have an embedded generic struct that would support the SyncCluster and these functions?

type GenericFailure struct {
  c *install.SyncedCluster
}
func (f *IPTablesPartitionNode) run(
	ctx context.Context, l *logger.Logger, node install.Nodes, args ...string,
) error {
	...
}

type IPTablesPartitionNode struct {
	GenericFailure
}

This way, you wouldn't have to rewrite them for all the failure types, but you could also override them for specific failure types if necessary.

func (f *IPTablesPartitionNode) PacketsDropped(
ctx context.Context, l *logger.Logger, node install.Nodes,
) (int, error) {
res, err := f.runWithDetails(ctx, l, node, "sudo iptables -L -v -n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add -x to get exact numbers; else in case of a lot of dropped packets, the output will be rounded to the nearest K/M/G (e.g. 1104K instead of 1104123).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should ideally grab only the network interface that's being exercised. Otherwise, there is a potential to introduce noise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both suggestions SGTM, but I actually just removed the function entirely for now. That was originally added in the network.go tests as a check to make sure the rules were actually applied. Once the test switches to the failure injection library, it shouldn't have to own that validation anymore. Will add it back in if needed.

This helper will be used in the new failure injection
library.
// CheckPortBlocked returns true if a connection from a node to a port on another node
// can be established. Requires nmap to be installed.
func CheckPortBlocked(
ctx context.Context, l *logger.Logger, c cluster.Cluster, fromNode, toNode option.NodeListOption, port string,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: typically port is a (positive) int; the weaker type could lead to some typos or API abuse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I intentionally left it as a string so you could take advantage of roachprod expanders, e.g. {pgport:1}. Not that there's anything wrong with calling

ports, err := `c.SQLPorts()`
srcPort := ports[0]
...

but seemed like extra boilerplate to me. Don't feel strongly about it though, happy to switch to an int if we prefer.

failureName: failures.IPTablesNetworkPartitionName,
args: failures.NetworkPartitionArgs{
Partitions: []failures.NetworkPartition{{
Source: install.Nodes{1},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Granted it's a "smoke test", I wonder if it would strengthen it by randomizing a pair of nodes instead of hardcoding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Great comments from the team, thanks! I don't see anything fundamentally unsound with the first iteration. There are a number of suggestions/improvements, which could be addressed in subsequent PRs, unless other folks feel strongly about something that should be done here.

@DarrylWong DarrylWong force-pushed the fi-lib branch 3 times, most recently from 5631dad to b7499cf Compare February 19, 2025 21:21
@DarrylWong DarrylWong force-pushed the fi-lib branch 2 times, most recently from b9544e5 to 8721ab3 Compare February 19, 2025 21:35
@DarrylWong
Copy link
Contributor Author

Lots of somewhat similar suggestions about how to wait for failure propagation (thanks!) so I'll just respond generally to all here. I've moved WaitForFailureToPropagate and WaitForFailureToRestore to FailureMode interface methods. Makes total sense that these are things the failure injection testing framework might want to do as well.

I like the idea to monitor the state of the cluster and send an event on a channel instead of just sleeping. I'm not sure without more investigation what this would actually entail though. e.g. If we start a disk stall, how do we know based on the state of the cluster that the disk stall is in full effect? We could check QPS or disk I/O and return when it appears to have stabilized, but what if there's external workloads running in parallel? How do we do it in a stable enough way that doesn't take 5-10+ minutes and effectively become a validation test? Perhaps iptables is not the most interesting first failure for this exercise since the rules are in effect immediately.

All that isn't to say I don't think we should do it, but rather I'm not sure how to do it without more experimenting and I'm open to ideas :).

Alternatively, we could have a retry loop

Good idea. Like mentioned, iptables is applied immediately so I didn't use it here, but I will steal this idea for the other failures.

we should consider adding Grafana annotations and Datadog

Good idea, I already added them in at the roachtest level on my WIP branch since it was helpful in debugging, but this is a good reminder that I should probably just add them in at the failure injection library level. Will do in a followup!

This commit adds the framework for the failure injection
library, as well as the first supported failure: iptables
network partitions.

This failure can be used on roachprod clusters to create
bidirectional and asymmetric network partitions between
node(s).
This registry will allow for future usage of the failure
injection library through the CLI and the failure injection
planner/controller.
This adds an integration test for the failure injection library.
The test spins up a cluster and randomly selects a failure to
inject. It then validates that the failure was correctly injected.
Afterwards, it reverts the failure and validates that the failure
was correctly cleaned up.
Copy link
Contributor

@golgeek golgeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants