-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachprod/failure-injection: add initial framework for failure injection library #140548
base: master
Are you sure you want to change the base?
Conversation
70dc215
to
50ca456
Compare
// TODO(darryl): In the future, roachtests should interact with the failure injection library | ||
// through helper functions in roachtestutil so they don't have to interface with roachprod | ||
// directly. | ||
failure, err := fr.GetFailure(c.MakeNodes(), t.failureName, l, c.IsSecure()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: GetFailureMode
and failureMode
to disambiguate? Seeing both failure
and err
in the same scope can be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
|
||
// Helper function that uses nmap to check if connections between two nodes is blocked. | ||
func checkPortBlocked( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be lifted to a (lib) helper if it also takes the port arg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// Run a light workload in the background so we have some traffic in the database. | ||
c.Run(ctx, option.WithNodes(c.WorkloadNode()), "./cockroach workload init tpcc --warehouses=100 {pgurl:1}") | ||
t.Go(func(goCtx context.Context, l *logger.Logger) error { | ||
return c.RunE(goCtx, option.WithNodes(c.WorkloadNode()), "./cockroach workload run tpcc --tolerate-errors --warehouses=100 {pgurl:1-3}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With --tolerate-errors
, it's essentially equivalent to assert true
. Is that intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah my line of thinking here is that while it's possible our failure injection here does cause a legitimate cockroach issue, it's out of the scope of this test to catch it. i.e. Hopefully, any actual failures here will be caught on much more complex tests that do more than just spin up a TPCC workload. This way this test is more of a signal of "does the failure injection library work?" and less muddied by "does cockroach work under chaos?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice Work @DarrylWong . I have left some minor comments.
For future enhancements from an observability perspective, we should consider adding Grafana annotations and Datadog events as part of the Inject and Restore failure methods when running failure injection as part of the roachtest/DRT cluster. This will help in better tracking and monitoring of failure events. Definitely this is not required as part of this PR but should be given a thought.
} | ||
|
||
type IPTablesPartitionNode struct { | ||
c *install.SyncedCluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we rename IPTablesPartitionNode
to something like IPTablesPartitionFailure
? This would enhance code readability and intuitively indicate that it implements the FailureMode interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
// Cleanup uninstalls any dependencies that were installed by Setup. | ||
Cleanup(ctx context.Context, l *logger.Logger, args FailureArgs) error | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider including ValidateFailure
, ValidateRestore
and WaitForFailureToPropogate
as part of the FailureMode interface? This would relieve the user of the FailureMode
from having to implement these validation functions and ensure that anyone implementing a new failure mode is required to write them.
return err | ||
} | ||
if !blocked { | ||
return fmt.Errorf("expected connections from node 1 to node 3 to be blocked") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Throughout errors.New/errors.Errorf
can be used instead of fmt.Errorf
here as we are not using any placeholders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// runWithDetails is a wrapper function for SyncedCluster.RunWithDetails. | ||
func (f *IPTablesPartitionNode) runWithDetails( | ||
ctx context.Context, l *logger.Logger, node install.Nodes, args ...string, | ||
) (install.RunResultDetails, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we refactor run
and runDetails
to be utility functions instead of receiver methods? This would allow other failure modes to utilize them in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I agree with utilities as it's a nightmare to mock if we come to that.
But maybe have an embedded generic struct that would support the SyncCluster and these functions?
type GenericFailure struct {
c *install.SyncedCluster
}
func (f *IPTablesPartitionNode) run(
ctx context.Context, l *logger.Logger, node install.Nodes, args ...string,
) error {
...
}
type IPTablesPartitionNode struct {
GenericFailure
}
This way, you wouldn't have to rewrite them for all the failure types, but you could also override them for specific failure types if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks for the suggestion!
Type: failures.Bidirectional, | ||
}}, | ||
}, | ||
waitForFailureToPropagate: 5 * time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could have a retry loop (possibly with exponential backoff) that polls for the failure, and rather have a maximum time wherein if the failure has not taken effect yet it would count as the failure not occurring. i.e., if failure has not taken effect within 30 seconds fail test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty cool stuff!
I'm a bit bothered by the fact that the test writer has to specify a period to wait for after the failure has been applied. This seems like it will be a huge guess work based on a lot of trial and errors.
Maybe the framework could monitor the state of the cluster and send an event on a channel when something happened at the cluster state (e.g. node drop)?
This way the test could just f.WaitForFailureEffects(ctx.WithTimeout())
.
|
||
// Drop all outgoing traffic to the ip address. | ||
asymmetricOutputPartitionCmd = ` | ||
sudo iptables %[1]s OUTPUT -s {ip:%[2]d} -p tcp --dport {pgport:%[2]d} -j DROP; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you meant -d
instead of -s
on both these rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! Whole point of the smoke test is to catch silly mistakes like this except I didn't write one for outgoing traffic 🙃. Nice catch! Fixed and added a smoke test.
// runWithDetails is a wrapper function for SyncedCluster.RunWithDetails. | ||
func (f *IPTablesPartitionNode) runWithDetails( | ||
ctx context.Context, l *logger.Logger, node install.Nodes, args ...string, | ||
) (install.RunResultDetails, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I agree with utilities as it's a nightmare to mock if we come to that.
But maybe have an embedded generic struct that would support the SyncCluster and these functions?
type GenericFailure struct {
c *install.SyncedCluster
}
func (f *IPTablesPartitionNode) run(
ctx context.Context, l *logger.Logger, node install.Nodes, args ...string,
) error {
...
}
type IPTablesPartitionNode struct {
GenericFailure
}
This way, you wouldn't have to rewrite them for all the failure types, but you could also override them for specific failure types if necessary.
func (f *IPTablesPartitionNode) PacketsDropped( | ||
ctx context.Context, l *logger.Logger, node install.Nodes, | ||
) (int, error) { | ||
res, err := f.runWithDetails(ctx, l, node, "sudo iptables -L -v -n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add -x
to get exact numbers; else in case of a lot of dropped packets, the output will be rounded to the nearest K/M/G (e.g. 1104K
instead of 1104123
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should ideally grab only the network interface that's being exercised. Otherwise, there is a potential to introduce noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both suggestions SGTM, but I actually just removed the function entirely for now. That was originally added in the network.go
tests as a check to make sure the rules were actually applied. Once the test switches to the failure injection library, it shouldn't have to own that validation anymore. Will add it back in if needed.
This helper will be used in the new failure injection library.
// CheckPortBlocked returns true if a connection from a node to a port on another node | ||
// can be established. Requires nmap to be installed. | ||
func CheckPortBlocked( | ||
ctx context.Context, l *logger.Logger, c cluster.Cluster, fromNode, toNode option.NodeListOption, port string, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: typically port
is a (positive) int
; the weaker type could lead to some typos or API abuse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I intentionally left it as a string so you could take advantage of roachprod expanders, e.g. {pgport:1}
. Not that there's anything wrong with calling
ports, err := `c.SQLPorts()`
srcPort := ports[0]
...
but seemed like extra boilerplate to me. Don't feel strongly about it though, happy to switch to an int if we prefer.
failureName: failures.IPTablesNetworkPartitionName, | ||
args: failures.NetworkPartitionArgs{ | ||
Partitions: []failures.NetworkPartition{{ | ||
Source: install.Nodes{1}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Granted it's a "smoke test", I wonder if it would strengthen it by randomizing a pair of nodes instead of hardcoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Great comments from the team, thanks! I don't see anything fundamentally unsound with the first iteration. There are a number of suggestions/improvements, which could be addressed in subsequent PRs, unless other folks feel strongly about something that should be done here.
5631dad
to
b7499cf
Compare
b9544e5
to
8721ab3
Compare
Lots of somewhat similar suggestions about how to wait for failure propagation (thanks!) so I'll just respond generally to all here. I've moved I like the idea to All that isn't to say I don't think we should do it, but rather I'm not sure how to do it without more experimenting and I'm open to ideas :).
Good idea. Like mentioned,
Good idea, I already added them in at the roachtest level on my WIP branch since it was helpful in debugging, but this is a good reminder that I should probably just add them in at the failure injection library level. Will do in a followup! |
This commit adds the framework for the failure injection library, as well as the first supported failure: iptables network partitions. This failure can be used on roachprod clusters to create bidirectional and asymmetric network partitions between node(s).
This registry will allow for future usage of the failure injection library through the CLI and the failure injection planner/controller.
This adds an integration test for the failure injection library. The test spins up a cluster and randomly selects a failure to inject. It then validates that the failure was correctly injected. Afterwards, it reverts the failure and validates that the failure was correctly cleaned up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
This PR adds the initial framework for the failure injection library within roachprod. The
failures
package can now be used which adds theFailureMode
interface. AFailureMode
describes a failure that can be injected into a roachprod cluster along with how to revert the failure. Additionally, it also adds the first supported failure:iptables
network partitions.See individual commits for details.
Release note: none
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-46439
Informs: #138970
I have a WIP branch here with rough implementations of the CLI, roachtest refactoring, and disk stall failures if you're curious how that works. I wanted to keep this PR small though to keep things reviewable + get feedback before I get too deep.