Implement sleep failpoint status #47

pchan · 2023-05-18T18:58:09Z

This PR is based out of @ramil600 PR #37 and is intended to address etcd-io/etcd#14729.

The current status is to update Ramil's code to run it successfully. Further updates will require will incorporate changes requested in #37 (comment)

cc: @ramil600 @ahrtr @serathius @lavacat

pchan · 2023-05-25T07:26:30Z

Incorporated the following changes:

Use runtime.Status instead of StatusCount
Status returns int instead of string and tests have been updated to do the change. In http for conversion I am converting it to string to byte as it was simple, I can take this off and do a byte conversion directly.

There was one comment about using subtests. I will investigate it.

runtime/http.go

serathius · 2023-05-25T08:36:12Z

runtime/http.go

+			fp := key[:len(key)-len("/count")]
+			_, count, err := Status(fp)
+			if err != nil {
+				http.Error(w, "failed to GET: "+err.Error(), http.StatusNotFound)


ErrNoExist is not the only error that Status can return. Please properly handle errors to avoid separate 400 from 500.

I didn't understand comment especially the part of separate 400 and 500. For error handling I have changed code to output a specific error and fallback to a "catch-all" 500 generic error (link)

runtime/termscounter_test.go

serathius · 2023-05-25T08:38:56Z

runtime/termscounter_test.go

+			fp:       "ExampleString",
+			desc:     `10*sleep(10)->1*return("abc")`,
+			runafter: 12,
+			want:     11,


Why 11? Maybe it was answered on previous PR, but please leave at least a comment explaining.

I have updated with latest code with a comment. Please review.

serathius · 2023-05-25T08:40:45Z

runtime/termscounter_test.go

+
+			for i := 0; i < tc.runbefore; i++ {
+				exampleFunc()
+				time.Sleep(10 * time.Millisecond)


I am not sure why this was included by the original author. I have removed it.

runtime/termscounter_test.go

serathius · 2023-05-25T08:42:17Z

runtime/termscounter_test.go

+		fp        string
+		desc      string
+		newdesc   string
+		runbefore int


Suggested change

runbefore int

runOldTerm int

runtime/termscounter_test.go

serathius · 2023-05-25T08:43:34Z

runtime/termscounter_test.go

+	testcases := []struct {
+		name      string
+		fp        string
+		desc      string


Suggested change

desc string

failpointTerm string

runtime/termscounter_test.go

serathius · 2023-05-25T08:46:23Z

runtime/termscounter_test.go

+		{
+			name:      "Inbetween Enabling Failpoint",
+			fp:        "ExampleString",
+			desc:      `10*sleep(10)->1*return("abc")`,


One interesting followup would be to add a panic term, so we know the behavior of counter if user of gofail recovers the panic.

I added a panic, gofail does not recover from it. It fails the test. I haven't included the change in the update.

--- FAIL: TestTermsCounter/Inbetween_Enabling_Failpoint2 (0.02s)
panic: failpoint panic: ExampleString [recovered]
panic: failpoint panic: ExampleString

goroutine 24 [running]:
testing.tRunner.func1.2({0x12a7000, 0xc0000a1440})
/usr/local/go/src/testing/testing.go:1396 +0x24e
testing.tRunner.func1()
/usr/local/go/src/testing/testing.go:1399 +0x39f
panic({0x12a7000, 0xc0000a1440})
/usr/local/go/src/runtime/panic.go:884 +0x212
go.etcd.io/gofail/runtime.actPanic(0xc0000a2d40)
/Users/prasadc/dev/repo/gofail/p1/runtime/terms.go:326 +0xe7
go.etcd.io/gofail/runtime.(*term).do(...)
/Users/prasadc/dev/repo/gofail/p1/runtime/terms.go:294
go.etcd.io/gofail/runtime.(*terms).eval(0xc0000a6640)
/Users/prasadc/dev/repo/gofail/p1/runtime/terms.go:108 +0x109
go.etcd.io/gofail/runtime.(*Failpoint).Acquire(0x0?)
/Users/prasadc/dev/repo/gofail/p1/runtime/failpoint.go:38 +0xa5
go.etcd.io/gofail/runtime_test.exampleFunc()
/Users/prasadc/dev/repo/gofail/p1/runtime/termscounter_test.go:145 +0x25
go.etcd.io/gofail/runtime_test.TestTermsCounter.func1(0xc000132340)
/Users/prasadc/dev/repo/gofail/p1/runtime/termscounter_test.go:78 +0x106
testing.tRunner(0xc000132340, 0xc0000a1350)
/usr/local/go/src/testing/testing.go:1446 +0x10b
created by testing.(*T).Run
/usr/local/go/src/testing/testing.go:1493 +0x35f
exit status 2
FAIL go.etcd.io/gofail/runtime 0.497s

gofail doesn't recover panics, but the client using the library might wrap the gofail code in recovery.

pchan · 2023-05-29T05:10:22Z

@serathius Thanks for your input. I have incorporated the review comments. Please review.

ahrtr · 2023-05-29T09:47:41Z

Please squash & signoff the commits, I will take a look later.

pchan · 2023-05-30T05:03:42Z

Please squash & signoff the commits, I will take a look later.

Done.

lavacat

LGTM, but please also update README.md. There is also design.md, but I didn't find appropriate section there, maybe Step 3.

lavacat · 2023-05-30T06:53:56Z

runtime/termscounter_test.go

+
+}
+
+func exampleFunc() string {


nit: can you add some comments about this test setup. My understanding is that exampleFunc mimics code generated by gofail and __fp_ExampleString mimics fail.go file. This way you don't need to actually run gofail.

I have added a comment for this function and the variable that together mimics the actions of code package.

Why do we need exampleFunc in this test file?

It simulates the triggering of the failpoint. We can achieve the same thing by a combination of "envTerms" and "Acquire()" as seen in failpoint_test.go. I can remove and replace if that is the preferred way.

serathius · 2023-05-30T08:04:06Z

runtime/termscounter_test.go

+			// This refers to the number of term executions and is
+			// independent of caller execution frequency. E.g., Run
+			// failpoint actions only 11 times even if we have a greater
+			// number of callsite executions (12 in this case)


I don't think this explains why run failpoint actions != callsite executions

From my understanding, the reason is to qualify the number of times the failpoint actions happen. So if the callsite is hit X number of times but you want to restrict term executions to Y (typically Y < X), then you either use a hard count or a probability to restrict. Since the original author is not active I am not sure if this is indeed the reason. Maybe @ahrtr can comment.

the reason is to qualify the number of times the failpoint actions happen

Yes, we want to count number of times failpoint i executed.

So if the callsite is hit X number of times but you want to restrict term executions to Y (typically Y < X), then you either use a hard count or a probability to restrict

Don't understand

Please double check the reason and document it as a comment. We shouldn't merge code that we don't understand.

My understanding is that 10*sleep(10)->1*return("abc") is called a chain (split on ->). Gofail will try to execute terms in chain until one succeeds. Terms have mods: count, percentage, list. In our example first term sleep(10) will be executed 10 times (count mod) and then second term return("abc") will be allowed to execute 1 time.
That's why we get 11 executions. Docs aren't great around this behavior and I'm not sure chains are useful.

So if the callsite is hit X number of times but you want to restrict term executions to Y (typically Y < X), then you either use a hard count or a probability to restrict

Don't understand

This is implemented by mods(code). The terms used in the example (10*sleep(10)->1*return("abc")) have mods in them. You can specify a floating point probabilityand the term actions will be executed only % of callsite executions. If you remove mods i.e., not have int or float in the terms, then the number of term executions will equal callsite executions. This example tests mods and so the number of callsite executions != failpoint actions.

Note: Don't worry about explaining this to me. Just improve the comment to explain the test case as accurate as possible.

I have updated the comment. Please check.

pchan · 2023-05-30T11:16:40Z

I am investigating failures because of cpu flag in go test. Will update with the fix.

runtime/runtime.go

ahrtr · 2023-05-30T11:26:44Z

runtime/termscounter_test.go

+	"go.etcd.io/gofail/runtime"
+	"testing"


Suggested change

"go.etcd.io/gofail/runtime"

"testing"

"testing"

"go.etcd.io/gofail/runtime"

ahrtr · 2023-05-31T13:26:28Z

runtime/termscounter_test.go

+// This variable mimics the code generated by gofail code package.
+// This works in tandem with exampleFunc function.
+var __fp_ExampleString *runtime.Failpoint //nolint:stylecheck
+
+// check if failpoint is initialized as gofail
+// tests can clear global variables of runtime packages
+func initFP() {
+	if __fp_ExampleString == nil { //nolint:stylecheck
+		__fp_ExampleString = runtime.NewFailpoint("ExampleString") //nolint:stylecheck
+	}
+}


Why need these?

Two generic comments:

As @lavacat mentioned, the design doc (and probably readme) need to be updated.

I will update this in the next checkin.

Please also consider to setup the e2e test (Add e2e test #40), to verify the functionality end to end instead of just unit test; but of course, can be in a separate PR.

Sure, I will look at creating a e2e setup in a separate PR

ahrtr · 2023-05-31T13:27:32Z

runtime/termscounter_test.go

+
+}
+
+func exampleFunc() string {


Why do we need exampleFunc in this test file?

ahrtr · 2023-05-31T13:37:58Z

runtime/terms.go

@@ -41,6 +41,8 @@ type terms struct {

 	// mu protects the state of the terms chain
 	mu sync.Mutex
+	// counts executions
+	counter int


I think it should be a field/attribute of Failpoint, and we need to clearly define what does counter exactly mean. There are two options:

The count of the failpoint execution, no matter the terms are evaluated or not.

The count of the failpoint execution, only include the case that the terms are really evaluated. For example, 40.0%return(true) means 40% possibility to return true. If it returns false (falls into the other 60% possibility), then we don't increment the counter.

Your current implementation should be the second option.

I have updated the definition of counter. Regarding making it a field of Failpoint, I have the following concerns

There are number of similar fields that are defined in terms rather than in failpoint (e.g., fpath, desc), should we standardize all, rather than moving just one field into Failpoint ?

This will involve changes to parsing code, get Failpoint lock and reference in terms.go.

Should these be addressed as a refactor in a separate PR ?

ahrtr · 2023-05-31T13:53:56Z

Two generic comments:

As @lavacat mentioned, the design doc (and probably readme) need to be updated.
Please also consider to setup the e2e test (Add e2e test #40), to verify the functionality end to end instead of just unit test; but of course, can be in a separate PR.

Signed-off-by: Ramil Mirhasanov <[email protected]> Signed-off-by: Prasad Chandrasekaran <[email protected]> Co-authored-by: Marek Siarkowicz <[email protected]> Co-authored-by: Benjamin Wang <[email protected]>

ahrtr · 2023-06-09T11:51:09Z

@pchan

Suggest to remove termscounter_test.go#L10-L25 and termscounter_test.go#L147-L158; and move the two test cases into terms_test.go, no need to add the new test file termcount_test.go. Let's only test the terms, so as to keep the test as clean as possible.
The runtime.status is simple enough, we can ignore them in this PR (I mean no need to add test for them in this PR). I suggest to setup e2e test to test the http endpoints, covering all functionalities. Of course, in a separate PR.

serathius · 2023-10-07T08:52:34Z

ping @pchan
will you have time to finish the PR?

pchan · 2023-10-17T04:13:45Z

ping @pchan will you have time to finish the PR?

I dropped this, will get it done by the end of this month.

serathius · 2023-10-17T08:15:55Z

That's great, thanks. Can you take a look into one issue I found in etcd-io/etcd#16776?

Status endpoint works for failpoints that were setup using HTTP endpoint. However, they don't seem if they were setup up using environment variables.

serathius · 2023-11-03T09:39:55Z

@pchan any progress?

@ZhouJianMS would you be interested in looking into this?

ZhouJianMS · 2023-11-09T09:59:16Z

That's great, thanks. Can you take a look into one issue I found in etcd-io/etcd#16776?

Status endpoint works for failpoints that were setup using HTTP endpoint. However, they don't seem if they were setup up using environment variables.

This issue is due to "beforeApplyOneConfChange" not hit. I have tested with "raftBeforeSave" as an environment variable failpoint and it works.

ZhouJianMS · 2023-11-09T10:04:19Z

Continue ramil and pchan 's work in #58 . Keep @ramil600 and @pchan as the author of the original commit.

serathius · 2024-04-01T16:20:45Z

This again became even more important for etcd-io/etcd#17680

pchan mentioned this pull request May 18, 2023

Implement sleep failpoint status/counter for linearizability test cases #37

Closed