Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement sleep failpoint status #47

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pchan
Copy link

@pchan pchan commented May 18, 2023

This PR is based out of @ramil600 PR #37 and is intended to address etcd-io/etcd#14729.

The current status is to update Ramil's code to run it successfully. Further updates will require will incorporate changes requested in #37 (comment)

cc: @ramil600 @ahrtr @serathius @lavacat

@pchan
Copy link
Author

pchan commented May 25, 2023

Incorporated the following changes:

  1. Use runtime.Status instead of StatusCount
  2. Status returns int instead of string and tests have been updated to do the change. In http for conversion I am converting it to string to byte as it was simple, I can take this off and do a byte conversion directly.

There was one comment about using subtests. I will investigate it.

runtime/http.go Outdated Show resolved Hide resolved
runtime/http.go Outdated
fp := key[:len(key)-len("/count")]
_, count, err := Status(fp)
if err != nil {
http.Error(w, "failed to GET: "+err.Error(), http.StatusNotFound)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ErrNoExist is not the only error that Status can return. Please properly handle errors to avoid separate 400 from 500.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand comment especially the part of separate 400 and 500. For error handling I have changed code to output a specific error and fallback to a "catch-all" 500 generic error (link)

fp: "ExampleString",
desc: `10*sleep(10)->1*return("abc")`,
runafter: 12,
want: 11,
Copy link
Member

@serathius serathius May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 11? Maybe it was answered on previous PR, but please leave at least a comment explaining.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated with latest code with a comment. Please review.


for i := 0; i < tc.runbefore; i++ {
exampleFunc()
time.Sleep(10 * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why sleep?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why this was included by the original author. I have removed it.

fp string
desc string
newdesc string
runbefore int
Copy link
Member

@serathius serathius May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
runbefore int
runOldTerm int

testcases := []struct {
name string
fp string
desc string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
desc string
failpointTerm string

{
name: "Inbetween Enabling Failpoint",
fp: "ExampleString",
desc: `10*sleep(10)->1*return("abc")`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One interesting followup would be to add a panic term, so we know the behavior of counter if user of gofail recovers the panic.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a panic, gofail does not recover from it. It fails the test. I haven't included the change in the update.

--- FAIL: TestTermsCounter/Inbetween_Enabling_Failpoint2 (0.02s)
panic: failpoint panic: ExampleString [recovered]
panic: failpoint panic: ExampleString

goroutine 24 [running]:
testing.tRunner.func1.2({0x12a7000, 0xc0000a1440})
/usr/local/go/src/testing/testing.go:1396 +0x24e
testing.tRunner.func1()
/usr/local/go/src/testing/testing.go:1399 +0x39f
panic({0x12a7000, 0xc0000a1440})
/usr/local/go/src/runtime/panic.go:884 +0x212
go.etcd.io/gofail/runtime.actPanic(0xc0000a2d40)
/Users/prasadc/dev/repo/gofail/p1/runtime/terms.go:326 +0xe7
go.etcd.io/gofail/runtime.(*term).do(...)
/Users/prasadc/dev/repo/gofail/p1/runtime/terms.go:294
go.etcd.io/gofail/runtime.(*terms).eval(0xc0000a6640)
/Users/prasadc/dev/repo/gofail/p1/runtime/terms.go:108 +0x109
go.etcd.io/gofail/runtime.(*Failpoint).Acquire(0x0?)
/Users/prasadc/dev/repo/gofail/p1/runtime/failpoint.go:38 +0xa5
go.etcd.io/gofail/runtime_test.exampleFunc()
/Users/prasadc/dev/repo/gofail/p1/runtime/termscounter_test.go:145 +0x25
go.etcd.io/gofail/runtime_test.TestTermsCounter.func1(0xc000132340)
/Users/prasadc/dev/repo/gofail/p1/runtime/termscounter_test.go:78 +0x106
testing.tRunner(0xc000132340, 0xc0000a1350)
/usr/local/go/src/testing/testing.go:1446 +0x10b
created by testing.(*T).Run
/usr/local/go/src/testing/testing.go:1493 +0x35f
exit status 2
FAIL go.etcd.io/gofail/runtime 0.497s

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gofail doesn't recover panics, but the client using the library might wrap the gofail code in recovery.

@pchan
Copy link
Author

pchan commented May 29, 2023

@serathius Thanks for your input. I have incorporated the review comments. Please review.

@ahrtr ahrtr self-requested a review May 29, 2023 09:47
@ahrtr
Copy link
Member

ahrtr commented May 29, 2023

Please squash & signoff the commits, I will take a look later.

@pchan
Copy link
Author

pchan commented May 30, 2023

Please squash & signoff the commits, I will take a look later.

Done.

Copy link

@lavacat lavacat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please also update README.md. There is also design.md, but I didn't find appropriate section there, maybe Step 3.


}

func exampleFunc() string {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you add some comments about this test setup. My understanding is that exampleFunc mimics code generated by gofail and __fp_ExampleString mimics fail.go file. This way you don't need to actually run gofail.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a comment for this function and the variable that together mimics the actions of code package.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need exampleFunc in this test file?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It simulates the triggering of the failpoint. We can achieve the same thing by a combination of "envTerms" and "Acquire()" as seen in failpoint_test.go. I can remove and replace if that is the preferred way.

Comment on lines 32 to 35
// This refers to the number of term executions and is
// independent of caller execution frequency. E.g., Run
// failpoint actions only 11 times even if we have a greater
// number of callsite executions (12 in this case)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this explains why run failpoint actions != callsite executions

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, the reason is to qualify the number of times the failpoint actions happen. So if the callsite is hit X number of times but you want to restrict term executions to Y (typically Y < X), then you either use a hard count or a probability to restrict. Since the original author is not active I am not sure if this is indeed the reason. Maybe @ahrtr can comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason is to qualify the number of times the failpoint actions happen

Yes, we want to count number of times failpoint i executed.

So if the callsite is hit X number of times but you want to restrict term executions to Y (typically Y < X), then you either use a hard count or a probability to restrict

Don't understand

Please double check the reason and document it as a comment. We shouldn't merge code that we don't understand.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that 10*sleep(10)->1*return("abc") is called a chain (split on ->). Gofail will try to execute terms in chain until one succeeds. Terms have mods: count, percentage, list. In our example first term sleep(10) will be executed 10 times (count mod) and then second term return("abc") will be allowed to execute 1 time.
That's why we get 11 executions. Docs aren't great around this behavior and I'm not sure chains are useful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the callsite is hit X number of times but you want to restrict term executions to Y (typically Y < X), then you either use a hard count or a probability to restrict

Don't understand

This is implemented by mods(code). The terms used in the example (10*sleep(10)->1*return("abc")) have mods in them. You can specify a floating point probabilityand the term actions will be executed only % of callsite executions. If you remove mods i.e., not have int or float in the terms, then the number of term executions will equal callsite executions. This example tests mods and so the number of callsite executions != failpoint actions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Don't worry about explaining this to me. Just improve the comment to explain the test case as accurate as possible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the comment. Please check.

@pchan
Copy link
Author

pchan commented May 30, 2023

I am investigating failures because of cpu flag in go test. Will update with the fix.

runtime/runtime.go Outdated Show resolved Hide resolved
Comment on lines 4 to 5
"go.etcd.io/gofail/runtime"
"testing"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"go.etcd.io/gofail/runtime"
"testing"
"testing"
"go.etcd.io/gofail/runtime"

Comment on lines 8 to 25
// This variable mimics the code generated by gofail code package.
// This works in tandem with exampleFunc function.
var __fp_ExampleString *runtime.Failpoint //nolint:stylecheck

// check if failpoint is initialized as gofail
// tests can clear global variables of runtime packages
func initFP() {
if __fp_ExampleString == nil { //nolint:stylecheck
__fp_ExampleString = runtime.NewFailpoint("ExampleString") //nolint:stylecheck
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need these?

Copy link
Author

@pchan pchan Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two generic comments:

  • As @lavacat mentioned, the design doc (and probably readme) need to be updated.

I will update this in the next checkin.

  • Please also consider to setup the e2e test (Add e2e test #40), to verify the functionality end to end instead of just unit test; but of course, can be in a separate PR.

Sure, I will look at creating a e2e setup in a separate PR


}

func exampleFunc() string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need exampleFunc in this test file?

@@ -41,6 +41,8 @@ type terms struct {

// mu protects the state of the terms chain
mu sync.Mutex
// counts executions
counter int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be a field/attribute of Failpoint, and we need to clearly define what does counter exactly mean. There are two options:

  • The count of the failpoint execution, no matter the terms are evaluated or not.
  • The count of the failpoint execution, only include the case that the terms are really evaluated. For example, 40.0%return(true) means 40% possibility to return true. If it returns false (falls into the other 60% possibility), then we don't increment the counter.

Your current implementation should be the second option.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the definition of counter. Regarding making it a field of Failpoint, I have the following concerns

  1. There are number of similar fields that are defined in terms rather than in failpoint (e.g., fpath, desc), should we standardize all, rather than moving just one field into Failpoint ?
  2. This will involve changes to parsing code, get Failpoint lock and reference in terms.go.

Should these be addressed as a refactor in a separate PR ?

@ahrtr
Copy link
Member

ahrtr commented May 31, 2023

Two generic comments:

  • As @lavacat mentioned, the design doc (and probably readme) need to be updated.
  • Please also consider to setup the e2e test (Add e2e test #40), to verify the functionality end to end instead of just unit test; but of course, can be in a separate PR.

Signed-off-by: Ramil Mirhasanov <[email protected]>
Signed-off-by: Prasad Chandrasekaran <[email protected]>
Co-authored-by: Marek Siarkowicz <[email protected]>
Co-authored-by: Benjamin Wang <[email protected]>
@ahrtr
Copy link
Member

ahrtr commented Jun 9, 2023

@pchan

  1. Suggest to remove termscounter_test.go#L10-L25 and termscounter_test.go#L147-L158; and move the two test cases into terms_test.go, no need to add the new test file termcount_test.go. Let's only test the terms, so as to keep the test as clean as possible.
  2. The runtime.status is simple enough, we can ignore them in this PR (I mean no need to add test for them in this PR). I suggest to setup e2e test to test the http endpoints, covering all functionalities. Of course, in a separate PR.

@serathius
Copy link
Member

serathius commented Oct 7, 2023

ping @pchan
will you have time to finish the PR?

@pchan
Copy link
Author

pchan commented Oct 17, 2023

ping @pchan will you have time to finish the PR?

I dropped this, will get it done by the end of this month.

@serathius
Copy link
Member

That's great, thanks. Can you take a look into one issue I found in etcd-io/etcd#16776?

Status endpoint works for failpoints that were setup using HTTP endpoint. However, they don't seem if they were setup up using environment variables.

@serathius
Copy link
Member

@pchan any progress?

@ZhouJianMS would you be interested in looking into this?

@ZhouJianMS
Copy link
Contributor

That's great, thanks. Can you take a look into one issue I found in etcd-io/etcd#16776?

Status endpoint works for failpoints that were setup using HTTP endpoint. However, they don't seem if they were setup up using environment variables.

This issue is due to "beforeApplyOneConfChange" not hit. I have tested with "raftBeforeSave" as an environment variable failpoint and it works.

@ZhouJianMS
Copy link
Contributor

Continue ramil and pchan 's work in #58 . Keep @ramil600 and @pchan as the author of the original commit.

@serathius
Copy link
Member

This again became even more important for etcd-io/etcd#17680

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants