Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bed 4262 az panic recovery #73

Merged
merged 26 commits into from
Mar 21, 2024
Merged

Bed 4262 az panic recovery #73

merged 26 commits into from
Mar 21, 2024

Conversation

benwaples
Copy link
Contributor

@benwaples benwaples commented Mar 19, 2024

BP-367

Adds panic recovery to help with debugging panics that are causing a crash without logging anything. This will recover from a panic, log it, and keep the app from fully crashing.

Implementation

The big gotcha here is that recover() will not catch panics that occur in a separate goroutine... and azurehound leverages goroutines significantly. To handle this, we used channels to transmit those errors between each goroutine.

The main panicChan is opened in start.go and passed down to each subsequent func that starts a new goroutine. When a panic occurs, panicRecovery will catch it and send it to panicChan, then handleBubbledPanic will log the error (as well as stack) and stop the context.

Implementation exception

Each ...CmdImpl function creates its own panicChan.

Testing

I tested this by dropping panics at multiple different levels of goroutines and running collections.

notes:

panic recovery wasn't added to the client package... yet. I think this is a good starting point and we can refactor the client package to account for panicChan in another PR.

cmd/start.go Outdated
@@ -149,8 +150,11 @@ func start(ctx context.Context) {

start := time.Now()

ctx, stop := context.WithCancel(ctx)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im thinking I should move this child context up so that the getAvailableJobs and startJob are not using the parent context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that this won't make a difference because both of those requests will have been completed by the time we get to the goroutine where we spawn a child context.

@juggernot325 juggernot325 self-requested a review March 20, 2024 18:14
@juggernot325 juggernot325 added blocked This pull request cannot be completed and should not be merged work in progress This pull request is a work in progress and should not be merged and removed blocked This pull request cannot be completed and should not be merged labels Mar 20, 2024
@juggernot325
Copy link
Contributor

@benwaples and I discussed the possibility of moving the channel for panics to a global variable instead of passing it into each function separately. He's going to play around with that to see if it's an option to simplify these changes a bit.

@juggernot325 juggernot325 removed the work in progress This pull request is a work in progress and should not be merged label Mar 20, 2024
Copy link
Contributor

@juggernot325 juggernot325 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thank you for the refactor.

@urangel urangel merged commit 6889fa8 into main Mar 21, 2024
9 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Mar 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants