Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try fixing random hangs in CI due to races #2615

Closed
wants to merge 2 commits into from

Conversation

YJDoc2
Copy link
Collaborator

@YJDoc2 YJDoc2 commented Jan 4, 2024

Ref : #2144 , specifically #2144 (comment)

As per mentioned in the comment, I have added a different impl for the "main" function, that will leak the closure, and allow OS to collect memory at very end instead. For the sake of safety, I have added that function via cfg(test) , so the release code still gets the "proper" implementation. As mentioned in the comment, the races should every only be present in tests due to multi-threaded nature of test running, and actual real-world should not have this hang-up issue.

One downside is that because of this the code that is tested v/s the one that is executed is slightly different, and I'm not entirely fine with it ; but given that our CI is showing way more time-out flakes due to these hangs, I'm willing to try this solution.

I ran the test CI ~ 5 times on my repo, and tests did not hang any time : https://github.com/YJDoc2/youki/actions?query=branch%3Atests%2Ffix-for-2144++

However, one run did fail on one test - https://github.com/YJDoc2/youki/actions/runs/7407712843/job/20154439112 ;but as it still didn't hang, and the test itself might have flaked, so not considering it in particular.

@YJDoc2 YJDoc2 force-pushed the tests/fix-for-2144 branch from a8fd37e to 27e94a2 Compare January 4, 2024 10:05
@codecov-commenter
Copy link

codecov-commenter commented Jan 4, 2024

Codecov Report

Merging #2615 (4e99152) into main (b60889d) will increase coverage by 0.01%.
Report is 9 commits behind head on main.
The diff coverage is 66.66%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2615      +/-   ##
==========================================
+ Coverage   65.88%   65.89%   +0.01%     
==========================================
  Files         133      133              
  Lines       16819    16834      +15     
==========================================
+ Hits        11081    11093      +12     
- Misses       5738     5741       +3     



Signed-off-by: Yashodhan Joshi <[email protected]>
@YJDoc2 YJDoc2 force-pushed the tests/fix-for-2144 branch from 27e94a2 to 4e99152 Compare January 4, 2024 10:09
@YJDoc2 YJDoc2 requested a review from a team January 4, 2024 10:16
Comment on lines +214 to +224
#[cfg(not(test))]
extern "C" fn main(data: *mut libc::c_void) -> libc::c_int {
unsafe { Box::from_raw(data as *mut CloneCb)() }
}
#[cfg(test)]
extern "C" fn main(data: *mut libc::c_void) -> libc::c_int {
let mut func = unsafe { Box::from_raw(data as *mut CloneCb) };
let ret = func();
Box::into_raw(func);
ret
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[cfg(not(test))]
extern "C" fn main(data: *mut libc::c_void) -> libc::c_int {
unsafe { Box::from_raw(data as *mut CloneCb)() }
}
#[cfg(test)]
extern "C" fn main(data: *mut libc::c_void) -> libc::c_int {
let mut func = unsafe { Box::from_raw(data as *mut CloneCb) };
let ret = func();
Box::into_raw(func);
ret
}
extern "C" fn main_impl(data: *mut libc::c_void) -> libc::c_int {
unsafe { Box::from_raw(data as *mut CloneCb)() }
}
#[cfg(not(test))]
extern "C" fn main(data: *mut libc::c_void) -> libc::c_int {
main_impl(data)
}
#[cfg(test)]
extern "C" fn main(data: *mut libc::c_void) -> libc::c_int {
let mut func = main_impl(data);
let ret = func();
Box::into_raw(func);
ret
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also use #[cfg(any(target_env = "musl"))]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey,

  1. We can extract the main logic in a main_impl function, but it will need to leak the box or return the box which then the caller will have to leak it. If we do it as you have suggested, i.e. calling the boxed function and returning the result, we still do the free call when main_impl returns and run into same potential race condition.
  2. As per the last comment on that issues, runwasi was running into same hanging-up issue when running with libc. As far as I understand, the root cause if doing malloc/free after the fork, which is supposed to be undefined behavior. As this only affect in tests, I had added cfg test, and kept rest of the targets to the normal implementation. As per yihuaf 's comment, which I have linked in the desc, the hand issue should only be observed in tests, not in actual usage, so I feel only modifying the test target with cfg is better than modifying the whole musl compilation.

wdyt?

@jprendes
Copy link
Contributor

jprendes commented Jan 8, 2024

Does the hang still happen after #2425?
Would you have links to hanging runs?

@YJDoc2
Copy link
Collaborator Author

YJDoc2 commented Jan 8, 2024

Does the hang still happen after #2425?
Would you have links to hanging runs?

Hey, yes it still occasionally hangs, and in fact also hangs on glib :

eg Runs :

You can check https://github.com/containers/youki/actions/workflows/main.yml?query=is%3Afailure where runtime is >= 20 mins.

I think the issue is that even though that PR fixes some syscall issues, the box free syscall as investigated by yihuaf and mentioned in the comment in PR desc still exist.

@jprendes
Copy link
Contributor

jprendes commented Jan 8, 2024

Thanks!

One downside is that because of this the code that is tested v/s the one that is executed is slightly different, and I'm not entirely fine with it

Why can't we use the test behaviour in release code as well?

I would be happy with leaving the cleanup to the OS in release mode as well.

@YJDoc2
Copy link
Collaborator Author

YJDoc2 commented Jan 9, 2024

Why can't we use the test behaviour in release code as well?
I would be happy with leaving the cleanup to the OS in release mode as well.

It is mostly my preference to not leak things unless needed. There are no code related or technical reason why both versions cannot be same, at worse we are leaking a single box there, so I think <100 bytes of memory. I just separated the code, because as per yihuaf 's investigation the issue was only occurring in test, and should not be present in normal usage. I am fine with using the same behavior in both cases, so if that feels better, let me know, I'll update accordingly.

@jprendes
Copy link
Contributor

jprendes commented Jan 9, 2024

Another option could be setting --test-threads=1 at the expense of slightly increased test runtime.

cargo test -- --test-threads=1

Although I'm not 100% sure that this guarantees that the process will have only one thread.

@utam0k
Copy link
Member

utam0k commented Jan 9, 2024

Another option could be setting --test-threads=1 at the expense of slightly increased test runtime.

cargo test -- --test-threads=1

Although I'm not 100% sure that this guarantees that the process will have only one thread.

Thanks for your suggestion but I don't prefer this workaround as I think it hides the problem too much and affects other unit tests.

@utam0k
Copy link
Member

utam0k commented Jan 9, 2024

This is caused by multiple threads, so it never happens in production because Youki itself always expects one thread.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Co-authored-by: Toru Komatsu <[email protected]>
@utam0k
Copy link
Member

utam0k commented Jan 11, 2024

@YJDoc2 Can I ask you if this failure is related to this PR?
https://github.com/containers/youki/actions/runs/7484741181/job/20372045989?pr=2615

@YJDoc2
Copy link
Collaborator Author

YJDoc2 commented Jan 12, 2024

Hmm it appears that even after the changes here, some tests can hang

test process::fork::test::test_container_err_fork has been running for over 60 seconds
test process::fork::test::test_container_fork has been running for over 60 seconds
test seccomp::tests::test_basic has been running for over 60 seconds
test seccomp::tests::test_moby has been running for over 60 seconds

Which is why the one CI has failed. Not sure what can be done further right now 😓

@utam0k
Copy link
Member

utam0k commented Jan 12, 2024

hmm... @yihuaf May I ask you to look into this PR and CI?

@utam0k
Copy link
Member

utam0k commented Jan 13, 2024

Another option could be setting --test-threads=1 at the expense of slightly increased test runtime.

cargo test -- --test-threads=1

Although I'm not 100% sure that this guarantees that the process will have only one thread.

Good option.

@YJDoc2
How about separating the test into two, not using clone and using clone by cfg feature flags?

@utam0k utam0k self-requested a review January 18, 2024 12:36
@lengrongfu lengrongfu mentioned this pull request Jan 19, 2024
@YJDoc2
Copy link
Collaborator Author

YJDoc2 commented Jan 21, 2024

How about separating the test into two, not using clone and using clone by cfg feature flags?

Hey @utam0k , I'm not sure what do you mean by this. Do you mean to split tests into two parts, one which use the clone path (such as fork, etc) and others which don't? Maybe then we can run the non-fork ones in parallel, and run the fork ones in serial. However, I feel there would still be a chance that because the way tests are executed, the runner would still have multiple threads (total), even if we run with --test-threads=1 . Still, it might be worth a try. Can you confirm if this is what you meant? Thanks!

@utam0k
Copy link
Member

utam0k commented Jan 23, 2024

@YJDoc2 Yes, That's what I want to say. I believe features could separate unit tests into two parts, using clone(2) and not using it.

@YJDoc2
Copy link
Collaborator Author

YJDoc2 commented Feb 16, 2024

Closing this as done in #2685

@YJDoc2 YJDoc2 closed this Feb 16, 2024
@YJDoc2 YJDoc2 deleted the tests/fix-for-2144 branch February 16, 2024 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants