Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to fix GCP autoscaler threading issue #49440

Merged
merged 1 commit into from
Dec 26, 2024

Conversation

hartikainen
Copy link
Contributor

@hartikainen hartikainen commented Dec 25, 2024

This is an attempt to fix #46451. I believe that the issue is caused by incorrect handling of google-api-python-client-library, which is thread-unsafe as per the documentation.

The above-mentioned documentation also describes how the thread-unsafety can be worked around by instantiating separate httplib2.Http() object for each thread. Here, I've just slapped the request build pattern onto GCP service creation code. I have not yet fully verified whether this works or not as debugging this is slow and quite a pain in the ass. What I have verified though is that the SSL errors that I mentioned in #46451 (comment) and which consistently occurred in /tmp/ray/session_latest/logs/monitor.out are now gone.

I'm not sure when I have time to get back to this but I'll leave this here in case someone else wants to take a look.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@MortalHappiness
Copy link
Member

MortalHappiness commented Dec 26, 2024

Thank you for your contribution! Looks good. Would you mind converting it to a formal PR by signing off all commits and fixing the lint CI error?

I will use your PR to run some experiments to see if your changes applied to 2.37.0 can resolve the issue. I’ll post the results here afterward.

Update: Experiment PR is here. #49442

@MortalHappiness MortalHappiness added the go add ONLY when ready to merge, run all tests label Dec 26, 2024
@hartikainen hartikainen force-pushed the fix-46451 branch 2 times, most recently from b9d9c8f to cd26d3d Compare December 26, 2024 12:12
@hartikainen
Copy link
Contributor Author

hartikainen commented Dec 26, 2024

I signed-off and formatted (with scripts/format.sh) the commit. I can convert this to a non-draft PR but I want to emphasize that I have not tested this beyond my own use case. Clearly the most immediate issue is fixed but I don't know if this has some side effects. Are there tests that would verify that other things don't break e.g. due to build_request?

@hartikainen hartikainen marked this pull request as ready for review December 26, 2024 12:17
@hartikainen hartikainen requested review from hongchaodeng and a team as code owners December 26, 2024 12:17
@MortalHappiness
Copy link
Member

I signed-off and formatted (with scripts/format.sh) the commit. I can convert this to a non-draft PR but I want to emphasize that I have not tested this beyond my own use case. Clearly the most immediate issue is fix but I don't know if this has some side effects. Are there tests that would verify that other things don't break e.g. due to build_request?

I think it should be fine. First, the two files you modified are only related to GCP, so components other than GCP won't be affected. Secondly, the functions you changed are all related to creating resources. As I mentioned in #49440 (comment), I manually built your changes into a wheel and tested deploying it on GCP. From my observations, the compute engines were created correctly. As for AuthorizedHttp, since I was able to create compute engines, authentication should also be working fine.

cc @rynewang Do you have any other concerns about this?

@rynewang
Copy link
Contributor

Looks good!

@rynewang rynewang merged commit 6064bdd into ray-project:master Dec 26, 2024
5 checks passed
@rynewang
Copy link
Contributor

Merged. Thanks for your contribution!

@hartikainen hartikainen deleted the fix-46451 branch December 26, 2024 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Setup Commands not running with Ray Cluster on GCP
3 participants