Skip to content

Refactor backend classes and remove deprecated methods in PyTorch backends #141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

TaekyungHeo
Copy link
Collaborator

@TaekyungHeo TaekyungHeo commented Jul 31, 2024

Summary

  • Renamed backendFunctions to BaseBackend and refactored associated modules for improved abstraction.
  • Removed getBusBW from BaseBackend.
  • Removed initialize_tcpstore and UCC plugin support from PyTorchDistBackend.
  • Reorganized module structure by relocating files to a backend directory.

Test Plan

$ comm_replay --trace-type et --trace-path /home/sanshang/021_debug/000_code/param/trace/traces_megatronlm_gpt_43B_32ranks_pytnightly0703/execution_trace

image

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 31, 2024
@TaekyungHeo TaekyungHeo force-pushed the refactor-pytorch-backend branch from d521ade to 8a1899e Compare July 31, 2024 15:51
@TaekyungHeo TaekyungHeo changed the title Refactor pytorch backend Refactor Backend Classes and Remove Deprecated Methods in PyTorch Backends Jul 31, 2024
@TaekyungHeo TaekyungHeo changed the title Refactor Backend Classes and Remove Deprecated Methods in PyTorch Backends Refactor backend classes and remove deprecated methods in PyTorch backends Jul 31, 2024
@TaekyungHeo TaekyungHeo reopened this Jul 31, 2024
@shengfukevin shengfukevin self-requested a review August 1, 2024 16:37
@facebook-github-bot
Copy link
Contributor

@shengfukevin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

if backend == "ucc":
# try OSS/setup.py
# Import Fairring
if backend == "fairring":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like "fairring" is the supported backend. Shall we add "fairring" into supportedC10dBackends in base_backend.py? and remove "ucc"?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. Will try to update. Thanks!
@Sergei-Lebedev

@shengfukevin
Copy link
Contributor

Thanks for the PR. Looks good to me. Have one minor inline comment, please check.

@shengfukevin
Copy link
Contributor

@TaekyungHeo, I saw quite a few BLACK lint error, does our github project enable BLACK already?

@TaekyungHeo
Copy link
Collaborator Author

TaekyungHeo commented Aug 1, 2024

@shengfukevin, this repository has a linter, as you can see at

- name: Run Black
. We ensured that our PRs pass the black linter as described in the GitHub pipeline. Do you see any differences between these rules and Meta’s internal rules? If so, what would be the best solution? I suggest updating the PARAM black rules to align with Meta’s black rules.

@shengfukevin
Copy link
Contributor

Looks like Meta internal lint combined black and usort. I ran usort on this PR, I do see some format got changed, but still not exactly the same as the version from Meta "arc lint".

@TaekyungHeo
Copy link
Collaborator Author

@shengfukevin Do you have any suggestions for NVIDIA to avoid linter errors in advance?

@shengfukevin
Copy link
Contributor

@shengfukevin Do you have any suggestions for NVIDIA to avoid linter errors in advance?

I am asking PyTorch team and see how they make user OSS linter aligned with internal linter.

@shengfukevin shengfukevin self-assigned this Aug 1, 2024
@shengfukevin
Copy link
Contributor

@TaekyungHeo @GSSBMW, I fixed the meta lint issue after I imported this PR to meta. However, I can not export the changed code to this PR since I am not the owner. So I created a new PR #149 which include the lint fix. I also changed supportedC10dBackends in base_backend.py to add fairring and remove ucc. Please review #149.

@TaekyungHeo
Copy link
Collaborator Author

Closing this PR as @shengfukevin has created a new one to fix lint errors: #149

@TaekyungHeo TaekyungHeo closed this Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants