Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and Optimize trace_link.py for Enhanced Readability and Performance #90

Closed
wants to merge 17 commits into from

Conversation

TaekyungHeo
Copy link
Contributor

Summary

This pull request introduces significant improvements to trace_link.py. The changes aim to enhance the readability, maintainability, and performance of the codebase, addressing several issues identified in the previous implementation.

Key highlights of this pull request include:

  • Refactoring of trace_link.py to improve code readability and maintainability. The refactoring effort focused on restructuring the code for better understanding and future development, accompanied by comprehensive comments and logging messages to facilitate easier debugging and maintenance.
  • Removal of buggy or incomplete features, such as the multi-iteration support. This feature, while valuable, was found to introduce numerous bugs and complications.
  • Introduction of multi-threading optimizations for processing large traces. Recognizing the performance bottlenecks associated with handling extensive execution traces, this update implements multi-threading techniques to accelerate the processing of large datasets, significantly improving the tool's efficiency.

@facebook-github-bot
Copy link
Contributor

Hi @TaekyungHeo!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@TaekyungHeo TaekyungHeo changed the title Refactor Refactor and Optimize trace_link.py for Enhanced Readability and Performance Feb 6, 2024
@TaekyungHeo TaekyungHeo force-pushed the refactor branch 2 times, most recently from 8596483 to 6c6524c Compare February 7, 2024 19:54
train/compute/python/tools/trace_link.py Show resolved Hide resolved
train/compute/python/tools/trace_link.py Outdated Show resolved Hide resolved
train/compute/python/tools/trace_link.py Outdated Show resolved Hide resolved
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 22, 2024
@facebook-github-bot
Copy link
Contributor

@briancoutinho has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@briancoutinho
Copy link
Contributor

briancoutinho commented Mar 8, 2024

@TaekyungHeo I'm not sure why but import failed :( One request can you try to rebase on master
git checkout main; git pull
git checkout your branch; git rebase main;
git push --force

@facebook-github-bot
Copy link
Contributor

@TaekyungHeo has updated the pull request. You must reimport the pull request before landing.

@TaekyungHeo
Copy link
Contributor Author

Updated, @briancoutinho .

@facebook-github-bot
Copy link
Contributor

@TaekyungHeo has updated the pull request. You must reimport the pull request before landing.

The `handle_kineto_segmentation` function is intended to support kineto traces
cross multiple iterations by splitting a trace into several segments according
to the provided annotations. Unfortunately, this function is not operating as
expected, leading to errors. It is advisable to remove it.
The multi-iteration support feature for PyTorch execution traces is designed to
facilitate the handling of traces over multiple iterations. Unfortunately, this
feature is not functioning as expected and is leading to errors. It is advisable
to remove it.
This commit introduces support for inter-thread dependencies within the Chakra
framework. By examining Kineto traces via chrome://tracing, one can observe
multiple CPU threads and their implicit dependencies. This update explicitly
encodes these dependencies in the output trace, enabling accurate handling by
subsequent tools.
This commit adds stream ID encoding to GPU operators. This ensures that all
operators within the same stream are executed in the correct order, supporting
intra-stream dependencies.
Introduced exclusive duration calculation for Kineto operators in the TraceLinker
class.  This update differentiates between inclusive and exclusive durations,
providing a clearer distinction in the profiling data. Exclusive durations are
now calculated to identify the actual time spent in individual operations,
excluding overlaps with child operators.
@facebook-github-bot
Copy link
Contributor

@TaekyungHeo has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@TaekyungHeo has updated the pull request. You must reimport the pull request before landing.

@TaekyungHeo
Copy link
Contributor Author

TaekyungHeo commented Apr 5, 2024

Move trace_link.py to mlcommons/chakra based on the discussion between Meta and NVIDIA. (#100)

@TaekyungHeo TaekyungHeo closed this Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants