Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Training hang detection based on XPU Timer metric. #1288

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

BalaBalaYi
Copy link
Collaborator

@BalaBalaYi BalaBalaYi commented Oct 11, 2024

What changes were proposed in this pull request?

  1. Implement CheckTrainingHangOperator based on XPU Timer metric.
  2. Integrate context from JobManager in DiagnosisDataManager.
  3. Use limited Deque instead of List in DiagnosisDataManager to avoid data explosion.
  4. Add lock on 'diagnosis data' operation.

Why are the changes needed?

Training hang detection POC.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT.

# Conflicts:
#	dlrover/python/master/node/dist_job_manager.py
#	dlrover/python/tests/test_diagnosis_agent.py
#	dlrover/python/tests/test_inference_chain.py
Copy link

codecov bot commented Oct 15, 2024

Codecov Report

Attention: Patch coverage is 97.64706% with 4 lines in your changes missing coverage. Please review.

Project coverage is 80.49%. Comparing base (0ef290a) to head (5f36a45).

Files with missing lines Patch % Lines
.../inferenceoperator/check_training_hang_operator.py 92.72% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1288      +/-   ##
==========================================
+ Coverage   80.34%   80.49%   +0.14%     
==========================================
  Files         222      222              
  Lines       20481    20622     +141     
==========================================
+ Hits        16456    16599     +143     
+ Misses       4025     4023       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

# Conflicts:
#	dlrover/python/diagnosis/common/constants.py
#	dlrover/python/diagnosis/common/diagnosis_action.py
#	dlrover/python/elastic_agent/torch/training.py
#	dlrover/python/master/node/job_manager.py
#	dlrover/python/master/servicer.py
#	dlrover/python/tests/test_diagnosis.py
#	dlrover/python/tests/test_diagnosis_agent.py
#	dlrover/python/tests/test_job_manager.py
#	dlrover/python/util/time_util.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants