-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【WIP】add pod diagnosis feature #1219
base: master
Are you sure you want to change the base?
【WIP】add pod diagnosis feature #1219
Conversation
1. monitor pod periodically 2. diagnosis long pending pods based on the monitoring data Signed-off-by: xiluo <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1219 +/- ##
==========================================
- Coverage 80.41% 80.31% -0.11%
==========================================
Files 217 219 +2
Lines 19463 19567 +104
==========================================
+ Hits 15652 15715 +63
- Misses 3811 3852 +41 ☔ View full report in Codecov by Sentry. |
@@ -78,6 +82,22 @@ def get_timestamp(self) -> int: | |||
def get_type(self) -> str: | |||
return DiagnosisDataType.CHIPMETRICES | |||
|
|||
class K8sPodData(DiagnosisData): | |||
def __init__(self, timestamp: int, pods: List[V1Pod]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better not involve 'V1Pod' in common package(or user should add 'kubernetes' deps in their env).
from dlrover.python.scheduler.job import JobArgs | ||
|
||
|
||
class PodMonitor(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This impl is duplicated with _monitor_nodes
in dist_job_manager.py
.
What changes were proposed in this pull request?
Why are the changes needed?
automatically recover the job from the long pending problem
Does this PR introduce any user-facing change?
No.
How was this patch tested?