You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, DLRover uses the official Kubernetes Python client to interact with the Kubernetes API Server. This part of implementations are quite important because it involves managing the lifecycle of training workers. However, the Python client has inherent limitations and lags behind the Go (and Java) clients (e.g., lacks an informer implementation), leading to occasional unexpected usage issues in certain scenarios. Therefore, we intend to:
[Option 1] Replace the current Python client with the Go client(need to use CFFI).
[Option 2] Enhance k8s client implements in python.
Requirement
The enhancement/replacement should ensure compatibility with all existing Kubernetes-related calls while also adapting the usage in dlrover/python/scheduler/kubernetes.py.
Ensure compatibility with all related features (no regression).
Reimplement the watch mechanism using the 'informer'. (dlrover/python/master/watcher/k8s_watcher.py)
For replacement scheme requires evaluation of:
Limitations on different system platforms.
Issues with cross-language object transfer.
Potential performance overhead.
Additional costs for building and deployment.
For enhancement scheme requires evaluation of:
Feasibility and complexity of implementation.
The text was updated successfully, but these errors were encountered:
hello @BalaBalaYi
Please assign me the issue
I will be enhancing the Python Kubernetes client in dlrover/python/scheduler/kubernetes.py to address #1291 by adding an informer-like watch mechanism, improved error handling with retries, and resource caching to reduce API calls. This approach will optimize performance, ensure compatibility, and avoid moving to a Go client.
Background
Currently, DLRover uses the official Kubernetes Python client to interact with the Kubernetes API Server. This part of implementations are quite important because it involves managing the lifecycle of training workers. However, the Python client has inherent limitations and lags behind the Go (and Java) clients (e.g., lacks an informer implementation), leading to occasional unexpected usage issues in certain scenarios. Therefore, we intend to:
[Option 1] Replace the current Python client with the Go client(need to use CFFI).
[Option 2] Enhance k8s client implements in python.
Requirement
The enhancement/replacement should ensure compatibility with all existing Kubernetes-related calls while also adapting the usage in
dlrover/python/scheduler/kubernetes.py
.dlrover/python/master/watcher/k8s_watcher.py
)For replacement scheme requires evaluation of:
For enhancement scheme requires evaluation of:
The text was updated successfully, but these errors were encountered: