-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Valay/refactor on eddie's PR #3
base: outerbounds-multicloud
Are you sure you want to change the base?
Conversation
- Unfinished - Ran black
- no notification system in place. - re-orged stuff into a module
- Failure of workers triggers failure of control. - Failure of control results in failure of workers. - Added more comments. - control tasks have heartbeat thread that write for a long enough period of time.
….deepspeed` - ensures separation of concern and also helps when debugging user code failures as there are distinct .
self.environment = environment | ||
self.flow_datastore = flow_datastore | ||
|
||
self.is_batch = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is is_batch
and is_k8s
used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not. Will remove.
|
||
for deco in decos: | ||
if deco.name in ["resources", "kubernetes", "batch"]: | ||
compute_deco_attrs = compute_resource_attributes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this function needed here? we are planning on removing this from metaflow OSS soon.
DEEPSPEED_SUFFIX = "mf.deepspeed_datastore" | ||
|
||
|
||
HOSTFILE_IP_KEY = "hostfile_ips" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are available in constants.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot. Can move.
TaskFailedException, | ||
) | ||
|
||
DEEPSPEED_SUFFIX = "mf.deepspeed_datastore" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit - just mf.deepspeed
perhaps
class DeepspeedExecutor: | ||
""" | ||
Instances of the DeepspeedExecutor class are responsible for launching the Deepspeed command. There is one per Metaflow @step annotated with @deepspeed. | ||
DeepspeedExecutor consumes a list of hosts aka nodes aka k8s pods aka metaflow task containers, and information about how many processes to run on each. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be batch jobs too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The executor is called within the user code. It represents current.deepspeed
- made the hello example work out of the box. - refactor on fixes.
No description provided.