-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a NCAR casper system example #68
Conversation
To ensure that the vanilla OpenMPI launcher is used to launch the database, the database cluster launcher script needs to be wrapped within a bash script. This also comments out much of the code within the launching script so that the database will live for the duration of the allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff @jedwards4b! As discussed I think we will want to move this code to a seperate repository of SmartSim/NCAR examples and then add a README to the tutorials section that links to that repo and the others that will be created soon.
Minor comments and questions. Also some questions for @mellis13 and @al-rigazzi
tutorials/casper/Makefile
Outdated
@@ -0,0 +1,17 @@ | |||
REDIS_HOME = $(HOME)/sandboxes/SmartRedis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SMARTREDIS_HOME
might disentangle the two references, SmartRedis and Redis(the database)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored this to use the installed SmartRedis module.
```bash | ||
module purge | ||
module load gnu/9.1.0 ncarcompilers openmpi netcdf ncarenv cmake | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might want to put a note in that this does NOT work with gcc 10 (for future reference)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does work with gnu 10 and gnu 11. I've removed the version number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SmartRedis, possibly, SmartSim backends will not.
see RedisAI/redis-inference-optimization#777
Confirmed by RedisAI devs as well.
module load gnu/9.1.0 ncarcompilers openmpi netcdf ncarenv cmake | ||
``` | ||
|
||
I also needed a newer version of gmake, it's in /glade/work/jedwards/make-4.3/bin/make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conda install make
will also provide a new version of make
for the user that will satisfy this requirement
|
||
``pip install smartsim`` | ||
|
||
``smart --device gpu`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please note that this also could be smart --device cpu
in the case where the user would like to run the database (specifically ML) on cpu nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that --device gpu also builds a cpu version, this seems to be confirmed by testing.
tutorials/casper/utils.py
Outdated
|
||
return open(filearg, mode) | ||
|
||
_hack=object() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will you explain the usage here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleaned up
return db | ||
|
||
def monitor_client_jobs(rsvname): | ||
jobs_done=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to put a first sleep here instead in the case where jobs may not have arrived yet given the NCAR scheduler delay?
# pass in objects to make dirs for | ||
exp.generate(db, overwrite=True) | ||
|
||
# start the database on interactive allocation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"inside the batch allocation"
# shutdown the database because we don't need it anymore | ||
exp.stop(db) | ||
# delete the job reservation | ||
run_cmd("pbs_rdel {}".format(rsvname)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@al-rigazzi @mellis13 we used to have launcher functions for PBS reservations but since many systems didn't have it we took it out. do you think we should reintroduce them?
1. resv_job.template | ||
2. launch_database_cluster.template | ||
3. launch_client.template | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly a note here that this is not the expected way for users to launch models/infra through SmartSim and that this is a great example of how to launch both the ML infra(database) and simulation separately.
|
||
launch.py is the primary launch script | ||
``` | ||
usage: launch.py [-h] [--db-nodes DB_NODES] [--ngpus-per-node NGPUS_PER_NODE] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may want to consider some SmartSim functions similar to this where the database is just launched with no job management and interactivity. It just launches itself and then it's all up to the user.
@al-rigazzi what do you think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jedwards4b I am merging this as we are creating the Zoo of examples and will move this to the correct location, but I want you to get the credit for the commit.
@Spartee sounds good, thanks |
Newer versions of PBS include a feature create_resv_from_job, this feature allows a user to: