Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a NCAR casper system example #68

Merged
merged 15 commits into from
Oct 15, 2021
Merged

Conversation

jedwards4b
Copy link
Contributor

Newer versions of PBS include a feature create_resv_from_job, this feature allows a user to:

  • Launch a job that creates a reservation for multiple nodes - possibly a combination of cpu and gpu nodes.
  • Launch within that reservation the SmartSim database and one or more client jobs.

ashao and others added 14 commits June 24, 2021 15:21
To ensure that the vanilla OpenMPI launcher is used to launch the
database, the database cluster launcher script needs to be wrapped
within a bash script. This also comments out much of the code
within the launching script so that the database will live for the
duration of the allocation.
Copy link
Contributor

@Spartee Spartee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff @jedwards4b! As discussed I think we will want to move this code to a seperate repository of SmartSim/NCAR examples and then add a README to the tutorials section that links to that repo and the others that will be created soon.

Minor comments and questions. Also some questions for @mellis13 and @al-rigazzi

@@ -0,0 +1,17 @@
REDIS_HOME = $(HOME)/sandboxes/SmartRedis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SMARTREDIS_HOME might disentangle the two references, SmartRedis and Redis(the database)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored this to use the installed SmartRedis module.

```bash
module purge
module load gnu/9.1.0 ncarcompilers openmpi netcdf ncarenv cmake
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to put a note in that this does NOT work with gcc 10 (for future reference)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does work with gnu 10 and gnu 11. I've removed the version number.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SmartRedis, possibly, SmartSim backends will not.

see RedisAI/redis-inference-optimization#777

Confirmed by RedisAI devs as well.

module load gnu/9.1.0 ncarcompilers openmpi netcdf ncarenv cmake
```

I also needed a newer version of gmake, it's in /glade/work/jedwards/make-4.3/bin/make
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conda install make will also provide a new version of make for the user that will satisfy this requirement


``pip install smartsim``

``smart --device gpu``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please note that this also could be smart --device cpu in the case where the user would like to run the database (specifically ML) on cpu nodes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that --device gpu also builds a cpu version, this seems to be confirmed by testing.


return open(filearg, mode)

_hack=object()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you explain the usage here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleaned up

return db

def monitor_client_jobs(rsvname):
jobs_done=False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to put a first sleep here instead in the case where jobs may not have arrived yet given the NCAR scheduler delay?

# pass in objects to make dirs for
exp.generate(db, overwrite=True)

# start the database on interactive allocation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"inside the batch allocation"

# shutdown the database because we don't need it anymore
exp.stop(db)
# delete the job reservation
run_cmd("pbs_rdel {}".format(rsvname))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@al-rigazzi @mellis13 we used to have launcher functions for PBS reservations but since many systems didn't have it we took it out. do you think we should reintroduce them?

1. resv_job.template
2. launch_database_cluster.template
3. launch_client.template

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly a note here that this is not the expected way for users to launch models/infra through SmartSim and that this is a great example of how to launch both the ML infra(database) and simulation separately.


launch.py is the primary launch script
```
usage: launch.py [-h] [--db-nodes DB_NODES] [--ngpus-per-node NGPUS_PER_NODE]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may want to consider some SmartSim functions similar to this where the database is just launched with no job management and interactivity. It just launches itself and then it's all up to the user.

@al-rigazzi what do you think about this?

Copy link
Contributor

@Spartee Spartee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jedwards4b I am merging this as we are creating the Zoo of examples and will move this to the correct location, but I want you to get the credit for the commit.

@Spartee Spartee merged commit 852633b into CrayLabs:develop Oct 15, 2021
@jedwards4b
Copy link
Contributor Author

@Spartee sounds good, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants