Skip to content

Commit

Permalink
Add guide explaining how to use Pangeo Docker Images with Jupyter Not…
Browse files Browse the repository at this point in the history
…ebook on HPC Systems using Singularity (#430)

* Add Singularity + GPU guide

Co-authored-by: Ryan Abernathey <[email protected]>

* Update Sing+GPU.md

Co-authored-by: Ryan Abernathey <[email protected]>

---------

Co-authored-by: Scott Henderson <[email protected]>
Co-authored-by: Ryan Abernathey <[email protected]>
Co-authored-by: Scott Henderson <[email protected]>
  • Loading branch information
4 people authored Apr 3, 2024
1 parent cd97f6a commit 6582f8a
Show file tree
Hide file tree
Showing 2 changed files with 130 additions and 9 deletions.
21 changes: 12 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,16 +34,9 @@ graph TD;
click ml-notebook "https://hub.docker.com/r/pangeo/ml-notebook" "Open this in a new tab" _blank
```

### Using the image with Singularity on HPC systems

### Other notes

* Since 2020.10.16, [mamba](https://github.com/mamba-org/mamba) is installed into the base-image and conda-lock environment and is used by default to solve for a compatible environment (see #146)
* For a simple list of packages for a given image, you can use a link like this: https://github.com/pangeo-data/pangeo-docker-images/blob/2020.10.08/pangeo-notebook/packages.txt
* To compare changes between two images, you can use a link like this: https://github.com/pangeo-data/pangeo-docker-images/compare/2020.10.03..2020.10.08
* Our `ml-notebook` image now contains JAX and TensorFlow with XLA enabled. Due to licensing issues, conda-forge does not have `ptxas`, but `ptxas` is needed for XLA to work correctly. Should you like to use JAX and/or TensorFlow with XLA optimization, please install `ptxas` on your own, for example, by `conda install -c nvidia cuda-nvcc`. At the time of writing (October 2022), JAX throws a compilation error if the `ptxas` version is higher than the driver version. There does not exist an easy solution for K80 GPUs, but in the case of T4 GPUs, you should install `conda install -c nvidia cuda-nvcc==11.6.*` to be safe. Alternatively for any GPU, you could set an environment variable to resolve the error caused by JAX: `XLA_FLAGS="--xla_gpu_force_compilation_parallelism=1"`. The aforementioned error will be removed (and likely turned into a warning) in a future version of JAX. See https://github.com/google/jax/issues/12776#issuecomment-1276649134
* There used to be a `pangeo/forge` image, built for use with [pangeo-forge](https://pangeo-forge.org/). It is
no longer actively maintained or used, but you can still use the [historical tags](https://quay.io/repository/pangeo/forge?tab=tags)
if you wish.
If you want to use this image on an HPC system (including a GPU system), we recommend using Singularity. Please see the [Singularity guide](Sing+GPU.md).


### Dask-gateway compatibility
Expand All @@ -55,3 +48,13 @@ The primary use of these Docker images is running on Pangeo Cloud deployments wi
| 0.9 | 2020.11.06 |
| 0.8 | 2020.07.28 |
| 0.7 | 2020.04.22 |

### Other notes

* Since 2020.10.16, [mamba](https://github.com/mamba-org/mamba) is installed into the base-image and conda-lock environment and is used by default to solve for a compatible environment (see #146)
* For a simple list of packages for a given image, you can use a link like this: https://github.com/pangeo-data/pangeo-docker-images/blob/2020.10.08/pangeo-notebook/packages.txt
* To compare changes between two images, you can use a link like this: https://github.com/pangeo-data/pangeo-docker-images/compare/2020.10.03..2020.10.08
* Our `ml-notebook` image now contains JAX and TensorFlow with XLA enabled. Due to licensing issues, conda-forge does not have `ptxas`, but `ptxas` is needed for XLA to work correctly. Should you like to use JAX and/or TensorFlow with XLA optimization, please install `ptxas` on your own, for example, by `conda install -c nvidia cuda-nvcc`. At the time of writing (October 2022), JAX throws a compilation error if the `ptxas` version is higher than the driver version. There does not exist an easy solution for K80 GPUs, but in the case of T4 GPUs, you should install `conda install -c nvidia cuda-nvcc==11.6.*` to be safe. Alternatively for any GPU, you could set an environment variable to resolve the error caused by JAX: `XLA_FLAGS="--xla_gpu_force_compilation_parallelism=1"`. The aforementioned error will be removed (and likely turned into a warning) in a future version of JAX. See https://github.com/google/jax/issues/12776#issuecomment-1276649134
* There used to be a `pangeo/forge` image, built for use with [pangeo-forge](https://pangeo-forge.org/). It is
no longer actively maintained or used, but you can still use the [historical tags](https://quay.io/repository/pangeo/forge?tab=tags)
if you wish.
118 changes: 118 additions & 0 deletions Sing+GPU.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Jupyter Notebook + Singularity + GPU support on the HPC System


Singularity brings containers into traditional HPC use cases and centers. (FYI: It has been moved into the Linux Foundation and renamed Apptainer).

We first need download and use [one of the images created by Pangeo](https://github.com/pangeo-data/pangeo-docker-images). They are all hosted on [Dockerhub](https://hub.docker.com/u/pangeo)

## Downloading the image

After ssh-ing into your HPC system, load Singularity:

```
module load singularity
```

Pull the desired image (for example the ml-notebook which uses TensorFlow and GPUs) under our name of choice (in our case `tensorflow.sif`):

```
singularity pull tensorflow.sif docker://pangeo/ml-notebook
```

**Note I:** Depending on the size of the image, this could take some time and some warnings may appear. It may be a good idea to do some other work in the meantime.

**Note II:** If we were to choose a different image, just change what follows after `docker://` for the name of the image appearing on [Dockerhub](https://hub.docker.com/u/pangeo).

After being patient, the file `tensorflow.sif` should be available in the home folder.

## Running the Batch Job

To request resources and have a Jupyter Notebook running on a computing node it is necessary to have a batch script under, for example, the name `batch_tflw_v100s.sh`.

To create it run `vi batch_tflw_v100s.sh` and paste the below command text.

<span style="color:red">**Important:**</span> Make sure all paths are the relevant to your given case.

```
#!/bin/sh
#
#SBATCH --account=abernathey # The account name for the job.
#SBATCH --job-name=jupyter # The job name.
#SBATCH --gres=gpu:1 # Request 1 gpu (Up to 2 gpus per GPU node)
#SBATCH --partition=ocp_gpu
#SBATCH --constraint=v100s
#SBATCH -c 32 # The number of cpu cores to use.
#SBATCH --time=0-04:00 # The time the job will take to run in D-HH:MM
#SBATCH --output=/home/$USER/jupyter.log # Important to retrieve the port where the notebook is running, if not included a slurm file with the job-id will be outputted.
module load singularity
cat /etc/hosts
singularity exec --nv --cleanenv --bind /home/$USER:/run/user tensorflow.sif jupyter notebook --notebook-dir=/home/$USER --no-browser --ip=0.0.0.0
```

To exit the Vi editor, make sure to be in command mode by pressing `ESC`and then `:wq` to write and quit.

In this case, V100s GPUs are allocated if available. The [Ginsburg official guide](https://confluence.columbia.edu/confluence/display/rcs/Ginsburg+-+Job+Examples#GinsburgJobExamples-GPU(CUDAC/C++)) describes how to manage resource requests.

To enter the queue, run:

```
sbatch batch_tflw_v100s.sh
```

Once the job is running, the `jupyter.log` file should show which node you are using and in which port it is referenced. To obtain this, run:

```
cat jupyter.log
```

A line like `[I 15:14:51.868 NotebookApp] http://g051:8888/` should appear. In this case, `g051` is the node name and `8888` is the port.

## Running the Jupyter Notebook

### Forwarding the Port

In your local computer's terminal, forward the port by running (change `[email protected]` for your account),

```
ssh -N -L lochalhost:8080:g051:8888 [email protected]
```

This forwards your port `8888` from the HPC system to your port `8080` on your machine.

Then, in a web browser you should be able to access the Jupyter Notebook by writing:

![](https://i.imgur.com/ezXUVEv.png)

#### Sanity Checks

- The requested GPU should support your TensorFlow notebook:

![](https://i.imgur.com/g9tzOiQ.png)

- The python kernel should be the `Python 3 (ipykernel)`

![](https://i.imgur.com/CwTHtZk.png)

- Which points to the following path:

![](https://i.imgur.com/Lz3N88g.png)

### Connecting to remote host (VSCode)

If you are using VSCode, it is possible to connect to the allocated node directly without forwarding the port.

Make sure Remote-SSH, Python and Jupyter extensions are installed on VSCode.

1. Connect to your HPC system with Remote-SSH in VSCode.

2. Open a Jupyter Notebook.

3. Select the kernel on the top right under the gear wheel and then `Connect to a Jupyter Server`.

4. Introduce the URL (`http://g051:8888/`) and select the `Python 3 (ipykernel)`

5. Check for GPUs

![](https://i.imgur.com/XJp5IZd.png)

0 comments on commit 6582f8a

Please sign in to comment.