This repository will contain all relevant information related to using parallel and distributed data analysis on the Poseidon cluster at the Woods Hole Oceanographic Institution (WHOI). This repository will specifically go over Dask, xarray and requesting resources on Poseidon and your local machine.
The workshop will be held on October 3rd, 2024 and will be co-taught by Katy Abbott and Anthony Meza.
The workshop schedule can be found here: https://docs.google.com/document/d/1vHl_ZYNNWhaYK6h4Y93NFcOhk0zmLGbmzrVpsnM5Zi8/edit?usp=sharing
Our slides can be found here: https://docs.google.com/presentation/d/18fEL94cLxcA-prOxSrREmxWVYxIGuB8MrO2ymTiV2Sc/edit?usp=sharing
This workshop assumes knowledge of some basic programming concepts including variable declaration, boolean operators, loops, lists, dictionaries, conditionals and functions. This workshop will use Python. If you need to brush up on any of these concepts in Python, the WHOI Python Carpentries workshop website is a good place to start.
The goal of this workshop will be to provide attendees with an introduction to plotting and processing geophysical data stored in tabular (e.g. CSVs) or hierarchical (e.g. NetCDFs) formats. In particular, we will cover:
- Using Dask for parallel computing
- Leveraging xarray for handling multi-dimensional arrays
- Requesting resources on the Poseidon cluster
- Setting up your local machine for compatibility with Poseidon
Python is not the only language that provides distributed computing tools. However, based on our experience Python has the most mature and accessible tools for big data exploration and visualization in the climate sciences. Other languages such as MATLAB, Julia, and R offer similar tools to process geospatial data, but none are as complete as those in Python. Many scientific analysis codes are now being written exclusively in Python so we hope that understanding the basic functions that make up these codes will be worthwhile!
We have already created a script for you which installs micromamba
and the packages necessary to participate in the workshop.
To download the script, log in to Poseidon from Terminal (if using Mac/Linux) or Powershell (Windows)
ssh -XY [email protected]
Once logged in, confirm you are in your home directory
cd ~
Next, download the Miniforge script
wget https://raw.githubusercontent.com/anthony-meza/WHOI-PO-HPC/refs/heads/official_pilot_workshop/poseidon_setup.sh
Finally, to initialize the installation type the following command
sh poseidon_setup.sh
To check whether your installation succesfully run
source ~/.bash_profile
mamba
Running a Jupyter notebook on Poseidon requires you to create an SSH tunnel from Poseidon to your personal computer. Instructions depend on whether you're using Mac/Linux or Windows. Below are instructions for both.
To make sure you're able to start a Jupyter notebook from Poseidon and then access the remote server through ssh tunneling, follow these steps:
- ssh into Poseidon, copy the script
run_jupyter_on_poseidon.sh
to your home directory, change[email protected]
to your email, and runsbatch run_jupyter_on_poseidon.sh
on the command line - Check the output from this script, which is piped to
log-jupyter-{jobid}.log
. You can check the job ID by runningmj
(short for my job) to see what jobs of yours are in the queue. Any errors will also be sent to this log. - Copy the line that has a format like
ssh -N -f -L remote-port:remote server:remote-port [email protected]
, which show the port the server is running on, and the node it is using on Poseidon. Paste it into a new terminal window on your local machine and run. (See this screenshot for more details.) - Locate the url in the file
log-jupyter-{jobid}.log
that begins withhttp://127.0.0.1:remote-port....
. Copy this url and paste it into a browser and your notebook should pop up!
- Download and install Putty
- ssh into Poseidon, copy the script
run_jupyter_on_poseidon.sh
to your home directory, change[email protected]
to your email, and runsbatch run_jupyter_on_poseidon.sh
on the command line - Check the output from this script, which is piped to
log-jupyter-{jobid}.log
. You can check the job ID by runningmj
(short for my job) to see what jobs of yours are in the queue. Any errors will also be sent to this log. - Use the information from the log output to create a VPN tunnel with Putty (See this screenshot for more details.)
- Start the tunnel by clicking "Open" in Putty
- Locate the url in the file
log-jupyter-{jobid}.log
that begins withhttp://127.0.0.1:remote-port....
. Copy this url and paste it into a browser and your notebook should pop up!