Skip to content

Commit

Permalink
[Do not merge] Document new structure (#21)
Browse files Browse the repository at this point in the history
* Black

* Provide default for credentials and fix times

* Fix badge

* First part of the docs done

* Ignore data directories

* Fix var handling and output paths

* Document container usage

* Add sections for TODO's
  • Loading branch information
willirath authored Mar 2, 2021
1 parent 4e44c5e commit a385bec
Show file tree
Hide file tree
Showing 4 changed files with 136 additions and 277 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
*.nc
*.csv
zarr
nc
276 changes: 60 additions & 216 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,279 +1,123 @@
# CMEMS automated data retrieval

[![build-and-push-images](https://github.com/geomar-od/rasmus-cmems-downloads/workflows/build-and-push-images/badge.svg?branch=main)](https://github.com/geomar-od/rasmus-cmems-downloads/actions?query=workflow%3Abuild-and-push-images)
[![build-and-push-images](https://github.com/geomar-od/rasmus-cmems-downloads/actions/workflows/build_and_push_images.yaml/badge.svg)](https://github.com/geomar-od/rasmus-cmems-downloads/actions/workflows/build_and_push_images.yaml)
[![quay.io/willirath/rasmus-cmems-downloads](https://img.shields.io/badge/quay.io-build-blue)](https://quay.io/repository/willirath/rasmus-cmems-downloads)

## Overview

Currently, automated data downloading (https://github.com/geomar-od/rasmus-cmems-downloads) includes two steps:
Currently, the automated data downloading includes two steps:

1. data download and extraction,
1. data download and extraction: [`motypydownload`](motypydownload/)

2. format conversion from netCDF to `zarr` format.
2. format conversion from netCDF to [Zarr](https://zarr.readthedocs.io/en/stable/): [`netcdf2zarr`](netcdf2zarr/)

In the future, we may have more conversion steps (e.g., Parquet) and / or an upload step to the True Ocean systems.
In the future, there will be more conversion steps (e.g., from Zarr to Parquet) and / or an upload step to the systems of a collaborator.

## Description

For real-time downloading from the Copernicus Ocean website (https://resources.marine.copernicus.eu/?option=com_csw&task=results) the data are selected according to user-given parameters:
For real-time downloading from the [Copernicus Ocean website](https://resources.marine.copernicus.eu/?option=com_csw&task=results), the data are selected according to user-given parameters:

- spatial domain
- depth
- time span
- variables

To extract the data we use the [`motuclient`](https://github.com/clstoulouse/motu-client-python/) implemented in Python and bash scripting.
To extract the data we use the [`motuclient`](https://github.com/clstoulouse/motu-client-python/).

Currently, two simulation datasets are downloaded:
## File and directory naming

- GLOBAL_ANALYSIS_FORECAST_PHY
- GLOBAL_ANALYSIS_FORECAST_WAV
### netCDF

The data are downloaded as netCDF files into associated name directories: `global-analysis-forecast-phy-001-024-hourly-t-u-v-ssh/nc` and `global-analysis-forecast-wav-001-027/nc` in a directory that can be chosen via an input argument.
The data are downloaded as netCDF files into associated name directories:
- for the [physics analysis dataset](https://resources.marine.copernicus.eu/?option=com_csw&view=details&product_id=GLOBAL_ANALYSIS_FORECAST_PHY_001_024): `global-analysis-forecast-phy-001-024-hourly-t-u-v-ssh/nc/`
- for the [wave analysis dataset](https://resources.marine.copernicus.eu/?option=com_csw&view=details&product_id=GLOBAL_ANALYSIS_FORECAST_WAV_001_027): `global-analysis-forecast-wav-001-027/nc/`

For a time step a separated `.nc` file is created named according to selected model, variable and start and end time stamp, e.g., `global-analysis-forecast-phy-001-024-hourly-t-u-v-ssh_uo_2021-01-23_2021-01-24.nc` or `global-analysis-forecast-wav-001-027_VPED_2021-01-23_2021-01-24.nc`. We create daily data files for every variable.
For each day, a separate `.nc` file is created named according to the selected, variable and start and end time stamp, e.g., `global-analysis-forecast-phy-001-024-hourly-t-u-v-ssh_uo_2021-01-23_2021-01-24.nc` or `global-analysis-forecast-wav-001-027_VPED_2021-01-23_2021-01-24.nc`.

To convert data to the tabular `.csv` format, we use Python scripts and create directories `GLOBAL_ANALYSIS_FORECAST_PHY_CSV/` and `GLOBAL_ANALYSIS_FORECAST_WAV_CSV/` which again will be located in a directory that can be chosen via a command line argument. The converted files are called, e.g., `GLOBAL_ANALYSIS_FORECAST_PHY_001_24-TDS_2021-01-23_21:30:00:00.csv` or `GLOBAL_ANALYSIS_FORECAST_WAVE_001_27-TDS_2021-01-23_15:00:00:00.csv`.
### Zarr

We'll combine all timesteps into one Zarr store per variable.
For the [physics analysis dataset](https://resources.marine.copernicus.eu/?option=com_csw&view=details&product_id=GLOBAL_ANALYSIS_FORECAST_PHY_001_024), there would, e.g., be `global-analysis-forecast-phy-001-024-hourly-t-u-v-ssh/zarr/global-analysis-forecast-phy-001-024-hourly-t-u-v-ssh_uo_2021-01-01_2021-02-01.zarr/`.

## Usage
### General

### Data download and extraction
`{produc-id}/{format}/{product-id}_{variable}_{start-date}_{end-time}.{extension}`
with
- `product-id` being, e.g., `global-analysis-forecast-phy-001-024-hourly-t-u-v-ssh`, or `global-analysis-forecast-wav-001-027`.
- `format` being `nc`, `zarr`, etc.
- `variable`, being `uo`, `vo`, etc.
- `start-time` being interpreted as left inclusive boundary of the time interval covered by the data file / data store, and `end-time` being interpreted as the right exclusive boundary of the time interval
- `extension` being `nc`, `zarr/`, etc.

This needs a Python environment containing the following packages
- [`motuclient`](https://github.com/clstoulouse/motu-client-python#using-pip)
- [`xarray`](http://xarray.pydata.org/en/stable/installing.html#instructions)
- [`netCDF4`](https://pypi.org/project/netCDF4/)
- [`pandas`](https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html#installing-from-pypi)
Note that it is, however, recommended to use the Docker ([see below](#usage-with-docker)).

To initiate an automated data retrieval, copy the `.py` files to the desired output directory.
## Usage (with Docker)

Set environment variables containing your CMEMS credentials:
```shell
export MOTU_USER="XXXXXXXXXXXXXXXX"
export MOTU_PASSWORD="XXXXXXXXXXXXXXXXX"
```
### Building or pulling the container images

To execute download using python and default set of arguments:
```shell
python MotuClDownloadCMEMSPhysModel.py
python MotuClDownloadCMEMSWavModel.py
```
or, otherwise with selected arguments call:

```shell
python MotuClDownloadCMEMSPhysModel.py --basedir <basedir_name> --longitude_min <lon_min_value> --longitude_max <lon_max_value> --latitude_min <lat_min_value> --latitude_max <lat_max_value> --depth_min <depth_min_value> --depth_max <depth_max_value> --time_min <YYYY-MM-DD SS:mm:HH> --time_max <YYYY-MM-DD SS:mm:HH> --replace <True/False> --vars <list of variables> --service_id <service_id_name> --product_id <product_id_name>
python MotuClDownloadCMEMSWavModel.py --basedir <basedir_name> --longitude_min <lon_min_value> --longitude_max <lon_max_value> --latitude_min <lat_min_value> --latitude_max <lat_max_value> --depth_min <depth_min_value> --depth_max <depth_max_value> --time_min <YYYY-MM-DD SS:mm:HH> --time_max <YYYY-MM-DD SS:mm:HH> --replace <True/False> --vars <list of variables> --service_id <service_id_name> --product_id <product_id_name>
```
where the set of arguments is defined as follows:
* basedir - directory which contains the directories for the netCDF and the CSV files can be chosen. If no `<base_dir>` is supplied, the current directory will be chosen.
* longitude_min, --longitude_max - longitudinal domain extend (default = -180, -179.91667),
* latitude_min, --latitude_max - latitudonal domain extend (default = -80, 90),
* depth_min, --depth_max - bottom and top depth layers (default = 0.493, 0.4942),
* time_min, --time_max - time range in the format "YYYY-MM-DD SS:mm:HH",
* replace - option for re--downloading of data files, True/False is selected depending on whether existing files are replaced automatically <br>
upon downloading (default = False)
* vars - list of variables could be selected from available model ocean parameters
* service_id - name of ocean model (default = GLOBAL_ANALYSIS_FORECAST_PHY_001_024-TDS/GLOBAL_ANALYSIS_FORECAST_WAV_001_027-TDS for ocean/wave model correspondingly)
* product_id - name of ocean product (default = global-analysis-forecast-phy-001-024-hourly-t-u-v-ssh/global-analysis-forecast-wav-001-027 for ocean/wave model correspondingly)

## Data conversion from netcdf to Zarr

This step is outdated
After the data are downloaded, run the python scripts to convert the data to `.csv` format by running:
```shell
python NetCDF2CSVPhysModel.py --basedir <base_dir>
python NetCDF2CSVWaveModel.py --basedir <base_dir>
```
After downloads are finished, run the python scripts to convert netcdf files to `zarr store` by running
in command line with default arguments:
The necessary container images can be built locally:
```shell
python NetCDF2zarrPhysModel.py
python NetCDF2zarrWavModel.py
docker build -t rasmus-cmems-downloads:motupydownload-latest motupydownload/
docker build -t rasmus-cmems-downloads:netcdf2zarr-latest netcdf2zarr/
```
or using list of arguments as follows:

```shell
python NetCDF2zarrPhysModel.py --basedir <basedir_name> --product_id <product_id_name> --var <list of variables>
python NetCDF2zarrWavModel.py --basedir <basedir_name> --product_id <product_id_name> --var <list of variables>
```
where the set of arguments is defined as above in previous section.
After the conversion is finished `zarr store` directories with data are created named according to convention `product_id/product_id_variable_name_start_day_end_day.zarr`.

In future versions,there will be a single data conversion call, the model selection will be passed as argument.


## Usage with Docker

### Build or pull image

First, build the container images with
```shell
docker build \
-t rasmus-cmems-downloads:motupy-latest - \
< Dockerfile_motupy
docker build \
-t rasmus-cmems-downloads:netcdf2csv-latest - \
< Dockerfile_netcdf2csv
docker build \
-t rasmus-cmems-downloads:netcdf2zarr-latest - \
netcdf2zarr/
```
or pull the pre-built images with
Or pre-built images can be pulled and tagged:
```shell
docker pull quay.io/willirath/rasmus-cmems-downloads:motupy-latest
docker tag \
quay.io/willirath/rasmus-cmems-downloads:motupy-latest \
rasmus-cmems-downloads:motupy-latest
docker pull quay.io/willirath/rasmus-cmems-downloads:motupydownload-latest
docker pull quay.io/willirath/rasmus-cmems-downloads:netcdf2zarr-latest

docker pull quay.io/willirath/rasmus-cmems-downloads:netcdf2csv-latest
docker tag \
quay.io/willirath/rasmus-cmems-downloads:netcdf2csv-latest \
rasmus-cmems-downloads:netcdf2csv-latest

docker pull quay.io/willirath/rasmus-cmems-downloads:netcdf2zarr-latest
quay.io/willirath/rasmus-cmems-downloads:motupydownload-latest \
rasmus-cmems-downloads:motupydownload-latest
docker tag \
quay.io/willirath/rasmus-cmems-downloads:netcdf2zarr-latest \
rasmus-cmems-downloads:netcdf2zarr-latest
```

### Set credentials

Set environment variables containing your CMEMS credentials:
```shell
export MOTU_USER="XXXXXXXXXXXXXXXX"
export MOTU_PASSWORD="XXXXXXXXXXXXXXXXX"
```

### Download data
### Running the containers

Then, run the download steps with
For a help message, just run:
```shell
docker run -it --rm \
-e MOTU_USER -e MOTU_PASSWORD \
-v $PWD:/work -w /work \
rasmus-cmems-downloads:motupy-latest \
./MotuClCallPhysModel.sh <base_dir>
docker run rasmus-cmems-downloads:motupydownload-latest --help
```
and
```shell
docker run -it --rm \
-e MOTU_USER -e MOTU_PASSWORD\
-v $PWD:/work -w /work \
rasmus-cmems-downloads:motupy-latest \
./MotuClCallWaveModel.sh <base_dir>
```
Likewise, to run the download with python via docker
```shell
docker run -it --rm \
-e MOTU_USER -e MOTU_PASSWORD \
-v $PWD:/work -w /work \
rasmus-cmems-downloads:motupy-latest \
python MotuClDownloadCMEMSPhysModel.py
docker run rasmus-cmems-downloads:netcdf2zarr-latest --help
```
and
```shell
docker run -it --rm \
-e MOTU_USER -e MOTU_PASSWORD\
-v $PWD:/work -w /work \
rasmus-cmems-downloads:motupy-latest \
python MotuClDownloadCMEMSPhysModel.py
```
Again, `<base_dir>` indicates where the data should be downloaded.

### Run conversion
For actually downloading data and for authenticating on the copernicus service providing the data, read below.

And finally, run the conversion steps with
```shell
docker run -it --rm \
-v $PWD:/work -w /work \
rasmus-cmems-downloads:netcdf2csv-latest \
python NetCDF2CSVPhysModel.py --basedir <base_dir>
```
and
```shell
docker run -it --rm \
-v $PWD:/work -w /work \
rasmus-cmems-downloads:netcdf2csv-latest \
python NetCDF2CSVWaveModel.py --basedir <base_dir>
```
Again, `<base_dir>` indicates where the data should be downloaded.

To convert to Zarr, run:
```shell
docker run --rm \
-v $PWD:/work -w /work \
rasmus-cmems-downloads:netcdf2zarr-latest \
--basedir <base_dir> --product-id <productid> --var <var1> --var <var2>
```

## Usage with Singularity
### Data download example

### Load singularity module

If necessary, make sure `singularity` is in your path.
On Nesh, you currently need to run
We'll read credentials from environment variables:
```shell
module load singularity/3.5.2
export MOTU_USER="XXXXXXXXXXXXXXXX"
export MOTU_PASSWORD="XXXXXXXXXXXXXXXXX"
```

### Build or pull image

First, pull the container images with
To run the container for downloading 10 days of the wave forecast product into `./data/`, do:
```shell
singularity pull --disable-cache --dir $PWD docker://quay.io/willirath/rasmus-cmems-downloads:motupy-latest

singularity pull --disable-cache --dir $PWD docker://quay.io/willirath/rasmus-cmems-downloads:netcdf2csv-latest
```
This will create two singularity files (ending on `sif`):
```
rasmus-cmems-downloads_motupy-latest.sif
rasmus-cmems-downloads_motupy-latest.sif
docker run -v $PWD:/work --rm \
-e MOTU_USER -e MOTU_PASSWORD \
rasmus-cmems-downloads:motupydownload-latest \
--service_id GLOBAL_ANALYSIS_FORECAST_PHY_001_024-TDS \
--product_id global-analysis-forecast-phy-001-024-hourly-t-u-v-ssh \
--var uo --var vo --basedir /work/data
```

### Set credentials
### Zarr conversion example

Set environment variables containing your CMEMS credentials:
To convert the data that was just downloaded to `./data/`, run:
```shell
export MOTU_USER="XXXXXXXXXXXXXXXX"
export MOTU_PASSWORD="XXXXXXXXXXXXXXXXX"
docker run -v $PWD:/work --rm \
rasmus-cmems-downloads:netcdf2zarr-latest \
--product_id global-analysis-forecast-phy-001-024-hourly-t-u-v-ssh \
--var uo --var vo --basedir /work/data
```

### Download data
## Usage (with Singularity)

Then, run the download steps with
```shell
singularity run rasmus-cmems-downloads_motupy-latest.sif \
./MotuClCallPhysModel.sh <base_dir>
```
and
```shell
singularity run rasmus-cmems-downloads_motupy-latest.sif \
./MotuClCallWaveModel.sh <base_dir>
```
Likewise, to run the download with python via docker
```shell
singularity run rasmus-cmems-downloads_motupy-latest.sif \
python MotuClDownloadCMEMSPhysModel.py
```
and
```shell
singularity run rasmus-cmems-downloads_motupy-latest.sif \
python MotuClDownloadCMEMSPhysModel.py
```
Again, `<base_dir>` indicates where the data should be downloaded.
TBD

### Run conversion
## Usage (with local Python installation)

And finally, run the conversion steps with
```shell
singularity run rasmus-cmems-downloads_netcdf2csv-latest. \
python NetCDF2CSVPhysModel.py --basedir <base_dir>
```
and
```shell
singularity run rasmus-cmems-downloads_netcdf2csv-latest. \
python NetCDF2CSVWaveModel.py --basedir <base_dir>
```
Again, `<base_dir>` indicates where the data should be downloaded.
TBD
Loading

0 comments on commit a385bec

Please sign in to comment.