From 141d1797305f0413319fd4e317d3c9c86bc0002e Mon Sep 17 00:00:00 2001 From: Kylli Ek Date: Thu, 21 Sep 2023 13:59:20 +0300 Subject: [PATCH] Kylli slides (#14) * first ver of data section * add exercise files --- index.rst | 20 ++++---- materials/data/moving_data.md | 25 +++++++--- materials/supercomputer/exercise_gdal.md | 3 ++ materials/supercomputer/exercise_python.md | 3 ++ materials/supercomputer/exercise_r.md | 3 ++ materials/supercomputer/puhti_data_storage.md | 47 +++++++++++++++++++ .../supercomputer/spatial_data_at_csc.md | 23 +++++++++ 7 files changed, 107 insertions(+), 17 deletions(-) create mode 100644 materials/supercomputer/exercise_gdal.md create mode 100644 materials/supercomputer/exercise_python.md create mode 100644 materials/supercomputer/exercise_r.md create mode 100644 materials/supercomputer/puhti_data_storage.md create mode 100644 materials/supercomputer/spatial_data_at_csc.md diff --git a/index.rst b/index.rst index 54286763..fb5e5201 100644 --- a/index.rst +++ b/index.rst @@ -24,26 +24,24 @@ THIS MATERIAL IS WORK IN PROGRESS, do not trust anything ! ;) :maxdepth: 2 :caption: Supercomputer: - materials/supercomputer/webinterface.md + materials/supercomputer/webinterface.md + materials/supercomputer/puhti_data_storage.md + materials/supercomputer/spatial_data_at_csc.md materials/supercomputer/supercomputing.md - materials/supercomputer/scripts.md materials/supercomputer/htc.md + materials/supercomputer/exercise_gdal.md + materials/supercomputer/exercise_python.md + materials/supercomputer/exercise_r.md materials/supercomputer/own_software.md + materials/supercomputer/scripts.md .. toctree:: :maxdepth: 3 :caption: Data: - materials/data/puhti_data_storage.md materials/data/moving_data.md - materials/data/allas.md - materials/data/spatial_data_at_csc.md - -.. toctree:: - :maxdepth: 2 - :caption: Exercises: - - exercises/README.md + materials/data/allas.md + materials/data/stac.md .. toctree:: :maxdepth: 2 diff --git a/materials/data/moving_data.md b/materials/data/moving_data.md index 5ccbcf72..a66eb7cd 100644 --- a/materials/data/moving_data.md +++ b/materials/data/moving_data.md @@ -15,9 +15,9 @@ ## Command line tools - For any amount of data, practically required if data size > 1 Tb. -### scp +### [`scp`](https://docs.csc.fi/data/moving/scp/) -- [`scp`](https://docs.csc.fi/data/moving/scp/) +- The most usual Linux tool for moving file - `scp` works even in Windows Powershell ``` @@ -49,12 +49,25 @@ rsync --info=progress2 -a /path/to/data_directory cscusername@puhti.csc.fi:/scra -## Moving data as part of workflow +## From external data services to supercomputer -> TODO: Example of moving data into local scratch eg from Allas +- When downloading from exernal services try to download directly to CSC, not via your local computer +- Check what APIs/tools the service supports: + - ftp, rsync ? + - wget/curl if HTTP-urls avaialable + - OGC APIs, STAC ? +### [wget](https://docs.csc.fi/data/moving/wget/) +- + +``` +# One file: +wget http://wwwd3.ymparisto.fi/d3/gis_data/spesific/syvyyskayra.zip +# One folder: +wget -r -nc ftp://ftp.aineistot.metsaan.fi/Metsamaski/Maakunta/ --cut-dirs=2 +``` -## From internet to supercomputer -``` wget URL``` +## Moving data as part of workflow +> TODO: Example of moving data into local scratch eg from Allas diff --git a/materials/supercomputer/exercise_gdal.md b/materials/supercomputer/exercise_gdal.md new file mode 100644 index 00000000..591f0e7b --- /dev/null +++ b/materials/supercomputer/exercise_gdal.md @@ -0,0 +1,3 @@ +# Exercise: Python + +[GDAL exercise materials in Geocomputing Github](https://github.com/csc-training/geocomputing/tree/master/gdal) \ No newline at end of file diff --git a/materials/supercomputer/exercise_python.md b/materials/supercomputer/exercise_python.md new file mode 100644 index 00000000..6bd54cdb --- /dev/null +++ b/materials/supercomputer/exercise_python.md @@ -0,0 +1,3 @@ +# Exercise: Python + +[Python exercise materials in Geocomputing Github](https://github.com/csc-training/geocomputing/tree/master/python/puhti) \ No newline at end of file diff --git a/materials/supercomputer/exercise_r.md b/materials/supercomputer/exercise_r.md new file mode 100644 index 00000000..3c69330b --- /dev/null +++ b/materials/supercomputer/exercise_r.md @@ -0,0 +1,3 @@ +# Exercise: Python + +[R exercise materials in Geocomputing Github](https://github.com/csc-training/geocomputing/tree/master/R/puhti) \ No newline at end of file diff --git a/materials/supercomputer/puhti_data_storage.md b/materials/supercomputer/puhti_data_storage.md new file mode 100644 index 00000000..2573f064 --- /dev/null +++ b/materials/supercomputer/puhti_data_storage.md @@ -0,0 +1,47 @@ +# Disk areas and Allas {.title} +In this section, you will learn how to manage different disk areas in HPC environment at CSC + +# Main disk areas in Puhti + +![](../../images/disk-systems.svg){width=90%} + +| Name |Owner |Path |Cleaning |Capacity|Number of files| Use | +|------------|--------|--------------------|---------------------|--------------|----------------|----------------| +|**[home](https://docs.csc.fi/computing/disk/#home-directory)** |Personal|`/users/` |No |10 GiB |100 000 files | personal settings and files | +|**[projappl](https://docs.csc.fi/computing/disk/#projappl-directory)**|Project |`/projappl/`|No |50 GiB |100 000 files | installation files | +|**[scratch](https://docs.csc.fi/computing/disk/#scratch-directory)** |Project |`/scratch/` |180 days |1 TiB |1 000 000 files | main working area | + +- [LUMI disks](https://docs.lumi-supercomputer.eu/storage/) + +# Additional temporary fast local disk areas + +- [Login node local tmp](https://docs.csc.fi/computing/disk/#login-nodes) - `$TMPDIR`, compiling, temporary, + - Each of the login nodes have 2900 GiB of fast local storage + - The local storage is meant for temporary storage and is cleaned frequently + +- [NVMe](https://docs.csc.fi/computing/running/creating-job-scripts-puhti/#local-storage) - `$LOCAL_SCRATCH` in batch jobs, + - Interactive batch job, IO- and GPU-nodes have + - You must copy data in and out during your batch job. NVMe is accessible only during your job allocation. + - If your job reads or writes a lot of small files, using this can give 10x performance boost + + +## Displaying current status of disk areas + +-> displays all scratch and projappl directories you have access to. +- use `csc-workspaces` command to display available projects and quotas + + +![](../../images/disk_status.png){width=50%} + +# Some best practice tips + +- Take **backups** of important files. Data on Puhti disks is not backed up. +- Supercomputer disks do not work well with **too many small files** (see the file limits above) + - Plan your analysis in a way that too many files are not needed. + - Keep the small files in one zip-file, unzip it only on local fast disks during the analysis. + - Don't create a lot of files in one folder +- [Best practice performance tips for using Lustre](https://docs.csc.fi/computing/lustre/#best-practices) +- **Databases**: + - Only file databases (SQLite, GeoPackage) can be kept in supercomputer disks, + - For PostgreSQL (but not PostGIS) use CSC [Database-as-service](https://docs.csc.fi/cloud/dbaas/) + - For any other database set up virtual machine in cPouta \ No newline at end of file diff --git a/materials/supercomputer/spatial_data_at_csc.md b/materials/supercomputer/spatial_data_at_csc.md new file mode 100644 index 00000000..58a7fd95 --- /dev/null +++ b/materials/supercomputer/spatial_data_at_csc.md @@ -0,0 +1,23 @@ +# [Geospatial data available on CSC supercomputer Puhti](https://docs.csc.fi/data/datasets/spatial-data-in-csc-computing-env/) + +* Large commonly used geospatial datasets with open license +* Removes transfer bottleneck +* Located at: `/appl/data/geo/` +* All Puhti users have read access + +* ~13 TB of datasets available: +* Paituli data, with virtual mosaics for raster data +* SYKE open datasets +* LUKE Multi-source national forest inventory +* Forest center canopy height etc + + +# [Paituli STAC](https://paituli.csc.fi/stac.html) + +- Easy search and download of data +- Example scripts for Python and R +- [~100 datasets](https://radiantearth.github.io/stac-browser/#/external/paituli.csc.fi/geoserver/ogc/stac/v1) + - Paituli raster datasets + - FMI tuulituhohaukka datasets + - GeoCubes datasets + - Sentinel2 2A images in Allas by