Flows for moving back forth between cfs and hpss #59

dylanmcreynolds · 2025-02-06T19:17:41Z

Now that we have a working implementation of the sfapi in this repo, we can start looking at other things we can do with it at NERSC.

We are using the CFS for all strorage. But much of our data should be moved to tape. That system is called HPSS. Software at NERSC called hsi is the way of moving data between the two file systems.

While NERSC has a globus DTN for HPSS, it doesn't work for our use case because we depend on the alsdev collab account as a type of service account for our jobs, and there is no concept of collab accounts in HPSS. All data put there would be granted permissions for our personal accounts.

So, now that we have sfapi from prefect, we can start using HPSS.

Some use cases:

New file created:

Data is transferred from beamline to CFS (globus)
Data is transferred from CFS to HFS (HSI via sfapi)
Information about location on both CFS and HPSS is stored in scicat for the dataset
Prune job is setup in the future (6 months? 1,2 years) to delete from CFS. (Globus?)

Prune job:

Data is deleted from CFS
Scicat is updated

Recovery job

Data is moved from HPSS to CFS
Scicat is updated

Assigning David to start thinking about.

The text was updated successfully, but these errors were encountered:

davramov · 2025-02-08T01:11:42Z

I'm taking the time to think about this before jumping into the code since there are a few big pieces.

HPSS
Some considerations from the HPSS documentation:

Users should aim for file sizes between 100 GB and 2 TB. Avoid small files and huge files.
The command-line tools hsi and htar allow you to move files in and out and address the need for grouping. The commands tar and split can help break up large files. htar is used to put bundles of files into HPSS
Execute hsi commands from any Perlmutter login node or a Data Transfer Node by either typing hsi <command> or hsi

Considering the first point, maybe we should avoid moving individual scans to HPSS, and instead wait until after a beamtime/experiment is completed, and then bundle and send it using htar. I am not sure about a simple way to keep track of when to copy projects to tape. We could consider scheduling something to run every week.

Either way, I can add a new class HPSSTransferController() in orchestration/transfer_controller.py that implements the CFS -> HPSS -> CFS copy actions using hsi and htar in a Slurm script scheduled with SFAPI.

EDIT: It looks like it is possible to transfer directly from HPSS to a Globus endpoint. Upon further reading, it is not recommended to go this route for a few reasons. Instead we can use the xfer QOS for submitting hsi and htar commands.

SFAPI
Right now, create_sfapi_client is defined as a method in NERSCTomographyHPCController(), but maybe it should move to orchestration/nersc.py or in a new script orchestration/sfapi.py since we can potentially use it for many jobs on NERSC.

EDIT: Thinking about how to refactor create_sfapi_client, in the current implementation, we read the SFAPI ID/KEY directly from the files, with the path stored in .env variables. However, considering the frequency NERSC wants us to reauthorize the key/secret, and needing to store those in the production server ... we should consider a better strategy. We can store these values in Prefect Secret Blocks, but that still would require a manual step of copying the values into Prefect.

Pruning
It may also be worthwhile to refactor our pruning code into the same ABC pattern in a new file orchestration/prune_controller.py. This will make our pruning flows easier to read, more scalable, and maintainable.

SciCat
With SciCat, we should consider refactoring that code to be generalizable across beamline implementations. A few general method ideas:

scicat_controller.create_new_raw_dataset(data, ... ): first time the data goes into scicat, returns the dataset_id
scicat_controller.create_derived_dataset(derived_data, raw_dataset_id): create a derived dataset (i.e. reconstruction) and link it to the raw dataset. returns the derived_dataset_id
scicat_controller.copy_dataset(dataset_id, new_location): update scicat with a new entry for where to find the data. Can interface with either a raw_id or derived_id.
scicat_controller.pruned_dataset(dataset_id, pruned_location): indicate in scicat where the file has been deleted. Can interface with either a raw_id or derived_id.

Then, for each beamline, we can define specific SciCat implementations with the types of metadata and derived datasets we want to track.

…quires thorough testing. Includes a new transfer controller CFSToHPSSTransferController() with logic for handling single files vs directories using HPSS best practices. Moves create_sfapi_client() to the same level as transfer_controller.py, such that it can be easily accessed by multiple components. Includes new documentation in MkDocs for HPSS. Added an HPSS endpoint to config.yml. Updates orchestration/_tests/test_sfapi_flow.py to reflect the new location of create_sfapi_client().

dylanmcreynolds assigned davramov Feb 6, 2025

davramov mentioned this issue Feb 11, 2025

Issue 59: Transfer Flows between CFS and HPSS #62

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flows for moving back forth between cfs and hpss #59

Flows for moving back forth between cfs and hpss #59

dylanmcreynolds commented Feb 6, 2025

davramov commented Feb 8, 2025 •

edited

Loading

Flows for moving back forth between cfs and hpss #59

Flows for moving back forth between cfs and hpss #59

Comments

dylanmcreynolds commented Feb 6, 2025

davramov commented Feb 8, 2025 • edited Loading

davramov commented Feb 8, 2025 •

edited

Loading