Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flows for moving back forth between cfs and hpss #59

Open
dylanmcreynolds opened this issue Feb 6, 2025 · 1 comment
Open

Flows for moving back forth between cfs and hpss #59

dylanmcreynolds opened this issue Feb 6, 2025 · 1 comment
Assignees

Comments

@dylanmcreynolds
Copy link
Contributor

Now that we have a working implementation of the sfapi in this repo, we can start looking at other things we can do with it at NERSC.

We are using the CFS for all strorage. But much of our data should be moved to tape. That system is called HPSS. Software at NERSC called hsi is the way of moving data between the two file systems.

While NERSC has a globus DTN for HPSS, it doesn't work for our use case because we depend on the alsdev collab account as a type of service account for our jobs, and there is no concept of collab accounts in HPSS. All data put there would be granted permissions for our personal accounts.

So, now that we have sfapi from prefect, we can start using HPSS.

Some use cases:

New file created:

  • Data is transferred from beamline to CFS (globus)
  • Data is transferred from CFS to HFS (HSI via sfapi)
  • Information about location on both CFS and HPSS is stored in scicat for the dataset
  • Prune job is setup in the future (6 months? 1,2 years) to delete from CFS. (Globus?)

Prune job:

  • Data is deleted from CFS
  • Scicat is updated

Recovery job

  • Data is moved from HPSS to CFS
  • Scicat is updated

Assigning David to start thinking about.

@davramov
Copy link
Contributor

davramov commented Feb 8, 2025

I'm taking the time to think about this before jumping into the code since there are a few big pieces.

HPSS
Some considerations from the HPSS documentation:

  • Users should aim for file sizes between 100 GB and 2 TB. Avoid small files and huge files.
  • The command-line tools hsi and htar allow you to move files in and out and address the need for grouping. The commands tar and split can help break up large files. htar is used to put bundles of files into HPSS
  • Execute hsi commands from any Perlmutter login node or a Data Transfer Node by either typing hsi <command> or hsi

Considering the first point, maybe we should avoid moving individual scans to HPSS, and instead wait until after a beamtime/experiment is completed, and then bundle and send it using htar. I am not sure about a simple way to keep track of when to copy projects to tape. We could consider scheduling something to run every week.

Either way, I can add a new class HPSSTransferController() in orchestration/transfer_controller.py that implements the CFS -> HPSS -> CFS copy actions using hsi and htar in a Slurm script scheduled with SFAPI.

EDIT: It looks like it is possible to transfer directly from HPSS to a Globus endpoint. Upon further reading, it is not recommended to go this route for a few reasons. Instead we can use the xfer QOS for submitting hsi and htar commands.

SFAPI
Right now, create_sfapi_client is defined as a method in NERSCTomographyHPCController(), but maybe it should move to orchestration/nersc.py or in a new script orchestration/sfapi.py since we can potentially use it for many jobs on NERSC.

EDIT: Thinking about how to refactor create_sfapi_client, in the current implementation, we read the SFAPI ID/KEY directly from the files, with the path stored in .env variables. However, considering the frequency NERSC wants us to reauthorize the key/secret, and needing to store those in the production server ... we should consider a better strategy. We can store these values in Prefect Secret Blocks, but that still would require a manual step of copying the values into Prefect.

Pruning
It may also be worthwhile to refactor our pruning code into the same ABC pattern in a new file orchestration/prune_controller.py. This will make our pruning flows easier to read, more scalable, and maintainable.

SciCat
With SciCat, we should consider refactoring that code to be generalizable across beamline implementations. A few general method ideas:

  • scicat_controller.create_new_raw_dataset(data, ... ): first time the data goes into scicat, returns the dataset_id
  • scicat_controller.create_derived_dataset(derived_data, raw_dataset_id): create a derived dataset (i.e. reconstruction) and link it to the raw dataset. returns the derived_dataset_id
  • scicat_controller.copy_dataset(dataset_id, new_location): update scicat with a new entry for where to find the data. Can interface with either a raw_id or derived_id.
  • scicat_controller.pruned_dataset(dataset_id, pruned_location): indicate in scicat where the file has been deleted. Can interface with either a raw_id or derived_id.

Then, for each beamline, we can define specific SciCat implementations with the types of metadata and derived datasets we want to track.

davramov added a commit to davramov/splash_flows_globus that referenced this issue Feb 11, 2025
…quires thorough testing. Includes a new transfer controller CFSToHPSSTransferController() with logic for handling single files vs directories using HPSS best practices. Moves create_sfapi_client() to the same level as transfer_controller.py, such that it can be easily accessed by multiple components. Includes new documentation in MkDocs for HPSS. Added an HPSS endpoint to config.yml. Updates orchestration/_tests/test_sfapi_flow.py to reflect the new location of create_sfapi_client().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants