From 66d3de79da130be0913d47f54727aba3ff2214e7 Mon Sep 17 00:00:00 2001 From: Austin Macdonald Date: Tue, 6 Aug 2024 08:48:07 -0500 Subject: [PATCH 1/4] Initial commit for data retention policy discussion --- doc/design/data-retention-policy.md | 57 +++++++++++++++++++++++++++++ 1 file changed, 57 insertions(+) create mode 100644 doc/design/data-retention-policy.md diff --git a/doc/design/data-retention-policy.md b/doc/design/data-retention-policy.md new file mode 100644 index 0000000..5e93450 --- /dev/null +++ b/doc/design/data-retention-policy.md @@ -0,0 +1,57 @@ +# Data Retention Policy + +Dandihub data storage on AWS EFS is expensive, and we suppose that significant portions of the data +currently stored are no longer used. Data migration is where the cost becomes extreme. + +## Persistent Data locations + +Each user has access to 2 locations: `/home/{user}` and `/shared/` + +(Within jupyterhub `/home/{user}` the user always sees `/home/jovyan`, but is stored in EFS as their GitHub +username. + +## Known cache file cleanup + +We should be able to safely remove the following: + - `/home/{user}/.cache` + - `nwb_cache` + - Yarn Cache + - `__pycache__` + - pip cache + + +## Determining Last Access + +EFS does not store metadata for the last access of the data. (Though they must track somehow to move +to `Infrequent Access`) + +Alternatives: + - use the [jupyterhub REST API](https://jupyterhub.readthedocs.io/en/stable/reference/rest-api.html#operation/get-users) check when user last used/logged in to hub. + - dandiarchive login information + +## Automated Data Audit + +At some interval (30 days with no login?): + - find files larger than 1GB and mtime > 30 (?) days -- get total size and count + - find _pycache_ and nwb-cache folders and pip cache and mtime > 30? days -- total sizes and list of them + +Notify user if: + - total du exceeds some threshold (e.g. 100G) + - total outdated caches size exceeds some threshold (e.g. 1G) + - prior notification was sent more than a week ago + +Notification information: + - large file list + - summarized data retention policy + - Notice number + - request to cleanup + +### Non-response cleanup + +If a user has not logged in for 60 days (30 days initial + 30 days following audit), send a warning: +`In 10 days the following files will be cleaned up` + +If the user has not logged in for 60 days (30 initial + 30 after audit + 10 warning): +`The following files were removed` + +Reset timer. From 74fba570a83dbf88e6c60ff79f660cb479280e41 Mon Sep 17 00:00:00 2001 From: Austin Macdonald Date: Mon, 12 Aug 2024 09:54:38 -0500 Subject: [PATCH 2/4] Update doc/design/data-retention-policy.md Remove unnecessary (and unclosed paren --- doc/design/data-retention-policy.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/design/data-retention-policy.md b/doc/design/data-retention-policy.md index 5e93450..165a1ee 100644 --- a/doc/design/data-retention-policy.md +++ b/doc/design/data-retention-policy.md @@ -7,7 +7,7 @@ currently stored are no longer used. Data migration is where the cost becomes ex Each user has access to 2 locations: `/home/{user}` and `/shared/` -(Within jupyterhub `/home/{user}` the user always sees `/home/jovyan`, but is stored in EFS as their GitHub +Within jupyterhub `/home/{user}` the user always sees `/home/jovyan`, but is stored in EFS as their GitHub username. ## Known cache file cleanup From cd4aee0c4e6777466b6eab0532e0780dac9b6c53 Mon Sep 17 00:00:00 2001 From: Austin Macdonald Date: Mon, 19 Aug 2024 12:59:40 -0500 Subject: [PATCH 3/4] Apply suggestions from code review --- doc/design/data-retention-policy.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/design/data-retention-policy.md b/doc/design/data-retention-policy.md index 165a1ee..3c00373 100644 --- a/doc/design/data-retention-policy.md +++ b/doc/design/data-retention-policy.md @@ -32,7 +32,7 @@ Alternatives: ## Automated Data Audit At some interval (30 days with no login?): - - find files larger than 1GB and mtime > 30 (?) days -- get total size and count + - find files larger than 1 (?) GB and mtime > 30 (?) days -- get total size and count - find _pycache_ and nwb-cache folders and pip cache and mtime > 30? days -- total sizes and list of them Notify user if: @@ -51,7 +51,7 @@ Notification information: If a user has not logged in for 60 days (30 days initial + 30 days following audit), send a warning: `In 10 days the following files will be cleaned up` -If the user has not logged in for 60 days (30 initial + 30 after audit + 10 warning): +If the user has not logged in for 70 days (30 initial + 30 after audit + 10 warning): `The following files were removed` Reset timer. From badd7ea892d942e0439648b616b1f26dbbc3ff5c Mon Sep 17 00:00:00 2001 From: Austin Macdonald Date: Tue, 20 Aug 2024 08:50:17 -0500 Subject: [PATCH 4/4] Apply suggestions from code review Co-authored-by: Yaroslav Halchenko --- doc/design/data-retention-policy.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/doc/design/data-retention-policy.md b/doc/design/data-retention-policy.md index 3c00373..0b0a08a 100644 --- a/doc/design/data-retention-policy.md +++ b/doc/design/data-retention-policy.md @@ -5,7 +5,7 @@ currently stored are no longer used. Data migration is where the cost becomes ex ## Persistent Data locations -Each user has access to 2 locations: `/home/{user}` and `/shared/` +Each user has access to 2 locations: `/home/{user}` and `/shared/`. Within jupyterhub `/home/{user}` the user always sees `/home/jovyan`, but is stored in EFS as their GitHub username. @@ -22,8 +22,7 @@ We should be able to safely remove the following: ## Determining Last Access -EFS does not store metadata for the last access of the data. (Though they must track somehow to move -to `Infrequent Access`) +EFS does not store metadata for the last access of the data. (Though they must track somehow to move to `Infrequent Access`) Alternatives: - use the [jupyterhub REST API](https://jupyterhub.readthedocs.io/en/stable/reference/rest-api.html#operation/get-users) check when user last used/logged in to hub. @@ -32,10 +31,12 @@ Alternatives: ## Automated Data Audit At some interval (30 days with no login?): + - find files larger than 100 (?) GB and mtime > 10 (?) days -- get total size and count - find files larger than 1 (?) GB and mtime > 30 (?) days -- get total size and count - find _pycache_ and nwb-cache folders and pip cache and mtime > 30? days -- total sizes and list of them Notify user if: + - any of the above listed thresholds were reached - total du exceeds some threshold (e.g. 100G) - total outdated caches size exceeds some threshold (e.g. 1G) - prior notification was sent more than a week ago