Add data retention policy #188

asmacdo · 2024-08-06T13:50:35Z

Heres a sketch of a possible data retention policy. Lets iron out what we want here prior to implementation.

Fixes: #182

from Yarik's initial thoughts : #177 (comment)

doc/design/data-retention-policy.md

Remove unnecessary (and unclosed paren

doc/design/data-retention-policy.md

yarikoptic

I think it is good for the starting point. After implemented/deployed we will see how it could be improved

doc/design/data-retention-policy.md

yarikoptic · 2024-08-19T18:33:10Z

doc/design/data-retention-policy.md

+ - `nwb_cache` 
+ - Yarn Cache
+ - `__pycache__`
+ - pip cache


In case user is still active -- I think it would be useful to report to the long running users, after reaching some threshold on any of those folders (e.g. 50MB) asking to clean them up.

Hi @asmacdo, should we add a separate point here about monitoring and reporting the quotas of cache directories for active users?

doc/design/data-retention-policy.md

yarikoptic · 2024-08-19T18:37:37Z

doc/design/data-retention-policy.md

+   - large file list
+   - summarized data retention policy
+   - Notice number
+   - request to cleanup


meanwhile it might be worth creating a simple data record schema to store those records as well so they could be reused by the tools to assemble higher level stats etc.

doc/design/data-retention-policy.md

Co-authored-by: Yaroslav Halchenko <[email protected]>

doc/design/data-retention-policy.md

kabilar · 2024-09-17T21:51:18Z

doc/design/data-retention-policy.md

+Dandihub data storage on AWS EFS is expensive, and we suppose that significant portions of the data
+currently stored are no longer used. Data migration is where the cost becomes extreme.


Since S3 buckets can now mount to EC2 instances (reference: August 2023 blog post) and S3 costs are ~10X cheaper than EFS, as part of this data retention work perhaps we should also look into what it would take to move to S3 storage (and discuss any features that would not be available with this migration)?

kabilar · 2024-09-17T22:12:13Z

doc/design/data-retention-policy.md

+ - dandiarchive login information
+
+## Automated Data Audit
+


Suggested change

At an interval of 7 days:

- Calculate home directory disk usage

kabilar · 2024-09-17T22:12:27Z

doc/design/data-retention-policy.md

+
+## Automated Data Audit
+
+At some interval (30 days with no login?):


Suggested change

At some interval (30 days with no login?):

At an interval of 30 days with no login to JupyterHub:

kabilar · 2024-09-17T22:15:42Z

doc/design/data-retention-policy.md

+Dandihub data storage on AWS EFS is expensive, and we suppose that significant portions of the data
+currently stored are no longer used. Data migration is where the cost becomes extreme.
+
+## Persistent Data locations


Suggested change

## Persistent Data locations

## Persistent Data Locations

kabilar · 2024-09-17T22:15:54Z

doc/design/data-retention-policy.md

+Within jupyterhub `/home/{user}` the user always sees `/home/jovyan`, but is stored in EFS as their GitHub
+username.
+
+## Known cache file cleanup 


Suggested change

## Known cache file cleanup

## Known Cache File Cleanup

kabilar · 2024-09-17T22:34:57Z

doc/design/data-retention-policy.md

+Each user has access to 2 locations: `/home/{user}` and `/shared/`.
+
+Within jupyterhub `/home/{user}` the user always sees `/home/jovyan`, but is stored in EFS as their GitHub
+username.


We were previously considering providing a /scratch directory for each user that is automatically cleaned up after 30 days. In addition to the policy for the /home/jovyan directory, do we also want to implement a /scratch directory with a 30 day clean up policy?

kabilar · 2024-09-18T02:17:21Z

doc/design/data-retention-policy.md

+   - find files larger than 100 (?) GB and mtime > 10 (?) days -- get total size and count
+   - find files larger than 1 (?) GB and mtime > 30 (?) days -- get total size and count
+   - find _pycache_ and nwb-cache folders and pip cache and mtime > 30? days -- total sizes and list of them


Suggested change

- find files larger than 100 (?) GB and mtime > 10 (?) days -- get total size and count

- find files larger than 1 (?) GB and mtime > 30 (?) days -- get total size and count

- find _pycache_ and nwb-cache folders and pip cache and mtime > 30? days -- total sizes and list of them

- find files larger than 100 GB and mtime > 10 days -- get total size and count

- find files larger than 1 GB and mtime > 30 days -- get total size and count

- find _pycache_ and nwb-cache folders and pip cache and mtime > 30 days -- total sizes and list of them

kabilar · 2024-09-18T02:20:11Z

doc/design/data-retention-policy.md

+
+Notify user if:
+   - any of the above listed thresholds were reached
+   - total du exceeds some threshold (e.g. 100G)


Suggested change

- total du exceeds some threshold (e.g. 100G)

- total home directory disk usage exceeds 1 TB

I suggested a quota of 1 TB for home directories as many datasets are getting to be quite large. This would provide temporary, high-capacity storage, but hopefully users won't get anywhere near this threshold. This would cost $300/user/month for standard EFS, and $23/user/month if we move to Standard S3.

If we implement a scratch directory, then perhaps the home directory can have a much smaller quota.

kabilar · 2024-09-18T02:22:04Z

doc/design/data-retention-policy.md

+Notify user if:
+   - any of the above listed thresholds were reached
+   - total du exceeds some threshold (e.g. 100G)
+   - total outdated caches size exceeds some threshold (e.g. 1G)


Suggested change

- total outdated caches size exceeds some threshold (e.g. 1G)

- total outdated caches size exceeds 1 GB

kabilar · 2024-09-18T02:23:47Z

doc/design/data-retention-policy.md

+   - prior notification was sent more than a week ago
+
+Notification information:
+   - large file list


Suggested change

- large file list

- summarized audit data (total size and count for each of the above thresholds)

- large file list

kabilar

Thank you, @asmacdo. This is great. A few suggestions are listed above.

Initial commit for data retention policy discussion

66d3de7

asmacdo requested a review from yarikoptic August 9, 2024 15:46

yarikoptic reviewed Aug 9, 2024

View reviewed changes

doc/design/data-retention-policy.md Show resolved Hide resolved

asmacdo commented Aug 12, 2024

View reviewed changes

doc/design/data-retention-policy.md Outdated Show resolved Hide resolved

Update doc/design/data-retention-policy.md

74fba57

Remove unnecessary (and unclosed paren

asmacdo commented Aug 19, 2024

View reviewed changes

doc/design/data-retention-policy.md Outdated Show resolved Hide resolved

asmacdo commented Aug 19, 2024

View reviewed changes

doc/design/data-retention-policy.md Outdated Show resolved Hide resolved

Apply suggestions from code review

cd4aee0

yarikoptic requested changes Aug 19, 2024

View reviewed changes

asmacdo commented Aug 20, 2024

View reviewed changes

doc/design/data-retention-policy.md Outdated Show resolved Hide resolved

Apply suggestions from code review

badd7ea

Co-authored-by: Yaroslav Halchenko <[email protected]>

kabilar reviewed Sep 17, 2024

View reviewed changes

doc/design/data-retention-policy.md Show resolved Hide resolved

kabilar reviewed Sep 17, 2024

View reviewed changes

dandi locked and limited conversation to collaborators Sep 17, 2024

kabilar changed the title ~~Initial commit for data retention policy discussion~~ Add data retention policy Sep 17, 2024

kabilar reviewed Sep 17, 2024

View reviewed changes

kabilar reviewed Sep 18, 2024

View reviewed changes

kabilar requested changes Sep 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add data retention policy #188

Add data retention policy #188

asmacdo commented Aug 6, 2024

yarikoptic left a comment

yarikoptic Aug 19, 2024

kabilar Sep 17, 2024 •

edited

Loading

yarikoptic Aug 19, 2024

kabilar Sep 17, 2024

kabilar Sep 17, 2024 •

edited

Loading

kabilar Sep 17, 2024

kabilar Sep 17, 2024

kabilar Sep 17, 2024

kabilar Sep 17, 2024

kabilar Sep 18, 2024

kabilar Sep 18, 2024

kabilar Sep 18, 2024

kabilar Sep 18, 2024

kabilar Sep 18, 2024

kabilar left a comment

		Dandihub data storage on AWS EFS is expensive, and we suppose that significant portions of the data
		currently stored are no longer used. Data migration is where the cost becomes extreme.



	At an interval of 7 days:
	- Calculate home directory disk usage


		## Automated Data Audit

		At some interval (30 days with no login?):

	At some interval (30 days with no login?):
	At an interval of 30 days with no login to JupyterHub:

	- total du exceeds some threshold (e.g. 100G)
	- total home directory disk usage exceeds 1 TB

	- total outdated caches size exceeds some threshold (e.g. 1G)
	- total outdated caches size exceeds 1 GB

	- large file list
	- summarized audit data (total size and count for each of the above thresholds)
	- large file list

Add data retention policy #188

Are you sure you want to change the base?

Add data retention policy #188

Conversation

asmacdo commented Aug 6, 2024

yarikoptic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kabilar Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kabilar Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kabilar left a comment

Choose a reason for hiding this comment

kabilar Sep 17, 2024 •

edited

Loading

kabilar Sep 17, 2024 •

edited

Loading