Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For estimating the cost for data upload to AWS S3 #480

Open
manoaman opened this issue Jun 21, 2021 · 6 comments
Open

For estimating the cost for data upload to AWS S3 #480

manoaman opened this issue Jun 21, 2021 · 6 comments
Labels
question What is going on??? :thinking emoji:

Comments

@manoaman
Copy link

Hi there,

AWS S3 storage comes with the cost for uploading the data to their storage. I realized CloudVolume is compatible with direct upload to the S3 storage if I'm not mistaken. When estimating the cost, should I be calculating based on the number of files being uploaded or number of POST/PUT requests made to the S3 storage?

Precomputed volumes are typically stored on AWS S3, Google Storage, or locally. CloudVolume can read and write to these object storage providers given a service account token with appropriate permissions. However, these volumes can be stored on any service, including an ordinary webserver or local filesystem, that supports key-value access.

This is what AWS is providing for pricing estimation.
https://aws.amazon.com/s3/pricing/

Thank you,
-m

@william-silversmith william-silversmith added the question What is going on??? :thinking emoji: label Jun 22, 2021
@william-silversmith
Copy link
Contributor

Hi m,

I believe it is the number of PUT requests (Which should be ~1:1 with the number of files except for failed uploads? Not sure if they charge for error codes.) Remember to account for storage cost, number of writes, and any inter-region egress and egress to internet charges later on. Usually these cloud providers have free ingress. No warranty on this advice though! It's always possible I've missed something.

@william-silversmith
Copy link
Contributor

Here's some example calculations for running a process on Google. The AWS calculations are pretty similar. https://github.com/seung-lab/kimimaro/wiki/The-Economics:-Skeletons-for-the-People/c2d4e28645e96d3e963f7338a46d15dc3890c553

@manoaman
Copy link
Author

Thank you for the inputs @william-silversmith !! Wow, it does need to be carefully processed for production ready chunks to be uploaded. What if the scenario is to compute on a local machine, (and not directly writing to the cloud), should I be looking at the following costs for upload ready files? Let's say the cloud provider provides a CLI to upload these dataset.

Assuming image chunks stored as 128x128x64:

15.6 TVx / (128x128x64 voxels/file) = 14.9 million files
14.9e6 files * ($4 per ten million files) = $5.95

Assume segmentation labels are fractured into about 1.5 billion fragments after chunking:
2.3B PUT requests * 5 $/million PUTs = $11,500

neurodata.io seems to use s3 cloud storage. Do you know if they use CloudVolume to upload?
https://neurodata.io/project/ocp/

@william-silversmith
Copy link
Contributor

neurodata.io seems to use s3 cloud storage. Do you know if they use CloudVolume to upload?

To my knowledge they've been using s3 though at one point they were considering Azure. I believe they use CV to upload, but I can't be sure. They recommend using CloudVolume to download on their site.

What if the scenario is to compute on a local machine, (and not directly writing to the cloud), should I be looking at the following costs for upload ready files? Let's say the cloud provider provides a CLI to upload these dataset.

You should check to see if those are the current prices for s3 yourself, but the first line refers to reading the entire segmentation and the second line refers to writing all the skeleton fragments. If you're just reading and writing images using a reasonable chunk size, you should be okay. What kind of job are you running and what size is it (approximately if need be)? It would helpful for giving better insight.

@manoaman
Copy link
Author

Hi @william-silversmith ,

Yes, definitely I would need to revisit the current prices for s3. To better understand the illustrated example, what was the original file size of the raw data? Has the file been downsampled?

1 Petavoxel = 200,000 x 200,000 x 25,000 voxels at 4x4x40 nm resolution

@william-silversmith
Copy link
Contributor

In the example given, the file was downsampled to mip 3 (hence 15.6 TVx). If you're concerned about the generation of meshes/skeletons, there's some good news. Skeletons has gotten a lot better (that's an old article) by using the sharded format. Meshes are still under development.

Here's the updated article: https://github.com/seung-lab/kimimaro/wiki/The-Economics:-Skeletons-for-the-People

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question What is going on??? :thinking emoji:
Projects
None yet
Development

No branches or pull requests

2 participants