Zero (data) copy conversion of existing data in object stores #2697

vicaya · 2022-09-08T06:26:14Z

vicaya
Sep 8, 2022

First, congrats on the 1.0 GA release!

According to the FAQ section in the docs, accessing existing data in object store is not yet supported.

Do you have a plan to support this feature?
Would it be easier if the existing data is read-only?

Here is a compelling reason to do so: there are many AI/ML workloads that would like to access public (or private) datasets (often with average object size smaller than jfs blocksize) in read-only mode that's already in (cloud) some object stores. Having to copy them into juicefs is sub-optimal. One compelling reason to use jfs is the separate metadata store that makes metadata operations like listing all the files in a tree efficiently for multiple data parallel clients. Prefix listing a large object store tree takes several minutes per client. We worked around this problem by explicitly generating a shared static manifest file that could be fetched in a second or two. An ideal usage example:

juicefs format --storage <type> ... \
               --import <similar-to-sync-uri> \
               --import <another-import-uri> \
               <metadata-engine-uri> \
               <volume-name>

The format command would scan the read-only import uris (check prefix conflicts) and create fs metadata for imported data without copying any data. Afterwards, If you ls /mnt/jfs/<volume-name>, you should see one directory (prefix from imported uris) per import uri instead of an empty directory. ls -R should be fast because these directories are read-only and no rescan of original object stores is needed.

davies · 2022-09-13T02:34:58Z

davies
Sep 13, 2022
Maintainer

We are looking into this, but don't know how to do it:

Scan the objects once and build the metadata into Redis, then it list of objects (also the attributes) may out of sync, it will be expensive to keep them updated. The JuiceFS Cloud has this feature, can you try that out?
Scan the objects when they are accessed for the first them, build a in-memory cache in client. The all the client need to scan them separately. A TTL could be specified to tell when to expire these cached objects.

Initially, we can provide read-only support, then users will ask for read-write support, that's the thing we decided to avoid in the beginning.

4 replies

vicaya Sep 22, 2022
Author

Just got a chance to checkout the juicefs cloud service, it indeed already has the read-only zero-copy import feature exactly what I had imagined 😁. It'd be great if the same feature is available for the community version.

OTOH, if juicefs cloud supports Cloudflare R2 object store: https://www.cloudflare.com/products/r2/, I'd be happy to use juicefs cloud service as well. Zero egress fee is a major incentive!

davies Sep 22, 2022
Maintainer

Cloudflare does not provide virtual machine, so you have to access R2 somewhere else. You can create a file system in AWS region (close to you), and use Cloudflare R2 bucket instead:

./juicefs auth XXX --bucket https://endpoint_of_R2_bucket --token XXX --access-key XXX --secret-key XXX

vicaya Sep 22, 2022
Author

Are you saying that I can use juicefs import in the cloud service to works with R2 like this? If so that's good enough for me 😁

davies Sep 22, 2022
Maintainer

yes, please give it a try.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero (data) copy conversion of existing data in object stores #2697

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Zero (data) copy conversion of existing data in object stores #2697

vicaya Sep 8, 2022

Replies: 1 comment · 4 replies

davies Sep 13, 2022 Maintainer

vicaya Sep 22, 2022 Author

davies Sep 22, 2022 Maintainer

vicaya Sep 22, 2022 Author

davies Sep 22, 2022 Maintainer

vicaya
Sep 8, 2022

Replies: 1 comment 4 replies

davies
Sep 13, 2022
Maintainer

vicaya Sep 22, 2022
Author

davies Sep 22, 2022
Maintainer

vicaya Sep 22, 2022
Author

davies Sep 22, 2022
Maintainer