Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fs] basic sync tool #13834

Closed
wants to merge 1 commit into from
Closed

[fs] basic sync tool #13834

wants to merge 1 commit into from

Commits on Feb 5, 2024

  1. [fs] basic sync tool

    CHANGELOG: Introduce `hailctl fs sync` which robustly transfers one or more files between Amazon S3, Azure Blob Storage, and Google Cloud Storage.
    
    There are really two distinct conceptual changes remaining here. Given my waning time available, I
    am not going to split them into two pull requests. The changes are:
    
    1. `basename` always agrees with the the [`basename` UNIX
    utility](https://en.wikipedia.org/wiki/Basename). In particular, the folder `/foo/bar/baz/`'s
    basename is *not* `''` it is `'baz'`. The only folders or objects whose basename is `''` are objects
    whose name literally ends in a slash, e.g. an *object* named `gs://foo/bar/baz/`.
    
    2. `hailctl fs sync`, a robust copying tool with a user-friendly CLI.
    
    `hailctl fs sync` comprises two pieces: `plan.py` and `sync.py`. The latter, `sync.py` is simple: it
    delegates to our existing copy infrastructure. That copy infastructure has been lightly modified to
    support this use-case. The former, `plan.py`, is concurrent file system `diff`.
    
    `plan.py` generates and `sync.py` consumes a "plan folder" containing these files:
    
    1. `matches` files whose names and sizes match. Two columns: source URL, destination URL.
    
    2. `differs` files or folders whose names match but either differ in size or differ in type. Four
       columns: source URL, destination URL, source state, destination state. The states are either:
       `file`, `dif`, or a size. If either state is a size, both states are sizes.
    
    3. `srconly` files only present in the source. One column: source URL.
    
    4. `dstonly` files only present in the destination. One column: destination URL.
    
    5. `plan` a proposed set of object-to-object copies. Two columns: source URL, destination URL.
    
    6. `sumary` a one-line file containing the total number of copies in plan and the total number of
       bytes which would be copied.
    
    As described in the CLI documentation, the intended use of these commands is:
    
    ```
    hailctl fs sync --make-plan plan1 --copy gs://gcs-bucket/a s3://s3-bucket/b
    hailctl fs sync --use-plan plan1
    ```
    
    The first command generates a plan folder and the second command executes the plan. Separating this
    process into two commands allows the user to verify what exactly will be copied including the exact
    destination URLs. Moreover, if `hailctl fs sync --use-plan` fails, the user can re-run `hailctl fs
    sync --make-plan` to generate a new plan which will avoid copying already successfully copied files.
    Moreover, the user can re-run `hailctl fs sync --make-plan` to verify that every file was indeed
    successfully copied.
    
    Testing. This change has a few sync-specific tests but largely reuses the tests for `hailtop.aiotools.copy`.
    
    Future Work. Propagating a consistent kind of hash across all clouds and using that for detecting
    differences is a better solution than the file-size based difference used here. If all the clouds
    always provided the same type of hash value, this would be trivial to add. Alas, at time of writing,
    S3 and Google both support CRC32C for every blob (though, in S3, you must explicitly request it at
    object creation time), but *Azure Blob Storage does not*. ABS only supports MD5 sums which Google
    does not support for multi-part uploads.
    Dan King committed Feb 5, 2024
    Configuration menu
    Copy the full SHA
    945d6d4 View commit details
    Browse the repository at this point in the history