Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fs] basic sync tool #14248

Open
wants to merge 66 commits into
base: main
Choose a base branch
from
Open

[fs] basic sync tool #14248

wants to merge 66 commits into from

Commits on Feb 27, 2024

  1. [fs] basic sync tool

    CHANGELOG: Introduce `hailctl fs sync` which robustly transfers one or more files between Amazon S3, Azure Blob Storage, and Google Cloud Storage.
    
    There are really two distinct conceptual changes remaining here. Given my waning time available, I
    am not going to split them into two pull requests. The changes are:
    
    1. `basename` always agrees with the the [`basename` UNIX
    utility](https://en.wikipedia.org/wiki/Basename). In particular, the folder `/foo/bar/baz/`'s
    basename is *not* `''` it is `'baz'`. The only folders or objects whose basename is `''` are objects
    whose name literally ends in a slash, e.g. an *object* named `gs://foo/bar/baz/`.
    
    2. `hailctl fs sync`, a robust copying tool with a user-friendly CLI.
    
    `hailctl fs sync` comprises two pieces: `plan.py` and `sync.py`. The latter, `sync.py` is simple: it
    delegates to our existing copy infrastructure. That copy infastructure has been lightly modified to
    support this use-case. The former, `plan.py`, is concurrent file system `diff`.
    
    `plan.py` generates and `sync.py` consumes a "plan folder" containing these files:
    
    1. `matches` files whose names and sizes match. Two columns: source URL, destination URL.
    
    2. `differs` files or folders whose names match but either differ in size or differ in type. Four
       columns: source URL, destination URL, source state, destination state. The states are either:
       `file`, `dif`, or a size. If either state is a size, both states are sizes.
    
    3. `srconly` files only present in the source. One column: source URL.
    
    4. `dstonly` files only present in the destination. One column: destination URL.
    
    5. `plan` a proposed set of object-to-object copies. Two columns: source URL, destination URL.
    
    6. `sumary` a one-line file containing the total number of copies in plan and the total number of
       bytes which would be copied.
    
    As described in the CLI documentation, the intended use of these commands is:
    
    ```
    hailctl fs sync --make-plan plan1 --copy-to gs://gcs-bucket/a s3://s3-bucket/b
    hailctl fs sync --use-plan plan1
    ```
    
    The first command generates a plan folder and the second command executes the plan. Separating this
    process into two commands allows the user to verify what exactly will be copied including the exact
    destination URLs. Moreover, if `hailctl fs sync --use-plan` fails, the user can re-run `hailctl fs
    sync --make-plan` to generate a new plan which will avoid copying already successfully copied files.
    Moreover, the user can re-run `hailctl fs sync --make-plan` to verify that every file was indeed
    successfully copied.
    
    Testing. This change has a few sync-specific tests but largely reuses the tests for `hailtop.aiotools.copy`.
    
    Future Work. Propagating a consistent kind of hash across all clouds and using that for detecting
    differences is a better solution than the file-size based difference used here. If all the clouds
    always provided the same type of hash value, this would be trivial to add. Alas, at time of writing,
    S3 and Google both support CRC32C for every blob (though, in S3, you must explicitly request it at
    object creation time), but *Azure Blob Storage does not*. ABS only supports MD5 sums which Google
    does not support for multi-part uploads.
    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    8621664 View commit details
    Browse the repository at this point in the history
  2. more docs

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    86b5a4b View commit details
    Browse the repository at this point in the history
  3. fix bad help string

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    8fbc0f2 View commit details
    Browse the repository at this point in the history
  4. use recursive=True for rapid listing

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    861ca13 View commit details
    Browse the repository at this point in the history
  5. fix

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    7660033 View commit details
    Browse the repository at this point in the history
  6. no prints

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    f956f0a View commit details
    Browse the repository at this point in the history
  7. allow isdir without trailing slash

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    5b0c8a9 View commit details
    Browse the repository at this point in the history
  8. simplify sync.py dramatically

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    46b1b34 View commit details
    Browse the repository at this point in the history
  9. update listeners before return

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    55c83e0 View commit details
    Browse the repository at this point in the history
  10. use uvloop

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    eac543c View commit details
    Browse the repository at this point in the history
  11. uvloopx

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    2bc9d1d View commit details
    Browse the repository at this point in the history
  12. import fix

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    9588f91 View commit details
    Browse the repository at this point in the history
  13. maybe get InsertObjectStream right

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    b905f30 View commit details
    Browse the repository at this point in the history
  14. await _cleanup_future too

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    c4162cb View commit details
    Browse the repository at this point in the history
  15. use async with instead of async with await

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    c8388b1 View commit details
    Browse the repository at this point in the history
  16. smaller part size maybe helps?

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    527488f View commit details
    Browse the repository at this point in the history
  17. prints

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    c8891ec View commit details
    Browse the repository at this point in the history
  18. Configuration menu
    Copy the full SHA
    0e6ae95 View commit details
    Browse the repository at this point in the history
  19. files are 1

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    04448ed View commit details
    Browse the repository at this point in the history
  20. prints

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    22ef264 View commit details
    Browse the repository at this point in the history
  21. debugging

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    9a43627 View commit details
    Browse the repository at this point in the history
  22. debugging

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    f5a7af2 View commit details
    Browse the repository at this point in the history
  23. fix

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    e6fffd7 View commit details
    Browse the repository at this point in the history
  24. revert prints

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    470884d View commit details
    Browse the repository at this point in the history
  25. fewre prints

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    01e0759 View commit details
    Browse the repository at this point in the history
  26. debug

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    52fd7b6 View commit details
    Browse the repository at this point in the history
  27. wtf

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    e5ce4d6 View commit details
    Browse the repository at this point in the history
  28. debug

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    279dcfa View commit details
    Browse the repository at this point in the history
  29. debug

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    d1ef63c View commit details
    Browse the repository at this point in the history
  30. debug

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    41d9a9a View commit details
    Browse the repository at this point in the history
  31. debug

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    927ad35 View commit details
    Browse the repository at this point in the history
  32. debug

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    8e0d44c View commit details
    Browse the repository at this point in the history
  33. debug

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    cab8b8f View commit details
    Browse the repository at this point in the history
  34. debug

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    4a848b4 View commit details
    Browse the repository at this point in the history
  35. debug

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    f89a9e5 View commit details
    Browse the repository at this point in the history
  36. debug

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    188fff7 View commit details
    Browse the repository at this point in the history
  37. debug

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    9ef4b32 View commit details
    Browse the repository at this point in the history
  38. debug

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    7086648 View commit details
    Browse the repository at this point in the history
  39. debug

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    2c9d79d View commit details
    Browse the repository at this point in the history
  40. debug

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    6268348 View commit details
    Browse the repository at this point in the history
  41. debug

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    edbf8cd View commit details
    Browse the repository at this point in the history
  42. debug

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    c9b07de View commit details
    Browse the repository at this point in the history
  43. debug

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    7aa4a5a View commit details
    Browse the repository at this point in the history
  44. debug

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    2a886aa View commit details
    Browse the repository at this point in the history
  45. debug

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    7fc7003 View commit details
    Browse the repository at this point in the history
  46. async with await

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    e564b5d View commit details
    Browse the repository at this point in the history
  47. pyright fixes

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    40f96c5 View commit details
    Browse the repository at this point in the history
  48. remove uvloopx changes

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    4c69ec6 View commit details
    Browse the repository at this point in the history
  49. Revert "also front_end.py"

    This reverts commit 618e960.
    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    e3425e1 View commit details
    Browse the repository at this point in the history
  50. remove cruft

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    eeb861f View commit details
    Browse the repository at this point in the history
  51. remove debug cruft

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    ba77330 View commit details
    Browse the repository at this point in the history
  52. fix Self improt

    Dan King committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    44c9364 View commit details
    Browse the repository at this point in the history

Commits on Feb 28, 2024

  1. fix bad imports

    Dan King committed Feb 28, 2024
    Configuration menu
    Copy the full SHA
    f6eacea View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    f502b8f View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    2bd85ec View commit details
    Browse the repository at this point in the history
  4. also front_end.py

    Dan King committed Feb 28, 2024
    Configuration menu
    Copy the full SHA
    4c05471 View commit details
    Browse the repository at this point in the history
  5. revert unnecsesary changes to copy and copier

    Dan King committed Feb 28, 2024
    Configuration menu
    Copy the full SHA
    4bba249 View commit details
    Browse the repository at this point in the history

Commits on Feb 29, 2024

  1. Configuration menu
    Copy the full SHA
    7ae6107 View commit details
    Browse the repository at this point in the history

Commits on Jun 25, 2024

  1. Configuration menu
    Copy the full SHA
    38727c9 View commit details
    Browse the repository at this point in the history
  2. fix

    chrisvittal committed Jun 25, 2024
    Configuration menu
    Copy the full SHA
    9c25bc0 View commit details
    Browse the repository at this point in the history

Commits on Aug 5, 2024

  1. Configuration menu
    Copy the full SHA
    1a33173 View commit details
    Browse the repository at this point in the history

Commits on Aug 7, 2024

  1. lint fixes

    chrisvittal committed Aug 7, 2024
    Configuration menu
    Copy the full SHA
    44ef794 View commit details
    Browse the repository at this point in the history

Commits on Aug 8, 2024

  1. test fixes

    chrisvittal committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    7087fc0 View commit details
    Browse the repository at this point in the history

Commits on Aug 12, 2024

  1. fix?

    chrisvittal committed Aug 12, 2024
    Configuration menu
    Copy the full SHA
    5a32d23 View commit details
    Browse the repository at this point in the history

Commits on Aug 13, 2024

  1. lint

    chrisvittal committed Aug 13, 2024
    Configuration menu
    Copy the full SHA
    c66533f View commit details
    Browse the repository at this point in the history

Commits on Sep 13, 2024

  1. Configuration menu
    Copy the full SHA
    a447e22 View commit details
    Browse the repository at this point in the history