Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce new transaction model based on ephemeral branches #258

Merged
merged 21 commits into from
Feb 2, 2024

Conversation

nicholasjng
Copy link
Collaborator

@nicholasjng nicholasjng commented Jan 25, 2024

This is a breaking change, as it requires users to give arguments to the return object of LakeFSFileSystem.transaction.

Changes the LakeFSTransaction class to become a repository+branch-scoped ephemeral branch.

File uploads, removals, and commits happen on that ephemeral branch. After transaction exit, the created branch is optionally merged back into the base branch.

The base branch (from which we branch off in the transaction) must exist before start of the transaction.

Some transaction APIs (merge, revert, tag) can also work on branches other than the transaction's ephemeral branch - this behavior might not be well-defined, and is up for debate on removal.

Similarly, on puts and deletes, one currently has to specify the transaction branch directly via the LakeFSTransaction.branch property. This can be made nicer by allowing resource paths relative to the transaction's repo and branch in special functions like e.g. LakeFSTransaction.put() (deferring to fs.put()), LakeFSTransaction.rm() (deferring to fs.rm()), and so on.

This pull request turns the LakeFSTransaction into a callable, which needs to be called with a repository name at minimum to ensure transaction execution.

@nicholasjng nicholasjng self-assigned this Jan 25, 2024
@nicholasjng nicholasjng changed the title testest Introduce new transaction model based on ephemeral branches Jan 25, 2024
Copy link

codecov bot commented Jan 26, 2024

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (cc7d289) 94.08% compared to head (f6b0cca) 94.24%.

Files Patch % Lines
src/lakefs_spec/transaction.py 98.63% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #258      +/-   ##
==========================================
+ Coverage   94.08%   94.24%   +0.15%     
==========================================
  Files           5        5              
  Lines         406      382      -24     
  Branches       72       70       -2     
==========================================
- Hits          382      360      -22     
+ Misses         15       13       -2     
  Partials        9        9              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nicholasjng nicholasjng force-pushed the transaction-canary branch 2 times, most recently from 3033d40 to fcbe093 Compare January 26, 2024 10:39
@nicholasjng
Copy link
Collaborator Author

Stateful transaction support requested in fsspec/filesystem_spec#1517.

@nicholasjng nicholasjng force-pushed the transaction-canary branch 2 times, most recently from 54b261f to 9c26a65 Compare January 26, 2024 11:49
The first building block for transactions on ephemeral branches.
It now takes the repo and branch arguments from the transaction.
The new ephemeral branch transaction is entirely eager, meaning the
placeholder is not needed anymore.
Works the same as in the conftest.py temporary branch fixture.

Since the branch creation is deferred until context entry, it is
currently not possible to start a transaction using the fsspec file
system's `start_transaction` API.

Ephemeral branch is optionally merged back  and deleted again on __exit__.
Since the repository and ephemeral branch are now saved on the
transaction, they are no longer needed on the APIs. Commits through
`LakeFSTransaction.commit()` are automatically created on the ephemeral
branch.

The biggest change is for rev_parse, which does not need the complicated
commit fetching anymore, and can just take the ref expression directly.
Removes wrapt from the lockfiles, and bumps some Jupyter dependencies
that were reported to have security vulnerabilities.
This has the unfortunate side effect that the typing does not match that
of the `AbstractFileSystem` base class anymore.
Signified by the absence of the exc_info tuple in __exit__.

Also, set `base_branch` as a `lakefs.Branch`, since it needs to be the
left side of the diff in `__exit__` - apparently, that is not symmetric.
Requires repository, base branch everywhere.

Right now, we solve all file puts by interpolating the string in-place
with `tx.branch` instead of `temp_branch`. This is relatively ugly, and
could be better with transaction APIs that change state on the branch
(puts, rms).
However, those need a check that the input path can be
relative, since we just by default assume the user wants to put on the
ephemeral branch.
Otherwise we are prone to the same error as in the transaction exit.

This also has the side effect that we have to canonicalize `into` into
a lakefs.Branch, as it has to be the LHS of the `diff()` call.

Also fix expectations for the transaction merge test.
Repo and base branch are now entered into the transaction constructor.

Interpolated strings for rpaths are now filled with `tx.branch.id` instead
of the respective static values.
This loses the named branch in most cases, since the ephemeral branch
model means one is already created. This points to that it might be
useful to be able to give a name to the ephemeral branch instead of
rolling with the autogenerated one.
Previously, the new branch was a result of implicit branch creation in
`fs.put_file()` _within_ the context.

Now, since we do not allow the base branch to be created in situ, which
would be very confusing, we simply create it explicitly before the
transaction start.

We do this for both NEW_BRANCH (not sure what it is used for in the end)
and TRAINING_BRANCH.
Saves a couple of lines (and coverage misses) in the transaction module.

Also rename the tag name positional to `name` in `LakeFSTransaction.tag()`.
This seems like the more natural thing to do. Adjusts the tests to
expect the new behavior.

Also, since the merge logic only optionally creates a new commit on the
destination branch, we can unconditionally return the destination HEAD in
`LakeFSTransaction.merge()`.
1) by excluding `raise NotImplementedError` from coverage, and 2) by
initializing the transaction's repo by a string.

Also adds a test for the remaining partial covers by explicitly checking
if-branches for the non-default arguments of `automerge` and `delete`.
@AdrianoKF
Copy link
Contributor

AdrianoKF commented Jan 30, 2024

Another idea is to give default arguments to the file system constructor, but these are stickier / more opaque than initializer arguments to the transaction itself.
Passing arguments to the transaction has to be solved on the fsspec level, however, if there is interest at all by the maintainers.

In case there is no upstream solution, there still is the third option, which I explored in my prototype: turning LakeFSTransaction into a callable object, which allows passing (optional) configuration when using it as a context manager, but still being compatible with the way that fsspec instantiates the transaction class.

This way, users can debug transaction failures by looking at the left
over ephemeral branch.

Also move up the `fs._intrans` toggle to before branch deletion, since
that needs to be put cleanly on __exit__. Set `fs._intrans` to `True`
only after successful branch creation, since otherwise we end up with
the wrong state.
Makes the transaction backwards compatible with fsspec.

It is still undefined behavior to use the transaction without specifying
a repository at minimum.
Requires a rewrite of the transaction doc due to the changes.
This way, we do not get hard-to-interpret errors on uses of `None`
objects (which is the default for `LakeFSFTransaction.repository`).
We want a simple ValueError saying that the repo does not exist, not an
endless stack trace on lakeFS API exceptions.
@nicholasjng nicholasjng merged commit 1586b02 into main Feb 2, 2024
7 checks passed
@nicholasjng nicholasjng deleted the transaction-canary branch February 2, 2024 10:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Objects are uploaded on transaction failure Allow usage of ephemeral branches for transactions
3 participants