-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repository API proposal #486
Comments
As someone who doesn't spend much time writing JS I'm intrigued by the more protocol-ey things here. Here are a few thoughts that occur to me
|
Yes. In the simplest case, this could use essentially the current sync protocol, with each message tagged with the docId it refers to. In the common case where most docs are unchanged since the last sync, this would involve the peers exchanging heads hashes for each document. For large collections of docs, sending a hash per doc could be a bit inefficient; if we want to optimise that case, we could aggregate heads hashes of different documents in a Merkle tree. Specifically, I think a Merkle search tree would work well here.
Maybe… I think it's nice that if you currently have an Automerge file containing a bunch of changes, you can just load it and it will load correctly even if there are duplicates and changes out of causal order (so you can concatenate files without having to think about it too hard), so I think it still makes sense to have causal ordering facilities at the document level. But we could say that the queueing of changes that are not yet causally ready could happen at the repository level.
That would be nice; we'd want to do it in some way that doesn't assume one particular scheme, but gives apps the freedom to extend the protocol with whatever scheme they need. Maybe this could be done with a kind of "middleware" API that can add arbitrary additional data to each protocol message, similarly to what some web frameworks do?
Not currently, but that seems like a reasonable extension to consider. |
Other ideas that have come up:
|
Something that I've been working on in my own project is migrations. That has been historically challenging and I wonder what opportunities thinking about this abstraction can provide us with to address this issue. Some of the most challenging aspects are not storage/sync related, but it's possible that migrations that cross document boundaries might want to be at the level of "repository". |
You're right, @scotttrinh, we're certainly going to have to deal with the problem at some point but I think it's quite important to maintain some good separation of concerns and not just have this class take on either all the features or all the scope of the various pieces we're missing. Our next research project is going to be Cambria-adjacent, I think, but we haven't nailed down the scope yet so I'm not sure where it will lead. I believe the first step is to get a simple multi-document class implemented which can connect to a single storage engine and to a single network. That should give us a good starting point to expand from. I want to be a bit cautious about pre-designing the class too much because my experience using the storage adapters in the dat ecosystem was that they were closely tied to a particular kind of storage engine and poorly suited to others. I plan to implement something rudimentary over the next few days (time allowing) put together for some initial feedback. We'll need it for our next project anyway. |
How would that play out with RDBMS that generate their IDs? Sure, it should be possible to add an extra column and index it, but that sounds a bit wasteful as unique IDs are guaranteed. |
In systems where you have a single authoritative DB server it might make sense to let that server generate the IDs, but in general, in a decentralised system it needs to be possible for clients to generate their own IDs without depending on a particular server for ID assignment. |
Different systems will definitely want to assign names differently -- for example, IPFS has content addresses and hypercores use a signing public key. In the sketch @HerbCaudill and I put together over the weekend we let you provide an ID-generator as an argument to the Repo API, but it may just be better to have users provide IDs at document creation time. (I think that's something we'll want to feel out.) |
Heya @ept! We met at HYTRADBOI and you pointed me towards this issue about the repository API. I work at https://github.com/athensresearch/athens, and formerly I was at https://roamresearch.com/. Athens works as an optimistically updated shared document, whose operations are partially CRDT-like. We talked about how Athens might use CRDTs instead of its custom data structures. I've been thinking a lot about this since HYTRADBOI, and especially this repository proposal. A lot of what's described here comprises the core complexity in our product and it would be great to move that out. But it also strikes me as interesting that the document sync in Athens is in fact a partial implementation of this repository API, albeit not over CRDTs.
Athens synchronizes append-only event logs. This is somewhat straightforward because of the append-only nature of them - you basically always want to go to the tip. Events are forwarded to clients to bring them up to date. But when clients start, they get a snapshot / materialized view of the document so they don't have to load all the events. CRDTs themselves function differently, and offer a richer change model than an event log. But for the purpose of storing, loading, and syncing changes for a single CRDT instance, I think it is fundamentally the same as an append-only event log:
In fact, I think that is what the matrix-crdt provider for Yjs does. The core difference being that matrix-crdt is an event log first, and only a CRDT second, meaning that all changes go to the event log and only then to the CRDT clients. I think this isn't great for communicating between CRDTs instances, since CRDTs provide a richer sync model than event logs. But it seems good for a repository model, where the goal is to store and restore a given document, and then leave cross-node sync for the document to do. |
@filipesilva Yes, Automerge changes essentially form an event log. However, there are some important optimisations, because we allow every single keystroke to be a separate event (for real-time collaboration), which means that you can accumulate hundreds of thousands of events over the history of a single document. Storing each event individually would mean the history quickly grows into the megabytes. Automerge puts a lot of effort into compressing that history so that it can be stored and transmitted over the network efficiently. When you do It would be possible to use an append-only storage and networking model, but it wouldn't be able to take advantage of this compression. With this repository API we're trying to set things up so that apps can easily take advantage of Automerge's compression. |
Okay, just a few notes from my ongoing work on this. The first big change from how this is proposed is the relationship between the network and the repository and the repository and the sync engine. In my initial prototype these were all coupled together as described above. Unfortunately, when assessing how to integrate such an object into existing applications (or proposed ones we discussed) it became clear that having the network and synchronizer embedded entirely inside the Repo makes it tricky to extend the system or make varying decisions about how these systems should interact. My new prototype decouples these systems. The Repo is now a relatively small object that allows listeners to be notified when documents are created or loaded and returns handles to the documents it tracks so that interested parties can be track their changes. In addition to the Repo, there are also Networking, Synchronization, and Storage subsystems that can interact with the repo in different ways depending how you want to put your application together. (I will include one or two packagings of these ideas to make them easy to consume.) Finding these APIs and picking a comfortable idiomatic JS style is an ongoing process and I'm not entirely happy with where I'm at right now, but if you're interested in following along I'm occasionally pushing my work-in-progress implementation here.. Feedback is welcome but I have a pretty clear vision of what needs doing at the moment. |
This is an interesting proposal. We’ve been working on an application that uses automerge as one of the core data structures, so I thought some perspective from our experience might be helpful. We implemented some of the same functionality for our app. For example, we abuse Redux as a repository and it manages applying changes to in memory automerge objects, somewhat like the repository.change() proposal. I don't yet have specific recommendations, but I wanted to bring up a few more details that I think may help inform this proposal. Even if some of these details can be abstracted away by the API, it may be useful for backend implementers to take note. To start with a concrete scenario; if the frontend boots up by asking the backend “show me 50 of my most recent items”. How does that query, or the results of that query interact with this repository API? For context, our application is a record management tool. Think of it as sitting between a spreadsheet and a complicated case management system. (I’ll try to remember to link to it here when we do our soft release in a few weeks). We want to support realtime collaboration as well as offline use, but the more common usage is expected to be online, in a browser, with the server helping to provide query/search functionality as well as access controls like most webapps (the server is like a big trusted client, so all those features are designed to work locally offline as well, though that’s not fully implemented yet). Here are some thoughts I have given our experience.
I can go into more detail about data structures and whatnot that we settled on, but this is already long enough. Hopefully the above summary is helpful. |
Thanks for the comments, Rob. I think I agree with most of this though I want to be cautious not to set the expectation that we'll Solve All The Things with this one patch. A lot of this perspective is pretty high-level, and I'd be curious to hear in a more holistic sense what your current biggest experienced pain points are. |
Hello, what is the status of this? I saw this repository https://github.com/pvh/automerge-repo and it's fairly active. How usable is it and would it stay as a separate package to automerge or is the idea to merge it in as a part of automerge? |
The repo will be a separate repo but will likely move to the Automerge
namespace. Initial adoption is welcome, but expect bugs and breaking
changes as the system settles.
At this point I suspect what's there is significantly easier to work with
than implementing things yourself. The documentation is pretty thin, but
the automerge-repo-react-demo should be pretty easy to follow.
No npm packages yet but soon.
Feel free to hit me up on Slack for a chat if you have questions. I'm
going to be on Central US time (GMT-5) for a week or so.
P
…On Wed, Sep 21, 2022, 9:17 AM LiraNuna ***@***.***> wrote:
Hello, what is the status of this? I saw this repository
https://github.com/pvh/automerge-repo and it's fairly active. How usable
is it and would it stay as a separate package to automerge or is the idea
to merge it in as a part of automerge?
—
Reply to this email directly, view it on GitHub
<automerge/automerge#486 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAAWQF6AXNRVUJREFXVQPTV7MYJXANCNFSM5USEBPOQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
At the moment Automerge only provides an API for an in-memory data structure, and leaves all I/O (persistence on disk and network communication) as an "exercise for the reader". The sync protocol attempts to provide an API for network sync between two nodes (without assuming any particular transport protocol), but experience has shown that users find the sync protocol difficult to understand, and easy to misuse (e.g. knowing when to reset the sync state is quite subtle; sync with more than one peer is also error-prone).
I would like the propose a new API concept for Automerge, which we might call "repository" (or "database"?). It should have the following properties:
Some thoughts on what the repository API might look like:
repository
object and register the storage library you want to use. When any sort of network link is established, you also register it with the repository; when it disconnects, it automatically unregisters itself from the repository. The repository object is typically a singleton that exists for the lifetime of the app process.Automerge.init()
you would callrepository.create()
. The new document would automatically be given a unique docId.Automerge.load()
and passing in a byte array, you would callawait repository.load(docId)
, which loads the document with the given docId from the registered storage library.Automerge.change(doc, callback)
to make a change, callrepository.change(docId, callback)
, which automatically writes the new change to persistent storage and sends it via any network links that are registered on the repository. The callback can be identical to the current Automerge API.repository.get(docId)
.Automerge.applyChanges()
to update the document, the repository automatically receives incoming changes via its registered network links. The application should register an observer, e.g. usingrepository.observeChanges(callback)
, to re-render the UI whenever a document changes.getChanges
/applyChanges
API for potential advanced use cases that are not satisfied by the repository API, but the expectation would be that most app developers use the repository API.Need to do some further thinking on what the APIs for storage and networking interfaces should look like.
One inspiration for this work is @localfirst/state, but I envisage the proposed repository API having a deeper integration between storage and sync protocol than is possible with the current Automerge API, in order to make sync of large document sets as efficient as possible.
Feedback on this high-level outline very welcome. If it seems broadly sensible, we can start designing the APIs in more detail.
The text was updated successfully, but these errors were encountered: