-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathconclusion.tex
37 lines (20 loc) · 6.94 KB
/
conclusion.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
\cleardoublepage
\section{Future Work}
\paragraph{Safe concurrent editing.}
A distributed system expects connection loss and simultaneous conflicting edits. Especially when distributing state not just within the data center, but towards clients which thereby become (limited) database peers and are free to perform queries and writes at any time, the question arises of how such distribution can be achieved safely over (very) unreliable networks with partitions lasting for e.g. days. While theorems such as CAP and PACELC fundamentally stake out the playing field, common current data layers preempt the designer's choices and appear to make austere and conservative decisions in favor of consistency requiring coordination where not strictly needed, thus unnecessarily limiting availability.
\gls{CRDT} and \gls{OT} are concepts which appear to provide various composable consistency primitives for robust replication \cite{weilbach2015replikativ, weilbach2016decoupling}, generally trading space efficiency for the ability to perform arbitrary \emph{merges} via an idempotent, associative, and commutative function. The result is that partitioned nodes can safely to continue to perform changes while guaranteeing that the resulting states can be reconciled later. A variety of commonly used data types have been implemented in a conflict-free variant. Since some of them may grow extremely quickly compared to their unsafe twins, garbage collection and epochal compaction of such data types remain areas of active research.
In a system like the one presented in this work, it should at least be possible to define a schema that specifies a distribution tradeoff for each attribute as dictated by the \gls{CAP} theorem. These constraints must propagate to the clients and elicit the data layer to restrict the possible operations on the data item in question given the current network conditions \cite{emerick2014api}.
Going forward, the same schema could also select one of many built-in conflict resolution strategies (built on CRDTs or similar) specific to the domain requirements of each attribute. This is the approach taken by the Braid \cite{braid19} protocol, an in-progress draft of a proposed \gls{IETF} standard to add history, editing, and subscription semantics to HTTP resources. It aims to standardize the representation and synchronization of arbitrary web application state. Braid can allow simultaneous editing of the same resource by different clients and servers and can guarantee a consistent resulting state via selectable \emph{merge types} implementing various CRDTs and OTs.
Relaxing the design goal of full auditability of every change via bitemporal transactions, the described system could provide a more useful and vastly more efficient notion of history concerning simultaneous editing of full-text values (e.g. medical notes) when represented as a mutably updatable CRDT value within a fact.
Lastly, the entire immutable log described in this work could be constructed as a conflict-free data type. Clients could converge values of the database with no coordination, because then only the latest commits would need to be transmitted between clients, forgoing the need of a server. However, the publication/subscription mechanism would have to be redesigned completely, and a decentralized immutable database implemented as a CRDT would require tackling clock synchronization if the resulting system were to have a notion of bitemporality.
\cleardoublepage
\paragraph{(Temporal) logic constraints.} An ideal programming environment would let the programmer "use logic to express what is true, use logic to check whether something is true, [and] use logic to find out what is true \cite{sicp}". Research by \cite{alvaro2010dedalus,alvaro2011consistency} explores temporal extensions to Datalog and a domain-specific language for describing time-dependent behavior of distributed systems.
\paragraph{Full stack laziness.} A fully lazy distributed data structure would allow transparent access and local caching of all facts for which the client passes access rules set up by the server. Such a design would also allow transparent querying of past facts, possibly aided by hints from the programmer as to where (on client or server) the query should be executed.
\paragraph{Incremental maintenance.} Efficient execution of Datalog programs installed as "live" queries entails incremental updating of the result set as the source data changes.
Research in the direction of incremental view maintenance in such systems includes timely dataflow \cite{murray2013naiad}, differential dataflow \cite{mcsherry2013differential}, and 3DF providing an implementation of reactive Datalog for Datomic \cite{gobel2019optimising}.
\paragraph{Data as code.} As the presented system's flexibility allows storing, versioning, replicating and querying arbitrary data, including functions, the question arises of how an entire application can be constructed with all its code existing as facts inside the database, replicating to the clients -- thus closing the circle back to MUMPS-like systems.
\cleardoublepage
\section{Conclusion}
Easy things are easy, hard things are hard. This thesis set out to redesign and implement the entire data layer using a combination of non-mainstream ideas. From data representation in EAV facts, to the database being an immutable value to be passed around and queried locally, to only accreting facts via assertion and retraction, requiring no schema yet keeping ACID guarantees, to having functions stored as values which derive new facts or act as constraints on the database value, to a replication mechanism where clients subscribe to the live-updating result set of a query, to pulling out simple values directly from the database value by reaching into its index structures, or asking complex questions with a pattern matching query language.
The resulting implementation in less than 400 lines is impressively tight yet appears to map nicely to the initial design, mostly thanks to the incredibly expressive Clojure language and its built-in immutable data structures. While all of the mentioned features are implemented to a degree that is just enough to experiment with, almost all of the more complicated aspects were simply left out of the design: There is no expectation of performance, efficiency, scale, security, safety, testing, or any implied suitability for usage in the real world or with more than just a handful of sample facts. The proof-of-concept fixates on the easy parts of that utopian data layer design which were almost trivial to implement, and barely covers any truly complex minutiae. In particular, the implementation of the query language turned out to be harder than anticipated, despite cutting out almost all but the most basic pattern matching features of a real Datalog.
Yet the formidable degree to which the presented ideas appear to mesh together, supported by considerable amount of related work, gives a sound impression of the general direction of this and similar designs for better data layers.