-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathrelated_work.tex
72 lines (45 loc) · 13.9 KB
/
related_work.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
\cleardoublepage
\section{Related Work}\label{sec:related_work}
\paragraph{\gls{RAD}.} Scaling developer productivity as the team size goes $n \to 1$ on data-intensive near real time collaborative line-of-business applications goes back to the \gls{RAD} movement of the 1990s sparked by tools such as the Delphi suite \cite{mackay2000reconfiguring}, PowerBuilder \cite{zubeck1997implementing}, Lotus Notes \cite{zubeck1997implementing}, and FileMaker
\cite{chen2010developing}.
\paragraph{MUMPS.} On the aspects of near real time state synchronization combined with RAD tooling, a notable mention is the \gls{MUMPS} \cite{bowie1979methods} dating back to the year 1966. An extremely terse programming language combined with a runtime system featuring built-in persistence mechanisms. Its "global database" is a subscripted array which replicates and persists \emph{code and data} across machines in near real time. Major hospitals and financial institutions continue to run their daily operational workloads on various evolutions of MUMPS systems today \cite{aller2018evolution}.
\paragraph{Firebase et al.} Firebase and Parse are hosted services which brought real time state synchronization to web and mobile applications starting in 2011. Firebase is a proprietary managed service operated by Google. Its data model is based on a global mutable structure in \gls{JSON} and allows only rather simplistic filtering and navigational queries \cite{wingerath2019real}. Parse was a similar real time Backend-as-a-Service based on MongoDB and Redis which was active from 2011 to 2017. An open source client-side database which allows subscribing to the result set of a query is RxDB for JavaScript replicating PouchDB, CouchDB, or GraphQL endpoints \cite{wingerath2019real}. The data model of these services is generally oriented around mutable documents in free-form nested \gls{JSON}, although there are means to validate records via a schema.
\paragraph{Meteor.} This thesis is an evolution based on the author's experience running a \gls{SaaS} business built on the Meteor \cite{schmidt2014live} framework and the \gls{DDP} \cite{ddpspec} which were released in 2012. Meteor allows clients to transparently subscribe to the results of predefined MongoDB queries updating live over arbitrary durations. Synchronization is achieved by the server watching the operations log of the database (called "oplog tailing") \cite{wingerath2019real}, keeping track of the entire set of published documents and subscribed queries for each connected client, and computing delta changes to send back to the client. See table \ref{tbl:dbcomparison} for a comparison of Meteor with other data management approaches.
Clients cache the received documents in a local instance of MongoDB implemented in JavaScript (called Minimongo), enabling local queries with no latency via an API that mostly mirrors the server-side MongoDB interface. Combined with a functional-reactive view library such as React to bind the management of a view's lifecycle with its data subscriptions, this setup gives the impression of data "just being here" on the client with almost no programmatic overhead.
\cleardoublepage
Reads are by default served from the client's local copy of the data, giving an AP read path that may result in the reading of stale data but remains performant and available. The application developer can choose to apply writes either optimistically to the local database which Meteor later replicates to the server (that may issue corrections) thus yielding AP characteristics with last-write-wins pseudo resolution semantics (previous writes are lost because there is no concept of time); or let the server process the transaction via a Remote Procedure Call (RPC) and wait for a quorum of the database servers before confirming and replicating to the clients, thus yielding CP semantics.
The major problem of Meteor is its absolute dependence on MongoDB and the ensuing lack of a relational data model, i.e., no joins and requiring to denormalize data, leading to incidental complexity of the application code and the $n+1$ query problem. Its Minimongo implementation lacks indices and does not support all of MongoDB's query capabilities. Meteor's publication/subscription mechanism does not scale well to many clients or publication queries returning a large number of documents. This is because the server keeps a working copy of all documents present in each connected client's local database in memory and creates a diff for each client on each database update. As a result, Meteor's publication/subscription mechanism is stateful and requires sticky sessions, meaning that an upstream load balancer must take care to always forward a client's requests to the same backend server.
Preliminary load testing and tracing of performance problems during typical workloads encountered in the author's medical information system concluded that performance of Meteor's publication/subscription system starts to degrade noticeably with more than 300 clients connected to one server process on moderate dedicated server hardware, or when a client subscribes to a single publication containing more than around 100,000 documents. Clients subscribing to such large collections has intermittently led to server processes permanently hanging at 100\% processor usage until forcibly restarted, an issue which has been known for at least two years\footnote{Meteor issue \#9796 on GitHub: \url{https://github.com/meteor/meteor/issues/9796}}. Despite these shortfalls, Meteor is production-grade software with extraordinary developer ergonomics and served as the backbone of the authors's business replicating terabytes of data extremely reliably since first put into production five years ago.
While this work originally aimed to design a publication/subscription mechanism with developer usability goals similar to Meteor's, its focus shifted towards an exploration of a wholly different data layer based on a model implementation of an immutable bitemporal EAV database, with only minimal yet general query subscription capabilities remaining.
\begin{table}[]
\caption{Comparison of Meteor with other database and stream processing systems \cite{wingerathcase}}
\begin{tabular}{r|c|c|c|c|}
\cline{2-5}
& \textbf{DBMS} & \textbf{\begin{tabular}[c]{@{}c@{}}Real-time\\ database\end{tabular}} & \textbf{\begin{tabular}[c]{@{}c@{}}Data stream\\ management\end{tabular}} & \textbf{\begin{tabular}[c]{@{}c@{}}Stream\\ processing\end{tabular}} \\ \hline
\multicolumn{1}{|r|}{\textbf{data}} & \multicolumn{2}{c|}{\begin{tabular}[c]{@{}c@{}}persistent\\ collections\end{tabular}} & \multicolumn{2}{c|}{\begin{tabular}[c]{@{}c@{}}persistent \& ephemeral\\ streams\end{tabular}} \\ \hline
\multicolumn{1}{|r|}{\textbf{processing}} & one-time & \begin{tabular}[c]{@{}c@{}}one-time \&\\ continuous\end{tabular} & \multicolumn{2}{c|}{continuous} \\ \hline
\multicolumn{1}{|r|}{\textbf{access}} & random & \begin{tabular}[c]{@{}c@{}}random \&\\ sequential\end{tabular} & \multicolumn{2}{c|}{sequential} \\ \hline
\multicolumn{1}{|r|}{\textbf{streams}} & \multicolumn{3}{c|}{structured} & \begin{tabular}[c]{@{}c@{}}structured \&\\ unstructured\end{tabular} \\ \hline
\multicolumn{1}{l|}{} & \begin{tabular}[c]{@{}c@{}}PostgreSQL,\\ MongoDB\end{tabular} & \begin{tabular}[c]{@{}c@{}}Firebase,\\ Meteor,\\ RethinkDB,\\ Parse\end{tabular} & \begin{tabular}[c]{@{}c@{}}Influx,\\ PipelineDB,\\ SQLStream\end{tabular} & \begin{tabular}[c]{@{}c@{}}Storm,\\ Samza,\\ Flink,\\ Spark\end{tabular} \\ \cline{2-5}
\end{tabular}
\label{tbl:dbcomparison}
\end{table}
\paragraph{Datomic.} Most of the design tenets for the project described in section~\ref{sec:design} originate from the database Datomic \cite{hickey2012dbvalue} which was created by the author of the Clojure programming language in 2012. Its data model is a monotemporal (transaction time \gls{tx} only) and immutable log of assertions and retractions of EAV facts, with first-class transactions being reified as regular queryable facts themselves. Datomic requires a minimal schema to be specified upfront.
Instead of thinking of the database as a single \emph{place} to send queries to and receive tuples from, Datomic separates its components as follows to achieve horizontal read scalability, turning the classical model of a database "inside out" \cite{kleppmann2017designing}:
\begin{itemize}
\item There is a \emph{generic} stateful distributed storage layer, which may be backed by any pluggable implementation (SQL, Redis, Riak, Kafka, files, memory...). The storage layer keeps all data in the form of tuples and various indices. \cite{hickey2019datomic}.
\item Application servers (called "peers") operate on a working set of tuples in memory. They perform queries themselves \emph{locally on their own copy of the data.} Tuples are cached locally as needed and are lazily loaded from the storage layer over the (local) network, where disk locality is considered irrelevant \cite{ananthanarayanan2011disk}. As put by the author: "perception does not require coordination \cite{hickey2012values}", yet consistency and caching are trivial because tuples are immutable.
\item All write transactions -- represented as data structures -- go through a \emph{single} instance of the \emph{transactor} component, which is the only part of the system that may write to the storage backend. Because it is one single threaded instance, it can impose a global order on all incoming transactions. Conversely, the transactor is the bottleneck for write throughput, as well as a single point of failure for which a second instance should be operated in a hot failover configuration. Peers can place first-class functions into the database, and call them as part of a transaction to enforce arbitrary data invariants \cite{datomicdocs}.
\end{itemize}
\paragraph{Crux.} Rooted in the same core principles as Datomic but having taken some different architectural choices, Crux is an efficient \emph{bitemporal} schemaless unbundled document database. For a detailed comparison with Datomic, see \cite{juxtcrux}.
\cleardoublepage
\paragraph{Datascript et al.} Similarly inspired by Datomic, DataScript \cite{prokopov15datascript} is an immutable schemaless in-memory EAV store. It lacks notion of time and history because it is designed to manage frequently changing client-side state of an \gls{SPA} within the browser for the lifetime of a session where memory is limited.
Several attempts exist, including this thesis, to bridge the gap between a Datomic-like database on the server and a DataScript-like instance on the client, which may be used simultaneously for client-only view state as well as for caching tuples from the server \cite{small16datscript}; and for closing the gap between that database and the view layer \cite{parker15posh,krivosheev19reposh}. Datahike \cite{datahike} is a port of DataScript back to the server, where it adds durable persistence. It aims to be a single-node replacement for Datomic suitable for use in small projects.
Mozilla Mentat \cite{mozillamentat} was an attempt at a performant embeddable implementation in Rust of a combination of ideas from Datomic, DataScript, and SQLite.
Eva \cite{eva} is an open source monotemporal EAV database very similar to and compatible with the \gls{API} of Datomic, though the project appears to be abandoned in a feature-complete "alpha" quality stage.
Ittyon \cite{ittyon} is a Clojure library to manage distributed state in games based on ideas from Entity-Component Systems (ECS) combined with the EAV+$t_x$ data model. The testbed described in section~\ref{sec:implementation} is based on its client/server implementation.
This thesis takes ideas related to increased expressivity from each of these projects and presents a greatly simplified implementation which ignores all aspects of performance, efficiency, or interoperability.
\paragraph{Hyperfiddle, LogicBlox, and Eve.} A notable proprietary commercial RAD self-serve database system with an integrated programming language based on Datalog, LogicBlox \cite{aref2015design} similarly aims to reduce incidental complexity of developing line-of-business applications with a focus on probabilistic and predictive analytics of large business data sets. Its developers have contributed a novel efficient algorithm for incremental view maintenance for Datalog systems \cite{veldhuizen2012leapfrog}.
The ongoing Hyperfiddle project \cite{getz18hyperfiddle} aims to be a \gls{DSL} to fully describe an application using only plain expressions of \gls{edn}. It is built on Datomic to interactively and quickly create a \gls{UI} for database applications,
Ambitious but ultimately abandoned, the Eve project \cite{eve} was an attempt to reinvent programming for "humans first" through a combination of development environment, database, and a novel relational and reactive programming language where e.g. the concept of identifier scope is abandoned, and the order of statements has no semantics.
If the data layer presented in this thesis would be evolved towards a full framework for declaratively turning business requirements into functioning applications, following in the spirit of Meteor, the resulting product could look like a combination of Hyperfiddle, LogicBlox and Eve.