v1

albertzak · May 23, 2020 · 4a39c2a · 4a39c2a
1 parent ef39407
commit 4a39c2a
Show file tree

Hide file tree

Showing 13 changed files with 170 additions and 122 deletions.
diff --git a/Zak20-v1.pdf b/Zak20-v1.pdf
diff --git a/glossary.tex b/glossary.tex
@@ -35,3 +35,4 @@
 \newacronym{txe}{txe}{Transaction entity}
 \newacronym{UUID}{UUID}{Universally Unique Identifier}
 \newacronym{ECS}{ECS}{Entity-Component Systems}
+\newacronym{RPC}{RPC}{Remote Procedure Call}
diff --git a/lit.bib b/lit.bib
@@ -722,3 +722,36 @@ @inproceedings{wiebusch2015decoupling
   year={2015},
   organization={IEEE}
 }
+
+@article{mccarthy1960recursive,
+  title={Recursive functions of symbolic expressions and their computation by machine, Part I},
+  author={McCarthy, John},
+  journal={Communications of the ACM},
+  volume={3},
+  number={4},
+  pages={184--195},
+  year={1960},
+  publisher={ACM New York, NY, USA}
+}
+
+@article{stead1983chartless,
+  title={A chartless record—Is it adequate?},
+  author={Stead, William W and Hammond, William E and Straube, Mark J},
+  journal={Journal of medical systems},
+  volume={7},
+  number={2},
+  pages={103--109},
+  year={1983},
+  publisher={Springer}
+}
+
+@article{decker2000semantic,
+  title={The semantic web: The roles of XML and RDF},
+  author={Decker, Stefan and Melnik, Sergey and Van Harmelen, Frank and Fensel, Dieter and Klein, Michel and Broekstra, Jeen and Erdmann, Michael and Horrocks, Ian},
+  journal={IEEE Internet computing},
+  volume={4},
+  number={5},
+  pages={63--73},
+  year={2000},
+  publisher={IEEE}
+}
diff --git a/sections/abstract.tex b/sections/abstract.tex
@@ -2,3 +2,8 @@
 
 \section*{Abstract}
 
+Growing customer demands lead to increased incidental complexity of data-intensive distributed applications. Relational Database Management Systems (RDBMS), especially those implementing Structured Query Language (SQL) possess performant but destructive-by-default write semantics. Modeling domains such as healthcare in classic RDBMS leads to an explosion in the number of columns and tables. Their structure has to be known in advance, discouraging an explorative development process. Requirements of real time collaboration, auditability of changes, and evolution of the schema push the limit of established paradigms.
+
+The core of the data layer presented in this thesis is a simple relational model based on facts in the form of Entity-Attribute-Value (EAV) triples. A central append-only immutable log accretes these facts via assertions and retractions. Transactions guarantee atomicity/consistency/isolation/durability (ACID) and are themselves first-class queryable entities carrying arbitrary meta facts, realizing strict bitemporal auditability of all changes by keeping two timestamps: transaction time $t_x$ and valid time $t_v$. Changes are replicated to clients which subscribe to the result of a query. Multiple incrementally maintained indices (EAVT, AEVT, AVET, VAET) grant efficient direct access to tuples, regarding the database analogous to a combined graph, column, and document store. The database itself is an immutable value which can be passed within the program and queried locally.
+
+This work demonstrates the feasibility of implementing various desirable features in less than 400 lines of Clojure: bitemporality, audit logging, transactions, server-client reactivity, consistency criteria, derived facts, and a simple relational query language based on Datalog.
diff --git a/sections/conclusion.tex b/sections/conclusion.tex
@@ -23,3 +23,9 @@ \section{Future Work}
 
 \cleardoublepage
 \section{Conclusion}
+
+Easy things are easy, hard things are hard. This thesis set out to redesign and implement the entire data layer using a combination of non-mainstream ideas. From data representation in EAV facts, to the database being an immutable value to be passed around and queried locally, to only accreting facts via assertion and retraction, yet keeping ACID guarantees, to having functions stored as values inside the database which derive new facts or act as constraints, to a replication mechanism where clients subscribe to the live-updating result set of a query, to pulling out simple values directly from the database by reaching into its index structures, or asking complex questions with a pattern matching query language.
+
+The resulting implementation in less than 400 lines is impressively tight yet appears to map nicely to the initial design, mostly thanks to the incredibly expressive Clojure language and its built-in immutable data structures. While all of the mentioned features are implemented to a degree that is just enough to experiment with, almost all of the more complicated aspects were left simply left out of the design: There is no expectation of performance, efficiency, scale, security, safety, testing, or any implied suitability for usage in the real world or with more than just a handful of sample facts. The proof-of-concept fixates on the easy parts of that utopian data layer design which were almost trivial to implement, and barely covers any truly complex minutiae. In particular, the implementation of the query language turned out to be harder than anticipated, despite cutting out almost all but the most basic pattern matching features of a real Datalog.
+
+Yet the formidable degree to which the presented ideas appear to mesh together, supported by considerable amount of related work, gives a sound impression of the general direction of this and similar designs for better data layers.
diff --git a/sections/design.tex b/sections/design.tex
@@ -1,5 +1,6 @@
 \section{Design}\label{sec:design}
 
+The contribution of this work is divided into two main sections, design and implementation. The subsequent parts of this section first present various design problems of commonly employed data layer technologies. Deriving from their limitations, the next part paints a blissful picture of what an ideal data layer would look like (subsection~\ref{sec:goals}), while the following demarcates the scope of the contribution (subsection~\ref{sec:nongoals}). Finally, the conceptual model (subsection~\ref{sec:conceptual_model}) and the query language (subsection~\ref{sec:query_language}) are presented.
 
 \input{sections/design_problems}
 \input{sections/design_goals}
@@ -19,7 +20,7 @@ \subsection{Conceptual model}\label{sec:conceptual_model}
 
 \paragraph{Indexing.}
 
-EAV systems commonly keep a number of sorted indices (see \autoref{tbl:indices}) to allow the data to be retrieved from multiple "angles", depending on the need of the query. Index structures are named after the \emph{nesting order} in which the elements of the facts are arranged. Not all database systems maintain the same indices. In this case, the system keeps four indices covering the following common use cases:
+EAV systems commonly keep a number of sorted indices (see table \ref{tbl:indices}) to allow the data to be retrieved from multiple "angles" or directions, depending on the need of the query. Index structures are named after the \emph{nesting order} in which the elements of the facts are arranged. Not all database systems maintain the same indices. In this case, the system keeps four indices covering the following common use cases:
 \begin{itemize}
   \item EAVT, the canonical order, which \emph{maps} an entity to its attributes like a document,
   \item AEVT, for finding entities which \emph{have} a certain attribute set
@@ -31,16 +32,16 @@ \subsection{Conceptual model}\label{sec:conceptual_model}
 \newcolumntype{s}{>{\hsize=.5\hsize}X}
 
 \begin{table}[]
-  \label{tbl:indices}
   \caption{Impact of the index sort order on the area of application}
   \begin{tabularx}{\textwidth}{|l|s|s|X|}
   \hline
   \textbf{index} & \textbf{name}               & \textbf{feels like}      & \textbf{good for}                                     \\ \hline
-  \lisp{:eavt}          & "entity-oriented"           & document store           & accessing various attributes of a known entity        \\ \hline
-  \lisp{:aevt}          & "attribute-entity-oriented" & column store             & accessing the same attribute of various entities      \\ \hline
-  \lisp{:avet}          & "attribute-value-oriented"  & filtering a column store & finding entities by the value of a specific attribute \\ \hline
-  \lisp{:vaet}          & "value-oriented"            & searching everything     & searching over all values, regardless of attribute    \\ \hline
+  EAVT          & "entity-oriented"           & document store           & accessing various attributes of a known entity        \\ \hline
+  AEVT          & "attribute-entity-oriented" & column store             & accessing the same attribute of various entities      \\ \hline
+  AVET          & "attribute-value-oriented"  & filtering a column store & finding entities by the value of a specific attribute \\ \hline
+  VAET          & "value-oriented"            & searching everything     & searching over all values, regardless of attribute    \\ \hline
   \end{tabularx}
+  \label{tbl:indices}
 \end{table}
 
 For example, here is a simple example to pull out the name of a known patient, using only the \lisp{get-in} function of the Clojure core library on the \lisp{:eavt} index:
@@ -62,29 +63,29 @@ \subsection{Conceptual model}\label{sec:conceptual_model}
 \end{center}
 
 
+\cleardoublepage
+\subsection{Query language}\label{sec:query_language}
 
-\subsection{Query language}
-
-The query language of the system is a greatly simplified language modeled after the pattern matching relational query language used in Datomic, which is in turn a Lisp variant of the Datalog \cite{abiteboul1988datalog} language expressed using the syntactic forms of Clojure's \gls{edn}.
+The query language of the system is a greatly simplified language modeled after the pattern matching relational query language used in Datomic, which is in turn a Lisp variant of the Datalog \cite{abiteboul1988datalog} language expressed in of Clojure's \gls{edn}.
 
 The choice of language is arbitrary -- any relational language would suffice -- and the core of the database does not depend on any query language capabilities Modeling the language after the one used in Datomic was chosen because because not only has the edn notation become a de-facto standard for other EAV databases like Crux, EVA, and Datascript, but because the shape of each query clause maps naturally to the representation of a fact in canonical EAV order.
 
-See \autoref{lst:example_query} for an query consisting of four query clauses (the \lisp{:where} part) performing an implicit join, and a final projection (\lisp{:find}) to extract the values bound to the \emph{\gls{lvar}} symbols \lisp{?name} and \lisp{?location}. For example, the query clause \lisp{[?p :name ?name]} applied to the fact \lisp{[:person/123 :name "Hye-mi"]} would result in \emph{binding} the lvar \lisp{?p} to the value \lisp{:person/123}, and the lvar \lisp{?name} to the value \lisp{"Hye-mi"}. Other clauses are bound likewise. Note that multiple occurrences of the same lvar prompt \emph{unification} with the same value, creating an implicit \emph{join}. The order of the query clauses has no semantic meaning.
+See listing~\ref{lst:example_query} for an query consisting of four query clauses (the \lisp{:where} part) performing an implicit join, and a final projection (\lisp{:find}) to extract the values bound to the \emph{\gls{lvar}} symbols \lisp{?name} and \lisp{?location}. For example, the query clause \lisp{[?p :name ?name]} applied to the fact \lisp{[:person/123 :name "Hye-mi"]} would result in \emph{binding} the lvar \lisp{?p} to the value \lisp{:person/123}, and the lvar \lisp{?name} to the value \lisp{"Hye-mi"}. Other clauses are bound likewise. Note that multiple occurrences of the same lvar prompt \emph{unification} with the same value, creating an implicit \emph{join}. The order of the query clauses has no semantic meaning.
 
 Performing a query entails applying the \lisp{q} function to a database value and a query. Clients can thus decide whether to leverage the query language via loading a library, or just access the data via the index structures directly.
 
-\begin{lstlisting}[label={lst:example_query},caption="Who from Korea is working for whom?"]
+\begin{lstlisting}[label={lst:example_query},caption="Who from Ulsan is working for whom?"]
 '[:find [?name ?company]
   :where [[?p :works-for ?e]
           [?e :name ?company]
           [?p :name ?name]
-          [?p :location "Korea"]]]
+          [?p :location "Ulsan"]]]
 \end{lstlisting}
 
 \paragraph{Temporal and bitemporal queries.}
-As stated in \autoref{sec:nongoals}, the (bi-)temporal aspects of the described system are secondary -- they are to be used for infrequent auditing purposes. Consequently, the design of the indexing and query mechanisms can be greatly simplified be forgoing bitemporal indexing strategies such as \cite{nascimento1995ivtt}.
+As stated in section~\ref{sec:nongoals}, the (bi-)temporal aspects of the described system are secondary -- they are to be used for infrequent auditing purposes. Consequently, the design of the indexing and query mechanisms can be greatly simplified be forgoing bitemporal indexing strategies such as \cite{nascimento1995ivtt}.
 
-As the query function simply takes a database as a \emph{value}, a \emph{filtering function} can be applied the the database beforehand. The \lisp{keep} function in \autoref{lst:queryfilter} returns a structurally shared and lazy copy of the database filtered by arbitrary bounds of the relevant timestamps $t_x$ and $t_v$.
+As the query function simply takes a database as a \emph{value}, a \emph{filtering function} can be applied the the database beforehand. The \lisp{keep} function in listing~\ref{lst:queryfilter} returns a structurally shared and lazy copy of the database filtered by arbitrary bounds of the relevant timestamps $t_x$ and $t_v$.
 
 \begin{lstlisting}[label={lst:queryfilter},caption=Applying a temporal filter before querying,morekeywords={keep,q,<,>}]
   (q (keep
@@ -95,21 +96,11 @@ \subsection{Query language}
   \end{lstlisting}
 
 
-A common use case in auditing is to retrieve the \emph{history} of all attributes related to a given entity over time. The \lisp{history} function takes a database value (optionally composed with a filtering function as described above) and an entity value, and returns an ordered slice of the log with transactions relevant to the requested entity.
-
-
-
-\subsection{Publication and subscription}
-
-TODO figure out how replication of everything
-
-- do it meteor-like.
+\paragraph{Per-entity history.} A common use case in auditing is to retrieve the \emph{history} of all attributes related to a given entity over time. The \lisp{history} function takes a database value (optionally composed with a filtering function as described above) and an entity value, and returns an ordered slice of the log with transactions relevant to the requested entity. Note that it does not make sense to create a new database value from a history log, because that would just result in only the latest values being present in the index yet again.
 
-- server controls publication via :where part
 
-- do not replicate past (superseded/retracteed) fast
+\paragraph{Publication and subscription}
 
-- deep history traversal should run on server
+One of the goals states that clients should be able to declaratively subscribe to the \emph{live result set} of a query. The results and the query itself will change over the duration of a client's session. Each change triggers an immediate re-render of the UI. Conceptually, clients \emph{install} their \emph{subscription queries} on the server, and the infrastructure will re-run the subscription query whenever the underlying data changes and notify the client of the changed results. The design does not prescribe whether or not to replicate past (i.e. superseded or retracted) facts, thus greatly simplifying the proof-of-concept implementation by deferring concerns such as diffing, authorization, and the decision of what exactly to replicate to the clients to the developer customizing this data layer to their use case.
 
-- server-side queries should run as callable "actions"
-Authorization / selective publishing / field-level access control
+\paragraph{Security.} While extreme dynamism may be warranted in a high-trust environment, a real-world application may interact with some malicious entities and thus needs a means to restrict queries on the server side. In a real-world application, clients would need to authenticate themselves and the server would authorize publication based on access rules. Yet, there is no simple way to statically analyze queries submitted by the client for safety properties, but the server can control which facts are allowed to be replicated to a client. A publication might, for example, choose to not replicate facts with specific attributes, or transform facts to censor parts of the value.