sections/design.tex

\section{Design}\label{sec:design}

The contribution of this work is divided into two main sections, design and implementation. The subsequent parts of this section first present various design problems of commonly employed data layer technologies. Deriving from their limitations, the next part paints a blissful picture of what an ideal data layer would look like (subsection~\ref{sec:goals}), while the following demarcates the scope of the contribution (subsection~\ref{sec:nongoals}). Finally, the conceptual model (subsection~\ref{sec:conceptual_model}) and the query language (subsection~\ref{sec:query_language}) are presented.

\input{sections/design_problems}
\input{sections/design_goals}


\subsection{Conceptual model}\label{sec:conceptual_model}

The data model of the presented system is extremely simple. There is no requirement to design a schema or to differentiate between entities and relationships. Yet, \emph{all} structured data can be represented in the system as long as data is fully normalized (6NF).

\begin{itemize}
  \item The basic unit of information is a \emph{fact}, a triple \lisp{[e a v]} containing values representing \emph{entity}, \emph{attribute}, and \emph{value}.

  \item Over time, facts are \emph{asserted} and \emph{retracted}, accreted as part of a \emph{transaction}.

  \item The database and all changes to its state over time are fully described by the transaction log of assertions and retractions of facts.
\end{itemize}

\paragraph{Indexing.}

EAV systems commonly keep a number of sorted indices (see table \ref{tbl:indices}) to allow the data to be retrieved from multiple "angles" or directions, depending on the need of the query. Index structures are named after the \emph{nesting order} in which the elements of the facts are arranged. Not all database systems maintain the same indices. In this case, the system keeps four indices covering the following common use cases:
\begin{itemize}
  \item EAVT, the canonical order, which \emph{maps} an entity to its attributes like a document,
  \item AEVT, for finding entities which \emph{have} a certain attribute set
  \item AVET, for \emph{filtering} entities by a known attribute set to a known value,
  \item VAET, for \emph{searching} over all attributes of all entities by a known value.
\end{itemize}


\newcolumntype{s}{>{\hsize=.5\hsize}X}

\begin{table}[]
  \caption{Impact of the index sort order on the area of application}
  \begin{tabularx}{\textwidth}{|l|s|s|X|}
  \hline
  \textbf{index} & \textbf{name}               & \textbf{feels like}      & \textbf{good for}                                     \\ \hline
  EAVT          & "entity-oriented"           & document store           & accessing various attributes of a known entity        \\ \hline
  AEVT          & "attribute-entity-oriented" & column store             & accessing the same attribute of various entities      \\ \hline
  AVET          & "attribute-value-oriented"  & filtering a column store & finding entities by the value of a specific attribute \\ \hline
  VAET          & "value-oriented"            & searching everything     & searching over all values, regardless of attribute    \\ \hline
  \end{tabularx}
  \label{tbl:indices}
\end{table}

For example, here is a simple example to pull out the name of a known patient, using only the \lisp{get-in} function of the Clojure core library on the \lisp{:eavt} index:

\begin{center}
  \lisp{(get-in db [:eavt :patient/91 :name])}
\end{center}

One can also leave out the attribute, and get back a map (as a conceptual \emph{document}) containing all known attributes related to that entity.

\begin{center}
  \lisp{(get-in db [:eavt :patient/91])}
\end{center}

Performing a search by name over all patients is similarly trivial using the \lisp{:avet} index, with the result

\begin{center}
  \lisp{(get-in db [:avet :name "Hye-mi"])}
\end{center}


\cleardoublepage
\subsection{Query language}\label{sec:query_language}

The query language of the system is a greatly simplified language modeled after the pattern matching relational query language used in Datomic, which is in turn a Lisp variant of the Datalog \cite{abiteboul1988datalog} language expressed in of Clojure's \gls{edn}.

The choice of language is arbitrary -- any relational language would suffice -- and the core of the database does not depend on any query language capabilities Modeling the language after the one used in Datomic was chosen because not only has the edn notation become a de-facto standard for other EAV databases like Crux, EVA, and Datascript, but because the shape of each query clause maps naturally to the representation of a fact in canonical EAV order.

See listing~\ref{lst:example_query} for an query consisting of four query clauses (the \lisp{:where} part) performing an implicit join, and a final projection (\lisp{:find}) to extract the values bound to the \emph{\gls{lvar}} symbols \lisp{?name} and \lisp{?location}. For example, the query clause \lisp{[?p :name ?name]} applied to the fact \lisp{[:person/123 :name "Hye-mi"]} would result in \emph{binding} the lvar \lisp{?p} to the value \lisp{:person/123}, and the lvar \lisp{?name} to the value \lisp{"Hye-mi"}. Other clauses are bound likewise. Note that multiple occurrences of the same lvar prompt \emph{unification} with the same value, creating an implicit \emph{join}. The order of the query clauses has no semantic meaning.

Performing a query entails applying the \lisp{q} function to a database value and a query. Clients can thus decide whether to leverage the query language via loading a library, or just access the data via the index structures directly.

\begin{lstlisting}[label={lst:example_query},caption="Who from Ulsan is working for whom?"]
'[:find [?name ?company]
  :where [[?p :works-for ?e]
          [?e :name ?company]
          [?p :name ?name]
          [?p :location "Ulsan"]]]
\end{lstlisting}

\paragraph{Temporal and bitemporal queries.}
As stated in section~\ref{sec:nongoals}, the (bi-)temporal aspects of the described system are secondary -- they are to be used for infrequent auditing purposes. Consequently, the design of the indexing and query mechanisms can be greatly simplified be forgoing bitemporal indexing strategies such as \cite{nascimento1995ivtt}.

As the query function simply takes a database as a \emph{value}, a \emph{filtering function} can be applied to the database beforehand. The \lisp{keep} function in listing~\ref{lst:queryfilter} returns a structurally shared and lazy copy of the database filtered by arbitrary bounds of the relevant timestamps $t_x$ and $t_v$.

\begin{lstlisting}[label={lst:queryfilter},caption=Applying a temporal filter before querying,morekeywords={keep,q,<,>}]
  (q (keep
       ($\lambda$ [$t_x$ $t_v$]
         (and (> $t_v$ 300) (< $t_v$ 500)
              (< $t_x$ 700)))
       db) query)
  \end{lstlisting}


\paragraph{Per-entity history.} A common use case in auditing is to retrieve the \emph{history} of all attributes related to a given entity over time. The \lisp{history} function takes a database value (optionally composed with a filtering function as described above) and an entity value, and returns an ordered slice of the log with transactions relevant to the requested entity. Note that it does not make sense to create a new database value from a history log, because that would just result in only the latest values being present in the index yet again.


\paragraph{Publication and subscription}

One of the goals states that clients should be able to declaratively subscribe to the \emph{live result set} of a query. The results and the query itself will change over the duration of a client's session. Each change triggers an immediate re-render of the UI. Conceptually, clients \emph{install} their \emph{subscription queries} on the server, and the infrastructure will re-run the subscription query whenever the underlying data changes and notify the client of the changed results. The design does not prescribe whether or not to replicate past (i.e., superseded or retracted) facts, thus greatly simplifying the proof-of-concept implementation by deferring concerns such as diffing, authorization, and the decision of what exactly to replicate to the clients to the developer customizing this data layer to their use case.

\paragraph{Security.} While extreme dynamism may be warranted in a high-trust environment, a real-world application may interact with some malicious entities and thus needs a means to restrict queries on the server side. In a real-world application, clients would need to authenticate themselves and the server would authorize publication based on access rules. Yet, there is no simple way to statically analyze queries submitted by the client for safety properties, but the server can control which facts are allowed to be replicated to a client. A publication might, for example, choose to not replicate facts with specific attributes, or transform facts to censor parts of the value.