Skip to content

Commit

Permalink
feat(report): submitted version
Browse files Browse the repository at this point in the history
  • Loading branch information
kory33 committed Jun 1, 2023
1 parent a439c35 commit 37eebf8
Showing 1 changed file with 26 additions and 15 deletions.
41 changes: 26 additions & 15 deletions report/report.tex
Original file line number Diff line number Diff line change
Expand Up @@ -75,18 +75,30 @@
\newcommand{\LRealz}[2]{\ensuremath{\normalfont{\textrm{LRealz}}_{ #1 }\!\left( #2 \right)}}
\newcommand{\SubgoalRule}[1]{\ensuremath{\normalfont{\textrm{SubgoalRule}} \left( #1 \right)}}

\title{Rewriting Conjunctive Queries \\ Under Guarded TGDs}
\author{Ryosuke Kondo}
\date{May 2023}

\begin{document}
\maketitle
\begin{titlepage}
\centering
\par\vspace{1cm}
{\textsc{University of Oxford} \par}
\vspace{1cm}
{\Large \textsc{Part C Project Report}\par}
\vspace{1.5cm}
{\huge\bfseries Rewriting Conjunctive Queries \\ Under Guarded TGDs\par}
\vspace{2cm}
{\Large\itshape Ryosuke Kondo \par}
\vfill
Supervised by\par
Prof. Michael Benedikt

\vfill

{\large Trinity Term 2023\par}
\end{titlepage}

\newpage
\thispagestyle{empty}
\
\newpage
\pagenumbering{roman}

\chapter*{\centering Abstract}

Expand All @@ -99,7 +111,6 @@ \chapter*{\centering Abstract}
\newpage

\newpage
\pagenumbering{arabic}
\chapter{Introduction}

\section{Background}
Expand All @@ -122,7 +133,7 @@ \section{Background}

A class of logical formulas known as Tuple Generating Dependencies (TGDs), which are of the form $\forall \vec{x}. \beta \rightarrow \exists \vec{y}.\ \eta$, could describe data integrity rules. In the example above, the two rules would be written as $\forall X.\ U(X) \rightarrow \exists Z.\ R(X, Z)$ and $\forall X, Y.\ R(X, Y) \rightarrow S(X, X)$.

Unfortunately, query answering under general TGDs is undecidable \cite{beeri_vardi_1981}. A line of work, including \cite{cali_gottlob_kifer_2013}, identified \emph{Guarded TGDs} (GTGDs) as a subclass of TGDs that leaves query-answering decidable yet much more expressive than description logics used for ontological reasoning.
Unfortunately, query answering under general TGDs is undecidable \cite{beeri_vardi_1981}. A line of work, including \cite{cali_gottlob_kifer_2013}, identified \emph{Guarded TGDs} (GTGDs) as a syntactically restricted subclass of TGDs that leaves query-answering decidable yet much more expressive than description logics used for ontological reasoning.

The first result that opened up the possibility towards practical query answering is shown in \cite{barany_benedikt_cate_2013}, which states that we can compute a \emph{Datalog rewriting} of (frontier) guarded TGDs and a conjunctive query. Roughly speaking, a Datalog rewriting is a set of existential-free TGDs that gives the same answer as the original query. It is well-known that a fixed Datalog program can be run in a polynomial time on a database.

Expand All @@ -134,7 +145,7 @@ \section{Contribution of This Work}

The primary theoretical contribution of this work is the development of a concrete Datalog rewriting algorithm. On our way, we introduce a variant of chase which we call \emph{shortcutting chase tree} and develop some theory concerning query satisfaction within the chase structure. We then apply the theory to derive a Datalog rewriting, demonstrating room for further optimisations.

In addition to the theoretical work, we provide the first implementation of the GTGD rewriting algorithm in Java, incorporating some optimisations we will have discussed.
In addition to the theoretical work, we provide the first \href{https://github.com/kory33/guarded-queries}{implementation} of the GTGD rewriting algorithm in Java, incorporating some optimisations we will have discussed.

\section{Outline of This Report}

Expand Down Expand Up @@ -1140,7 +1151,7 @@ \subsection{Glueing Subgoals}
Each $\mathrm{SglGlueingRule}_\mathrm{BVars}$ is ``sound" (in a sense as in \Cref{proposition:subgoal-captures-subquery-fulfilment}, by identifying subgoals with subquery fulfilments and the goal atom with query fulfilment), and also collectively complete (i.e. we can derive all answers to $Q$ as goal facts if we can use all glueing rules) by \Cref{corollary:base-connected-query-decomposition}.
\end{remark}

\subsection{Putting Pieces Together}
\subsection{Putting The Pieces Together}

Finally, we combine components from \Cref{section:naive-subquery-entailment-enumeration}, \Cref{subsection:subquery-entailment-to-datalog-rule} and \Cref{subsection:glueing-subgoals}.

Expand Down Expand Up @@ -1446,15 +1457,15 @@ \chapter{Conclusions and Further Discussion}
\section{Limitations and Future Work}
\label{section:limitations-and-future-work}

% extension to wider class of TGDs?
The current implementation lacks optimisations for trimming down the space of subquery entailment problem instances. Instead, it always explores the whole space, whose size is doubly exponential in the maximum arity of the input signature and exponential in the number of constants and predicates (\Cref{remark:rewriting-complexity}), making it impractical to rewrite large inputs such as real-world ontologies. Even though this matches with the theoretical lower bound of query answering procedure (which is \textsc{2exptime} for arbitrary arity and \textsc{exptime} for bounded arity \cite{cali_gottlob_kifer_2013}), we may be able to overcome this issue in some cases by analysing the structure of input rules. For instance, if a rule constant $c$ only appears in heads and not in the query, it is redundant to consider local instances containing facts with $c$ since no rule requires such facts.

The current implementation lacks optimisations for trimming down the space of subquery entailment problem instances. Instead, it always explores the whole space, whose size is doubly exponential in the maximum arity of the input signature and exponential in the number of constants and predicates (\Cref{remark:rewriting-complexity}), making it impractical to rewrite large inputs such as real-world ontologies. We may be able to overcome this issue by analysing the structure of input rules. For instance, if a rule constant $c$ only appears in heads and not in the query, it is redundant to consider local instances containing facts with $c$ since no rule requires such facts.
Arguably, the most crucial optimisation is handling instance subsumption, as discussed in \Cref{remark:naive-seenumeration-inefficiencies}: If a single atom $R(1, 2)$ suffices to entail a subquery, \emph{all} local instances containing a fact with $R$ no longer need to be tested for entailment, reducing the search space by a factor of $16 = 2^4$. We leave for future work the method for efficiently controlling the search space.

Another crucial optimisation concerns instance subsumption, as discussed in \Cref{remark:naive-seenumeration-inefficiencies}: If a single atom $R(1, 2)$ suffices to entail a subquery, \emph{all} local instances containing a fact with $R$ no longer need to be tested for entailment, reducing the search space by a factor of $16 = 2^4$. We leave for future work the method for efficiently controlling the search space.
Another performance consideration is, as remarked in \Cref{section:correctness-tests}, that we can rewrite some queries into atomic queries by adding a few guarded rules. Our system does not perform such preprocessing, nor does it reduce subquery entailment problems to atomic queries, even when induced subqueries are acyclic. Investigating the effectiveness of such input transformation is left for future work.

Moreover, as remarked in \Cref{section:correctness-tests}, we can rewrite some queries into atomic queries by adding a few guarded rules. However, our system does not perform such preprocessing, nor does it reduce subquery entailment problems to atomic queries, even when induced subqueries are acyclic. Investigating the effectiveness of such input transformation is left for future work.
Moreover, our prototypical system spends most of its CPU time in Datalog-saturating and chasing the local instances. We use an inefficient join algorithm without indexes and the Naive Evaluation to saturate instances for simplicity. One might want to incorporate more sophisticated join and saturation algorithms and compare their performances.

Finally, our prototypical system spends most CPU time in Datalog-saturating and chasing the local instances. We use an inefficient join algorithm without indexes and the Naive Evaluation to saturate instances for simplicity. One might want to incorporate more sophisticated join and saturation algorithms and compare their performances.
Finally, as mentioned in the cited paper, the result in \cite{barany_benedikt_cate_2013} concerning rewritability extends to a slightly wider class of TGDs known as frontier-guarded TGDs, where only frontier-variables have to be guarded in the body. Therefore, it is of theoretical interest if we could extend our approach to this class of TGDs.

\printbibliography

Expand Down

0 comments on commit 37eebf8

Please sign in to comment.