% dodaj opcję [licencjacka] dla pracy licencjackiej % dodaj opcję [en] dla wersji angielskiej (mogą być obie: [licencjacka,en]) \documentclass[en]{pracamgr} \usepackage{definitions} % Dane magistranta: \autor{Marcin Chrzanowski}{370754} \title{MSO Query Answering on Trees} \titlepl{Odpowiadanie na zapytania MSO na drzewach} \kierunek{Computer Science} % Praca wykonana pod kierunkiem: % (podać tytuł/stopień imię i nazwisko opiekuna % Instytut % ew. Wydział ew. Uczelnia (jeżeli nie MIM UW)) \opiekun{dr hab. Szymon Toruńczyk\\ Institute of Informatics\\ } % miesiąc i~rok: \date{August 2021} %Podać dziedzinę wg klasyfikacji Socrates-Erasmus: \dziedzina{ 11.3 Informatyka\\ } %Klasyfikacja tematyczna według ACM \klasyfikacja{% TODO D. Software\\ D.127. Blabalgorithms\\ D.127.6. Numerical blabalysis} \keywords{MSO, query answering, tree automata, RMQ} \begin{document} \maketitle %tu idzie streszczenie na strone poczatkowa \begin{abstract} We define relabel regular queries on trees, which, via the known equivalence between tree automata and MSO formulae on trees, happens to be a generalization of the MSO query answering problem on trees. We show these queries can be performed in linear time with respect to query size (constant time in the case of MSO formulae with only first-order free variables) after preprocessing the input tree in linear time. Along the way, we show an algorithm for handling queries of the form ``Does the infix of a tree branch between nodes $x$ and $y$ belong to the regular language $L$'' (for a previously fixed regular language $L$) in the same time complexities. Our approach is much simpler in presentation than a previously known solution due to Colcombet. \end{abstract} \tableofcontents \chapter{Introduction} % \addcontentsline{toc}{chapter}{Introduction} \queryproblem[% an MSO formula $\varphi(\vec{X})$ over trees with $k$ free second-order variables. ]{% MSO Query Answering on Trees }{% a tree $T$. }{% given a $k$-tuple of subsets of $T$'s vertices $\vec{W} \in \mathcal{P}(V(T))^{k}$, is $\vec{W}$ a satisfying assignment to $\vec{X}$? In other words, does $T \models \varphi(\vec{W})$? } \queryproblem[% a deterministic bottom-up tree automaton $A$ over ranked alphabet $\Sigma$. ]{% Relabel Regular Queries on Trees }{% a tree $T$ labeled with $\Sigma$. }{% given $m$ relabelings $v_1 \mapsto a_1, \ldots, v_m \mapsto a_m$, where $v_i$ are vertices of $T$ and $a_i$ are elements of $\Sigma$, what state does $A$ arrive at in the root of $T'$, where $T'$ is $T$ with each $v_i$'s label modified to the corresponding $a_i$. } \chapter{Preliminaries}\label{r:pojecia} % \begin{quote} % Blaba, który jest blaba, nie jest prawdziwym blaba. % \raggedleft\slshape tłum. z~chińskiego Robert Blarbarucki % \end{quote} \section{Definitions} \subsection{Trees} We work with finite trees whose vertices are labeled with letters from a finite alphabet. More formally, given a finite alphabet $\Sigma$, for each $a \in \Sigma$, $a$ is a tree, and if $t_1, \ldots, t_k$ are trees, then $a(t_1, \ldots, t_k)$ is also a tree. \subsection{Tree automata} c \subsection{Monadic Second Order (MSO) Logic} % def of MSO From a logical point of view, the trees we work with can be seen as structures over a signature with a single binary relation $E$ and unary relations $U_a$ for each $a \in \Sigma$. $E(v, w)$ represents a (directed from parent to child) edge from $v$ to $w$. $U_a(v)$ signifies that $v$'s label is $a$. For convenience, we will also make use of the binary relation $\leq$, with $v \leq w$ signifying that $v$ is an ancestor of $w$ (with every vertex being an ancestor of itself; $<$ can be used to signify a strict ancestor). Note that in the case of MSO on trees, $\leq$ could be defined using just the edge relation $E$. \subsection{Query answering problems} d \section{Known algorithms we will use} \subsection{Least Common Ancestor} \queryproblem{% Least Common Ancestor (LCA) Queries }{% a tree $T$. }{% given vertices $x$ and $y$, find the vertex $z$ that's an ancestor of both $x$ and $y$, and is their lowest (i.e. furthest from the root) common ancestor. } \subsection{Range Minimum Query} f \chapter{Branch Infix Regular Queries}\label{r:branchinfix} \queryproblem[% regular language $L$ over alphabet $\Sigma$. ]{Branch Infix Regular Queries}{% a $\Sigma$-labeled tree $T$. }{% given a vertex $x$ and its descendant $y$, does the word given by labels on the path from $x$ to $y$ belong to $L$? } \section{Word infix regular queries} \queryproblem[% regular language $L$ over alphabet $\Sigma$. ]{Word Infix Regular Queries}{% a word $w \in \Sigma^*$. }{% given indices $i$ and $j$ with $1 \leq i < j \leq |w|$, does the infix $w[i, j]$ belong to $L$? } There is a simple and well-known data structure that, given a fixed automaton $A$, preprocesses an input word $w$ in time $O(n)$, and afterwards is able to answer in time $O(1)$ questions of the form ``Is the infix $w[i, j]$ accepted by $A$?''. We present the full construction as we will be generalizing its internals for the tree case. Take $A$ that is deterministic. We begin by replacing each letter of $w$ with the set of states of $A$, $Q$. Each letter $a$ of $w$ defines an injective function on $Q$, and we draw these functions as directed edges between successive copies of $Q$. For example, if $A$ in state $q$, reading $a$, would move to state $q'$, then, for any copy of $Q$ corresponding to an instance of $a$ in the original word, there will be an edge from $q$ there to $q'$ in the successive copy of $Q$. Now we will color the vertices of the graph we just constructed with the colors $1, 2, \ldots, |Q|$ in such a way that \begin{enumerate} \item every copy of $Q$ has one vertex of each of the $|Q|$ colors; \item when a vertex of color $i$ has an edge to a vertex of color $j$ in a successive copy of $Q$, then $i \geq j$. \end{enumerate} The second point basically means that we will be trying to draw single-color paths, but when paths need to join, it is the higher-colored path that gets cut off. The construction is as follows: \begin{enumerate} \item Color an arbitrary vertex of the first copy of $Q$ with the color $1$. \item Follow the deterministic edges to the end of the word, coloring all vertices along this path with $1$. \item Now color another uncolored vertex of the first copy of $Q$ with the color $2$. \item Try following edges as far as possible, coloring all vertices with $2$. \item If your run into an already colored vertex, pick an arbitrary uncolored vertex in this copy of $Q$ to color with $2$ and continue from here. \item Repeat steps 3.-5. for each successive color up to $|Q|$. \end{enumerate} Additionally, in each vertex, we store the index of the next copy of $Q$ in which the path of this vertex's color is broken by a lower color. \section{Generalizing to trees} We will now generalize the above construction from regular infix queries on a word, to ones on a tree. The exact problem we're solving is \emph{Given}: a tree $T$ with labels from alphabet $\Sigma$, a deterministic finite automaton $A$. \emph{Queries}: given a vertex $x$ and its descendant $y$, is the word given by labels on the path from $x$ to $y$ accepted by $A$? We begin with similar path coloring as in the word case, i.e. we replace each vertex of the tree with a copy of the states of $A$, $Q$. Each labeled node defines \section{Highest Black Descendant on Path} We can now reduce the problem to the following: \queryproblem{Highest Marked Descendant on Path Queries}{% a tree $T$ with set $M \subseteq V(T)$ of marked vertices. }{% given a vertex $x$, its descendant $y$, find the node $z \in M$ that is the highest marked node on the path between $x$ and $y$, if such $z$ exists. } We will build an index structure, constructible in linear time, that allows us to handle such queries in constant time. The structure is heavily inspired by Bender and Farach-Colton's presentation of an algorithm originally described by Berkman and Vishkin for computing LCA queries. First, let's define an auxillary problem we will use, that of Range Minimum Queries (RMQ): \queryproblem{Range Minimum Queries}{% an array $A$ of integers. }{% given indices $i$ and $j$, return the index of the smallest element in the subarray $A[i, j]$. } Fact: RMQ queries can be answered in constant time after linear preprocessing of the input array. We turn to describing the index structure for our tree problem. First, we create the array $POST$ of length $n$, which is the post-order of the nodes. Next, we label each node of the tree with its pre-order number. We create the array $PRE$ with the corresponding pre-order labels of the nodes in $POST$, i.e. if $POST[i] = v$, then $PRE[i]$ is $v$'s pre-order number. Finally, for each node of the tree, we record its index in $POST$ in the array $INDEX$. Observation: given a node $x$ and its descendant $y$, looking at the range $PRE[INDEX[y], INDEX[x] - 1]$, all the values in this range are descendants of $x$, and the values smaller than $PRE[INDEX[y]]$ correspond to ancestors of $y$. In particular, the minimum of that range corresponds to the highest ancestor of $y$ that's a descendant of $x$. In our problem, we only care about ancestors of $y$ that are colored black. So we perform one final modification of our data structure: for all non-black vertices $v$, we change $PRE[INDEX[v]]$ to $\infty$ (which can be represented by $n+1$, an integer greater than any node's pre-order label). With $PRE$ modified like this, we observe that now the minimum value of $PRE[INDEX[y], INDEX[x] - 1]$ corresponds exactly to the answer of our queries -- the highest black node between $x$ and $y$. We preprocess $PRE$ for RMQ queries in linear time. Now when given a query $x$, $y$, we: \begin{enumerate} \item Lookup $i := INDEX[y]$ and $j := INDEX[x] - 1$. \item Perform an RMQ query on $PRE[i, j]$, giving us index $k$ of the minimal value in that range. \item Lookup the corresponding vertex as $z := POST[k]$. This is the answer to our query. \end{enumerate} \chapter{Relabel Regular Queries on Trees} h \chapter{Conclusions} \end{document} %%% Local Variables: %%% mode: latex %%% TeX-master: t %%% coding: latin-2 %%% End: