mgr.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837

% dodaj opcję [licencjacka] dla pracy licencjackiej
% dodaj opcję [en] dla wersji angielskiej (mogą być obie: [licencjacka,en])
\documentclass[en]{pracamgr}

\usepackage{definitions}
\usepackage[backend=biber]{biblatex}
\addbibresource{mgr.bib}

\usepackage{amsmath}
\usepackage{amsfonts}

% Dane magistranta:
\autor{Marcin Chrzanowski}{370754}

\title{MSO Query Answering on Trees}
\titlepl{Odpowiadanie na zapytania MSO na drzewach}
\kierunek{Computer Science}

% Praca wykonana pod kierunkiem:
% (podać tytuł/stopień imię i nazwisko opiekuna
% Instytut
% ew. Wydział ew. Uczelnia (jeżeli nie MIM UW))
\opiekun{dr hab. Szymon Toruńczyk\\
  Institute of Informatics\\
  }

% miesiąc i~rok:
\date{August 2021}

%Podać dziedzinę wg klasyfikacji Socrates-Erasmus:
\dziedzina{
11.3 Informatyka\\
}

%Klasyfikacja tematyczna według ACM
\klasyfikacja{% TODO
  D. Software\\
  D.127. Blabalgorithms\\
  D.127.6. Numerical blabalysis}

\keywords{MSO, query answering, tree automata, RMQ}

\begin{document}
\maketitle

\begin{abstract}
    We define relabel regular queries on trees, which, via the known equivalence
    between tree automata and MSO formulae on trees, happens to be a
    generalization of the MSO query answering problem on trees. We show these
    queries can be performed in time $O(m \log(m))$ with respect to query size
    (constant time in the case of MSO formulae with only first-order free
    variables) after preprocessing the input tree in linear time. Along the way,
    we show an algorithm for handling queries of the form ``Does the infix of a
    tree branch between nodes $x$ and $y$ belong to the regular language $L$''
    (for a previously fixed regular language $L$) in constant time after linear
    preprocessing.  Our approach is much simpler in presentation than a
    previously known solution due to \textcite{colcombet}.
\end{abstract}

\tableofcontents

\chapter{Introduction}

The model checking problem is a classic problem in theoretical computer science
in which we are tasked with answering whether a given structure $S$ satisfies a
sentence $\varphi$ of some logic. Several other interesting computational
problems arise when we generalize to formulae with free variables.

As an analogy to databases, formulae with free variables are often referred to
as \definedterm{queries} in the literature. We consider the \definedterm{query
answering} problem: given a query $\varphi(X_1, ..., X_k)$ and a structure $S$,
we want to answer questions of the form ``does assigning the set of elements
$W_1$ to $X_1$, ..., $W_k$ to $X_k$, satisfy $\varphi$?''.  A naive solution
would be to transform the formula $\varphi$ into a sentence by adding $k$ unary
relations $U_1, \ldots, U_k$ and labeling elements of $W_i$ with $U_i$. Now we
can answer an individual question in the same time as model checking.

Already model checking is known to be computationally difficult for various
interesting logics, but the situation can improve when restricting the class of
structures considered. In this work we will focus on the case of MSO logic over
trees. While MSO model checking is NP-hard in general, \textcite{courcelle1990}
showed that deciding if an MSO sentence holds in a structure of bounded
treewidth can be done in time linear in the size of the structure. This, using
the naive method described above, gives us an algorithm for MSO query answering
over trees that answers each question in time $O(n)$.

In our setting we suppose that we want to answer multiple such questions and
thus allow a preprocessing step over the formula and tree. In this case, the
above naive algorithm could be improved with Amarilli et al.'s
\cite{amarilli2019} data structure that allows updates of the underlying tree.
Let $m$ be the size of a given question, i.e. $|W_1 \cup ... \cup W_k|$.
Amarilli's result allows us to preprocess the tree in linear time, label it with
a choice of $W_1, \ldots, W_k$ in time $O(m \log n)$, then answer model checking
in constant time, thus greatly improving over the naive algorithm for questions
that talk about a small number of vertices compared to the size of the whole
tree.

\section{Main result}

We will improve on the above method even further showing an algorithm that,
after linear preprocessing of the tree we can answer questions in time
linearithmic with respect to the question size. In particular, we eliminate any
dependence on tree size during question answering.

More explicitly, we will show how to solve Problem
\ref{mso-query-answering-problem}, with a preprocessing algorithm working in
time $O(n)$ (where $n$ is the size of the input tree), and then answer questions
in time $O(m \log m)$ (where $m$ is the size of the question):

\queryproblem[%
    an MSO formula $\varphi(\vec{X})$ over trees with $k$ free
    second-order variables.
]{%
    \label{mso-query-answering-problem} MSO Query Answering on Trees
}{%
    a tree $T$.
}{%
    given a $k$-tuple of subsets of $T$'s vertices $\vec{W}
    \in \mathcal{P}(V(T))^{k}$, is $\vec{W}$ a satisfying assignment to
    $\vec{X}$? In other words, does $T \models \varphi(\vec{W})$?
}

Note that though we speak of MSO formulae with second-order variables only,
first-order variables are supported by simply restricting to questions in which
sets assigned to single-order variables are singletons. Also note, that for
formulae with first-order free variables only, question answering happens in
constant time.

Through a series of reductions, we will show that MSO query answering reduces to
the following problem about tree automata, which we solve in Chapter
\ref{mso-query-answering}:

\queryproblem[%
    a deterministic bottom-up tree automaton $A$ over ranked alphabet $\Sigma$.
]{%
    \label{relabel-regular-queries} Relabel Regular Queries on Trees
}{%
    a tree $T$ labeled with $\Sigma$.
}{%
    given $m$ relabelings $v_1 \mapsto a_1, \ldots, v_m \mapsto a_m$, where
    $v_i$ are vertices of $T$ and $a_i$ are elements of $\Sigma$, what state
    does $A$ arrive  at in the root of $T'$, where $T'$ is $T$ with each $v_i$'s
    label modified to the corresponding $a_i$.
}

\section{Related Work}

A different generalization of model checking to the case of formulae with free
variables is \definedterm{query enumeration}. In query enumeration, as in query
evaluation, we are interested in outputing all the valuations satisfying a
formula. However, instead of considering the global time of the algorithm to
output every single result, we are interested in enumerating the valuations one
by one with as small a delay between individual answers as possible. In other
words, we split the entire algorithm into a preprocessing algorithm which builds
an indexing structure, and an enumeration algorithm that, given the indexing
structure, outputs the next answer and prepares the indexing structure for the
next step.

The complexity class \constantdelaylin\ refers to those enumeration
problems in which the preprocessing step takes linear time and the delay between
each successive answer (i.e. the time complexity of the second algorithm) is
constant. \lineardelaylin\ is a generalization of \constantdelaylin\ where the
time complexity of the second algorithm is linear in the size of the answer
being output. \textcite{bagan2006} introduced this second notion and showed that
enumerating MSO queries is \lineardelaylin\ on trees. This in particular means
that enumerating MSO queries for formulae whose free variables are all first
order is in \constantdelaylin, and \textcite{kazana2013} revisit this result,
proving it using deterministic factorization forests due to \textcite{colcombet}
rather than Bagan's intricate indexing structure.

Colcombet's factorization could likely be used to solve the problem of branch
infix regular queries, which we introduce as a subproblem of MSO query answering
in Chapter \ref{branchinfix}. However, Colcombet's result depends on fairly deep
and complex applications of semigroup theory. We present a much more
straightforward algorithmic approach.

\section{Organization}

The rest of our work is organized in the following way:

\begin{itemize}
    \item In Chapter \ref{preliminaries} we introduce definitions and
        known algorithms we will use to arrive at our main result.
        \begin{itemize}
            \item In particular, Section \ref{query-answering-problems} serves
                as a short review of some fundamental algorithms for solving
                query answering problems related to trees and regular languages.
        \end{itemize}
    \item In Chapter \ref{branchinfix} we solve an important subproblem, which
        is a generalization of a well known algorithm about words to the tree
        case.
    \item In Chapter \ref{mso-query-answering} we arrive at our main result by
        reducing Problem \ref{mso-query-answering-problem} to Problem
        \ref{relabel-regular-queries} and solving it.
        \begin{itemize}
            \item Of note, in Section \ref{computing-closure} we show how to
                compute the LCA closure of a set of $m$ tree vertices in time
                $O(m \log m)$.
        \end{itemize}
\end{itemize}

\chapter{Preliminaries}\label{preliminaries}

\section{Definitions}
\subsection{Trees}

We work with finite trees whose vertices are labeled with letters from a finite
alphabet. More formally, given a finite alphabet $\Sigma$, for each $a \in
\Sigma$, $a$ is a tree, and if $t_1, \ldots, t_k$ are trees, then
$a(t_1, \ldots, t_k)$ is also a tree.

We use the standard notions of root, child, sibling, ancestor, descendant, etc.

Binary trees are trees where each node has either no children (the node is a
leaf), or exactly two children (which, based on their order, can be called the
left and the right child).

\subsubsection{Tree traversals}

The \definedterm{post-order} of $V(T)$ is an ordering of $T$'s vertices produced
by the following recursive procedure:

\begin{enumerate}
    \item First traverse the root's subtrees.
    \item Visit the root.
\end{enumerate}

Similarly, the \definedterm{pre-order} of $V(T)$ is produced by the following
recursive procedure:

\begin{enumerate}
    \item First visit the root.
    \item Traverse the root's subtrees.
\end{enumerate}

Finally, the following procedure produces the \definedterm{in-order} of $V(T)$:

\begin{enumerate}
    \item First traverse the left subtree.
    \item Visit the root.
    \item Traverse the right subtree.
\end{enumerate}

\subsection{Tree automata}

Consider the case of binary trees labeled with $\Sigma$. A
\definedterm{deterministic, bottom-up tree automaton} (further called just a
\definedterm{tree automaton}) consists of

\begin{itemize}
    \item A finite set of \definedterm{states} $Q$.
    \item A set of \definedterm{accepting states} $F \subseteq Q$.
    \item A bottom-up \definedterm{transition function} $\delta : Q \times
        \Sigma \times Q \to Q$.
    \item An \definedterm{initializatoin function} $\iota : \Sigma \to Q$.
\end{itemize}

A \definedterm{run} of tree automaton $A$ over tree $T$ is a relabeling of $T$
with elements of $Q$ such that

\begin{itemize}
    \item Each leaf with label $a \in \Sigma$ is relabeled with $\iota(a)$.
    \item If an inner node $v$ has label $a \in \Sigma$, its left child got
        relabeled to $p \in Q$, and its right child got relabeled to $q \in Q$,
        then $v$ gets relabeled with $\delta(p, a, q)$.
\end{itemize}

A run is \definedterm{accepting} if $T$'s root gets relabeled to an accepting
state, that is a state $q \in F$. The set of all trees accepted by an automaton
$A$ is called the \definedterm{language recognized by $A$}, notated $L(A)$. We
call the class of all languages recognized by tree automata \definedterm{regular
tree languages}, analogously to regular languages recognized by finite state
automata.

We note that the expressive power of deterministic bottom-up tree automata is
the same as that of nondeterministic (either bottom-up or top-down) tree
automata.

\subsection{Monadic Second Order (MSO) Logic}
% def of MSO

From a logic point of view, the trees we work with can be seen as structures
over a signature with a single binary relation $E$ and unary relations $U_a$ for
each $a \in \Sigma$. $E(v, w)$ represents a (directed from parent to child) edge
from $v$ to $w$. $U_a(v)$ signifies that $v$'s label is $a$.

In the case of binary trees, the relation $E$ is replaced with two binary
relations $E_l$ and $E_r$, which represent an edge from parent to left child,
and from parent to right child, respectively.

For convenience, we will also make use of the binary relation $\leq$, with $v
\leq w$ signifying that $v$ is an ancestor of $w$ (with every vertex being an
ancestor of itself; $<$ can be used to signify a strict ancestor). Note that in
the case of MSO on trees, $\leq$ can be defined using just the edge relation
$E$.

We make use of a fundamental theorem tying MSO logic on trees and tree automata:

\begin{theorem}\label{mso-to-automaton}
    For every MSO formula $\varphi$ over trees, there exists a tree automaton
    $A$ such that for every tree $T$, $T \models \varphi$ if, and only if $T \in
    L(A)$.
\end{theorem}

We note that the converse of this theorem is also true (i.e. that for every tree
automaton, there is a corresponding MSO formula), however we will use only the
MSO to automata direction in this work.

\section{Query answering problems}\label{query-answering-problems}

Consider a computational problem whose inputs are of the form $(S, q) \in
\mathcal{S} \times \mathcal{Q}$ for some set of \definedterm{structures}
$\mathcal{S}$ and some set of \definedterm{queries} $\mathcal{Q}$. This induces
a \definedterm{query answering problem} which is divided into two phases:

\begin{description}
    \item[preprocessing] An input structure $S \in \mathcal{S}$ is given and a
        \definedterm{preprocessing algorithm} outputs a data structure
        $S'$.
    \item[queries] $S'$ is used to handle queries $q_1, \ldots$ from
        $\mathcal{Q}$.
\end{description}

We are interested in the time complexities of both phases. We use the following
notation for algorithms that have both a preprocessing and query phase: If
it takes $f(n)$ time to complete the preprocessing step for an input of size
$n$, and $g(n, m)$ time to then handle a query of size $m$, we say the algorithm
has time complexity \qptime{$f(n)$}{$g(n, m)$}.

We turn to a discussion of several query problems with known solutions, both to
serve as examples and because we will be using them in our algorithm.

\subsection{Lowest Common Ancestor}

\queryproblem{%
    Lowest Common Ancestor (LCA) Queries
}{%
    a tree $T$.
}{%
    given vertices $x$ and $y$, find the vertex $z$ that's an ancestor of both
    $x$ and $y$, and is their lowest (i.e. furthest from the root) common
    ancestor.
}

\textcite{tarjan1984} were the first to show an optimal \qpoptimal{}
algorithm for LCA. \textcite{schieber1988} used a similar approach but
simplified the indexing structure, keeping the same time complexities.
\textcite{berkman1993} showed a completely new approach to the problem, which
relies on answering range minimum queries (see below) about an array of the
tree's vertices arranged in a specific order. \textcite{bender2000} offer a
simpler presentation of the algorithm in \cite{berkman1993} and note the
equivalence between the LCA and RMQ problems.

\subsection{Range Minimum Query (RMQ)}

\queryproblem{Range Minimum Queries}{%
    an array $A$ of integers.
}{%
    given indices $i$ and $j$, return the index of the smallest
    element in the subarray $A[i, j]$.
}

\textcite{bender2000} show an \qpoptimal{} algorithm for the RMQ problem.

Their method is as follows. First they show that a special case of the RMQ
problem, $\pm1$ RMQ, can be solved in \qpoptimal. This restriction of the problem
is enough to handle LCA queries. Then, for the general RMQ case, a Cartesian
tree\footnote{A Cartesian tree of an array is a binary tree with the array's
minimum element in its root, the root's children being Cartesian trees of the
left and right subarrays around the minimal element. It can be constructed in
linear time.} of the array is built, and LCA queries on this tree correspond to
range minimum queries on the array.

\subsection{Word infix regular queries}\label{wordinfix}

We now turn to a problem about regular languages. Note that the regular language
is fixed, i.e. we treat the size of its representation (e.g. the size of an
automaton representing it) as a parameter. Below we take a deterministic
automaton for convenience, but if we instead start with a nondeterministic
automaton we ``don't care'' about the exponential cost of determinizing it.

\queryproblem[%
    regular language $L$ over alphabet $\Sigma$, given by DFA $A$.
]{Word Infix Regular Queries}{%
    a word $w \in \Sigma^*$.
}{%
    given indices $i$ and $j$ with $1 \leq i < j \leq |w|$, does the infix $w[i,
    j]$ belong to $L$?
}

This problem has a very elegant \qpoptimal{} solution, as presented by
\textcite{bojanczyk}. We present the full construction as we will be
generalizing its internals for the tree case in Chapter \ref{branchinfix}.

\textbf{preprocessing} We begin by replacing each letter of $w$ with a copy of
the set of states of $A$, $Q$.  More precisely, we will be working with a graph
whose vertex set is $\{1, \ldots, |w|\} \times Q$. We will refer to a vertex in
the $i$th copy of $Q$ labeled with state $q$ as $i.q$.

Since $A$ is deterministic, each letter $a \in \Sigma$ can be seen as a function
$a : Q \to Q$, and we draw these functions as directed edges between successive
copies of $Q$. For example, suppose the $i$th letter of $w$ is $a$. If $A$ in
state $q$, reading $a$, would move to state $q'$, then there will be an edge
from $i.q$ to $(i+1).q'$.

Note that by determinism, each vertex has exactly one outgoing edge (except for
vertices in the last copy of $Q$ which have no outgoing edges).

Now we will color the vertices of the graph we just constructed with the colors
$1, 2, \ldots, |Q|$ in such a way that

\begin{enumerate}
    \item every copy of $Q$ has one vertex of each of the $|Q|$ colors;
    \item when $i.q$ has color $c$ and there is an edge to $(i+1).q'$, which is
        colored with color $c'$, then $c \geq c'$.
\end{enumerate}

The second point basically means that we will be trying to draw single-color
paths, but when paths need to join, it is the higher-colored path that gets cut
off (think of the colors as priorities, with lower numbers representing more
important priorities).

The construction is as follows:

\begin{enumerate}
    \item Color an arbitrary vertex of the first copy of $Q$ with the color $1$.
    \item Follow the deterministic edges to the end of the word, coloring all
        vertices along this path with $1$.
    \item Now color another uncolored vertex of the first copy of $Q$ with the
        color $2$.
    \item Try following edges as far as possible, coloring all vertices with
        $2$.
    \item If your run into an already colored vertex, pick an arbitrary
        uncolored vertex in this copy of $Q$ to color with $2$ and continue from
        here.
    \item Repeat steps 3.-5. for each successive color up to $|Q|$.
\end{enumerate}

Additionally, for each vertex $i.q$, we store the index of the next copy of $Q$ in
which the path of this vertex's color is broken by a lower color. Put this
information in table $\breaktable$. For example, if $i.q$ is colored with color
$c$, and when following the deterministic path from $i.q$, all encountered
vertices are colored $c$ until $j.q'$ which is colored $c'$, then we set
$\breaktable[i.q] = j - 1$.

We can compute $\breaktable$ in linear time with a single backwards pass through
the colored graph.

\textbf{answering queries} Consider a query ``does $w[i, j] \in L$''. We claim that using our colored graph
and the table $\breaktable$ we can, in constant time, conclude in what state the
automaton $A$ will end up in on the $j$th position of $w$ if it had started in
its initial state on the $i$th position.

First, look at vertex $i.q_0$, which is the vertex of $A$'s initial state in the
$i$th copy of $Q$ and note its color $c$. Now we want to answer the following
question: if we follow the edges of the graph until the $j$th copy of $Q$, what
color will we end in? First, look at $k := \breaktable[i.q_0]$. If $k \geq j$,
then we know that in the $j$th copy of $Q$, the path we're interested in still
has color $c$. Find the unique vertex $j.q$ in this copy of $Q$ that's colored
with $c$. $q$ is the state we will end up in, so if it is an accepting state
answer YES, if not, answer NO.

If $k < j$, then jump to the $k$th copy of $Q$ and consider the vertex $k.q$,
the unique vertex here colored $c$. From $k.q$, follow its single outgoing edge
to $(k+1).q'$, which will be colored with color $c' < c$. Continue as we did
before, by looking at $k' := \breaktable[(k+1).q']$, comparing it to $j$, and either
halting if $k' \geq j$, or jumping again otherwise. Because with each jump we
move to a color strictly smaller than before, the number of jumps is bounded by
the number of colors, $|Q|$. Thus the query is answered in time constant with
respect to $|w|$.

\chapter{Branch Infix Regular Queries}\label{branchinfix}

Before solving our main problem, that of MSO queries on trees, we generalize
word infix regular queries (section \ref{wordinfix}) to trees. This will be a
vital step in the MSO queries algorithm, but is an interesting result on its
own.

The query problem we will solve is:

\queryproblem[%
    regular language $L$ over alphabet $\Sigma$.
]{Branch Infix Regular Queries}{%
    a $\Sigma$-labeled tree $T$.
}{%
    given a vertex $x$ and its descendant $y$, does the word given by labels on
    the path from $x$ to $y$ belong to $L$?
}

We begin with a similar construction as in the word case, i.e. we replace each
vertex of the tree with a copy of the states of $A$. Our new graph has vertex
set $V(T) \times Q$, and we can refer to a specific vertex as $x.q$ for $x \in
V(T)$, $q \in Q$.

Again, each letter $a \in \Sigma$ defines a function $a : Q \to Q$, and these
functions induce edges in our ``fattened'' tree: consider a vertex $x$ of $T$,
labeled with letter $a$ and having children $y$ and $z$. If $A$ in state $q$,
reading letter $a$ goes to $q'$, then there will be edges from $x.q$ to $y.q'$
and $z.q'$.

We also color all vertices with $1, \ldots, |Q|$ in this graph, analogously to
how we did so in the word case: begin by coloring an arbitrary vertex in the
root with $1$, then follow edges downwards, coloring all visited vertices with
$1$. Then begin the same process with the next color, restarting with a
different vertex in a given copy of $Q$ if we run into an already colored
vertex.

This process will again lead to a coloring with the desired properties that

\begin{enumerate}
    \item every copy of $Q$ has one vertex of each of the $|Q|$ colors;
    \item when a vertex of color $c$ has an edge to a vertex of color $c'$ in a
        child copy of $Q$, then $c \geq c'$.
\end{enumerate}

Now let's consider how we could answer a query ``is the branch infix from $x$
down to $y$ in $L$?''. Here we can't proceed exactly as in the word case. We
don't have a $\breaktable$ table since a vertex in an internal node can have
arbitrarily many points below it where its color is broken by a lower one.
Somehow we need to be able to find such a break point that is ``in the direction
of $y$ from $x$''.

In the next section we formulate this question formally and solve it as a
subproblem.

\section{Highest Marked Descendant on Path}

We will show that the branch infix problem reduces to the following:

\queryproblem{Highest Marked Descendant on Path Queries}{%
    a tree $T$ with set $M \subseteq V(T)$ of marked vertices.
}{%
    given a vertex $x$, its descendant $y$, find the node $z \in M$ that
    is the highest marked node on the path between $x$ and $y$, if such $z$
    exists.
}

We will build an index structure, constructible in linear time, that allows us
to handle such queries in constant time. The structure is heavily inspired by
\textcite{bender2000}, where a simple algorithm for computing LCA queries is
presented.

First, we create the array $\posttable$ of length $n$, which is the post-order of
the nodes.

Next, we label each node of the tree with its pre-order number. We create the
array $\prefixtable$ with the corresponding pre-order labels of the nodes in
$\posttable$, i.e.  if $\posttable[i] = v$, then $\prefixtable[i]$ is $v$'s
pre-order number.

Finally, for each node of the tree, we record its index in $\posttable$ in the
array $\indextable$.

\begin{observation}
    Given a node $x$ and its descendant $y$, looking at the range
    $\prefixtable[\indextable[y], \indextable[x] - 1]$:
    \begin{enumerate}
        \item All the values in this range correspond to descendants of $x$.
            \label{obs-descendants}
        \item The values smaller than $\prefixtable[\indextable[y]]$ correspond
            to ancestors of $y$. \label{obs-ancestors}
        \item In particular, the minimum of this range corresponds to the
            highest ancestor of $y$ that is also a descendant of $x$.
            \label{obs-conclusion}
    \end{enumerate}
\end{observation}

The pre-order labels in $\prefixtable$ are ordered according to a post-order. In
a post-order, the root of a subtree is visited directly after all its
descendants. Since $y$ is a descendant of $x$, all values between its position
in $\prefixtable$ and $x$'s position there will also be descendants of $x$,
giving item \ref{obs-descendants}.

The values in $\prefixtable$ are pre-order numbers. In a pre-order, for a given
node $z$, the only nodes that will have lower pre-order numbers are $z$'s
ancestors, and the children of these ancestors that are to the left of the child
towards $z$. By starting our range at index $\indextable[y]$, we've skipped over
all of the second types of vertices with lower pre-order numbers, leaving only
$y$'s ancestors as nodes in the range with a lower pre-order number. This gives
us item \ref{obs-ancestors}.

Items \ref{obs-descendants} and \ref{obs-ancestors} together give us item
\ref{obs-conclusion}.

In our problem, we only care about ancestors of $y$ that are marked. So
we perform one final modification of our data structure: for all non-marked
vertices $v$, we change $\prefixtable[\indextable[v]]$ to $\infty$ (which can be
represented by $n+1$, an integer greater than any node's pre-order label). With
$\prefixtable$ modified like this, we observe that now the minimum value of
$\prefixtable[\indextable[y], \indextable[x] - 1]$ corresponds exactly to the
answer of our queries -- the highest marked node between $x$ and $y$.

We preprocess $\prefixtable$ for RMQ queries in linear time.

Now when given a query $x$, $y$, we:

\begin{enumerate}
    \item Lookup $i := \indextable[y]$ and $j := \indextable[x] - 1$.
    \item Perform an RMQ query on $\prefixtable[i, j]$, giving us index $k$ of
        the minimal value in that range. \label{hmd-algo-lookup}
    \item Lookup the corresponding vertex as $z := \posttable[k]$. This is the
        answer to our query.
\end{enumerate}

We can detect the siutation of there not being a marked node between $x$ and $y$
during step \ref{hmd-algo-lookup} by checking if $\prefixtable[k]$ is smaller
than the pre-order label of $y$. If it isn't, then all the nodes on the path
from $x$ to $y$ were unmarked (had higher values in the modified $\prefixtable$
since values at indices corresponding to unmarked nodes are set to $\infty$),
and $k$ will correspond to a marked node that's not on the path from $x$ to $y$
(or some unmarked vertex if there were no marked vertices at all in the range we
check).

\section{Solving branch infix regular queries}

With the above problem solved, we are ready to finish our algorithm for the
branch infix regular query problem. Recall that we have replaced the input tree
$T$'s vertices with copies of $Q$, then connected and colored them in the same
way as in the word case. We additionally remember in each vertex a unique
identifier of the connected component of vertices of the same color that it is
in (call this data $\componenttable[v]$).

For each color $c$, we will preprocess the above graph for queries that answer
the question ``if we start in a vertex $v$ colored with $c$, which is the first
copy of $Q$ on the path towards the copy of $Q$ that contains $y$?'' by creating
a copy of $T$ in which we mark the vertices where an edge from a $c$ colored
vertex points to a vertex with a lower color. This tree we preprocess for
highest marked descendant on path queries.

Now query answering can proceed similar to the word case. Given a query ``does
the word on vertices from $x$ down to $y$ belong to $L$?'' look at the vertex
$x.q_0$ and consider its color $c$. In the copy of $Q$ corresponding to $y$
consider the vertex $y.q$, the unique vertex here of color $c$. If it is in the
same connected component as $x.q_0$ (which we check by comparing
$\componenttable[x.q_0]$ with $\componenttable[y.q]$), we can immediately answer
whether or not $A$ will accept the word on this path based on whether this state
is accepting. If that is the case then we have not changed connected components,
thus there is a single-color path from $x.q_0$ down to $y.q$ and $q$ is indeed
the state $A$ would have ended in after running on the word given by the labels
from $x$ down to $y$.

Otherwise, we need to jump down to a lower copy of $Q$. We can find the first
such copy where the color on our path changes from $c$ to something lower by
issuing a highest marked descendant on path query on the marked tree we created
for color $c$. The answer to this query exactly corresponds to the point where
the path of color $c$ from our initial state merges into a lower color $c'$ on
the path towards $y$. From here we continue as at the beginning, checking
$\componenttable$ to see if we can answer immediately, or taking another jump
down. Handling each jump takes constant time, and the number of jumps is bounded
by $|Q|$ since we can only jump to a lower color. Thus we can answer branch
infix regular queries in constant time.

\chapter{MSO Query Answering on Trees}\label{mso-query-answering}

We now have all the pieces necessary to present our \qptime{$O(n)$}{$O(m \log
m)$} algorithm for MSO query answering on trees. We fix an MSO formula
$\varphi(\vec{X})$ and take a tree $T$ of size $n$. We proceed with several
reductions so that we end up with needing to solve the relabel regular queries
problem, which we solve in the remainder of this chapter.

First of all, if $T$ is not binary, we can use a standard encoding to encode it
inside of a binary tree $T'$, modifying $\varphi$ to $\varphi'$, such that $T
\models \varphi(\vec{W})$ if and only if $T' \models \varphi'(\vec{W})$.

Now, to use Theorem \ref{mso-to-automaton}, we turn the formula into an MSO
sentence without free variables by adding a unary relation $U_X$ for each
variable $X \in \vec{X}$ and instead of considering the formula $\varphi'$, we
now consider the sentence

\begin{equation*}
    \varphi'' = \exists_{X_1} \ldots \exists_{X_k}
        (\forall{}_x x \in X_i \iff U_{X_i}(x)) \land
        \varphi'(X_1, \ldots, X_k).
\end{equation*}

Selecting a model in which the vertices in set $W$ are exactly those colored
with unary relation $U_X$ is equivalent to a valuation of $\vec{X}$ in which the
variable $X$ is set to $W$.

Now by Theorem \ref{mso-to-automaton}, there is a tree automaton $A$ for binary
trees over the alphabet $\{ 0, 1 \}^{\vec{X}}$ which accepts a tree if and only
if its labeling corresponds to a model satisfying $\varphi''$. Here a label
$a_1 \ldots a_k$ (where $k = |\vec{X}|$ and each $a_i \in \{ 0, 1 \}$) should be
interpreted as a bit vector signifying which unary relations $U_X$ a vertex is
assigned to.

By combining the above reductions, we can solve MSO queries on trees by solving
relabel regular queries on trees. Indeed, fix a formula $\varphi(\vec{X})$ and take
a tree $T$. Derive automaton $A$ and tree $T'$ as above. Label each vertex of
$T'$ with $0\ldots0$ (signifying that none of its vertices have yet been
assigned to any of the variables in $\vec{X}$). Preprocess $A$ and $T'$ for
relabel regular queries. Now a relabeling of $T'$'s vertices corresponds to
selecting a specific valuation $\vec{W}$ of $\vec{X}$, thus answering whether or
not $A$ accepts the relabeled tree is equivalent to answering whether or not $T
\models \varphi(\vec{W})$.

\section{Relabel Regular Queries on Trees}

Fix a tree automaton $A$ and take a $\Sigma$-labeled binary tree $T$.

We'll preprocess the tree in such a way that when given a relabel query, we'll
be able to partition the tree into linearly (with respect to the query's size)
many parts, then, in a bottom-up fashion, compute the state in the root of each
part in the relabeled tree. The computation for each part will take constant
time.

\subsection{LCA partition}

Consider a set of tree vertices $W$. We'll say that $W$ is \definedterm{LCA
closed} if for every pair of vertices $v, w \in W$, it is the case that
$\mathlca(v, w) \in W$ (note that one of $v$ or $w$ might be their LCA). We also
require that the tree's root is in $W$ (just to ensure the partition we define
next covers the whole tree).

The \definedterm{subtree of $v$} is $v$ along with all its descendants.
We also consider subtrees with holes: the \definedterm{subtree of $v$ with hole
$w$} is the subtree of $v$ minus the subtree of $w$ (for $w$ a descendant of
$v$).

For an LCA closed set $W$ in tree $T$, we define the \definedterm{LCA partition
with respect to $W$} to be a partition of $T$'s vertices into subtrees, subtrees
with holes, and individual vertices created according to the following rules:

\begin{enumerate}
    \item If $v \in W$ is a maximal in $W$ with respect to the ancestor
        relation $<$ (i.e. there are no other elements of $W$ in the subtree of
        $v$), then the subtree of $v$ is a part of the partition.
    \item If $v \in W$ has descendants that are in $W$ only in its left subtree,
        let $w$ be the highest such descendant (there is a unique highest
        descendant because $W$ is LCA closed). The subtree of $v$ with hole $w$
        is a part of the partition.
    \item Symmetrically if $v \in W$ only has descendants that are in $W$ in its
        right subtree.
    \item If $v \in W$ has descendants that are in $W$ in both its subtrees, let
        $v_l$ and $v_r$ be $v$'s left and right child, respectively. Treat $v_l$
        and $v_r$ as if they were in $W$ for the purposes of defining the
        partition. Put $v$ into a singleton part on its own.
\end{enumerate}

Note that in the last rule, both $v_l$ and $v_r$ will either themselves be
elements of $W$ or will have descendants in $W$ in only one of their subtrees.
In the first case, we will have already handled them. Otherwise, we will handle
them with a rule different than the last one, thus only one part will be added
to the partition for each of them. In particular, each element of $W$ will
contribute only a constant number of parts to the partition (one part if handled
with one of the first three rules, three parts if handled with the last one).

To see that indeed, for example $v_l$, won't need to be handled with rule 4, if
it wasn't in $W$ but had descendants from $W$ in both its subtrees, then $W$
wouldn't have been LCA closed.

Thus the partition covers the entire tree with $O(|W|)$ parts.

If $W$ is not LCA closed, we will call the LCA partition with respect to $W$
simply the partition with respect to $W$'s LCA closure.

\subsection{Computing the LCA closure}\label{computing-closure}

The LCA closure of a set of $m$ vertices $W$ can be computed in time $O(m \log
m)$ after linear preprocessing of the tree $T$. In the preprocessing step, in
addition to preprocessing for LCA queries, we will assign each vertex $v$ its
in-order number, $\infixtable[v]$.

Now to compute the closure of $W$, we first sort the vertices in $W$ with
respect to their in-order numbers, so we end up with a list $v_1, \ldots, v_m$
of vertices such that $\infixtable[v_1] < \ldots < \infixtable[v_m]$.

\begin{lemma}
    if $\infixtable[u] < \infixtable[v] < \infixtable[w]$, then $\mathlca(u, w)$
    is equal to either $\mathlca(u, v)$ or $\mathlca(v, w)$.
\end{lemma}

This may be proved by a simple case analysis.

Now, once $W$ has been sorted by in-order numbers, we can complete our
computation in time $O(m)$. It is enough to walk through the list and issue LCA
queries about successive pairs. By the lemma, the resulting set will indeed
contain LCAs for each pair in the set. Note also that for each successive pair
we will have added at most one new vertex to the closure, thus the size of the
closure of $W$ is linear in $m$.

\section{Computing root states of parts}

Let's now discuss how we will use LCA paritions to solve the relabel query
problem. Our approach is the following: given a query $v_1 \mapsto a_1, \ldots,
v_m \mapsto a_m$, let $W = \{ v_1, \ldots, v_m \}$. Consider the partition of
$T$ with respect to $W$. We will show how to compute $A$'s state in the root of
each part of the partition, after the tree had been relabeled. Per our
definition of the partition, we have three types of parts to consider:

\begin{enumerate}
    \item A subtree rooted at vertex $v$.
    \item A singleton vertex $v$.
    \item A subtree rooted at vertex $v$ with hole $w$.
\end{enumerate}

The first case is trivial to compute: no descendants of $v$ had been relabeled,
so the states of the children of $v$ are the same as in $A$'s run on the
original tree $T$. We look them up in the precomputed run, then apply $A$'s
transition to those states and $v$'s new label.

We will compute the states of part roots in a bottom-up fashion, so the second
case is also simple: once we've computed the states in the children of $v$, we
again simply apply $A$'s transition function to those states and $v$'s new
label.

The third is the interesting case, that of a subtree with a hole. Consider the
path from the hole $w$ up to the root $v$. What we claim is that the question
``if $A$'s state in $w$ is $q$, is the state in $v$ going to be $q'$?'' can be
answered with a branch infix regular query.

This is the case because a finite automaton can compute $A$'s states along the
path from $w$ to $v$ when given access to each of the node's label, state of its
children in the run of $A$ on the original labeling of $T$, and the information
whether this node is a left or right child of its parent. Thus the language
$L_{q, q'}$ over alphabet $Q \times \Sigma \times Q \times \{ left, right \}$ of
proper semgents of a run of $A$ that start in state $q$ and end in state $q'$ is
a regular language. Our algorithm for branch infix regular queries dealt with
top-down queries, but regular languages are reversible. So we preprocess $T$ for
branch infix regular queries for each language $L_{q, q'}^R$.

Now to compute $v$'s state after the relabeling of $T$, given we've already
computed $w$'s new state $q$, take $v'$ -- $v$'s child in the direction of $w$,
find $q' \in Q$ such that the path from $v'$ to $w$ belongs to $L_{q, q'}^R$
(using branch infix regular queries; since $A$ is deterministic, there will be
exactly one such $q'$), then from $q'$, the state of $v$'s other child, and
$v$'s new label, use $A$'s transition function to compute $v$'s new state.

Since $T$'s root is the root of one of the parts of our LCA partition, in the
end we will know whether or not $A$ accepts $T$ after relabeling. Handling each
part takes constant time, and there are $O(m)$ parts to handle. We did require
$O(m \log m)$ time to sort the query's vertices when computing their LCA
closure, thus query handling takes time $O(m \log m)$.

\printbibliography

\end{document}