\section{The Case Marking Regime}
%
%
%
\subsection{Group Marking}
%
%
%
Now we shall move to case marking regimes. Again, we assume that
a finite signature $\auf F, \Omega\zu$ is given. Now, let
$\mu := \mbox{\it max\/} \{\Omega(f) : f \in F\}$. For each
$i < \mu$ we assume to have a case marker $\mbox{\tt c}_i$.
(Often, we shall also write {\tt 0} in place of $\mbox{\tt c}_0$,
{\tt 1} in place of $\mbox{\tt c}_1$, and so on.)
Of course, we shall assume that all case markers are distinct
and that $\mbox{\tt c}_i \not\in F$ for all $i < \mu$.
The case markers can be thought of as unary function symbols.
Therefore, we may think of the language as a language over
an enriched signature, which we denote by $\Omega^{\gamma}$.
Define the {\it right peripheral group marking language\/}
as follows. The map $^r$ is defined by
%%
\begin{enumerate}
\item $t^r := t$, if $t \in F$ and $\Omega(t) = 0$.
\item $(f(t_0, \ldots, t_{\Omega(f)-1}))^r :=
t_0^r \mbox{\tt c}_0 t_1^r \mbox{\tt c}_1
\ldots t_{\Omega(f)-1}^r \mbox{\tt c}_{\Omega(f)-1} f$,
$\Omega(f) > 0$.
\end{enumerate}
%%
Now, let $\vec{x} = t^r$, for some $t$. We understand the
notions of constituent, head and argument as defined in the
previous section. Now let $\vec{y}$ result from $\vec{x}$
by permuting within a constituent the arguments among each
other. Then $\vec{y}$ is said to {\it rpg--represent\/}
$t$. In particular, $\vec{x}$ rpg--represents $t$.
For example, the arithmetical term $t := \mbox{\tt -} %
(\mbox{\tt +}(\mbox{\tt a}, \mbox{\tt b}),\mbox{\tt +} %
(\mbox{\tt x}, \mbox{\tt y}))$ is translated by $^r$ into
%%
$$\mbox{\tt ac}_0\mbox{\tt bc}_1\mbox{\tt +c}_0%
\mbox{\tt xc}_0 \mbox{\tt yc}_1 \mbox{\tt +c}_1\mbox{\tt -}$$
%%
But also the following string rpg--represents $t$:
%%
$$\mbox{\tt xc}_0\mbox{\tt yc}_1 \mbox{\tt +c}_1%
\mbox{\tt bc}_1\mbox{\tt ac}_0 \mbox{\tt +c}_0\mbox{\tt -}$$
%%
\begin{defn}
Let $\Omega$ be a signature. Then $R_{\Omega}$ denotes the
set of all strings rpg--representing some $\Omega$-term.
\end{defn}
%%
\begin{thm}
$R_{\Omega}$ is context--free. It is transparent and therefore
deterministic.
\end{thm}
%%
The following is a grammar for that language:
%%
$$\begin{array}{l@{\quad \pf \quad}l}
\mbox{\tt S} & \mbox{\tt F}_0 \\
\mbox{\tt S} & \mbox{\tt Sc}_0 \mbox{\tt F}_1 \\
\mbox{\tt S} & \mbox{\tt Sc}_0\mbox{\tt Sc}_1 \mbox{\tt F}_2\\
\mbox{\tt S} & \mbox{\tt Sc}_1\mbox{\tt Sc}_0 \mbox{\tt F}_2\\
\mbox{\tt S} & \mbox{\tt Sc}_0\mbox{\tt Sc}_1 \mbox{\tt Sc}_2
\mbox{\tt F}_3 \\
\mbox{\tt S} & \mbox{\tt Sc}_0\mbox{\tt Sc}_2 \mbox{\tt Sc}_1
\mbox{\tt F}_3 \\
\mbox{\tt S} & \mbox{\tt Sc}_1\mbox{\tt Sc}_0 \mbox{\tt Sc}_2
\mbox{\tt F}_3 \\
\mbox{\tt S} & \mbox{\tt Sc}_1\mbox{\tt Sc}_2 \mbox{\tt Sc}_0
\mbox{\tt F}_3 \\
\mbox{\tt S} & \mbox{\tt Sc}_2\mbox{\tt Sc}_0 \mbox{\tt Sc}_1
\mbox{\tt F}_3 \\
\mbox{\tt S} & \mbox{\tt Sc}_2\mbox{\tt Sc}_1 \mbox{\tt Sc}_0
\mbox{\tt F}_3 \\
\multicolumn{2}{c}{\ldots}
\end{array}$$
%%
(The rules for expanding $\mbox{\tt F}_n$ are as before
in $G_{\Omega}$.) Call that grammar $H_{\Omega}$.
%%
\begin{thm}
\label{subrpg}
Let $\Omega$ be a signature. Then $H_{\Omega}$ is a
transparent context--free grammar generating $R_{\Omega}$.
Consequently, $R_{\Omega}$ is deterministic.
\end{thm}
%%
\proofbeg
Let $h$ be the following string--homomorphism.
$h(\mbox{\tt c}_i) := \varepsilon$, $h(f) = f$ for
all $f \in F$. It is easy to see that if $\vec{x}$
rpg--represents some term, then $h(\vec{x})$ also
represents some term (though not necessarily the same one).
Furthermore, given a parse tree of $\vec{x}$, every
constituent occurrence of $\vec{y}$ is mapped onto
a constituent occurrence of $h(\vec{y})$.
This is seen by showing that if $h$ is applied to a
derivation in $H_{\Omega}$, it yields a $G_{\Omega}$
derivation. This shows the claim on the basis of
Theorem~\ref{poltrans}.
\proofend
%%
We note the following corollary.
%%
\begin{cor}
Let $\vec{x}$ rpg--represent $s$ and $t$. Then $s = t$.
\end{cor}
%%
One can play different variations on that theme, allowing
some more freedom or placing some restrictions. However,
the following variant yields an ambiguous language. Suppose
that we allow the head to be placed anywhere between
its arguments. Suppose that in this case it takes its
case marker along. This type of regime is a group marking
regime, where only the head is marked, but is free in
its relative position. Then unary function symbols may
be either before their arguments or following them.
The following string is then ambiguous: {\tt f0a0g0h}. (Recall
that we write {\tt 0} in place of $\mbox{\tt c}_0$.)
It may represent either $\mbox{\tt h}(\mbox{\tt f}(\mbox{\tt %
g}(\mbox{\tt a})))$ or $\mbox{\tt h}(\mbox{\tt g}(\mbox{\tt %
f}(\mbox{\tt a})))$.
On the other hand, if we assume that the head is free
in its position but the case marker remains right
peripheral, then we once again get a transparent language.
In this language, the case markers mark the right edge
of the constituent. One can show, namely, that there are
no accidental minimal constituents, and the same arguments
go through once again. Let us call the language $\mbox{\it
RC}_{\Omega}$. It too is context--free. Here is a grammar
generating it:
%%
$$\begin{array}{l@{\quad \pf \quad}l}
\mbox{\tt S} & \mbox{\tt F}_0 \\
\mbox{\tt S} & \mbox{\tt Sc}_0 \mbox{\tt F}_1 \\
\mbox{\tt S} & \mbox{\tt F}_1 \mbox{\tt Sc}_0 \\
\mbox{\tt S} & \mbox{\tt Sc}_0\mbox{\tt Sc}_1 \mbox{\tt F}_2 \\
\mbox{\tt S} & \mbox{\tt Sc}_0\mbox{\tt F}_2 \mbox{\tt Sc}_1 \\
\mbox{\tt S} & \mbox{\tt F}_1 \mbox{\tt Sc}_0\mbox{\tt Sc}_1 \\
\mbox{\tt S} & \mbox{\tt Sc}_1\mbox{\tt Sc}_0 \mbox{\tt F}_2 \\
\mbox{\tt S} & \mbox{\tt Sc}_1\mbox{\tt F}_2\mbox{\tt Sc}_0 \\
\mbox{\tt S} & \mbox{\tt F}_2\mbox{\tt Sc}_1\mbox{\tt Sc}_0 \\
\multicolumn{2}{c}{\ldots}
\end{array}$$
%%
\begin{thm}
$\mbox{\it RC}_{\Omega}$ is a transparent, deterministic
context--free language.
\end{thm}
%
%
%
The same applies to the mirror image of this language, where
case markers are consistently left peripheral.
All these case marking languages (with the exception of the
language where the head is case marked) have the following
characteristics.
%%
\begin{enumerate}
\item
The case marker is consistently right (left) peripheral.
\item
All semantic constituents are continuous.
\end{enumerate}
%
%
%
To define the notion of a semantic constituent, we shall
go back to the definition of a string representing some
term. If $\vec{x}$ represents $t$, we find a correspondence
between subterm occurrences in $t$ and subparts of $\vec{x}$.
We shall not spell out the details here. Suffice it to say
that in this way each subterm occurrence corresponds to a
unique substring of $\vec{x}$. The definition can be made
independent of the actual grammar generating the language.
This is why we call this a semantic constituent. The idea
is to replace the subterm by a fixed symbol, say {\tt Z},
and to observe in what ways we must change $\vec{x}$ for it
to represent the new term.
The two conditions do constitute a positional regime. The
second condition restricts the word order in such a way that
whatever is a constituent must be a substring. The first
condition is a positional regime as soon as the case markers
are words rather than morphemes.
%%
\footnote{In formal language theory this distinction is hardly
made. We shall therefore introduce the following terminology.
There is a blank symbol $\square$, and given a string $\vec{x}$,
each maximal substring not containing $\square$ is called a
{\it word\/} of $\vec{x}$. Words may be composed from smaller
units by simple concatenation, while a word is a string with
a right peripheral boundary marker (see Section~\ref{sec6}).
Regimes in our sense are defined over words. We may therefore
call them {\it syntactic regimes\/} to distinguish them from
{\it morphological regimes}. In any language we know of,
there is a morphological regime. We will turn to this issue
in Section~\ref{sec6}.}
%%
In any case, however, there is positional regime. It can be observed
that many languages do have such a regime. NPs tends to be
continuous in virtually all languages we know of, and deviations
from this do occur but are severely restricted. For example,
Kayardild has a case marking regime that allows discontinuous
constituents without creating ambiguities. Yet,
\cite{evans:kayardild}, page 249 states that `NP-splitting
obeys precise rules and has a clear semantic rationale. It
always a single modifier being split off; split NPs always
straddle a verb.'
%
%
%
\subsection{Word Marking}
%
%
%
Word marking brings us to the last type of string representations,
which will define a language that needs no positional regime.
Let $^w$ differ from $^r$ in that the case markers are not
distributed at the end of each phrase, but at each individual
word. Then what we get is the ideal model for a group marking
language. Unfortunately, this language is not definable by
means of a string homomorphism (this can be shown: otherwise
the language would be semilinear which it isn't). So, we must
choose a different approach. We shall first define a {\it set\/}
representing a term and on the basis of that set we define the
strings. The definitions are based on the notion of a {\it tree
domain}. A {\it tree domain\/} is a subset $T$ of $\BN^{\ast}$
such that the following holds.
%%
\begin{enumerate}
\item $\varepsilon \in T$.
\item If $\sigma\sigma' \in T$ then $\sigma \in T$.
\item If $\sigma i \in T$ and $j < i$ then also
$\sigma j \in T$.
\end{enumerate}
%%
Tree domains are useful because they allow us to write down
a tree using a set of sequences of natural numbers. We can
define the relations $<$ (less than) and $\sqsubset$ (to the
left of) are definable as follows: $\tau < \tau'$ iff
$\tau'$ is a proper prefix of $\tau$, and $\tau \sqsubset
\tau'$ iff there exist $\sigma$, $i,j$ such that
$\tau < \sigma{^{\smallfrown}i}$ and $\tau < \sigma{^{\smallfrown}j}$
and $i < j$. (Figure~\ref{treedomain} shows the tree domain
$\{\varepsilon, 0, 00, 01, 1, 2, 20, 21, 22\}$.)
%%
\begin{figure}
\caption{A Tree Domain}
\label{treedomain}
\begin{center}
\begin{picture}(14,10)
\put(1,1){\makebox(0,0){$00$}}
\put(1,1.5){\line(1,1){3}}
\put(4,1){\makebox(0,0){$01$}}
\put(4,1.5){\line(0,1){3}}
\put(4,5){\makebox(0,0){$0$}}
\put(4,5.5){\line(1,1){3}}
\put(7,9){\makebox(0,0){$\varepsilon$}}
\put(7,8.5){\line(0,-1){3}}
\put(7,5){\makebox(0,0){$1$}}
\put(7,8.5){\line(1,-1){3}}
\put(10,5){\makebox(0,0){$2$}}
\put(10,4.5){\line(-1,-1){3}}
\put(7,1){\makebox(0,0){$20$}}
\put(10,4.5){\line(0,-1){3}}
\put(10,1){\makebox(0,0){$21$}}
\put(10,4.5){\line(1,-1){3}}
\put(13,1){\makebox(0,0){$22$}}
\end{picture}
\end{center}
\end{figure}
%%
Notice that for every subset of $\BN^{\ast}$ it is decidable
whether or not it is a tree domain, and that two tree domains
are identical iff they define isomorphic trees. Now, the problem
in parsing a sentence is that we are given a string of symbols
but we have to guess the syntactic structure. As we have just
seen, the syntactic structure can also be given by means of a
tree domain.
The tree domain forms the tree on which we hang the term
symbols. An $\Omega$--{\it labelled tree domain\/} is
a pair $\auf T, \ell\zu$ such that $T$ is a tree domain
and $\ell : T \pf F$ a function such that the number of
daughters of $\sigma \in T$ is exactly $\Omega(\ell(x))$.
With a term we can associate a canonical labelled tree domain
$t^{\delta}$ as follows. If $t = g$, $\Omega(g)= 0$, we associate
with $t$ the tree domain $\{\varepsilon\}$, where $\varepsilon$
gets the label $g$. If $t_i^{\delta} = \auf T_i, \ell_i\zu$,
are given for each $i < \Omega(f)$, we form a new tree domain
$\auf S, \ell\zu$ as follows:
%%
$$\begin{array}{l@{\quad := \quad}l}
S & \{\varepsilon\} \cup \bigcup_{i < \Omega(f)}
\{i^{\smallfrown}\sigma : \sigma
\in T_i\} \\
\ell(\sigma) & \left\{\begin{array}{ll}
f & \mbox{ if }\sigma = \varepsilon \\
\ell_j(\tau) & \mbox{ if }\sigma = j^{\smallfrown}\tau
\end{array}\right.
\end{array}$$
%%
This is the tree domain associated with $f(t_0, \ldots,t_{\Omega(f)-1})$.
It is easy to see that this representation is unique.
Define the transpose $\sigma^T$ to be the transpose of $\sigma$
(ie the string written in reverse). For example, $\mbox{\tt 213}^T =
\mbox{\tt 312}$.
%%
\begin{defn}
A {\bmi bag over} $\Omega$ is a set of the form
$\Delta(\GT) := \{f{^{\smallfrown}\sigma} : \ell(\sigma^T) = f\}$,
where $\GT$ is an $\Omega$--labelled tree domain.
A {\bmi partial bag} (over $\Omega$) is a subset
of a bag.
\end{defn}
%%
Let us give an example. The terms
$\mbox{\tt +}(\mbox{\tt -}(\mbox{\tt x},\mbox{\tt y}), \mbox{\tt z})$
and
$\mbox{\tt -}(\mbox{\tt x},\mbox{\tt +}(\mbox{\tt y}, \mbox{\tt z}))$
correspond to the sets
%%
$$\{\mbox{\tt +}, \mbox{\tt z2}, \mbox{\tt -1}, \mbox{\tt x11},
\mbox{\tt y21}\}, \qquad
\{\mbox{\tt -}, \mbox{\tt x1}, \mbox{\tt +2}, \mbox{\tt y12},
\mbox{\tt z22}\}$$
%%
We have remarked above that a tree domain uniquely encodes an
ordered tree. This means that a bag uniquely encodes an ordered
labelled tree. It follows different bags correspond to different
ordered trees. This shows that we have unique readability for bags,
despite the fact that the bag is a set and not a sequence.
Finally, let $\Delta$ be a bag, and let $\Delta = \{\delta_i : i
\leq n\}$ be an enumeration of its members. Then the string
%%
$$\delta_1{^{\smallfrown}\delta_2{^{\smallfrown}}}\ldots
{^{\smallfrown}\delta_{n-1}{^{\smallfrown}\delta_n}}$$
%%
is said to be a $\Delta$--{\bf string}.
%%
\begin{defn}
Let $\Omega$ be a signature. The {\bmi ideal
case marking language} over this signature is the set of
all $\Delta$--strings where $\Delta$ is a bag over
$\Omega$. It is denoted by $\mbox{\it ICM}_{\Omega}$.
The {\bmi weak ideal case marking language} over $\Omega$,
$\mbox{\it WICM}_{\Omega}$ is the set of strings associated
with subsets of bags over $\Omega$.
\end{defn}
%%
We can also define the bags by a generating system over sets
of sequences. A {\it unit\/} is a sequence $f{^{\smallfrown}\sigma}$,
where $f \in F$ and $\sigma$ is a finite sequence of natural numbers.
(We may actually assume that no number of $\sigma$ is
larger than the maximum of the $\Omega(f)$, $f \in F$.)
The set of units is denoted by $U$. Now we shall define
sets of units, which correspond to terms. To do that, we
shall use an auxiliary symbol {\tt X}. This symbol will
allow us to define incomplete derivations. We shall define
$S$, the set of terms, as follows. For a set $M$ let
$M^+$ be the result of replacing an occurrence of
$\mbox{\tt X}{^{\smallfrown}{\sigma}}$ by
$\{f{^{\smallfrown}\sigma}\}
\cup \{\mbox{\tt X}{^{\smallfrown}i{^{\smallfrown}\sigma}}
: 1 \leq i \leq n\}$.
Let $S$ be the smallest set containing $\{\mbox{\tt X}\}$ and
which is closed under the transition from $M$ to $M^+$. We
call $S$ the set of {\it partial terms}. A {\it bag\/} is a set
in $S$ which does not contain any occurrence of {\tt X}.
The first bag is generated as follows.
%%
$$\begin{array}{lll}
\{\mbox{\tt X}\}, & \{\mbox{\tt +}, \mbox{\tt X1}, \mbox{\tt X2}\},
& \{\mbox{\tt +}, \mbox{\tt -1}, \mbox{\tt X11},
\mbox{\tt X12}, \mbox{\tt X2}\}, \\
\{\mbox{\tt +}, \mbox{\tt -1}, \mbox{\tt x11},
\mbox{\tt X21}, \mbox{\tt X2}\},
& \{\mbox{\tt +}, \mbox{\tt -1}, \mbox{\tt x11},
\mbox{\tt y21}, \mbox{\tt X2}\}, &
\{\mbox{\tt +}, \mbox{\tt -1}, \mbox{\tt x11},
\mbox{\tt y21}, \mbox{\tt z2}\}
\end{array}$$
%%