We study the succinctness of the complement and intersection of regular expressions. In particular, we show that when constructing a regular expression defining the complement of a given regular expression, a double exponential size increase cannot be avoided. Similarly, when constructing a regular expression defining the intersection of a fixed and an arbitrary number of regular expressions, an exponential and double exponential size increase, respectively, can in worst-case not be avoided. All mentioned lower bounds improve the existing ones by one exponential and are tight in the sense that the target expression can be constructed in the corresponding time class, i.e., exponential or double exponential time. As a by-product, we generalize a theorem by Ehrenfeucht and Zeiger stating that there is a class of DFAs which are exponentially more succinct than regular expressions, to a fixed four-letter alphabet. When the given regular expressions are one-unambiguous, as for instance required by the XML Schema specification, the complement can be computed in polynomial time whereas the bounds concerning intersection continue to hold. For the subclass of single-occurrence regular expressions, we prove a tight exponential lower bound for intersection.
Deep Dive into Succinctness of the Complement and Intersection of Regular Expressions.
We study the succinctness of the complement and intersection of regular expressions. In particular, we show that when constructing a regular expression defining the complement of a given regular expression, a double exponential size increase cannot be avoided. Similarly, when constructing a regular expression defining the intersection of a fixed and an arbitrary number of regular expressions, an exponential and double exponential size increase, respectively, can in worst-case not be avoided. All mentioned lower bounds improve the existing ones by one exponential and are tight in the sense that the target expression can be constructed in the corresponding time class, i.e., exponential or double exponential time. As a by-product, we generalize a theorem by Ehrenfeucht and Zeiger stating that there is a class of DFAs which are exponentially more succinct than regular expressions, to a fixed four-letter alphabet. When the given regular expressions are one-unambiguous, as for instance requi
The two central questions addressed in this paper are the following. Given regular expressions r, r 1 , . . . , r k over an alphabet Σ, (1) what is the complexity of constructing a regular expression r ¬ defining Σ * \ L(r), that is, the complement of r? (2) what is the complexity of constructing a regular expression r ∩ defining L(r 1 ) ∩ • • • ∩ L(r k )? In both cases, the naive algorithm takes time double exponential in the size of the input. Indeed, for the complement, transform r to an NFA and determinize it (first exponential step), complement it and translate back to a regular expression (second exponential step). For the intersection there is a similar algorithm through a translation to NFAs, taking the crossproduct and a retranslation to a regular expression. Note that both algorithms do not only take double exponential time but also result in a regular expression of double exponential size. In this paper, we exhibit classes of regular expressions for which this double exponential size increase cannot be avoided. Furthermore, when the number k of regular expressions is fixed, r ∩ can be constructed in exponential time and we prove a matching lower bound for the size increase. In addition, we consider the fragments of one-unambiguous and single-occurrence regular expressions relevant to XML schema languages [2,3,13,23]. Our main results are summarized in Table 1.
The main technical part of the paper is centered around the generalization of a result by Ehrenfeucht and Zeiger [8]. They exhibit a class of languages (Z n ) n∈N each of which can be accepted by a DFA of size O(n 2 ) but cannot be defined by a regular expression of size smaller than 2 n-1 . The most direct way to define Z n is by the DFA that accepts it: the DFA is a graph consisting of n states, labeled 0 to n -1, which are fully connected and the edge between state i and j carries the label a i,j . It now accepts all paths in the graph, that is, all strings of the form a i 0 ,i 1 a i 1 ,i 2 • • • a i k ,i k+1 . Note that the alphabet over which Z n is defined grows quadratically with n. We generalize their result to a four-letter alphabet. In particular, we define K n as the binary encoding of Z n using a suitable encoding for a i,j and prove that every regular expression defining K n should be at least of size 2 n . As integers are encoded in binary the complement and intersection of regular expressions can now be used to separately encode K 2 n (and slight variations thereof) leading to the desired results. In [9] the same generalization as obtained here is attributed to Waizenegger [35]. Unfortunately, we believe that proof to be incorrect as we discuss in the full version of this paper.
Although the succinctness of various automata models have been investigated in depth [14] and more recently those of logics over (unary alphabet) strings [15], the succinctness of regular expressions has hardly been addressed. For the complement of a regular expression an exponential lower bound is given by Ellul et al [9]. For the intersection of an arbitrary number of regular expressions Petersen gave an exponential lower bound [28], while Ellul et al [9] mention a quadratic lower bound for the intersection of two regular expressions. In fact, in [9], it is explicitly asked what the maximum achievable blow-up is for the complement of one and the intersection of two regular expressions (Open Problems 4 and 5). Although we do not answer these questions in the most precise way, our lower bounds improve the existing ones by one exponential and are tight in the sense that the target expression can be constructed in the time class matching the space complexity of the lower bounds.
Succinctness of complement and intersection relate to the succinctness of semi-extended (RE(∩)) and extended regular expressions (RE(∩,¬)). These are regular expressions augmented with intersection and both complement and intersection operators, respectively. Their membership problem has been extensively studied [18,20,26,28,30]. Furthermore, non-emptiness and equivalence of RE(∩,¬) is non-elementary [33]. For RE(∩), inequivalence is expspace-complete [10,16,29], and non-emptiness is pspace-complete [10,16] even when restricted to the intersection of a (non-constant) number of regular expressions [19]. Several of these papers hint upon the succinctness of the intersection operator and provide dedicated techniques in dealing with the new operator directly rather than through a translation to ordinary regular expressions [20,28]. Our results present a double exponential lower bound in translating RE(∩) to RE and therefore justify even more the development for specialized techniques.
A final motivation for this research stems from its application in the emerging area of XML-theory [21,27,31,34]. From a formal language viewpoint, XML documents can be seen as labeled unranked trees and collections of these documents are defined by schemas. A schema can take various forms, but the
…(Full text truncated)…
This content is AI-processed based on ArXiv data.