Static Program Analysis for String Manipulation Languages

Static Program Analysis for String Manipulation Languages

In this section, we discuss how to design an abstract domain for string manipulation dealing also with other primitive types, namely able to combine different abstractions of different primitive types. In particular, since operations on strings combine strings also with other values (e.g., integers), an abstract domain for string analysis equipped with dynamic typing must include all the possible primitive values, i.e., the whole $`\cval=\cint\cup\cbool\cup\cstr\cup\{\cnan\}`$. The idea is to consider an abstract domain for each type of primitive value and to combine these abstract domains in a unique abstract domain for $`\cval`$. Consider, for each primitive value $`\mathbb{D}`$, an abstract domain $`\mathbb{D}^\sharp`$ (we denote the domain $`\mathbb{D}^\sharp`$ without bottom as $`\mathbb{D}^\sharp_{\not\bot}`$), equipped with an abstraction $`\alpha_{\mathbb{D}}: \mathbb{D} \rightarrow \mathbb{D}^\sharp`$ and a concretization $`\gamma_{\mathbb{D}}:\mathbb{D}^\sharp \rightarrow \mathbb{D}`$ forming a Galois insertion .

Coalesced sum. One way to merge domains is the coalesced sum . The resulting domain contains all the non-bottom elements of the domains, together with a new top and a new bottom, covering all the elements and covered by all the elements, respectively. In our case, if we consider the abstract domains $`\cint^\sharp`$, $`\cstr^\sharp`$ and $`\cbool^\sharp`$, the coalesced sum is the abstraction of $`\wp(\cval)`$ depicted in Fig. 1.

Coalesced sum abstract domain for $\mujs$

This is the simplest choice, but unfortunately this is not suitable for dynamic languages, and in particular for dealing with dynamic typing and implicit type conversion. The problem is that the type of variables is inferred at run-time and may change during execution. For example, consider the following $`\mujs`$ fragment: $`\mathtt{if\ (y < 5)\ \{x = ``42";\}\ else\ \{x = true;\}}`$. The value of the variable $`\code{y}`$ is statically unknown hence, in order to guarantee soundness, we must take into account both the branches, meaning that may be both a string and a boolean value, after the statement. On the coalesced sum domain, the analysis would lose any precision w.r.t. collecting semantics by returning $`\alpha_{\scriptsize{\cstr}}(``\code{42}") \sqcup \alpha_{\scriptsize{\cbool}}(\code{true}) = \top`$.

Cartesian product. In order to catch union types, without losing too much precision, we need to complete the above domain in order to observe collections of values of different types. In order to define this combination, we rely on the cartesian product, following . Hence, the complete abstract domain w.r.t. dynamic typing and implicit type conversion is: $`\aint \times \abool \times \astr \times \wp(\{\anan\})`$, abstraction of $`\wp(\cval)`$. In this combining abstract domain, the value of after the -execution is precisely $`(\bot, \alpha_{\scriptsize{\cbool}}(\code{true}), \alpha_{\scriptsize{\cstr}}(``\code{42}"), \bot)`$, now an element of the domain, inferring that the value of can be $`\alpha_{\scriptsize{\cbool}}(\code{true})`$ or $`\alpha_{\scriptsize{\cstr}}(``\code{42}")`$, but definitely not an abstract integer or $`\cnan`$.

In the following, we consider the abstract domain $`\aval`$ for string analysis obtained as cartesian product of the following abstractions: $`\aint = \interval`$ (the well-known abstract domain of intervals ), $`\astr = \fa`$, $`\abool = \wp(\{\code{true}, \code{false}\})`$.

Dynamic languages, such as JavaScript or Python, have faced an important increment of usage in a very wide range of fields and applications. Common features in dynamic languages are dynamic typing (typing occurs during program execution, at run-time) and implicit type conversion, lightening the development phase and allowing not to block the program execution in presence of unexpected or unpredictable situations. Moreover, one important aspect of dynamic languages is the way strings may be used. In JavaScript, for example, strings can be either used to access property objects or transformed into executable code, by using the global function . In this way, dynamic languages provide multiple string features that simplify writing programs, allowing, at the same time, statically unpredictable executions which may make programs harder to understand . For this reason, string obfuscation (e.g., string splitting) is becoming one of the most common obfuscation techniques in JavaScript malware , making it hard to statically analyze code. Consider, for example, the JavaScript program fragment in Fig. 2 where strings are manipulated, de-obfuscated, combined together into the variable and finally transformed into executable code, the statement . This command, in Internet Explorer, opens a shell which may execute malicious commands. The command is not hard-coded in the fragment but it is built at run-time and the initial values of $`\code{i}`$,$`\code{j}`$ and $`\code{k}`$ are unknown, such as the number of iterations of the loops in the fragment. These observations suggest us that, in order to statically understand statements dynamically generated and executed, it may be extremely useful to statically analyze the string value of . Unfortunately, existing static analyzers for dynamic languages , may fail to precisely analyze strings in dynamic contexts. For instance, in the example, existing static analyzers lose precision on the input value, losing any information about it. Namely, the issue of analyzing dynamic languages, even if tackled by sophisticated tools as the cited ones, still lacks formal approaches for handling the more dynamic features of string manipulation, such as dynamic typing, implicit type conversion and dynamic code generation.

Contributions. In this paper, we focus on the characterization of an abstract interpretation-based formal framework for handling dynamic typing and implicit type conversion, by defining an abstract semantics able to capture these dynamic features. Even if we do not tackle the problem of analyzing dynamically generated code (meaning that we do not analyze its behavior), we strongly believe that such a semantics is a necessary step towards a sufficiently precise analysis of dynamically generated code, being able to reason about a class of string manipulation programs (as far as string values are concerned) that state-of-art static analyzers would fail to precisely analyze. Indeed, the domain we propose allows us to collect (and potentially approximate) the set of all the string values that a variable may receive during computation (at each program point). It should be clear that, in order to analyze what an eval statement may execute, we surely need to (over-)approximate the set of precise values that its parameter may have. Hence, we propose an approach aiming at defining a collecting semantics for strings. With this task in mind, we first discuss how to combine abstract domains of primitive types (strings, integers and booleans) in order to capture dynamic typing. Once we have such an abstract domain, we define on it an abstract semantics for a toy language, augmented with implicit type conversion, dynamic typing and some interesting string operations, whose concrete semantics is inspired by the JavaScript one. In particular, for each one of these operations we provide the algorithm computing its abstract semantics and we discuss their soundness and completeness.

2

vd, ac, la = "";
v = "wZsZ"; m = "AYcYtYiYvYeYXY";
tt = "AObyaSZjectB";
l = "WYSYcYrYiYpYtY.YSYhYeYlYlY"; 

while (i+=2 < v.length) vd = vd + v.charAt(i);

while (j+=2 < m.length) ac = ac + m.charAt(j);

ac += tt.substring(tt.indexOf("O"), 3);
ac += tt.substring(tt.indexOf("j"), 11);

while (k+=2 < l.length) 
    la = la + l.charAt(k);

d = vd + "=new " + ac + "(" + la + ")";
eval(d);
A potentially malicious obfuscated JavaScript program.

Paper structure. In Sect. 2 we recall relevant notions on finite state automata and the core language we adopt for this paper and the finite state automata domain, highlighting some important operations and theoretical results, respectively. In Sect. 3 we discuss and present two ways of combining abstract domains (for primitive types) suitable for dynamic languages. Then, In Sect. 4 we present the novel abstract semantics for string manipulation programs. Finally, in Sect. 5 we discuss the related work compared to this paper and we conclude the paper.

Basic notations and concepts

String notation.

We denote by $`\Sigma`$ a finite alphabet of symbols, its Kleene-closure by $`\Sigma^*`$ and a string element by $`\sigma \in \Sigma^*`$. If $`\sigma = \sigma_0\sigma_1\cdots\sigma_n`$, the length of $`\sigma`$ is $`|\sigma|=n+1`$ and the element in the $`i`$-th position is $`\sigma_i`$. Given two strings $`\sigma, \sigma' \in \Sigma^*`$, $`\sigma\sigma'`$ is their concatenation. A language is a set of strings, i.e., $`\lin \in \wp(\Sigma^*)`$. We use the following notations: $`\Sigma^i\defi\sset{\sigma\in\Sigma^*}{|\sigma|=i}`$ and $`\Sigma^{< i}\defi\bigcup_{j< i}\Sigma^j`$. Given $`\sigma\in\Sigma^*`$, $`i,j\in\mathbb{N}`$ ($`i\leq j\leq |\sigma|`$) the substring between $`i`$ and $`j`$ of $`\sigma`$ is the string $`\sigma_i\cdots\sigma_{j-1}`$, and we denote it by $`\substringf{\sigma}{i}{j}`$. Let $`\mathbb{Z}`$ be the set of integers. We denote by $`\Sigma^*_{\scriptsize{\cint}} \defi \{+,-, \epsilon\}\cdot\{0,1, \dots, 9\}^+`$ the set of numeric strings, i.e., strings corresponding to integers. $`\mathcal{I}: \Sigma^*_{\scriptsize{\cint}} \rightarrow \mathbb{Z}`$ maps numeric strings to the corresponding integers. Dually, we define the function $`\mathcal{S}: \mathbb{Z} \rightarrow \Sigma^*_{\scriptsize{\cint}}`$ that maps each integer to its numeric string representation (e.g., 1 is mapped to the string , not , -5 is mapped to ).

Regular languages and finite state automata. We follow for automata notation. A finite state automaton (FA) is a tuple $`\aut = (Q, q_0, \Sigma, \delta, F)`$ where $`Q`$ is a finite set of states, $`q_0 \in Q`$ is the initial state, $`\Sigma`$ is a finite alphabet, $`\delta \subseteq Q \times \Sigma \times Q`$ is the transition relation and $`F \subseteq Q`$ is the set of final states. In particular, if $`\delta:Q\times\Sigma\rightarrow Q`$ is a function then $`\aut`$ is called deterministic FA (DFA).1 The class of languages recognized by FAs is the class of regular languages. We denote the set of all DFAs as . Given an automaton $`\aut`$, we denote the language accepted by $`\aut`$ as $`\lang(\aut)`$. A language $`\lin`$ is regular iff there exists a FA $`\aut`$ such that $`\lin = \lang(\aut)`$. From the Myhill-Nerode theorem, for each regular language there uniquely exists a minimum automaton, i.e., with the minimum number of states, recognizing the language. Given a regular language $`\lin`$, we denote by $`\minimize(\lin)`$ the minimum DFA $`\aut`$ s.t. $`\lin=\lang(\aut)`$.

The programming language. We consider an $`\mujs`$ language (Fig. 3) that contains representative string operations taken from the set of methods offered by the JavaScript built-in class . Other JavaScript string operations can be modeled by composition of the given string operations or as particular cases of them. Primitive values are $`\cval=\cstr\cup\cint\cup\cbool\cup\{\cnan\}`$ with $`\cstr\defi\Sigma^*`$ (strings on the alphabet $`\Sigma`$), $`\cbool\defi\{\true,\false\}`$ and $`\cnan`$ a special value denoting not-a-number.

Implicit type conversion. In order to capture the semantics of the language $`\mujs`$, inspired by the JavaScript semantics, we need to deal with implicit type conversion. For each primitive value, we define an auxiliary function converting primitive values to other primitive values (Fig. 4). Note that all the functions behave like the identity when applied to values not needing conversion, e.g., $`\code{toInt}`$ on integers. Then, $`\code{toStr}: \cval \rightarrow \cstr`$ maps any input value to its string representation; $`\code{toInt}: \cval \rightarrow \cint \cup \{\cnan\}`$ returns the integer corresponding to a value, when it is possible: For $`\true`$ and $`\false`$ it returns respectively $`1`$ and $`0`$, for strings in $`\Sigma^*_{\footnotesize{\nint}}`$ it returns the corresponding integer, while all the other values are converted to $`\cnan`$. For instance, $`\code{toInt}(``42") = 42`$, $`\code{toInt}(``42hello") = \cnan`$. Finally, $`\code{toBool}: \cval \rightarrow \cbool`$ returns $`\false`$ when the input is $`0`$, and $`\true`$ for all the other non boolean primitive values. For example, implicit type conversion is applied when the guards of and statements do not evaluate to booleans (e.g., {x=x+1;}, the guard is implicitly converted to ).

$\mujs$ syntax

Semantics. Program states are partial maps from identifiers to primitive values, i.e., $`\states: \id \rightarrow \cval`$. The concrete big-step semantics $`\sem{\cdot} : \stmt \times \states \rightarrow \states`$ follows , and it includes dynamic typing and implicit type conversion. Also the expression semantics, $`\sem{\cdot}:\expr\times\states\rightarrow\cval`$, is standard; we only provide the formal and precise semantics of the $`\mujs`$ string operations. Let $`\sigma, \sigma'\in\cstr`$ and $`i,j\in\cint`$ (values which are not strings or numbers respectively, are converted by the implicit type conversion primitives. Negative values are treated as zero).

substring:
It extracts substrings from strings, i.e., all the characters between two indexes. The semantics is the function Ss$`: \cstr \times \cint \times \cint \rightarrow \cstr`$ defined as:

\substringl{\sigma}{i}{j} \defi
        \begin{cases}
%           \semt{$\sigma.\substring{i}{0}$} & \mbox{if}\ j < 0 \\          
%           \semt{$\sigma.\substring{0}{j}$} & \mbox{if}\ i < 0 \\          
%           \semt{$\sigma.\substring{j}{i}$} & \mbox{if}\ j < i \\
            \substringl{\sigma}{j}{i}       &  j < i \\
            %\substringf{\sigma}{i}{j} &  j < |\sigma|\ \wedge\ i\leq j \\
            %\substringf{\sigma}{i}{n} &  j\geq n = |\sigma| \wedge\ i\leq j 
            \substringf{\sigma}{i}{\mathsf{max}(j,|\sigma|)} &  \mbox{otherwise}
        \end{cases}

charAt:
It returns the character at a specified index. The semantics is the function Ca$`: \cstr \times \cint \rightarrow \cstr`$ defined as follows:

\charats{\sigma}{i} \defi \begin{cases}
    \sigma_i & 0 \leq i < |\sigma| \\
    \epsilon & \mbox{otherwise}
    \end{cases}

indexOf:
It returns the position of the first occurrence of a given substring. The semantics is the function Io$`:\cstr \times \cstr \rightarrow \cint`$ defined as follows:

\indexofs{\sigma}{\sigma'}\defi \begin{cases}
        \mathsf{min}\sset{i}{\sigma_i\dots\sigma_j = \sigma'} & \exists i,j.\:\sigma_i\dots\sigma_j = \sigma' \\
        -1 & \mbox{otherwise}
        \end{cases}

length:
It returns the length of a string $`\sigma \in \cstr`$. Its semantics is the function Le$`:\cstr\rightarrow\cint`$ defined as $`\lengths{\sigma} \defi |\sigma|`$.

concat:
The string concatenation is handled by $`\mujs`$ plus operator (). The concrete semantics relies on the concatenation operator reported in Sect. 2, i.e., $`\concs{\sigma}{\sigma'} = \sigma\sigma'`$.

scale=0.76 $\ctostr{v} = \begin{cases} v & v \in \cstr \\ ``\code{NaN}" & v = \cnan \\ ``\true" & v = \true \\ ``\false" & v = \false \\ \mathcal{S}(v) & v \in \cint \end{cases}$

 

scale=0.76 $\ctoint{v} = \begin{cases} v & v \in \cint \\ 1 & v = \true \\ 0 & v = \false \vee v = \cnan \\ \mathcal{I}(v) & v \in \cstr \wedge v \in \Sigma^*_{\scriptsize{\cint}} \\ \cnan & v \in \cstr \wedge v \not\in \Sigma^*_{\scriptsize{\cint}} \end{cases}$

 

scale=0.76 $\ctobool{v} = \begin{cases} v & v \in \cbool \\ \true & v \in \cint \smallsetminus \{0\} \vee v \in \cstr \smallsetminus \{\epsilon\} \\ \false & v = 0 \vee v = \epsilon \vee v = \cnan \end{cases}$

$\mujs$ implicit type conversion functions.

The finite state automata domain for strings

In this section, we describe the automata abstract domain for strings , namely the domain of regular languages over $`\Sigma^*`$. In particular, our aim is that of characterize automata as a domain for abstracting the computation of program semantics in the abstract interpretation framework. The exploited idea is that of approximating strings as regular languages represented by the minimum DFAs  recognizing them. In general, we have more DFAs that recognize a regular language, hence the domain of automata is indeed the quotient $`\fa`$ w.r.t. the equivalence relation induced by language equality: $`\forall \aut_1,\aut_2\in\DFA.\:\aut_1 \equiv \aut_2 \Leftrightarrow \lang(\aut_1) = \lang(\aut_2)`$. Hence, any equivalence class $`[\aut]_{\equiv}`$ is composed by the automata that recognize the same regular language. We abuse notation by representing equivalence classes in the domain $`\fa`$ w.r.t. $`\equiv`$ by one of its automata (usually the minimum), i.e., when we write $`\aut\in\fa`$ we mean $`[\aut]_{\equiv}`$. The partial order $`\leqfa`$ induced by language inclusion is $`\forall \aut_1, \aut_2 \in \fa \ . \ \aut_1 \leqfa \aut_2 \Leftrightarrow \lang(\aut_1) \subseteq \lang(\aut_2)`$, which is well defined since automata in the same $`\equiv`$-equivalence class recognize the same language.

The least upper bound (lub) $`\lubfa: \fa \times \fa \rightarrow \fa`$ on the domain $`\fa`$, corresponds to the standard union between automata: $`\forall \aut_1, \aut_2 \in \fa.\:\aut_1 \lubfa \aut_2 \defi \minimize(\lang(\aut_1) \cup \lang(\aut_2))`$. It is the minimum automaton recognizing the union of the languages $`\lang(\aut_1)`$ and $`\lang(\aut_2)`$. This is a well-defined notion since regular languages are closed under union. The greatest lower bound $`\glbfa : \fa \times \fa \rightarrow \fa`$ corresponds to automata intersection, since regular languages are closed under finite intersection: $`\forall \aut_1, \aut_2 \in \fa.\: \aut_1 \glbfa \aut_2 \defi \minimize(\lang(\aut_1) \cap \lang(\aut_2)).`$

$`\latticefa`$ is a sub-lattice but not a complete meet-sub-semilattice of $`\wp(\Sigma^*)`$.

In other words, there exists no Galois connections between $`\fa`$ and $`\wp(\Sigma^*)`$, i.e., there may exist no minimal automaton abstracting a language.2 However, this is not a concern, since the relation between concrete semantics and abstract semantics can be weakened while still ensuring soundness . A well known example is the convex polyhedra domain .

Widening. The domain $`\fa`$ is an infinite domain, and it is not ACC, i.e., it contains infinite ascending chains. For instance, consider the set of languages $`\{\sset{a^j b^j}{0\leq j\leq i}\}_{i\geq 0}\subseteq\wp(\Sigma^*)`$ forming an infinite ascending chain, then also the set of the corresponding minimal automata forms an ascending chain on $`\fa`$. This clearly implies that any computation on $`\fa`$ may lose convergence . Most of the proposed abstract domains for strings trivially satisfy ACC by being finite, but they may lose precision during the abstract computation . In these cases, domains must be equipped with a widening operator approximating the lub in order to force convergence (by necessarily losing precision) for any increasing chain . As far as automata are concerned, existing widenings are defined in terms of a state equivalence relation merging states that recognize the same language, up to a fixed length $`n`$ (set as parameter for tuning the widening precision) . We denote this parametric widening with $`\nabla_n`$, $`n \in \mathbb{N}`$.

Consider the following $`\mujs`$ fragment

str = ""; while (x++ < 100) { str += "a"; }

Since the value of the variable is unknown, also the number of iterations of the -loop is unknown. In these cases, in order to guarantee soundness and termination, we apply the widening operator. In Fig. 5 we report the abstract value of the variable at the beginning of the second iteration of the loop, while in Fig. 6 the abstract value of the variable at the end of the second iteration is reported. Before starting a new iteration, in the example, we apply $`\nabla_1`$ between two automata, namely we merge all the states having the same outgoing character. The minimization of the obtained automaton is reported in Fig. 7. The next iteration will reach the fix-point, guaranteeing soundness and termination.

(a)

scale=0.85

  (b)

scale=0.85

  (c)

scale=0.85

(a) $\aut_1$ s.t. $\lang(\aut_1) = \{\epsilon, a\}$ (b)$\aut_2$ s.t. $\lang(\aut_2) = \{a, aa\}$ (c) $\aut_1 \nabla_1 \aut_2$

The abstract interpreter for the abstract semantics so far defined has been tested by means of the implementation of an automata library3. This library includes the implementation of all the algorithms concerning the finite state automata domain and provide well-known operations on automata such as suffix, right quotient, and abstract domain-related operations, such as $`\lubfa`$, $`\glbfa`$, and a parametric widening for tuning precision and forcing convergence. The library is suitable and easily pluggable into existing static analyzers, such as . The bottleneck of our library is the determinization operation, having exponential complexity (we rely on determinization in the minimization algorithm, in order to preserve the automata arising during the abstract computations minimum and deterministic). It is worth noting that, as reported in Thm. [thm:fa-moore-family], $`\wp(\Sigma^*)`$ (string concrete domain) and $`\fa`$ (abstract string domain) do not form a Galois connection but, nevertheless, this is not a concern. We have shown, for the core language we adopted, that the abstract semantics we have defined for string operations guarantee soundness hence, if the abstract interpreter starts from regular initial conditions (i.e., constraints expressible as finite state automata) it will always compute regular invariants. Indeed, it is sound to start from $`\top`$ initial condition that, in our string abstract domain, is expressible by $`\minimize(\wp(\Sigma^*))`$, which is regular.

Example: Obfuscated malware. Consider the fragment reported in Fig. 2 in the introduction. By computing the abstract semantics of this code, we obtain that the abstract value of , at the call, is the automaton $`\aut_{\scriptsize{d}}`$ in Fig. 9. The cycles are caused by the widening application in the computation.

$\aut_{\scriptsize{d}}$ abstract value of before call of the program in Fig. 2

From this automaton we are able to retrieve some important and non-trivial information. For example, we are able to answer to the following question: May $`\aut_{\scriptsize{d}}`$ contain a string corresponding to an assignment to an ActiveXObject? We can simply answer by checking the predicate $`\aut_{\scriptsize{d}} \sqcap \minimize(\id \cdot \{new \ ActiveXObject(\} \cdot \Sigma^* \cdot \{)\}) \neq \varnothing`$, checking whether $`\aut_{\scriptsize{d}}`$ recognizes strings that are concatenations of any identifier with the string $`new \ ActiveXObject`$, followed by any possible string. In the example, the predicate returns $`\true`$. Another interesting information could be: May $`\aut_{\scriptsize{d}}`$ contain string? We can answer by checking whether $`\aut_{\scriptsize{d}} \sqcap \minimize(\{eval\}) \neq \varnothing`$, that is false and guarantees that no explicit call to can occur.
We observe that such analysis may lose precision during fix-point computations, causing the cycles in the automaton in Fig. 9, due to the widening application. Nevertheless, it is worth noting that this result is obtained without any precision improvement on fix-point computations, such as loop unrolling or widening with thresholds. We think these analyses will drastically decrease false positives of the proposed string analysis but we will address this topic in future work.

In this section, we define the abstract semantics of , i.e., we define the operator SS$`^\sharp : \fa \times \interval \times \interval \rightarrow \fa`$, starting from an automaton, an interval $`[i,j]`$ of initial indexes and an interval $`[l,k]`$ of final indexes for substrings, and computing the automaton recognizing the set of all substrings of the input automata language between the indexes in the two intervals. Hence, since the abstract semantics has to take into account the swaps when the initial index is greater than the final one, several cases arise handling (potentially unbounded) intervals. Tab. 1 reports the abstract semantics of SS$`^\sharp`$ when $`i,j \leq l`$ (hence $`i \leq k`$). The definition of this semantics is by recursion with four base cases (the other cases are recursive calls splitting and rewriting the input intervals in order to match or to get closer to base cases) for which we describe the algorithmic characterization. Consider $`\aut\in\fa`$, $`i,l\in\nint\cup\{-\infty\}`$, $`j,k\in\nint\cup\{+\infty\}`$ (for the sake of readability we denote by $`\sqcup`$ the automata lub $`\lubfa`$, and by $`\sqcap`$ the glb $`\glbfa`$), the base cases are

If $`i,j,l,k \in\mathbb{Z}`$ (first row, first column of Tab. 1) we have to compute the language of all the substrings between an initial index in $`[i,j]`$ and a final index in $`[l,k]`$, namely $`\substringl{\lang(\aut)}{[i,j]}{[l,k]}\footnote{We abuse notation by denoting with $ \mbox{\sc Ss}$ also the additive lift to languages and to sets of indexes: $\mbox{\sc Ss}:\wp(\Sigma^) \times \wp(\mathbb{Z}) \times \wp(\mathbb{Z}) \rightarrow \wp(\Sigma^)$ defined as $\substringl{\lin}{I}{J} = \sset{\substringl{\lin}{i}{j}}{i \in I, j \in J}=\sset{\substringl{\sigma}{i}{j}}{\sigma,\in\lin, i \in I, j \in J}$.}`$. For example, let $`\lin = \{a\}^* \cup \{hello, bc\}`$, the set of its substrings from 1 to 3 is $`\substringl{\lin}{[1,1]}{[3,3]}`$ = $`\{\epsilon, a, aa, el, c\}`$. The automaton accepting this language is computed by the operator

\substringa{\aut}{[i,j]}{[l,k]}\defi\bigsqcup_{a\in[i,j],b\in[l,k]}(\arightquotient{\aisuff{\aut}{a}}{\aisuff{\aut}{b}}\sqcap \minimize(\Sigma^{b-a}))\sqcup (\aisuff{\aut}{a}) \sqcap \minimize(\Sigma^{<b-a})

When both intervals correspond to $`[-\infty, +\infty]`$, the result is the automaton of all possible factors of $`\aut`$ (last row, last column), i.e., $`\afactors{\aut}`$;

If $`[i,j]`$ is defined and the interval of final indexes is unbounded, i.e., $`[l, +\infty]`$ (first row, third column), we have to compute the automaton recognizing the following language

\substringlright{\lang(\aut)}{[i,j]}{l}\defi\bigcup_{a\in[i,j]}\sset{\substringl{\sigma}{a}{k}}{\sigma \in\lang(\aut),\ k \geq l}

i.e., all the strings between a finite interval of initial indexes and an unbounded final index. The automaton accepting this language is computed by

\begin{equation*}
\substringaright{\aut}{[i,j]}{l}\defi\bigsqcup_{a\in[i,j]}\arightquotient{\aisuff{\aut}{a}}{\asuff{\aisuff{\aut}{l}}}
\end{equation*}

The abstract semantics returns the least upper bound of all the automata of substrings from $`a`$ in $`[i,j]`$ to an unbounded index greater than or equal to $`l`$;

When both intervals are unbounded ($`[i, +\infty]`$ and $`[l, +\infty]`$, third row, third column of Tab. 1), we split the language to accept. In particular, we compute the substrings between $`[i, l]`$ and $`[l +\infty]`$ (and this has been considered in case 3), and the automaton recognizing the language of all substrings with both initial and final index with any value greater than $`l`$, i.e., the language $`\substringlr{\lang(\aut)}{l}`$ $`\defi \sset{\substringl{\sigma}{a}{b}}{\sigma \in \lang(\aut),\ a,b \geq l}`$. This latter set is computed by the algorithm $`\substringalr{\aut}{l}\defi\afactors{\aisuff{\aut}{l}}`$

Here we show the table only for the case $`i,j\leq l \ (\mbox{and thus } i \leq k)`$. Only few cases are not considered and they are not reported for space limitations. Anyway, they are compatible with Tab. 1. In Fig. 12 we report an example obtained applying the rules in the tables.

scale=0.75

$`i,j\leq l\ (i\leq k)`$ $`l,k \in \mathbb{Z}`$
$`i,j \in \mathbb{Z}`$ $`\substringa{\aut}{[i,j]}{[l,k]}`$ $`\asubstring{\aut}{ [i,j]}{[0,k]}`$ $`\substringaright{\aut}{[i,j]}{l}`$ $`\asubstring{\aut}{[i,j]}{[0,+\infty]}`$
$`\asubstring{\aut}{[0,j]}{[l,k]}`$ $`\asubstring{\aut}{[0,j]}{[0,k]}`$ $`\asubstring{\aut}{ [0,j]}{[l,+\infty]}`$ $`\asubstring{\aut}{[0,j]}{[0,+\infty]}`$
$`\:\sqcup\:\asubstring{\aut}{[i,k]}{[l,k]}`$ $`\asubstring{\aut}{[i,+\infty]}{[0,k]}`$ $`\asubstring{\aut}{[i,+\infty]}{[0,+\infty]}`$
$`\asubstring{A}{[0,+\infty]}{[l,k]}`$ $`\asubstring{\aut}{[0,+\infty]}{[0,k]}`$ $`\asubstring{\aut}{[0,+\infty]}{[l,+\infty]}`$ $`\afactors{\aut}`$

Definition of $`\mathsf{SS}^\sharp`$ when $`i,j\leq l \ (\mbox{and thus } i \leq k)`$

For each $`\aut \in \fa, I, J \in \interval.\:\asubstring{\aut}{I}{J}`$ performs at most three recursive calls, before reaching a base case.

$`\mathsf{SS}^\sharp`$ is sound and complete: $`\forall \aut \in \fa, I, J \in \interval.\:\substringl{\lang(\aut)}{I}{J} = \lang(\asubstring{\aut}{I}{J})`$.

(a)

scale=0.75

(b)

scale=0.75

(a) $\aut$, $\lang(\aut) = \{lang, hello\}$. (b) $\aut' = \asubstring{\aut}{[1,1]}{[3,+\infty]}$, $\lang(\aut') = \{an, ang, el, ell, ello\}$.

  1. We consider DFA also those FAs which are not complete, namely such that a transition for each pair $`(q,a)`$ ($`q\in Q`$, $`a\in\Sigma`$) does not exists. They can be easily transformed in a DFA by adding a sink state receiving all the missing transitions. ↩︎

  2. Note that, some works have studied automatic procedures to compute, given an input language $`L`$, the regular cover of $`L`$ (i.e., an automaton containing the language $`L`$). In particular, have studied regular covers guaranteeing that the automaton obtained is the best w.r.t. a minimal relation (but not minimum). ↩︎

  3. Available at www.github.com/SPY-Lab/fsa and the $`\mujs`$ static analyzer at www.github.com/SPY-Lab/mu-js ↩︎