Non-redundant random generation algorithms for weighted context-free languages
We address the non-redundant random generation of $k$ words of length $n$ in a context-free language. Additionally, we want to avoid a predefined set of words. We study a rejection-based approach, whose worst-case time complexity is shown to grow exponentially with $k$ for some specifications and in the limit case of a coupon collector. We propose two algorithms respectively based on the recursive method and on an unranking approach. We show how careful implementations of these algorithms allow for a non-redundant generation of $k$ words of length $n$ in $\mathcal{O}(k\cdot n\cdot \log{n})$ arithmetic operations, after a precomputation of $\Theta(n)$ numbers. The overall complexity is therefore dominated by the generation of $k$ words, and the non-redundancy comes at a negligible cost.
💡 Research Summary
The paper tackles the problem of generating k distinct random words of a fixed length n from a weighted context‑free language (WCFL), while also avoiding a predefined set of forbidden words. Traditional approaches rely on rejection sampling: a word is drawn according to the weighted distribution and discarded if it has already been generated or belongs to the forbidden set. The authors first demonstrate that this naïve method can have an exponential expected running time in k for certain weight configurations, and in the worst case it reduces to the classic coupon‑collector problem, where the expected number of draws grows as Θ(k log k) times the size of the language. This makes rejection sampling impractical for large k or highly skewed weight distributions.
To overcome these limitations, the authors propose two constructive algorithms that guarantee non‑redundant generation with provably low complexity.
-
Recursive Method
The algorithm begins with a preprocessing phase that computes, for every non‑terminal X and every length i (0 ≤ i ≤ n), the total weight W_X(i) of all derivation trees of length i rooted at X. This dynamic‑programming step runs in Θ(n·|N|) time and stores the results in a cumulative‑weight table. During generation, the algorithm traverses the grammar tree from the start symbol, selecting a production for the current non‑terminal according to the relative weights derived from W_X(i). After a complete word is built, it is inserted into a hash set; any collision (i.e., a duplicate) triggers an immediate retry. Because each selection uses a binary search on a cumulative array, the cost per production is O(log n), leading to an overall per‑word cost of O(n·log n). Forbidden words are handled either by subtracting their weight from the tables or by rejecting them on the fly, without affecting the asymptotic bound. -
Unranking Method
This approach maps every possible derivation tree of length n to a unique rank in the interval
Comments & Academic Discussion
Loading comments...
Leave a Comment