Agnostic Language Identification and Generation
Recent works on language identification and generation have established tight statistical rates at which these tasks can be achieved. These works typically operate under a strong realizability assumption: that the input data is drawn from an unknown distribution necessarily supported on some language in a given collection. In this work, we relax this assumption of realizability entirely, and impose no restrictions on the distribution of the input data. We propose objectives to study both language identification and generation in this more general “agnostic” setup. Across both problems, we obtain novel interesting characterizations and nearly tight rates.
💡 Research Summary
The paper revisits two fundamental problems in formal language learning—language identification (recovering the exact language that generated a set of positive examples) and language generation (producing a new valid example from the same language). Prior work has largely assumed a realizability condition: the data are drawn i.i.d. (or adversarially) from a distribution whose support coincides exactly with some language in a known collection C = {L₁, L₂, …}. Under this assumption, online identification is tractable for certain language families, while generation is tractable for any countable collection.
The authors drop the realizability assumption entirely and study an agnostic setting where the data come from an arbitrary distribution D over the universe U, with no guarantee that supp(D) belongs to C. To evaluate algorithms in this more general regime they introduce two error measures:
- IdErr(A, D, C, n) – the expected excess probability that a string drawn from D lies outside the language output by algorithm A, relative to the minimal possible error among all languages in C. Formally, IdErr = E_{S∼Dⁿ, r}
Comments & Academic Discussion
Loading comments...
Leave a Comment