Analogical Structure, Minimal Contextual Cues and Contrastive Distractors: Input Design for Sample-Efficient Linguistic Rule Induction
Large language models achieve strong performance on many tasks, but their training makes it hard to see which properties of the input support efficient linguistic rule learning. We ask how three cognitively-inspired principles of input design support sample-efficient linguistic rule induction: analogical structure, contrastive learning, and minimal contextual cue. We also ask how their effects compare to those of LLMs on the same controlled tasks. We implement these principles in structured sentence completion tasks that test English verb alternations. Lightweight models trained on hundreds to one-thousand such examples learn the alternation rules with high F1 on these tasks. Ablation studies show that analogical organisation is the main driver of sample efficiency, and contrastive distractors and minimal context help further gains. We also evaluate zero- and few-shot LLMs on the same tasks. In this controlled setting, the lightweight models reach higher F1 with far fewer task-specific data. We treat this contrast as a comparison between learning regimes rather than a general verdict on LLMs. Our results show that careful input organisation supports sample-efficient learning of linguistic rules and reveals distinct learning signatures for trained lightweight models and prompted LLMs.
💡 Research Summary
This paper investigates how the organization of training inputs can dramatically improve sample‑efficient learning of linguistic rules, focusing on English verb alternations. Drawing on three cognitively inspired principles—Analogical Structure, Minimal Contextual Cues, and Contrastive Distractors—the authors construct a set of structured sentence‑completion puzzles based on Blackbird Language Matrices. Each puzzle presents a “Context” consisting of two parallel paradigms that encode an analogical mapping (e.g., Man : Dice :: Explorer : Mat), subtle semantic scaffolding cues (e.g., “did it”, “was on the floor”), and an “Answer Set” containing one correct continuation and six systematically designed distractors. The distractors are organized into a hierarchical taxonomy that isolates specific error types: role violations, structural violations, and paradigm violations, allowing precise measurement of which components of the rule a model has acquired.
Data generation combines expert‑crafted seed sentences with automated template instantiation, yielding thousands of examples across two main configurations: Type I (same verb across paradigms) and Type II (different verbs across paradigms). A cross‑phenomenon validation uses bake‑class verbs (unspecified‑object alternations) to test generalizability beyond roll‑verbs. The final dataset, BLM‑ROLL‑BAKEE, is publicly released.
Experiments compare lightweight models (BERT‑base embeddings plus either a CNN or a feed‑forward network, ≈0.5 M parameters) against several large language models (GPT‑3, DeepSeek‑R1, LLaMA‑3, Qwen‑3, etc.) in zero‑shot and few‑shot prompting regimes. Lightweight models are trained on as few as 100–1 000 structured examples; even with only 100 examples they achieve F1 ≈ 0.95, surpassing the best zero‑shot LLM (F1 ≈ 0.87). An extensive ablation study manipulates the input organization: (i) “Base” (full analogical organization), (ii) “Shuffled” (same content, random order), (iii) “No Analogy” (first paradigm removed), (iv) “No Soft Cue” (contextual hints removed), and (v) “Transposed” (paradigm positions swapped). Results show that removing analogical structure causes the largest performance drop (F1 ≈ 0.78), while eliminating minimal cues or contrastive distractors yields smaller but still measurable declines (≈ 0.02–0.04). This demonstrates that analogical organization is the primary driver of sample efficiency, with the other two principles providing complementary gains.
LLM evaluation follows a puzzle‑solving prompt that first asks the model to generate a provisional answer, then to compare it against the answer set, and finally to output the selected sentence along with the hypotheses considered. Both chain‑of‑thought (CoT) and non‑CoT variants are tested. Even with five‑shot prompting, the strongest LLMs reach at most F1 ≈ 0.90, still below the lightweight models trained on far fewer task‑specific examples. The authors argue that this reflects a fundamental difference in learning regimes: parameter‑updating models can internalize the structured analogical patterns efficiently, whereas prompting‑only LLMs rely on pre‑trained knowledge and cannot exploit the same organized signal without weight updates.
The paper makes three contributions: (1) a concrete methodology for embedding analogical reasoning, minimal semantic scaffolding, and contrastive learning directly into the training data; (2) empirical evidence that such organization enables lightweight models to learn complex syntactic‑semantic mappings with dramatically fewer examples; (3) a comparative analysis showing distinct learning signatures between fine‑tuned lightweight models and prompted LLMs, emphasizing that high performance of LLMs does not automatically imply data‑efficient rule acquisition.
Limitations include the focus on English verb alternations and a fixed set of distractor types; future work could extend the framework to other linguistic phenomena, languages, and more diverse error taxonomies, as well as explore integration with parameter‑efficient fine‑tuning methods (e.g., adapters, LoRA). Overall, the study highlights that careful input design—mirroring human analogical reasoning—can unlock sample‑efficient linguistic learning, offering a complementary path to scaling model size for robust language understanding.
Comments & Academic Discussion
Loading comments...
Leave a Comment