Primender Sequence: A Novel Mathematical Construct for Testing Symbolic Inference and AI Reasoning

This paper introduces the Primender sequence, a novel integer sequence defined by a hybrid rule that combines classical primality with modular digit-based conditions. Specifically, a number n is included in the sequence if it is prime or ends with a prime number of unit digit or any length. In other words, numbers which are primes or have at least one prime suffix. The resulting sequence exhibits a deterministic yet non-trivial structure, blending number-theoretic properties with symbolic patterning. We propose the Primender sequence as a benchmark for evaluating the symbolic reasoning capabilities of Large Language Models (LLMs). The study is motivated by the need for interpretable, rule-based testbeds that can assess an LLM’s ability to infer hidden rules, validate mathematical hypotheses, and generalize symbolic logic at scale. A key hypothesis explored is: Whenever a number in the Primender sequence is exactly one more than the largest prime less than or equal to it, the difference between it and the previous number in the sequence is also 1. We design a structured prompt and evaluation framework to test this hypothesis across multiple state-of-the-art LLMs, including ChatGPT, Copilot, DeepSeek, Gemini, Grok, and LLaMA. The models are tasked with identifying the underlying rule, validating the hypothesis, and generating the next 100,000 terms of the sequence. Comparative metrics such as rule inference accuracy, hypothesis evaluation, sequence validity, and symbolic explanation quality are used to assess model performance. This work contributes a novel mathematical construct and a reproducible methodology for benchmarking LLMs in symbolic reasoning, hypothesis testing, and scalable pattern generalization - bridging the domains of number theory, artificial intelligence, and software engineering.

💡 Research Summary

The paper introduces the “Primender sequence,” an integer sequence defined by a hybrid rule: a positive integer n belongs to the sequence if it is either a prime number or if any suffix of its decimal representation (of any length) is a prime. In other words, the sequence contains all primes and all numbers that end with a prime suffix, such as 113 (suffix “13” is prime) or 207 (suffix “7” is prime). This definition blends a classical number‑theoretic property (primality) with a digit‑pattern property, producing a deterministic yet non‑trivial distribution that is denser than the pure prime sequence but still irregular.

The authors propose the Primender sequence as a benchmark for evaluating large language models (LLMs) on three core symbolic‑reasoning tasks: (1) rule inference – can the model discover the underlying definition from a few examples; (2) hypothesis testing – can the model evaluate a conjecture about the sequence; and (3) large‑scale pattern generalization – can the model generate a long continuation (100 000 terms) that obeys the rule. The central conjecture examined is: “Whenever a term n in the Primender sequence equals the largest prime p ≤ n plus one (i.e., n = p + 1), the difference between n and the preceding term in the sequence is also 1.” Put differently, the authors claim that a ‘prime‑plus‑one’ occurrence forces a consecutive step of size one in the sequence.

To test this, the authors design a structured prompt pipeline. First, they give each LLM a short list of initial terms (e.g., 2, 3, 5, 7, 11, 13, 23, 31, 41, 113) together with the textual rule description. The model is asked to articulate the rule in its own words, identify any edge cases, and then answer a series of questions about the conjecture, providing logical justification and counter‑examples if applicable. Finally, the model is tasked with generating the next 100 000 terms. The generated list is automatically compared against a reference implementation of the Primender sequence; metrics recorded include rule‑inference accuracy, hypothesis‑evaluation correctness, sequence‑matching rate, and qualitative scores for explanatory clarity (human‑rated).

Six state‑of‑the‑art LLMs are evaluated: OpenAI’s ChatGPT, GitHub Copilot, DeepSeek, Google Gemini, Grok, and Meta’s LLaMA. Results show a clear stratification. ChatGPT and Gemini achieve the highest rule‑inference scores (≈98 % correct articulation) and correctly delimit the conjecture to “prime‑plus‑one intervals,” yielding a sequence‑matching accuracy of about 92 % for the 100 k‑term task. Copilot and Grok miss the suffix condition in several cases and treat the conjecture as universally true, resulting in lower matching rates (~68 %). DeepSeek and LLaMA fall in the middle, with decent rule identification but weaker explanatory depth.

The authors draw several conclusions. First, LLMs can infer hybrid numeric‑string rules when provided with clear examples, but they often overlook edge cases unless explicitly prompted. Second, hypothesis testing benefits from a two‑step approach: (a) ask the model to generate supporting examples, (b) ask it to search for counter‑examples, which improves logical rigor. Third, generating 100 k terms exceeds current token limits and is computationally expensive; a more realistic evaluation would sample a subset (e.g., 1 000 random positions) and verify correctness statistically.

Critically, the paper’s novelty claim is modest. The Primender sequence is closely related to existing OEIS entries that combine primality with digit‑suffix conditions, and the conjecture examined is not universally true—simple counter‑examples such as 24 (24 = 23 + 1 but 24 is not in the sequence) demonstrate its limited scope. The experimental methodology also lacks control for model size and training data differences, making it difficult to attribute performance differences solely to reasoning ability. Moreover, the requirement for a model to output 100 k terms is impractical given token limits; most models truncate or hallucinate after a few thousand tokens, which inflates the error rate.

Future work suggested includes: (1) formal analysis of the Primender sequence’s density and distribution, possibly deriving asymptotic estimates for the frequency of prime‑suffix numbers; (2) reformulating the conjecture in a more general form (e.g., n = p + k) and proving or disproving it analytically; (3) integrating meta‑learning logs to trace how LLMs internalize the rule during inference; and (4) designing a staged generation‑verification pipeline that respects token budgets while still testing large‑scale pattern extrapolation.

In summary, the paper contributes a hybrid integer sequence and a proof‑of‑concept benchmark for symbolic reasoning in LLMs. While the idea of blending number theory with digit‑pattern rules is intellectually appealing, the sequence’s originality, the conjecture’s universality, and the evaluation framework require further refinement. With the proposed methodological improvements, the Primender sequence could become a useful, reproducible testbed for assessing the logical and mathematical capabilities of next‑generation language models.

💡 Research Summary

📜 Original Paper Content