The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Following multiple instructions is a crucial ability for large language models (LLMs). Evaluating this ability comes with significant challenges: (i) limited coherence between multiple instructions, (ii) positional bias where the order of instructions affects model performance, and (iii) a lack of objectively verifiable tasks. To address these issues, we introduce a benchmark designed to evaluate models’ abilities to follow multiple instructions through sequential instruction following (SIFo) tasks. In SIFo, the successful completion of multiple instructions is verifiable by examining only the final instruction. Our benchmark evaluates instruction following using four tasks (text modification, question answering, mathematics, and security rules), each assessing different aspects of sequential instruction following. Our evaluation of popular LLMs, both closed-source and open-source, shows that more recent and larger models significantly outperform their older and smaller counterparts on the SIFo tasks, validating the benchmark’s effectiveness. All models struggle with following sequences of instructions, hinting at an important lack of robustness of today’s language models.

💡 Research Summary

The paper introduces the Sequential Instruction Following (SIFo) benchmark, a novel evaluation suite designed to measure large language models’ (LLMs) ability to execute multiple, inter‑dependent instructions in a coherent sequence. Existing instruction‑following benchmarks largely focus on single‑step commands or on multi‑step tasks where the steps are loosely coupled, making it difficult to assess whether a model truly maintains the logical chain of operations. SIFo addresses three core challenges: limited coherence between instructions, positional bias (the order of instructions influencing performance), and the lack of objectively verifiable tasks. By constructing instruction sequences where each step’s success depends on the output of the previous step, the benchmark allows evaluation by checking only the final instruction, thereby eliminating the need for exhaustive intermediate verification and reducing evaluation bias.

SIFo comprises four objectively verifiable tasks, each probing a different facet of sequential instruction following:

Text Modification (TM) – Models must perform lexical operations (insertion, replacement, deletion) on a short Wikipedia‑derived context. Instructions are generated by randomly selecting named entities or the most frequent token, ensuring that each operation builds on the previous modifications.
Question Answering (QA) – A two‑stage process where the model first extracts an answer from a context, then revises the context by replacing the extracted knowledge, and finally answers a follow‑up question based on the revised context. This tests knowledge extraction, context updating, and re‑use of revised information.
Mathematics (Math) – A chain of arithmetic problems where each subsequent calculation relies on the result of the previous one (e.g., “Harry slept 9 h; James slept 2/3 of that; how many more hours did Harry sleep?”). This evaluates step‑wise numerical reasoning and the ability to carry forward intermediate results.
Security Rules (Sec) – A password‑protected command sequence that requires the model to respect access control rules when adding or modifying items in a fruit list. Incorrect passwords must be ignored, testing compliance with external procedural constraints.

The authors evaluate a broad spectrum of state‑of‑the‑art LLMs, including open‑source models (Mistral, Llama 2, Llama 3, DeepSeek, Qwen 2) and closed‑source systems (Claude‑3, GPT‑4). Results show a clear scaling trend: larger, newer models achieve higher overall accuracy across all tasks. However, performance consistently degrades in later steps of the instruction chain, even for the most powerful models. This indicates a systemic weakness in maintaining long‑range instruction dependencies.

A preliminary experiment on positional bias, using a parallel‑constraint text‑generation dataset (six constraints per sample, permuted across 6! orders), reveals that model accuracy varies significantly with the position of a given constraint. The bias patterns differ across models and constraint types, confirming that positional effects observed in other domains (e.g., multi‑document QA, retrieval) also manifest in instruction‑following scenarios.

Key insights from the study include:

Sequential coherence matters: By tying each instruction to the previous output, SIFo forces models to preserve context across steps, exposing weaknesses hidden in loosely coupled benchmarks.
Scaling helps but does not solve the problem: Larger models perform better overall, yet all models struggle with later‑stage instructions, suggesting that architectural or training improvements (e.g., better memory mechanisms, chain‑of‑thought prompting) are needed.
Positional bias persists: Even when instructions are meant to be order‑independent, the placement of constraints influences outcomes, highlighting the need for evaluation protocols that neutralize such bias.
Objective verification is feasible: By designing tasks where the final instruction’s correctness unambiguously reflects the entire chain’s success, the benchmark eliminates reliance on costly human judgments or secondary evaluator LLMs.

The authors release the benchmark code and data (https://github.com/shin‑ee‑chen/SIFo) to facilitate reproducibility and future extensions. They propose future work on more complex logical flows, multimodal instruction chains, and integration with reinforcement learning from human feedback to improve long‑range instruction fidelity. Overall, SIFo provides a rigorous, scalable, and reproducible framework for assessing a critical, yet under‑explored, dimension of LLM capability—robust, sequential instruction following.

The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment