ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints

Reading time: 46 minute
...

📝 Original Info

  • Title: ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints
  • ArXiv ID: 2511.18450
  • Date: 2025-11-23
  • Authors: Rui Xu, Dakuan Lu, Zicheng Zhao, Xiaoyu Tan, Xintao Wang, Siyu Yuan, Jiangjie Chen, Yinghui Xu

📝 Abstract

Spatial reasoning is a key capability in the field of artificial intelligence, especially crucial in areas such as robotics, computer vision, and natural language understanding. However, evaluating the ability of multimodal large language models (MLLMs) in complex spatial reasoning still faces challenges, particularly in scenarios requiring multi-step reasoning and precise mathematical constraints. This paper introduces ORIGAMISPACE, a new dataset and benchmark designed to evaluate the multi-step spatial reasoning ability and the capacity to handle mathematical constraints of MLLMs through origami tasks. The dataset contains 350 data instances, each comprising a strictly formatted crease pattern (CP diagram), the Compiled Flat Pattern, the complete Folding Process, and the final Folded Shape Image. We propose four evaluation tasks: Pattern Prediction, Multi-step Spatial Reasoning, Spatial Relationship Prediction, and End-to-End CP Code Generation. For the CP code generation task, we design an interactive environment and explore the possibility of using reinforcement learning methods to train MLLMs. Through experiments on existing MLLMs, we initially reveal the strengths and weaknesses of these models in handling complex spatial reasoning tasks.

📄 Full Content

Spatial reasoning is a core component of artificial intelligence [1,2], with wide applications in robotics [3], autonomous driving [4], and geographic information systems [5]. Although multimodal large language models (MLLMs) demonstrate outstanding performance in various vision and language tasks [6,7], they face challenges in imagining spatial transformations and grasping spatial relationships in image and text spaces. Evaluating their spatial reasoning ability has become an important task.

Multi-step reasoning and constraints are critical yet underexplored areas in spatial intelligence. Current spatial reasoning benchmarks typically focus on understanding static images or simple scenes [8]. Some studies are dedicated to comparing and reasoning about spatial relationships between image pairs, but lack attention to continuous spatial transformations [9,10]. Some studies propose multi-step spatial reasoning but do not involve interaction with the environment and lack constraints found in real-world tasks [11]. These limitations indicate a current need for a new benchmark to more comprehensively evaluate the capabilities of MLLMs in complex spatial reasoning scenarios.

Origami art offers an ideal platform for evaluating complex spatial reasoning abilities [12]. Origami involves a sequence of ordered folding operations, where each step depends on the result of the previous one, embodying the essence of multi-step reasoning. Furthermore, the origami process is governed by explicit geometric constraints, such as folds must occur along straight lines, and the paper cannot be torn or separated; all origami operations are defined by strict mathematical constraints (Kawasaki’s Theorem, Huzita-Hatori axioms, etc.) [13,14]. The transformation from a two-dimensional crease pattern (CP diagram) through multiple folding steps to a three-dimensional folded shape image requires strong spatial imagination and reasoning abilities. To bridge the gap of existing benchmarks, this paper introduces the ORIGAMISPACE dataset and benchmark. This dataset contains 350 meticulously collected origami data instances, including a CP diagram, its corresponding compiled flattened pattern, illustrations of the complete folding process, and the final folded shape. The diversity and complexity of the data cover various origami types. We improve the existing origami compiler, enabling it to output detailed flattened diagrams that include crease locations and stacking relationships, support interactive simulation with MLLMs, and provide more comprehensive error feedback. Based on this dataset, we design four challenging evaluation tasks: pattern prediction, spatial relationship prediction, multi-step spatial reasoning, and end-to-end CP code generation, which comprise 1,500 multiple-choice questions and 120 code generation questions. For the code generation task, we meticulously design a comprehensive evaluation strategy to measure the quality of the generated CP code across multiple dimensions.

The core advantages of ORIGAMISPACE lie in its authenticity (derived from real origami designs), multi-step reasoning characteristics (reflecting the inherent process of origami), and rigorous mathematical constraints (precisely verifiable through origami theorems). We evaluate the performance of various MLLMs on ORIGAMISPACE, and introduce environmental learning and reinforcement learning methods for the code generation task, which opens up new perspectives and effective avenues for assessing and enhancing the spatial reasoning abilities of MLLMs.

The main contributions of this paper include:

• We introduce ORIGAMISPACE, a dataset containing 350 high-quality origami data instances, and optimize the existing origami compiler, enabling it to provide more comprehensive feedback.

• We design four challenging tasks centered around spatial reasoning, including 1,500 multiplechoice questions and 120 CP code generation questions, which is the first benchmark to evaluate the multi-step spatial reasoning ability of MLLMs under mathematical constraints.

• We conduct a comprehensive evaluation of existing MLLMs and develop a complete interactive environment for the end-to-end CP code generation task, and explore environmental learning and reinforcement learning methods through this environment.

Folded Shape Image Different from the strictly compiled flat pattern, the folded shape image provides a direct, intuitive visualization of the final origami shape. It is typically a photograph or 3D rendering.

Folding Process The folding process refers to the multi-step sequence of transforming the original paper into the final shape. This folding process is gathered from various origami tutorials and cannot be represented in a standardized format, existing only as natural images.

We manually check and verify all data to ensure that 1) all CP diagrams can be compiled into the compiled flat pattern and correspond to the folded shape image; 2) the names of all origami data correspond to the folded shape image, with no potential for confusion (such as indistinguishable birds); and 3) all folding processes are feasible. In addition to this part of the data, we also collect 471 groups of data without intermediate folding processes for the subsequent training of the model.

The current origami compiler computes the final state achievable by a CP diagram under all mathematical constraints, thereby compiling the compiled flat pattern. We have optimized this process: 1) During compilation, we mark each crease, allowing us to locate the position of every crease in the compiled image. 2) We further compute the paper stacking order information, clarifying the top-bottom relationship of different paper regions in the compiled flat pattern. 3) We construct an interface for direct interaction between MLLMs and the compiler, enabling the model to call this system more conveniently to complete origami simulations. 4) We improve the error feedback system of the compiler. Specifically, it returns four types of errors:

CP Code Syntax Error (CSE) Validates the existence, format, and validity of inter-references of core data structures in the CP code (such as vertex coordinates vertices_coords, edge-vertex relationships edges_vertices, and face-vertex relationships faces_vertices). It also checks if crease types (e.g., ‘B’, ‘M’, ‘V’, ‘F’, ‘U’) are predefined characters, and verifies if Euler’s formula for planar graphs is satisfied: V -E + F = 2, where V, E, and F represent the number of vertices, edges, and faces, respectively.

Geometrically Impossible Fold (GIF) Refers to cases where the CP code geometrically violates fundamental origami principles, making the fold physically unrealizable. For example, violating local flat-foldability conditions at a vertex (such as Maekawa’s theorem |M -V | = 2 or Kawasaki’s theorem α i = 2π), or specified crease angle combinations would require the paper to be stretched or torn.

Paper Self-Intersection/Penetration (PSI) Occurs when logically incompatible situations are found while deducing the relative positions and layering order of different paper sections after folding. This may manifest as a cycle in the calculated paper layering relationships (e.g., layer A is above layer B, layer B is above layer C, and layer C is, in turn, above layer A), or in a 2D unfolded representation, different paper regions are assigned to overlapping positions that would cause physical penetration.

This error occurs when a given CP code, due to its inherent under-constrained nature (e.g., allowing multiple valid mountain-valley assignments for creases, or lacking critical information such as crease types or angles), can be compliantly folded into multiple different stable geometric structures, or prevents the compiler from uniquely determining the layering order when processing complex overlapping paper regions.

In ORIGAMISPACE, the distribution of different types of origami is relatively even. To ensure data diversity, we choose origami models covering different levels of complexity and types of folds, such as animals, plants, geometric shapes, etc. The average number of folding steps for origami models is 8.2, but the variation between different models varies greatly, ranging from a minimum of 3 steps to a maximum of 25 steps. Appendix A presents more detailed data analysis, including the themes and names of all origami data and the proportion of different folding steps.

Based on ORIGAMISPACE, we propose four tasks to evaluate the spatial reasoning capabilities of MLLMs comprehensively.

This task evaluates the model’s ability to understand the folding process from the CP diagram and imagine the final 3D shape. For this task, the input is the CP diagram, and MLLMs are required to predict the resulting folded shape image based on it. To enable better quantitative evaluation, we structure this task as a multiple-choice question. The correct option is the name of the target shape.

For the incorrect options, three origami enthusiasts design three options for each diagram, adhering to criteria that require them to be easily distinguishable from the correct option; not be variations of the same concept (e.g., if the correct option is a cat, incorrect options are not lions, leopards, etc.); and be close to potential folded states based on the CP diagram (e.g., removing a few key creases makes a boat’s CP diagram similar to a hat). We create 350 questions for this task. See Appendix B.1 for the specific annotation rules.

This task evaluates the model’s ability to understand the dynamic origami process and the logical relationships between steps. The input for the task is a set of images that collectively show several key steps of a complete origami process. However, the order of these images is randomly shuffled. MLLMs need to infer the correct chronological order in which these steps occur, based on their understanding of the geometric state changes in the images. To better quantify the model’s performance, we structure this task as a multiple-choice question. The correct option is the sequence of steps that represents the unique correct folding process (for example, “1-2-3-4”). For the incorrect options, we generate multiple logically incorrect sequences of steps (for example, “1-2-4-3”, “4-1-2-3”, etc.). These incorrect sequences may contain partially correct local orders but contain errors in the overall flow, in order to test the model’s grasp of the complete, coherent process. We design 250 such questions, and the average number of steps per question is 7.5.

This task evaluates the model’s ability to predict spatial relationships and geometric properties after the folding process is complete. For this task, the input is the CP diagram. The model is required to predict specific spatial relationships between parts of the origami model after it is fully folded. The task comprises three types of multiple-choice questions designed to test this ability: 1) Spatial Pose Localization: Determining the specific 3D position of a point from the original paper in the final model, including its pose within a reference frame (e.g., on a table, facing upwards). 2) Layering Relationship Analysis: Determining the paper stacking order after folding, requiring analysis of covering relationships during the folding process and identifying how many paper layers form a specific region (e.g., the thickest region). 3) Geometric Change Analysis: Predicting how specific geometric features (such as angles, distances, areas, etc.) change from the flat CP diagram to the final folded state. For example, predicting the relative angle or spatial distance between two original line segments after folding. The correct answers for all three question types are obtained using our optimized compiler. Incorrect options are then manually designed. We design 900 multiple-choice questions (300 for each type) for this task. See Appendix B.2 for specific annotation rules.

This task requires the MLLM to generate corresponding CP code based on a compiled flat layout and an image of the folded shape. Ideally, this CP code should compile into a folded pattern identical to the target shape. To comprehensively evaluate the quality of the generated results, we have designed a multidimensional evaluation framework.

The CP code generated by the model will first be attempted to be compiled using our origami compiler (see Section 3.2 for details). If the compilation fails, the model will return one or more error types. If the compilation succeeds, meaning the CP code is syntactically valid, geometrically foldable, and free of self-intersections, and produces a definite folded state, the system will compare the compilation result with the reference result across the following four dimensions: ), edge connectivity (e.g., similarity of degree distribution, number of connected components), face relationships (e.g., number of faces, distribution of face sizes), and the distribution similarity of crease types (“M”, “V”, “B”, etc.).

  1. Geometric Similarity (GS) This dimension focuses on the spatial characteristics of the compiled model. It evaluates point position similarity by calculating the bidirectional Hausdorff distance dH between the normalized 3D point sets of the generated and reference compiled models (score

where k is a sensitivity coefficient, e.g., 5). It assesses angular similarity by comparing the distribution of dihedral angles at the creases, and evaluates size and proportion similarity by comparing the aspect ratios of the overall bounding boxes of the models.

  1. Constraint Satisfaction (CS) This dimension evaluates whether the successfully compiled CP code, beyond the basic foldability ensured by the compiler, further adheres to the physical and mathematical constraints of origami. This includes comparing the presence and matching degree of critical constraint types (Taco-Taco, Taco-Tortilla, transitivity constraints) and checking for satisfaction of fundamental theorems of local flat-foldability, such as Maekawa’s theorem (the difference between the number of mountain creases M and valley creases V around a vertex is |M -V | = 2) and Kawasaki’s theorem (the sum of the angles α i of creases around a vertex is α i = 2π or 0).

This dimension directly compares the final 3D model shape compiled from the generated CP with the reference compiled 3D model. It primarily evaluates overall shape similarity by calculating the Hausdorff distance of the point sets, and where possible (if the model provides layering information), compares the layering relationships between facets, including paper stacking order information that may be obtained during the compilation process.

Total Score: The final total score S total is a weighted average of the scores s dim from each evaluation dimension: S total = dim w dim • s dim . By default, each of the four dimensions accounts for 25% of the weight (w dim = 0.25), and w dim = 1. This score ranges from 0 to 1 (S total ∈ [0, 1]), reflecting the overall quality of the generated CP code. For more details on the evaluation process, please refer to Appendix D. 5 Experiments

We evaluate multiple representative MLLMs. For open-source models, we evaluate MiniCPM-o 2.6 [25],NVILA-15B [26], llava-1.5-7b [27], VideoLLaMA3 [28], Qwen2.5-VL-[7B/32B/72B] [29], deepseek-vl2 [30], InternVL2.5-78B [31]. For proprietary models, we evaluate Claude-3.5-Sonnet [32], gpt-4o [33], Gemini2.5-[flash/pro] [34]. For all these models, we adopt the original model and official instruction formats.

We recruit two categories of people to complete the first three tasks. The first category consists of five laypersons recruited via a crowdsourcing platform, and the second category comprises three experts with extensive origami experience. Specific details of the human evaluation are provided in Appendix B.3. For the CP code generation task, we adopt the following settings:

In-context learning In this setting, we provide the model with detailed task instructions and a set of CP code examples. The instructions will introduce the meaning represented by each part of the CP code and all the constraints that must be followed. MLLMs need to generate the complete CP code in one go based on these instructions and examples.

Environmental learning In this setting, MLLMs no longer attempt to generate the complete CP code in one go, but instead engage in iterative interaction with the compiler. Specifically, the MLLM will first perform planning, then generate CP code. The compiler will return its compilation result, and the model then performs inference based on the returned compilation result, subsequently choosing to add or delete creases, iterating in this manner. We set the upper limit of interaction rounds to 10.

Reinforcement learning Through a constructed compilation environment, we explore a reinforcement learning approach. We utilize the 471 sets of data mentioned in Section 3. 2), as well as the overall compilation pass rate (CPR). When compilation is successful, the similarity in four dimensions(4.4) and the total score are calculated. This score is scaled to [0,100] for ease of presentation.

Intermediate reward: After modifying the code, if compilation is successful, a reward is given based on the quality progress of the current partial CP code (S partial -S partial_prev , where S partial is a quickly evaluated partial quality score), plus a small basic compilation success reward. If compilation fails, a fixed negative penalty is given. (2) Step penalty: A small negative reward is received for each action taken to encourage efficiency. (3) Final reward: After the interaction ends, the result of the evaluation function defined in Section 4.4 serves as the main reward. We adopt TRICO [35] for training on qwen2.5-vl-32B, which is a PPO-based [36], more efficient MLLMs multi-turn reinforcement learning algorithm. Specific training settings and parameters can be found in Appendix E.

Tasks 1 to 3 primarily focus on spatial analysis and prediction. The results shown in Table 1 are the average of three runs for different MLLMs, from which we observe that: 1) For MLLMs, ORIGAMISPACE is a challenging task; the performance of poor-performing models is close to random guessing (25%), and even for the best-performing models, there is a significant gap compared to human performance, especially in multi-step spatial reasoning. 2) Despite the different task types, the relative performance ranking of various models largely remains consistent, with Gemini 2.5-pro and GPT-4o demonstrating the best spatial reasoning ability. 3) Human experts perform well on all tasks, demonstrating the task’s upper bound. 4) MLLMs perform worst on the Spatial Relationship Prediction task, especially the sub-tasks involving Geometric Change, indicating significant difficulty for models in understanding fine-grained, internal spatial structures.

Table 2 presents the results of different methods and models on Task 4. We observe the following: 1) Impact of learning settings: The results clearly indicate the significant impact of learning settings on performance. In-context learning shows relatively limited performance. Environmental learning brings significant performance improvements, demonstrating that through iterative interaction with the compiler, planning, and trial-and-error based on feedback, models can overcome the limitations of one-shot generation. Reinforcement learning shows potential, as the trained Qwen2.5-VL-32B surpassed the performance of a 72B model. 2) There are significant performance differences among different models, with top-tier closed-source models exhibiting the best spatial reasoning capabilities.

Figure 3: The impact of interaction rounds on the compilation pass rate and total score of different models.

Mathematical constraints present a primary challenge in generating valid CP codes for the ORIGAMIS-PACE task. Table 2 indicates that failing to satisfy constraints is the main bottleneck for compilation failures; even when provided with detailed instructions, models struggle to strictly adhere to these complex rules, leading to persistently high compilation failure rates. Interactive processes with the environment enhance models’ ability to follow constraints, demonstrating that models can learn and internalize rules from feedback. Compared to environmental learning, reinforcement learning also shows improvement in constraint satisfaction, proving the effectiveness of specific reward mechanisms. However, even with interactive learning, precisely satisfying all mathematical constraints remains a significant challenge for top-tier models (such as GPT-4o and Gemini 2.5-pro, whose constraint satisfaction score is only 56.99% under environmental learning settings). This reveals MLLMs’ deficiencies in deep multi-step geometric and layering reasoning and highlights the value of the fine-grained feedback and constraint satisfaction evaluation introduced in this study.

Figure 3 illustrates the impact of interaction rounds on model performance across different dimensions under the environmental learning setting. We observe that as the number of interaction rounds increases, model performance improves in various aspects, particularly the compilation pass rate. However, performance tends to saturate after 8-10 rounds, indicating that interaction primarily helps overcome initial learning obstacles but struggles to break through the model’s inherent bottlenecks. Weaker models, limited by their understanding capabilities, reach their upper limit in fewer rounds.

The reinforcement learning-trained Qwen2.5-VL-32B also follows a similar trend, but due to policy optimization, it may reach its performance ceiling in fewer rounds.

In this paper, we introduce ORIGAMISPACE, a novel benchmark specifically designed to address the underexplored areas of multi-step spatial reasoning and constraint adherence in Multimodal Large Language Models (MLLMs). Leveraging the inherent complexities of origami, ORIGAMISPACE provides 350 meticulously curated data instances and an enhanced compilation program to facilitate in-depth evaluation. The benchmark features four challenging tasks, including pattern prediction, spatial relationship prediction, multi-step spatial reasoning, and end-to-end code generation, making it the first to assess MLLMs’ multi-step spatial reasoning under rigorous mathematical constraints. Our comprehensive evaluation of existing MLLMs and exploration of reinforcement learning methods for code generation highlight the utility of ORIGAMISPACE in not only assessing current capabilities but also in paving new ways to enhance the spatial intelligence of MLLMs.

The ORIGAMISPACE comprises a total of 350 data entries, covering various types of origami. We have categorized these based on the required number of folding steps into Easy (3-9 steps), Medium (10-19 steps), and Hard (20-30 steps). Tables 3,4, and 5 respectively display all the data for these three difficulty levels, including the origami design name, its category, and the number of folding steps required. All our data are public data or authorized by the original websites and data sources, with no potential infringement risks.

The primary goal of this annotation task is to create challenging yet fair incorrect options for multiplechoice questions (MCQs). For each given Crease Pattern (CP) diagram and its known correct folded 3D shape, annotators are required to design three distinct incorrect shape options. These options, along with the correct one, will form an MCQ designed to evaluate a model’s ability to predict the 3D shape from the CP. The following rules must be strictly adhered to when designing these incorrect options:

Each incorrect option must be easily and clearly distinguishable visually from the correct folded shape. The purpose is to prevent ambiguity where an incorrect option might be confused with the correct one due to only subtle visual differences.

Guideline:

• The overall silhouette, major components, and general form of the incorrect option should be significantly different from those of the correct option. • Avoid creating incorrect options that are merely slight modifications, re-orientations, or minor proportional changes of the correct shape.

Example:

• If the correct shape is an origami crane:

-An incorrect option that is another bird in a very similar pose (e.g., a crane with wings slightly more elevated versus wings fully spread, if the overall form remains highly similar) might be unsuitable if it’s not clearly visually distinct at a glance. -A suitable incorrect option would be an origami box, an origami boat, or an origami star, as these are visually very different from a crane.

Incorrect options should not be variations of the same concept or fall within the same narrow semantic category as the correct option. They should represent fundamentally different objects or ideas. This rule ensures the task tests the prediction of the specific shape, not fine-grained classification within a single conceptual group.

• If the correct option is a specific type of animal, incorrect options should not be other animals that are closely related (e.g., from the same family) or share very similar overarching characteristics. • Strive for incorrect options that belong to different conceptual categories than the correct option (e.g., animal vs. inanimate object vs. geometric form).

Example:

• If the correct shape is an origami cat:

-Incorrect options such as Lion, Tiger, or Leopard are unsuitable because they are all felines and thus variations of the same core concept (“large cat” or “wild cat” as opposed to “domestic cat”). -Suitable incorrect options could be an origami airplane, an origami hat, or an origami fish (assuming the ‘fish’ is a distinctly different concept from ‘cat’ within the context of common origami figures).

While incorrect, the alternative shapes should be plausible outcomes that could potentially be folded from a Crease Pattern that bears some relationship to the given CP diagram. This means an incorrect option might be a shape that could result from misinterpreting some creases, omitting a few key folds, or simplifying the original pattern. The objective is to create distractors that are not arbitrary but reflect potential, albeit erroneous, folding paths from a CP similar to the one provided.

• Consider what alternative, simpler, or related shapes might emerge if certain folds in the CP are ignored, if mountain and valley folds are confused, or if a common base derived from the CP is completed into a different known figure. • The incorrect option’s implied CP should not be drastically more complex or entirely unrelated to the structural elements suggested by the given CP. It should ideally represent a shape that an intermediate folder might erroneously produce when attempting the correct model or a related one.

Example:

• Given a CP diagram for a relatively simple origami boat:

-A suitable incorrect option could be an origami hat (e.g., a traditional paper hat like a “samurai helmet” or a simple party hat). Many simple hats share foundational folds or bases (like the water bomb base or a preliminary fold variation) with simple boats, or their CPs can be derived by altering or omitting a few creases from a boat’s CP. -An unsuitable incorrect option might be a highly complex origami insect or a multipiece modular origami ball if the provided CP is for a simple, single-sheet boat. The CP for such complex figures would likely be vastly different and far more intricate, making them implausible alternatives based on the given simple CP.

Summary for Annotators Creating Incorrect Options: For each CP diagram and its corresponding correct folded shape, you are to design three unique incorrect shape options. Before finalizing these options, please verify each one against the following three criteria:

  1. Visual Distinguishability: Is the incorrect option clearly visually different from the correct shape?

  2. Conceptual Distinctness: Is the incorrect option conceptually different from the correct shape, avoiding mere variations of the same theme?

  3. Crease Pattern Plausibility: Is the incorrect option a shape that could plausibly (even if incorrectly) be derived from the provided CP or a closely related CP (e.g., through simplification or common error)?

Adherence to these rules is crucial for creating high-quality and effective multiple-choice questions for the Pattern Prediction evaluation task.

This section outlines the rules for annotators tasked with designing incorrect options for the Spatial Relationship Prediction task. For each Crease Pattern (CP) diagram, questions are posed about the spatial properties of the final folded origami model. While correct answers are generated by an optimized compiler, annotators must manually create three plausible yet incorrect options for each question to form a multiple-choice question (MCQ). The aim is to generate distractors that effectively test a model’s nuanced understanding of 3D spatial relationships post-folding.

The task comprises three types of questions. Below are specific guidelines for designing incorrect options for each type:

This question type requires predicting the specific 3D position and/or pose (orientation) of a designated point (or feature) from the original flat paper once the model is fully folded. The pose might be described relative to a global reference frame (e.g., on a table, with a specific part facing upwards).

Guidelines for Designing Incorrect Options:

• Plausible Positional Errors:

-Offer coordinates that are slightly offset from the correct 3D position (e.g., incorrect by a small delta in one or more axes, located in an adjacent quadrant, or on a wrong but nearby surface). -Suggest a position that would be correct if a key fold were made inaccurately (e.g., a mountain fold treated as a valley, an incorrect fold angle, or slight misalignment of layers). -Propose the final position of a different, perhaps nearby or symmetrically opposite, salient point from the original CP.

• Plausible Pose Errors (if orientation is part of the question):

-Provide options with the correct 3D position but an incorrect orientation (e.g., correct (x, y, z) coordinates, but the point/surface faces downwards instead of upwards, or is rotated 90 • incorrectly).

-Offer an orientation that is a common simplification (e.g., aligned perfectly with a major axis when it’s actually slightly tilted).

• Symmetry-based Errors: For CPs/models exhibiting symmetry, an incorrect option could be the symmetrical counterpart of the correct position or pose.

• Reference Frame Confusion: Offer a position or pose that is correct relative to a local part of the origami model but incorrect within the specified global reference frame, or vice-versa.

Example: Suppose a specific vertex ‘P’ on the CP is queried for its final 3D coordinates (x, y, z) and the direction its local paper surface is facing (e.g., ‘upwards’), relative to a table it sits on. The correct answer (from compiler) is (10, 5, 3), local surface facing ‘upwards’.

• Suitable Incorrect Options could be:

-(10, 5, 0), local surface facing ‘upwards’ (Incorrect Z-coordinate, perhaps implying it’s on the table surface when it’s elevated). -(10, 5, 3), local surface facing ‘downwards’ (Correct position, but incorrect orientation).

-(-10, 5, 3), local surface facing ‘upwards’ (A symmetrical position if the model has YZ plane symmetry and origin is centered). -The final coordinates and pose of an adjacent vertex ‘Q’ from the CP.

• Unsuitable Incorrect Options: Random coordinates or orientations with no plausible relation to the model’s scale, structure, or folding process.

This question type focuses on the internal structure of the folded model, specifically the stacking order of paper layers or the number of layers at a particular region (e.g., identifying the thickest region or counting layers at a specific point).

Guidelines for Designing Incorrect Options:

• For Number of Layers Questions:

-Offer layer counts that are slightly off from the correct number (e.g., correct count ±1 or ±2 layers). -Propose the layer count of an adjacent or visually similar region in the folded model.

-Suggest a count that might result from overlooking some hidden internal layers or, conversely, double-counting some visible folded edges as separate layers.

-If the question asks to identify the “thickest region” from a set of options, incorrect options should be other regions that are also thick, but not maximally so, or regions that appear thick but are not.

• For Stacking Order Questions:

-Provide plausible but incorrect permutations of the layer sequence. For example, if the correct top-to-bottom order of layers (referenced by their original CP surface labels like S1, S2, S3) is S1-S3-S2, an incorrect option could be S1-S2-S3 or S2-S1-S3. -Suggest an order that would result if a specific flap were tucked differently during folding (e.g., a flap going over another flap instead of under it). -Offer an incomplete order (e.g., missing one or more layers from the sequence in that region) or an order that incorrectly includes layers not present in that specific stack.

Example: Question: “How many layers of paper form the central part of the crane’s body?” Correct answer (from compiler): 8 layers.

• Suitable Incorrect Options could be:

-6 layers (Plausible underestimation, perhaps missing some internal folds).

-7 layers (Close, but incorrect).

-10 layers (Plausible overestimation, perhaps counting edges).

-4 layers (Number of layers in the crane’s wing, a different region).

Question: “Consider a point X on the wing of a folded paper airplane. Starting from the externally visible top surface at X, what is the order of the original paper surfaces (labeled S1, S2, S3, S4 on the CP) one would pass through if drilling perpendicularly downwards through all layers at X?” Correct answer (from compiler): S1, S4, S2.

• Suitable Incorrect Options could be: -S1, S2, S4 (A common misremembered or simplified stacking).

-S4, S1, S2 (Incorrect starting layer or internal order).

-S1, S4 (Incomplete, missing the bottom layer S2).

This question type involves predicting how specific geometric features (e.g., angles between lines, distances between points, areas of surfaces) change from their state in the flat CP diagram to their state in the final 3D folded model.

• Value from Original CP: A very common and effective incorrect option is to offer the original geometric value as it was on the flat CP diagram (e.g., if an angle is 90 • on the CP but becomes 45 • in 3D, then 90 • is a strong distractor). This tests whether the model understands that geometric properties transform during folding. • Plausible Estimations or Miscalculations:

-For angles: Provide common angles (e.g., 30 • , 45 • , 60 • , 90 • , 180 • ) that might appear correct upon a cursory visual inspection of the folded form, or angles that result from assuming a simplified 3D configuration (e.g., assuming perpendicularity or parallelism where it doesn’t exactly exist). -For distances: Offer distances measured along the paper surface instead of the true Euclidean distance through 3D space (or vice-versa, depending on the question’s phrasing). Suggest distances that might result from slight errors in visualizing the 3D form, such as ignoring foreshortening or using dimensions from a 2D projection. -For areas: Propose areas that don’t account for overlaps of paper in the folded state, or the area of a 2D projection rather than the true 3D surface area (if the latter is specified). An area that results from a miscalculation of how a shape transforms (e.g., halving an area when it should be less or more). • Qualitative Change Errors: If the question is about the nature of change (e.g., “Does distance X increase, decrease, or stay the same?”), incorrect options could be the opposite type of change, or “stays the same” when there is indeed a significant change. • Values from Unrelated or Different Parts: Offer a geometric value (angle, distance, area) that is correct for a different feature or part of the folded model, or for a different but related origami model.

Example: Question: “Two line segments L1 and L2 are parallel on the CP diagram and are 5 cm apart. In the final folded model, these segments become two adjacent edges of a wing. What is the approximate angle between the segments L1 and L2 in the folded state?” Correct answer (from compiler): 60 • .

• Suitable Incorrect Options could be: -0 • (Implying they remain parallel, i.e., no change from CP state regarding their relative orientation). -90 • (A common angle in man-made objects and some origami steps, could be a plausible guess). -45 • (Another common angle, plausible visual estimate). Question: “A defined square region on the CP has an area of 16 cm 2 . After folding, this region forms part of a curved surface. What is the approximate surface area of this region in the 3D model?” Correct answer (from compiler): 16 cm 2 (assuming no stretching/shrinking of paper, the intrinsic surface area remains the same, though its projected area might change).

• Suitable Incorrect Options could be:

-8 cm 2 (Perhaps confusing with a projected area that is halved).

-12 cm 2 (A value less than original, implying shrinkage or significant overlap not intrinsic to the region itself). -20 cm 2 (A value more than original, implausible without stretching). * (Note: If the question was about projected area, then 16 cm 2 could be an incorrect option if the projection foreshortens it).

General Summary for Annotators Designing Incorrect Options: For each question across these three types, remember the following overarching principles when designing your three incorrect options:

1 By following these guidelines, you will help create high-quality multiple-choice questions that rigorously and fairly evaluate a model’s capabilities in spatial relationship prediction for origami.

For the manual evaluation of the first three tasks, we recruited evaluators from two different categories.

The first category included five non-professionals recruited through a crowdsourcing platform; the second category comprised three experts with extensive experience in the field of origami. Participants in these evaluations were compensated according to the prevailing local minimum hourly wage standard.

The following is a detailed supplementary explanation of the origami compiler error feedback system, including more specific error types, possible error messages, relevant parameters, and their underlying principles.

This type of error occurs in the initial phase when the compiler parses the Crease Pattern (CP) code provided by the user, if the code does not conform to predefined syntax rules. • faulty_token_or_command: (Optional) Indicates the specific instruction or token that caused the error.

• Formal Language and Grammar: CP code is treated as a formal language with precisely defined lexical and syntax rules.

• Parsing Stages:

  1. Lexical Analysis: Code text is broken into “tokens.” 2. Syntax Analysis: Token sequence is checked against grammar rules, often building an Abstract Syntax Tree (AST).

• Error Detection: Errors are reported if tokens or their sequence violate rules, preventing further compilation.

This error indicates that some defined folding operations are physically or geometrically unfeasible.

• Example Error Codes:

-E_GEOM_TOO_MANY_LAYERS: “Folding near (x, y) would result in N paper layers, exceeding the limit of M layers.” • suggested_disambiguation: (Optional) “Suggestion: Add layer order constraint (e.g., LAYER_ABOVE) or specify crease direction.”

• Non-uniqueness of Solution Space: A CP may correspond to multiple valid configurations.

• Local vs. Global Information: Local constraints may be met, but global form can vary.

• Symmetry: Symmetric CPs or operations can lead to multiple equivalent results.

• Branching Points in Configuration Space: Folding path may have bifurcations.

• Implicit vs. Explicit Instructions: Unstated conventions can lead to ambiguity for the compiler.

• Solver Behavior: Solvers for underdetermined systems might not find a unique solution.

This section introduces the complete evaluation process of the Crease Pattern . The final score is a weighted average of the scores from the different dimensions. Each of the four main dimensions is assigned an equal weight:

• Topological Similarity: w topological = 0.25

• Geometric Similarity: w geometric = 0.25

• Foldability Constraint Satisfaction: w f oldability = 0.25

• Final Folded State: w f old_state = 0.25

The total score S total is calculated as:

Since w dim = 1 with these weights, this simplifies to:

where s dim is the score for a particular dimension.

This initial step ensures the generated CP data (cp_data) is well-formed and meets basic criteria for a valid crease pattern.

• Presence of Basic Elements: Checks if “vertices_coords”, “edges_vertices”, and “faces_vertices” keys exist in the input. • Vertex Coordinates: Each vertex in vertices_coords must be a list of two numerical coordinates (e.g., [x, y]).

• Edge Definitions: Each edge in edges_vertices must be a list of two integer vertex indices (e.g., [v1, v2]). These indices must be valid and within the bounds of the vertex list.

• Crease Assignments (Optional): If “edges_assignment” is present, each assignment must be one of the valid types: “B” (Boundary), “M” (Mountain), “V” (Valley), “F” (Flat), “U” (Unassigned).

• Face Definitions: Each face in faces_vertices must be a list of at least three integer vertex indices. These indices must be valid.

• Euler Characteristic: For a planar graph, the Euler characteristic must satisfy V -E +F = 2, where V is the number of vertices, E is the number of edges, and F is the number of faces.

• Flat-Folder Validation (Optional): If the Flat-Folder compute module is available, its validate_cp_structure(cp_data) API is called to check if the CP can be compiled into a valid origami model. If not, the CP is considered invalid.

If any of these checks fail, the function returns {“valid”: False, “reason”: “error message”}. Otherwise, it returns {“valid”: True}.

This dimension assesses the similarity of the graph-theoretical structure of the generated CP (gen_cp) and the reference CP (ref_cp). It combines scores from four sub-metrics, after extracting basic topological information using extract_topology(cp_data), which retrieves vertices, edges, edge assignments, and faces.

The overall topological similarity score S topological is a weighted average defined within the calculate_topological_similarity method:

Compares the number of vertices (V gen , V ref ).

• Otherwise, the score is calculated using an exponential decay function:

The code implements this as exp(-0.5

Compares the edge structures based on degree distribution and connected components.

• Adjacency List Construction (build_adjacency_list): Adjacency lists are built for both CPs from their edge-vertex relationships.

• Degree Distribution Similarity:

-calculate_degree_distribution: Computes the distribution of vertex degrees (number of edges connected to each vertex). -calculate_wasserstein_distance: A simplified Wasserstein distance (d W ) is calculated between the degree distributions of the generated and reference CPs. The score for degree similarity is s degree = 1 -d W .

• Connected Components Similarity:

-count_connected_components: The number of connected components (C gen , C ref ) is determined for each CP graph using Depth First Search (DFS).

• The final edge connectivity score s edge is a weighted average: s edge = 0.7 • s degree + 0.3 • s conn .

Compares properties of the faces in the two CPs.

• Face Count Similarity (s f _count ):

where F gen and F ref are the number of faces. • Average Vertices per Face Similarity (s f _avg_v ): Let avgV gen and avgV ref be the average number of vertices per face.

• Face Size Distribution Similarity (s f _dist ): The distribution of face sizes (number of vertices per face) is computed for both CPs. A simplified Wasserstein distance (d W ) is calculated between these distributions using calculate_wasserstein_distance. The score is

• The final face relations score s f ace is a weighted average:

Compares the distribution of crease types (M, V, B) if “edges_assignment” is available.

• If either CP lacks edge assignments, a low score of 0.2 is returned.

• Crease Type Counts (count_crease_types): Counts the occurrences of Mountain (‘M’), Valley (‘V’), Boundary (‘B’), Flat (‘F’), and Unassigned (‘U’) creases. • Proportion Similarity: For Mountain, Valley, and Boundary creases, the similarity of their proportions (prop) in the generated (gen) and reference (ref ) CPs is calculated:

where proportion is count of type / total number of assigned edges for that CP. • Length Penalty (p L ): A penalty is applied if the total number of assigned edges differs:

where L is the total number of assigned edges. • The final crease assignment score s crease is a weighted average of the proportion scores, multiplied by the length penalty:

This dimension evaluates the similarity of the spatial characteristics of the compiled/folded models. It requires compiling the CPs into 3D models.

• CP Compilation (compile_cp_to_model):

-If the Flat-Folder compute.compute_folded_state(cp_data) API is available, it’s used to get the folded model data (typically including 3D vertex coordinates “P” and crease edges “SP”). -If Flat-Folder is unavailable, a simplified_folding method is used, which essentially returns the original 2D vertex coordinates as “P” and edges as “SP”. This is a significant simplification.

• If either CP fails to compile (or provide simplified data), a low score of 0.2 is returned by calculate_geometric_similarity.

The overall geometric similarity score S geometric is a weighted average defined within calculate_geometric_similarity:

Compares the 3D point clouds of the folded models.

• Coordinate Normalization (normalize_coordinates): Vertex coordinates (from “P”) of both models are normalized. If points are 2D, a Z-coordinate of 0 is added. Points are then translated so their centroid is at the origin and scaled so the maximum distance from the origin to any point is 1 (i.e., normalized to a unit sphere).

is calculated between the normalized point sets of the generated (P gen ) and reference (P ref ) models. d(a, b) is the Euclidean distance. This is achieved by calling calculate_hausdorff_distance twice. • The point position similarity score s point is calculated using an exponential decay function:

Compares the distribution of dihedral angles along creases in the folded models.

• Crease Edge Extraction (extract_crease_edges): Crease edges are extracted from the folded model data (typically from “SP”).

-Note: In the provided eval.py, if Flat-Folder is unavailable, this function returns a list of random angles as a placeholder. A proper implementation would calculate actual dihedral angles between faces sharing a crease.

-create_histogram: Histograms of dihedral angles are created for both models. Angles are typically in [0, 180 • ], binned into 18 bins (10 degrees per bin). -calculate_cosine_similarity: The cosine similarity between the two angle histogram vectors is calculated. This value serves as the angle similarity score s angle . • If creases cannot be extracted or angles cannot be calculated for either model, a default score of 0.5 is returned by compare_angles.

Compares the overall dimensions and aspect ratios of the folded models’ bounding boxes.

• Bounding Box Calculation (calculate_bounding_box): The axis-aligned bounding box (min/max coordinates along X, Y, Z) is computed for the point clouds of both models. 2D points are padded with Z=0. • Proportion Calculation: The dimensions (length, width, height) of the bounding boxes are calculated. These dimensions are sorted in descending order and then normalized by dividing by the largest dimension (e.g., [1, L 2 /L 1 , L 3 /L 1 ]). • Similarity Score: The cosine similarity between the normalized proportion vectors of the two models is calculated using calculate_cosine_similarity. This value is the size and proportions similarity score s size . • If either point set is empty, a default score of 0.5 is returned by compare_size_and_proportions.

This dimension assesses whether the generated CP adheres to known origami foldability constraints, beyond basic geometric foldability.

• Basic Foldability Check (Optional):

-If Flat-Folder’s compute.check_foldability(cp_data) API is available, it’s used to check if both CPs are foldable. -If the reference CP is foldable but the generated CP is not, the score for calculate_foldability_similarity returns 0.2.

The overall foldability score S f oldability is a weighted average defined within calculate_foldability_similarity:

If an exception occurs during calculation, calculate_foldability_similarity returns a score of 0.3.

This involves extracting and comparing critical origami constraints.

• Constraint Extraction (extract_constraints):

-This method aims to extract Taco-Taco (TT), Taco-Tortilla (TTo), and Transitivity (Trans) constraints by calling helper methods like extract_taco_taco_constraints. -Note: In the provided eval.py, if Flat-Folder’s constraints module is unavailable, the extraction methods are simplified and return empty lists. A full implementation would identify these constraints from the CP geometry and crease assignments.

• Constraint Set Comparison (compare_taco_taco_constraints, compare_taco_tortilla_constraints, compare_transitivity_constraints): For each constraint type (TT, TTo, Trans):

-If both CPs have no such constraints, similarity is 1.0.

-If one has constraints and the other doesn’t, similarity is 0.3.

-Otherwise:

  • Constraint Overlap (s overlap ): Calculated using Jaccard similarity on the sets of constraints (constraints are stringified for comparison via calculate_constraint_overlap).

  • Count Similarity (s count ):

where N is the number of constraints of that type. * The score for that constraint type (e.g., s T T ) is 0.7 • s overlap + 0.3 • s count .

Checks for adherence to local flat-folding theorems around vertices.

• Kawasaki’s Theorem Check (check_kawasaki_theorem):

-States that for a flat-foldable vertex, the sum of alternating angles around the vertex is 180 • , or equivalently, α i = 2π (or 0, depending on how angles are measured like (-1) i α i = 0).

-Note: The mock implementation in eval.py always returns True. A full implementation would iterate internal vertices and check angles. • Maekawa’s Theorem Check (check_maekawa_theorem):

-States that for a flat-foldable vertex, the number of mountain creases (M ) and valley creases (V ) must differ by two: |M -V | = 2. -Note: The mock implementation in eval.py always returns True. A full implementation would check crease assignments around internal vertices.

• Calculated using the bidirectional Hausdorff distance d H between the (normalized) point clouds of the generated and reference folded models, identical to the method in compare_point_positions. • The shape similarity score s shape is: 16 Layering Similarity (compare_layers)

Compares the stacking order of faces/layers in the folded state.

• This relies on layering information being present in the compiled model, typically under a key like “CF” (face assignments or configuration). • Note: The compare_layers function in the provided eval.py is a simplified placeholder and returns a default score of 0.5. A full implementation would require a detailed comparison of the layer graph or face ordering. • The score is s layer .

For the reinforcement learning method, we adopt TRICO [35] for training on qwen2.5-vl-32B, which is a PPO-based [36], more efficient MLLMs multi-turn reinforcement learning algorithm. Specifically, we trained for 10.2 hours on 16 H100 GPUs, with the following hyperparameter settings: γ turn = 0.95, γ token = 1.0, KL penalty = 0.001, Actor LR=1 × 10 -6 , and Critic LR=1 × 10 -5 .

While the ORIGAMISPACE benchmark and dataset offer a novel approach to evaluating multi-step spatial reasoning in MLLMs, we acknowledge certain limitations that provide avenues for future work. Firstly, although our dataset comprises 350 meticulously collected origami instances, the overall scale is relatively modest compared to some large-scale benchmarks in other vision and language domains. Future efforts could focus on expanding the dataset size and further diversifying the range of origami types and complexities included, potentially through semi-automated generation techniques, to ensure even broader coverage and statistical power. Secondly, while origami provides an excellent structured environment with clear mathematical constraints, the direct transferability of MLLM performance and the specific reasoning mechanisms learned on ORIGAMISPACE to other, less constrained or visually distinct spatial reasoning tasks (e.g., understanding dynamic real-world scenes or interpreting abstract diagrams from different fields) warrants further investigation. Exploring this generalization gap could be a valuable direction for future research. Finally, our current set of evaluation tasks, though designed to be challenging, focuses on specific facets of spatial reasoning highlighted by origami. There may be other subtle aspects of spatial intelligence or different interaction modalities with the origami compilation process that could be explored in future iterations to provide an even more holistic assessment of MLLM capabilities.

Answer: [Yes] Justification: We provide complete data, evaluation code, and model training code, which can be accessed via GitHub in the public version. Guidelines:

• The answer NA means that paper does not include experiments requiring code. Guidelines:

• The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.

• If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. • The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA] Justification: Our work primarily explores the performance of MLLMs (Multimodal Large Language Models) in origami scenarios and has no potential societal impact.

Guidelines:

• The answer NA means that there is no societal impact of the work performed.

• If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. • Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

• The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. • The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. • If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [Yes]

Justification: As discussed in A, all our data are public data or authorized by the original websites and data sources, with no potential infringement risks.

Guidelines:

• The answer NA means that the paper poses no such risks.

• Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. • Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. • We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

  1. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: We have complied with all licensing and usage terms and acknowledged the data owners.

Guidelines:

• The answer NA means that the paper does not use existing assets.

• The authors should cite the original paper that produced the code package or dataset.

• The authors should state which version of the asset is used and, if possible, include a URL. • The name of the license (e.g., CC-BY 4.0) should be included for each asset.

• For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. • If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

• For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. • If this information is not available online, the authors are encouraged to reach out to the asset’s creators.

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [Yes]

Justification: The paper introduces new assets, and all data and code are publicly available. Details about training, license, limitations, etc., are documented in compliance with submission guidelines.

Guidelines:

• The answer NA means that the paper does not release new assets.

• Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. • The paper should discuss whether and how consent was obtained from people whose asset is used. • At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [Yes]

Justification: All details of manual annotation, including annotation instructions and compensation descriptions, are provided in B.

Guidelines:

• The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. • Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. • According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [Yes] Justification: The corresponding content is described in B.

Guidelines:

• The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. • Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

Results of different MLLMs and methods on the code generation task. Compilation indicates whether compilation is successful, including the probability of no occurrence of the four compilation errors(3.

Results of different MLLMs and methods on the code generation task. Compilation indicates whether compilation is successful, including the probability of no occurrence of the four compilation errors

  1. Code of ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines ? Answer: [Yes] Justification: Our research complies with the NeurIPS Code of Ethics.

📸 Image Gallery

page_1.png page_2.png page_3.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut