Script-to-Slide Grounding: Grounding Script Sentences to Slide Objects for Automatic Instructional Video Generation

Script-to-Slide Grounding: Grounding Script Sentences to Slide Objects for Automatic Instructional V ideo Generation Rena Suzuki Nagoya Institute of T echnology Nagoya, Aichi, Japan suzure@ozlab .org Masato Kikuchi Nagoya Institute of T echnology Nagoya, Aichi, Japan kikuchi@nitech.ac.jp T adachika Ozono Nagoya Institute of T echnology Nagoya, Aichi, Japan ozono@nitech.ac.jp ©2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collectiv e works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Abstract —While slide-based videos augmented with visual ef- fects are widely utilized in education and research presentations, the video editing process—particularly applying visual effects to ground spoken content to slide objects—remains highly labor - intensive. This study aims to develop a system that automatically generates such instructional videos fr om slides and corr esponding scripts. As a f oundational step, this paper proposes and formu- lates Script-to-Slide Grounding (S2SG), deﬁned as the task of grounding script sentences to their corresponding slide objects. Furthermore, as an initial step, we propose “T ext-S2SG, ” a method that utilizes a large language model (LLM) to perf orm this grounding task for text objects. Our experiments demon- strate that the proposed method achie ves high perf ormance (F1- score: 0.924). The contribution of this work is the formalization of a previously implicit slide-based video editing process into a computable task, thereby paving the way for its automation. Index T erms —Grounding, Slide understanding, Slide-based video generation, Large language model (LLM) I . I N T R O D U CT I O N Slide-based instructional videos, widely utilized in edu- cation and research, can be enhanced with visual effects such as pointing to deepen viewer comprehension; howe ver , the signiﬁcant effort required for their creation remains a challenge. T o effecti vely facilitate audience understanding, visual effects—such as pointing and highlighting—must be synchronized with the corresponding ﬁgures, tables, and text on the slides with the narration. Ne vertheless, the manual process of adding these visual effects is exceedingly time- consuming and labor-intensi ve. Furthermore, the quality of the ﬁnal video is highly dependent on the editor’ s skills and interpretation, posing issues in terms of both efﬁcienc y and reproducibility . The objectiv e of this research is to de- velop a system that automatically generates slide-based videos augmented with visual effects. The system takes slides and a corresponding script as input, automatically adding visual effects based on the narration. In this context, a script is the text that is narrated for the slides. While existing research has addressed the automatic generation of slides and scripts, fe w studies have focused on the automatic application of the visual effects themselves—that is, determining “what” slide content to emphasize and “when” to do so based on the narration. In this study , we deconstruct this automation process into two sequential tasks. The ﬁrst is a “grounding” task, which identiﬁes the slide objects referenced by the script. The sec- ond, termed the “ Attention Control Problem, ” determines the speciﬁc visual effects to apply based on the grounding results. This paper focuses on the former task, grounding, as it serves as the foundation for the entire process. The accuracy of this grounding is fundamental to creating effecti ve visual ef fects, and since the latter task is entirely dependent on the outcome of the former, it must be addressed ﬁrst. This grounding task is challenging as it requires semantic interpretation; despite recent advancements in vision-language models (VLMs), the unique structure of slides prev ents their straightforward ap- plication. This paper analyzes and formulates this fundamen- tal task as Script-to-Slide Grounding (S2SG). Automatically applying visual effects based on a script necessitates an integrated understanding of textual and visual information, which is a highly complex problem. Therefore, we propose a phased approach that begins with the relativ ely simple task of grounding between textual elements, using these results as a foothold to tackle the more complex problem inv olving visual information. In this paper , we introduce a method called T ext-S2SG, which is limited to text-only slides, and demonstrate through ev aluation experiments that it can achiev e high-performance grounding (F1-score: 0.924). I I . R E L A T E D W O R K The manual creation of slide-based videos is not only exceedingly time-consuming and labor-intensi ve b ut also poses challenges in efﬁcienc y and reproducibility , as the quality is highly dependent on the editor’ s skills. Automating the appli- cation of visual effects based on a script is a formidable task, requiring an integrated understanding of textual and visual in- formation. W e, therefore, adopt a phased approach: we ﬁrst ad- dress the simpler task of grounding between textual elements, intending to use these results as a foundation for tackling the more complex problem of visual information. T o situate this approach, this section re views existing research from three perspectiv es—support for creating slide-based videos, slide content analysis, and the grounding problem—to clarify the positioning of our study . A. Support for Cr eating Slide-based V ideos Existing approaches aiming to automate the creation of slide-based videos can be broadly categorized into “automated content generation” and “interacti ve editing support. ” In research on automated content generation, the focus has been on the automatic creation of slides and scripts. Xu et al. [1], in the context of English vocab ulary learning, gener- ated scripts from existing slide materials using a specialized large language model (LLM) and created lecture videos by vocalizing these scripts as narration. Howe ver , their work is limited to narration generation and does not extend to the application of visual effects. Ge et al. [2] proposed a method for automatically generating slides solely from natural language instructions. Speciﬁcally , they designed a custom library that encapsulates basic slide creation operations such as image retriev al and title insertion. Using this library , an LLM generates a program to manipulate the python-pptx library , thereby creating slides with layouts and images. Howe ver , this research is also focused on generating static slides and does not address the generation or application of visual ef fects. In research on interactiv e editing support, LLM-based as- sistance has been proposed to reduce the workload of video editing. W ang et al. [3] introduced “LA VE, ” an agent system that understands and processes editing instructions in natural language to automatically plan and ex ecute tasks such as clip selection, trimming, and sequence editing. Howe ver , the application of advanced visual effects is identiﬁed as a future challenge. “ExpressEdit” by Tilekbay et al. [4] allows for the application of effects like shape overlays and zooming through a combination of natural language and sketches, but this is an interactiv e support system that relies on user instructions and does not achieve full automation of the editing process. Creating slide-based videos augmented with visual effects to promote comprehension requires more than just the automa- tion of material generation or editing operations. It demands an adv anced integrated process that dynamically corresponds narration with visual elements on the slides and designs and applies visual effects in a way that aids audience understand- ing. Gritz et al. [5] reported that an increase in amount of text within a slide leads to an increase in comprehension-related actions such as pausing. This suggests that visually indicating the scope of slide objects corresponding to the narration can signiﬁcantly contrib ute to comprehension. Ho wev er , at present, no established approach exists to fully automate the editing process of determining “what visual effect” to apply , “where, ” and “when. ” Consequently , the creation of such videos remains dependent on manual labor, which not only entails a high workload but also suffers from reproducibility issues, as the ﬁnal quality and style are dependent on the creator’ s sub- jectivity and skills. Therefore, establishing a mechanism that autonomously generates consistent visual effects from slides and a script is a critical challenge for the future of assistiv e technologies for creating slide-based videos. Scri pt List ： Slide ： s 2 : Narr ation Describing v 3 None DOM Tree ・・ Title Root I. 図 1 v 1 v 2 v 3 v 4 s 1 s 2 s 3 II. 図 2 v i v i 1 v i 2 v i 3 I. 図 1 v 1 v 2 v 3 v 4 s 1 s 2 s 3 II. 図 2 v i v i 1 v i 2 v i 3 I. 図 1 v 1 v 2 v 3 v 4 s 1 s 2 s 3 II. 図 2 v i v i 1 v i 2 v i 3 I. 図 1 v 1 v 2 v 3 v 4 s 1 s 2 s 3 II. 図 2 v i v i 1 v i 2 v i 3 I. 図 1 v 1 v 2 v 3 v 4 s 1 s 2 s 3 II. 図 2 v i v i 1 v i 2 v i 3 I. 図 1 v 1 v 2 v 3 v 4 s 1 s 2 s 3 II. 図 2 v i v i 1 v i 2 v i 3 I. 図 1 v 1 v 2 v 3 v 4 s 1 s 2 s 3 II. 図 2 v i v i 1 v i 2 v i 3 I. 図 1 v 1 v 2 v 3 v 4 s 1 s 2 s 3 s 1 s 2 s 3 II. 図 2 v i v i 1 v i 2 v i 3 I. 図 1 v 1 v 2 v 3 v 4 s 1 s 2 s 3 s 1 s 2 s 3 II. 図 2 v i v i 1 v i 2 v i 3 I. 図 1 v 1 v 2 v 3 v 4 s 1 s 2 s 3 s 1 s 2 s 3 II. 図 2 v i v i 1 v i 2 v i 3 I. 図 1 v 1 v 2 v 3 v 4 s 1 s 2 s 3 II. 図 2 v i v i 1 v i 2 v i 3 I. 図 1 v 1 v 2 v 3 v 4 s 1 s 2 s 3 II. 図 2 v i v i 1 v i 2 v i 3 Fig. 1. Overview of Script-to-Slide Grounding (S2SG) B. Slide Content Analysis Accurately extracting meaning from slides is extremely difﬁcult due to their unique data structure. A presentation ﬁle (e.g., .pptx) is essentially an archiv e containing multiple XML ﬁles and other data. The content of each slide is represented by a tree structure called a Document Object Model (DOM) tree, as shown in Fig. 1. In this structure, individual slide objects such as text and shapes become nodes, and the entire slide is serialized as an XML document. This DOM tree presents two problems for slide analysis: a lack of semantic structure and issues with object granularity . The ﬁrst problem, the lack of semantic structure, arises because the DOM tree merely reﬂects the slide’ s creation history or internal data structure, not the semantic structure that humans perceive. For example, as shown in Fig. 1, a human might interpret the layout as indicating that node v 2 should be read before node v 1 , but on the DOM tree, they could be recorded in the order v 1 , v 2 depending on the editing sequence. As such, the order of appearance in the XML does not necessarily match the visual layout, so a naiv e approach that simply maps script sentences to objects sequentially cannot correctly reﬂect the presenter’ s intended structure. The second problem is object granularity . A semantic unit may be represented as a single object, rendering the granularity too coarse. For instance, a bulleted list with indented items is often represented as a single text object. This poses a problem when processing the correspondence with a script, as it is difﬁcult to accurately point to just one item in the list e ven if the script refers to it speciﬁcally . Regarding the ﬁrst challenge, Hayama et al. [6] proposed a method that uses the visual layout information of a slide to identify the func- tional attributes of objects and extract a hierarchical structure. Howe ver , they report that analysis becomes difﬁcult when the relationships between objects are deﬁned by their text content. Furthermore, accurately understanding the meaning of non- text objects such as ﬁgures and tables requires considering the context and the entire slide’ s content. This second challenge can be addressed by hierarchizing the text objects within a slide using indentation information, as seen in the research by Inoue et al. [7] and Sano et al. [8]. This approach aims to reﬂect the creator’ s intended layout and semantic divisions by analyzing the hierarchical structure of each text object. Meanwhile, VLMs are a technology applicable to slide content interpretation. VLMs are foundation models capable of simul- taneously understanding and associating images and text. By inputting a slide as an image and a script as text, there is a potential to directly analyze the correspondence between them. Howe ver , signiﬁcant challenges remain with this approach. Lee et al. [9] h a ve shown that it is dif ﬁcult for AI models to accurately understand the complex diagrams and, particularly , the mathematical formulas found in lecture slides, for which recognition accuracy is known to be low . As a result, they report that the grounding of speech to slide objects in lecture videos has not yet reached human-level performance. In other words, understanding the ov erall meaning of a slide remains a difﬁcult task for VLMs. C. Slide Analysis as a Gr ounding Problem The challenge addressed in this research can be framed as a type of grounding problem, which in volves connecting linguistic information to speciﬁc objects. W e deﬁne the task of identifying which set of objects on a slide each sentence of a script refers to as S2SG. Grounding is an acti ve area of research, with studies such as “T extual Grounding” [10], which extracts text regions within a document based on a query , and “V isual Grounding” [11], which associates speciﬁc re gions in an image with text. Howe ver , directly applying these existing methods to S2SG is extremely difﬁcult due to two unique characteristics of the slide medium. The ﬁrst is the semantic and pragmatic gap. Scripts fre- quently use abstract and deictic expressions (pragmatics), such as “ As this graph sho ws... ” In contrast, the corresponding object on the slide is a more concrete entity (semantics), lik e the entire graph or a trend within it. Bridging this gap in abstraction requires advanced reasoning capabilities that go beyond simple te xt matching or object detection. The second is context dependency in a structured document. A slide is a structured document where objects like titles, bullet points, ﬁgures, and tables are intentionally arranged. The meaning of an object is not determined by the object in isolation but is strongly dependent on the overall layout structure of the slide and the ﬂow of the presentation (context). Therefore, existing grounding methods that treat each object as independent cannot capture this conte xt. From the analysis above, S2SG can be considered a no vel problem that cannot be solved by a mere extension of existing research. T o address this comple x problem, this paper proposes a phased approach that ﬁrst uses an LLM, which excels at contextual understanding, to perform the relatively easier task of grounding between textual elements (script and on- slide text). This strategy aims to effecti vely simplify the ov erall S2SG problem by signiﬁcantly reducing the search space for subsequent VLMs processing, thereby lowering the computational load and the risk of misinterpretation. I I I . S C R I P T - TO - S L I D E G R O U N D I N G This section describes S2SG, a task that inv olves estimating the correspondence between a slide and a script. First, we outline the problem setting and its associated challenges. Then, we present the formal deﬁnition of S2SG and the method for representing the correspondence between slide objects and each sentence in the script (hereafter , a “script sentence”). A. Pr oblem Setting W e aim to build a system that automatically generates slide-based videos augmented with visual effects. Giv en a set of presentation slides and the corresponding script, an LLM estimates which slide objects should be visually emphasized based on the content of script sentences. The system must then estimate and associate the relev ant slide objects for each script sentence and output the visual effect information to be applied during that sentence’ s playback. This task can be divided into two sub-tasks: S2SG, which inv olves associating the script with slide objects, and the “ Attention Control Problem, ” which in volv es determining the visual effects for guiding the vie wer’ s gaze. In this paper, we focus on S2SG. In addition to the issues discussed in Section II, sev eral challenges arise. First, there is a mismatch in information content and representation: slides are often more concise than scripts, and scripts may include abstract or deictic expressions (e.g., “this”) and non-v erbal cues (e.g., pointing). Second, the interpretation of graphical objects is context-dependent; for example, even if a yello w triangle represents a banana, it is difﬁcult to infer this unless the script explicitly states it. Accurately understanding non-textual objects typically requires reasoning over the context and the slide’ s overall structure. Finally , despite recent advances in image generation, precise highlighting ov er text-heavy slides remains difﬁcult to control; prompt design is costly , and processing time grows with the number of slides. For these reasons, S2SG requires accurately estimating correspondence by leveraging semantic and structural relation- ships between script sentences and slide objects. B. F ormal Deﬁnition The objective of the S2SG task is to automatically estimate the correspondence between the script sentences in a script and the objects on a slide. In this study , we assume that the correspondence in S2SG is conﬁned within a single slide. For each slide, giv en a set of script sentences S = { s i } n i =1 , and a set of slide objects V = { v j } m j =1 , we estimate a subset of slide objects V i = { v | s i → v , v ∈ V } that corresponds to each script sentence s i ∈ S . By these premises, the grounding function g is deﬁned as the follo wing function: g : s i → 2 V . (1) Here, 2 V represents the power set of the slide object set V . Let g ( s i ) be the set of slide objects estimated to correspond to the script sentence s i ; it can be empty if s i is unrelated to any slide objects (e.g., a digression). DOM Tree - Hello - World - Thank you Text Object Root - Thank you - Hello - World Hierarchical decomposition V i3 Root I. 図 1 v 1 v 2 v 3 v 4 s 1 s 2 s 3 II. 図 2 v i v i 1 v i 2 v i 3 I. 図 1 v 1 v 2 v 3 v 4 s 1 s 2 s 3 II. 図 2 v i v i 1 v i 2 v i 3 I. 図 1 v 1 v 2 v 3 v 4 s 1 s 2 s 3 II. 図 2 v i v i 1 v i 2 v i 3 I. 図 1 v 1 v 2 v 3 v 4 s 1 s 2 s 3 II. 図 2 v i v i 1 v i 2 v i 3 I. 図 1 v 1 v 2 v 3 v 4 s 1 s 2 s 3 II. 図 2 v i v i 1 v i 2 v i 3 Fig. 2. Hierarchical decomposition of text objects Y ou are a presentation support assistant. For each sentence in the following presentation script, please ground it to the object it describes from the list of slide objects. Note that not all objects will necessarily be described. Y our output must be in JSON format. For each sentence, return a list of shape IDs for the corresponding objects. Fig. 3. Instruction prompt used for T ext-S2SG Alternativ ely , the correspondence can be represented as a real-valued matrix M . The M is formulated as: M ∈ [0 , 1] n × m . (2) Here, m ij ∈ M indicates the likelihood that script sentence s i corresponds to slide object v j , with values near 1 indicating stronger correspondence. Therefore, S2SG can be formulated as the problem of estimating this correspondence matrix M . I V . P RO P O S E D M E T H O D : T E X T - S 2 S G This section proposes a T ext-S2SG algorithm tailored for slides containing only text objects and describes a prototype system that incorporates this algorithm. A. T e xt-S2SG Algorithm In this paper , we propose an LLM-based method to au- tomate S2SG under a simpliﬁed setting, where slides contain only text objects, in order to address the challenges mentioned in Section III-A and to ensure experimental feasibility and clarity of e valuation. Each script sentence—delimited by a period (.) or a newline—is treated as a unit. Slide text objects are split by ne wlines into sentence lev el elements. The method consists of two stages: preprocessing the slide information and performing the grounding using an LLM. As illustrated in Fig. 2, each text object v i extracted from the slide’ s XML is split by ne wlines into { v i 1 , . . . , v in } . W e then build a hierarchy by inferring parent-child relations from indent levels, thereby preserving the logical structure of the original text object v i . The processed slide information and the script are provided as input to the LLM, along with a pre-designed T ext-S2SG instruction prompt (Fig. 3). For each script sentence, the model outputs the corresponding shape_ID (s), a unique User Slides with Spea ker No tes Da t a Pr o c e s s i n g Mo d u l e LLM Mo d u l e Vi d e o Cr e a t i o n Mo d u l e Video Setting Visual Effect Command Proc essed Data Final Video Data Proc essing Module LLM Module Video Generation Module C B A Fig. 4. System architecture ID for a slide object. T o improv e grounding accuracy , rules regarding conte xt, group structures, and correspondence to multiple objects are also provided to the LLM. W e believ e that providing information on a per-page basis allows for a complete and concise presentation of the slide content. The grounding in this method can also be represented as a binary decision on the relationship between each script sentence and slide object. In the method proposed, M can be treated as a binary matrix, where the entries are determined based on the LLM’ s output. For e xample, when g ( s 1 ) = { v 2 , v 4 } and g ( s 2 ) = { v 3 } , the correspondence matrix M would be represented as follo ws: M =  0 1 0 1 0 0 1 0  . (3) W ith this design, we achiev e the automation of the S2SG task for slides containing only text objects, while considering the logical structure of the slide layout and the information gap with the script’ s content. B. Pr ototype System This prototype system takes slides with speaker notes as in- put and automatically generates slide-based videos augmented with visual effects. The system is limited to slide pages containing only text objects, and the speaker notes for each slide are used as the script. As shown in Fig. 4, the system consists of three components: (A) a Data Processing Module, (B) an LLM Module, and (C) a V ideo Generation Module. (A) The Data Processing Module conv erts the slide pre- sentation into a format usable by subsequent modules and simultaneously renders each slide as an image for video gener- ation. First, it uses python-pptx to extract strings from each text object and assigns a unique identiﬁer , shape_ID , to each. If a text object contains multiple sentences, it is split by newlines, and each resulting line is treated as an independent sentence element, as described in Section IV -A. Then, a hierarchical structure is constructed by analyzing the indent lev el of each sentence element and determining parent-child relationships based on the indentation differences from the preceding sen- tences. This hierarchy represents the logical parent-child rela- Y ou are a presentation video generation assistant. For each sentence in the following presentation script, please ground it to the object it describes from the list of slide objects. Note that not all objects are necessarily described. The ov erall goal is to create a short video clip for each script sentence and then concatenate them to produce the ﬁnal video. Based on the grounding results, assign appropriate visual effect commands for each script sentence. Y ou may assign multiple commands to a single sentence. Fig. 5. Instruction prompt used for the LLM Module (B) tionships at the sentence lev el while preserving the structure of the original text objects ( ContentList ). Each text ob- ject has { group_shape_ID, ContentList } attributes, and each sentence element has { shape_ID, Content, children } attributes. Furthermore, it generates “searchable slide data” for the video rendering process. This data lists all slide ob- jects on a page in a ﬂat structure, with each object as- signed { shape_ID, Content, position } attributes. Here, position indicates the coordinate range on the slide image. The coordinates are assigned as follows: when con- verting the slide to an image via PDF , PyMuPDF is used to extract text and coordinate information. These results are then matched with the te xt information e xtracted by python-pptx . (B) The LLM Module takes the preprocessed slide data, the script, and a speciﬁc instruction prompt for this module (Fig. 5) as input. It then outputs a list of corresponding shape_ID s and visual ef fect commands for each script sentence. This output includes a simple implementation of the “ Attention Control Problem, ” allowing for the automatic application of visual effects based on the narration’ s content. The LLM is called on a per-slide basis (e.g., for a 3-slide presentation, slides 1, 2, and 3 are processed independently). Each command output by the LLM has { type, start_time, duration, param } attributes. The can- didates for type are pointer, frame, and av atar , which are represented as POINT , RECTANGLE , and AVATAR , respec- tiv ely , in the command. The pointer helps in guiding the viewer’ s gaze by dynamically pointing to the object being discussed, while the frame highlights a speciﬁc area to clearly indicate the object of attention. By providing these com- mand candidates to the LLM, the system can automatically decide which object to emphasize, how , and when, accord- ing to the content of the narration. The meaning of each ﬁeld is as follows. start_time and duration represent temporal control within the short video clip, and param stores effect-speciﬁc parameters. For example, for a pointer , param = { start_pos, end_pos } ; for a frame, param = position . The parameters start_pos , end_pos , and position store the shape_ID of the corresponding object. (C) The V ideo Generation Module generates a series of short video clips with visual effects rendered based on the LLM output. These clips are then sequentially combined to create the ﬁnal video. Speciﬁcally , it uses the shape_ID contained in each command to reference the coordinate in- formation from the “searchable slide data” generated in step (A). For a frame command, for example, it draws a rectangle enclosing the coordinate range from ( x 1 , y 1 ) to ( x 2 , y 2 ) for duration seconds, starting at start_time . Furthermore, the user can conﬁgure the display position and visibility of the slide and av atar . When the av atar is set to be hidden, av atar- related commands are e xcluded from the prompt to ensure the y do not appear in the LLM’ s output. V . E V A L U A T I O N E X P E R I M E N T S W e ev aluate the usefulness of T ext-S2SG by examining how the amount and type of slide information provided to the LLM affects grounding accuracy . From text-only slides, one can extract not only visible text but also coordinates, font sizes, and other attributes. Howe ver , excessi ve input detail might hinder performance. W e test this hypothesis using the Gemini 2.5 Flash model (temperature = 0). A. Experimental Method In this experiment, we collected academic presentation slides from four individuals and constructed a test set of 19 text-only pages (from 6 presentations) and 94 script sentences. The text-only pages contained at least one textual element and may include decorativ e lines or arrows indicating the reading order , as well as headers, footers, or slide numbers, b ut excluded ﬁgures, tables, images, or other non-textual content. The slides averaged 8 . 9 objects ( SD 3 . 5 ), scripts 4 . 9 sentences ( SD 2 . 1 ), uncorrelated; sentences averaged 46 . 4 words ( SD 21 . 6 ). First, we considered two types of slide information: hierarchical (indent-based parent/child relations) and stylistic (font size, coordinates, role such as title/body). W e compared four data formats based on the presence/absence of these two types. Ne xt, we describe the ev aluation methods. For the ev aluation, we provided one page of slide information and a list of script sentences as a single input to the LLM, which then output a list of shape_ID corresponding to each script sentence. The instruction prompt used is shown in Fig. 3, along with the grounding rules mentioned in Section IV -A. After all outputs for every slide were generated, we compared the estimated results with the ground truth data and calculated the number of correct items for each script sentence. A correct item was deﬁned as a matching list element between the ground truth list and the estimated list. For example, if the ground truth was [1,2,3] and the estimate was [1,2,4], the number of correct items would be 2. Based on this number , we calculated the F1-score for the entire set of test cases for each data format, using a micro-averaged approach (i.e., computed ov er the entire dataset). The ground truth data was created subjectiv ely by one of the authors. B. Experimental Results T able I shows the experimental results. Across four data for- mats (presence/absence of hierarchical and stylistic metadata), we observed no substantial differences; the a verage F1-score was 0.924. Most mismatches inv olved erroneously selecting titles or parent/child items of a correct object; excluding T ABLE I F 1 - S C O R ES F O R E AC H D A TA F O R MAT Stylistic info Hierarchical info Present Absent A verage Present 0.923 0.930 0.926 Absent 0.928 0.916 0.922 A verage 0.925 0.923 0.924 titles from the targets can mitigate these cases. Even when visual emphasis is applied based on such predictions, the discrepancies are unlikely to hinder audience comprehension. This also suggests that, at least for text objects, the grounding process can be performed suf ﬁciently well as long as the speciﬁc displayed content is provided. V I . D I S C U S S I O N The S2SG proposed in this study is signiﬁcant as it is the ﬁrst to explicitly deﬁne the video editing work, which has hitherto been performed implicitly , as a scientiﬁcally veriﬁable task. The high performance (F1-score: 0.924) on te xt-only S2SG achiev ed in our ev aluation experiments demonstrates that the novel challenge of S2SG can be solved by an LLM. This is a crucial achiev ement that paves the way tow ard the ultimate goal of automatically applying visual effects. A particularly interesting ﬁnding from the experimental results is that the presence or absence of “stylistic information, ” such as coordinates and font sizes, did not affect accuracy . Randomly shufﬂing the script sentences had no ef fect on performance. This strongly suggests that the LLM makes its judgments based on the content (semantics) of the text rather than its superﬁcial appearance. This discovery offers a promising out- look for future system design, indicating that high performance can be expected from a simple, text-centric approach without necessarily handling complex visual features. Furthermore, our method’ s content-based (rather than order- based) grounding suggests that it is, in principle, capable of handling non-linear discourse structures, such as when a presenter moves back and forth between topics. Additionally , the observ ation that mismatches occurred in semantically related areas lik e titles and parent-child objects indicates that the method already captures semantic proximity at a high le vel, implying that further accuracy improv ements can be expected with the addition of simple rules. These points illustrate the high potential for the advancement of this approach. This paper has only addressed text objects, which is merely the ﬁrst step in the broader S2SG problem. Howe ver , this suc- cess is signiﬁcant because it validates the “phased approach” we proposed: ﬁrst, achieve high-performance grounding be- tween texts, and then use these results as a foothold to tackle the more complex problem of grounding non-text objects (e.g., by leveraging VLMs). Presenting this feasible dev elopment roadmap is one of the major contributions of this research. The high-performance grounding established in this study is merely the ﬁrst step in video generation. The next challenge is to solve the more advanced “ Attention Control Problem”: determining which objects to emphasize, at what timing, and with which visual effects, based on these grounding results. Findings from cognitiv e science indicate that inappropriate visual effects can increase the cognitiv e load on learners and hinder their understanding [12]. Therefore, the selection of which groundings should receiv e effects and the scheduling of when to display them become new research challenges that must be solved integrally to maximize learner comprehension. The results of this study provide the indispensable founda- tional technology for tackling this complex problem, and we are conﬁdent that it will lead to a future where anyone can easily produce high-quality educational and research content. V I I . C O N C L U S I O N In this study , we aimed to automate the creation of slide- based videos with visual effects and formulated S2SG as a foundational task. W e dev eloped T ext-S2SG for text-only slides and demonstrated its effecti veness, achieving an average F1-score of 0.924 across multiple data formats without notable performance differences. Qualitativ e observations further in- dicated that most errors in volv ed semantically related objects, which are unlikely to hinder viewer comprehension. These ﬁndings validate a phased approach that ﬁrst focuses on text grounding. Future work will extend the method to slides containing non-textual elements and address the “ Atten- tion Control Problem” to determine appropriate visual effects. A C K N O W L E D G M E N T This work was supported in part by JSPS KAKENHI Grant Numbers JP24K03052, JP25K21351. R E F E R E N C E S [1] T . Xu et al., “From Recorded to AI-Generated Instructional V ideos: A Comparison of Learning Performance and Experience, ” British Journal of Educational T echnology , vol. 56, no. 4, pp. 1–25, 2024. [2] J. Ge et al., “AutoPresent: Designing Structured V isuals from Scratch, ” in Proc. of CVPR’25 , 2025, pp. 2902–2911. [3] B. W ang et al., “LA VE: LLM-Powered Agent Assistance and Language Augmentation for V ideo Editing, ” in Pr oc. of IUI’24 , 2024, pp. 699–714. [4] B. Tilekbay et al., “ExpressEdit: V ideo Editing with Natural Language and Sketching, ” in Companion Pr oc. of IUI’24 , 2024, pp. 515–536. [5] W . Gritz et al., “From Formulas to Figures: How V isual Elements Impact User Interactions in Educational V ideos, ” in Proc. of AIED’25 , vol. 15879, 2025, pp. 453–467. [6] T . Hayama et al., “Structure Extraction from Presentation Slide Infor- mation, ” in Pr oc. of PRICAI’08 , 2008, pp. 678–688. [7] S. Inoue et al., “Learning Support System using Question Generation to Encourage Critical Thinking in Online Lectures, ” in Pr oc. of IM- COM’23 , 2023, pp. 1–8. [8] I. Sano et al., “Slide Analysis Method for Editing Lecture Materials based on Hierarchical Structures of Subject T erminologies, ” in Pr oc. of ISM’24 , 2024, pp. 279–284. [9] D. W . Lee et al., “Lecture Presentations Multimodal Dataset: T o- wards Understanding Multimodality in Educational V ideos, ” in Proc. of ICCV’23 , 2023, pp. 20 030–20 041. [10] M. Cheng et al., “T extual Grounding for Open-V ocabulary V isual Information Extraction in Layout-Diversiﬁed Documents, ” in Proc. of ECCV’24 , 2024, pp. 474–491. [11] H. Zhang et al., “LLaV A-Grounding: Grounded V isual Chat with Large Multimodal Models, ” in Proc. of ECCV’24 , 2024, pp. 19–35. [12] J. Sweller, “Cogniti ve Load During Problem Solving: Effects on Learn- ing, ” Cognitive Science , vol. 12, no. 2, pp. 257–285, 1998.

Script-to-Slide Grounding: Grounding Script Sentences to Slide Objects for Automatic Instructional Video Generation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment