AI 기반 안전중요 시스템을 위한 요구공학과 시각 인식 검증 통합 프레임워크

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 Fighting AI with AI: Lev eraging Foundation Models for Assuring AI-Enabled Safety-Critical Systems Anastasia Mavridou, KBR Inc., NASA Ames Divya Gopinath, KBR Inc., NASA Ames Corina S. Păsăreanu, KBR Inc., NASA Ames Abstract The integration of AI components, particularly Deep Neural Net- works (DNNs), into safety-critical systems such as aerospace and autonomous vehicles presents fundamental challenges for assur- ance. The opacity of AI systems, combined with the semantic gap between high-level requirements and low-level network representa- tions, creates barriers to traditional verication approaches. These AI-specic challenges are amplied by longstanding issues in Re- quirements Engineering, including ambiguity in natural language specications and scalability bottlene cks in formalization. W e pro- pose an approach that leverages AI itself to address these challenges through two complementary components. REACT (Re quirements Engineering with AI for Consistency and T esting) employs Large Language Models (LLMs) to bridge the gap between informal natu- ral language requirements and formal sp ecications, enabling early verication and validation. SemaLens (Semantic Analysis of Vi- sual Perception using large Multi-modal models) utilizes Vision Language Models (VLMs) to reason about, test, and monitor DNN- based perception systems using human-understandable concepts. T ogether , these comp onents provide a comprehensiv e pipeline fr om informal requirements to validated implementations. 1 Introduction The rise of AI-enabled systems has created a critical challenge. As AI components, such as Deep Neural Networks (DNNs), become integrated into aerospace, autonomous vehicles, and other safety- critical domains, their opacity and complexity create fundamental barriers to assurance. Unlike traditional systems whose behavior can often be tested or formally veried, AI systems exhibit emer- gent behaviors that resist conventional verication and validation approaches. This opacity is comp ounded by a semantic gap: re- quirements are typically e xpressed in high-level natural language descriptions (e.g., English te xt), while DNNs process low-level rep- resentations ( e.g., raw pixels). This mismatch b etween specication abstraction and implementation creates barriers to standard soft- ware engineering practices such as testing, debugging, runtime monitoring, and verication. These new AI-specic challenges compound the diculties al- ready inherent to traditional Requirements Engineering (RE) prac- tices. Requirements, typically expressed in natural language by practitioners, are prone to ambiguity , incompleteness and incon- sistency [ 14 ], issues that intensify when specifying requirements for complex, heterogeneous systems integrating AI with conv en- tional components. Moreover , requirements for learning-enabled components must extend beyond traditional specications to cap- ture uncertainty , condence thresholds, and safety boundaries for emergent behaviors that may arise during operation. Figure 1: Integrated Frame work with REA CT and SemaLens. Below , we elaborate on the key challenges that this work seeks to address. Need for Early Error Detection in Comple x, Heterogeneous Systems: As systems grow in complexity and heter ogeneity , espe cially with the inclusion of AI, the need for dete cting errors and inconsistencies early in the design process be comes critical. If left unaddressed, these issues can propagate into implementation, leading to costly failures and late-stage redesigns. This necessity is espe cially acute in safety-critical contexts, where requirement errors discovered during operation have resulted in catastrophic failures [1]. A mbiguity and Imprecision in Re quirements: Eective early dete ction fundamentally depends on the quality of requirements. Re quire- ments must be clear , unambiguous and precise. Y et, in practice, most requirements are written as natural language statements (e .g., English text), which are inherently ambiguous and prone to mis- interpretation [ 14 ]. This creates a major hurdle for designers and developers, who se ek for a single, veriable source of truth. Fur- thermore, to accommodate AI, specications must ev olve to include quantitative descriptions of performance boundaries, uncertainty handling, and condence levels. Scalability Bottlenecks in Requirements Engineering: The process of translating these informal, potentially ambiguous natural language statements into precise, veriable sp ecications is a challenging, time-consuming, and often error-prone task. The translation step 1 demands signicant eort from engineers with expertise in formal specication languages, creating a major scalability bottleneck. Inadequate requirements-based testing: T esting and simulation re- main crucial for the reliability and safety of critical systems, where failures can have severe consequences. Although various techniques have been developed to create test-suites, requirements-based test- ing for AI systems, particularly the ones that use neural networks, remains largely unexplored. Semantic mismatch between high-level, natural language require- ments and low-level network representations: Human-sp ecied re- quirements are written at a high-level, typically in natural language, howev er , the inputs and internal processing logic of DNNs have a low-level r epresentation. For instance , a requirement for an au- tonomous vehicle could be " Always detect pe destrians" , howev er the perception module se es a series of raw pixels and applies an uninterpretable logic to process them. This disconnect inhibits traceability of DNN behavior to requirements, posing a challenge for verication. Safety-assurance bottleneck: DNNs are notoriously hard to analyze, explain and debug due to their opaque and complex nature. It is specically challenging to measure test coverage of the behavior of perception models with respect to high-level semantic features of interest. For instance, a perception mo dule in an autonomous vehicle would need to b e suciently tested against dierent en- vironment conditions (weather , time of day , obstacles so on) and unexpected vagaries. In this work, we propose an approach that ghts AI with AI . In particular , we harness the linguistic and visual reasoning ca- pabilities of modern foundation models to ensure the safety and reliability of AI-enabled systems. W e introduce two complementary components that strategically deploy large language mo dels (LLMs) and vision-language models (VLMs) to bridge the gaps b etween informal requirements, formal specications, and DNN implemen- tations (Figure 1). REA CT (Requirements Engineering with AI for Consistency & T esting) uses LLMs to bridge the gap between in- formal natural language requirements and formal sp ecications, enabling automated consistency checking, test case generation aligned with spe cication semantics, and early verication and validation (V&V) with full traceability from requirements to test artifacts. SemaLens (Semantic Analysis of Visual Perception using large Multi-modal models) employs VLMs as an analytical "lens" to reason about, test, and monitor DNN-based p erception systems in terms of human-understandable concepts, closing the seman- tic gap between high-level requirements and low-level network representations (i.e., input images and videos). Figure 1 sho ws our overall approach. W e explore in detail the workow in Section 2 and discuss benets in Section 3. T ogether , these comp onents oer a comprehensive approach to requirements engine ering and verication and validation for AI-enabled safety-critical systems. Both lev erage AI to achiev e scal- ability while maintaining the rigor necessary for high-assurance systems and compliance with industr y standards such as DO-178C. The integration of these components creates an end-to-end pipeline from informal requirements to validated, tested implementations, enabling early error detection, reduced manual eort, and ulti- mately , safer autonomous systems. 2 Proposed Solution Our proposed approach has two components, REACT and Se- maLens designed to b e complementar y and form a rigorous toolchain for the assurance of complex, heter ogeneous safety-critical systems. Figure 2 presents a demonstration of a sample workow through the framework. Let us consider a requirement that ensures path completion of an autonomous vehicle; for this example we consider an experimental rover developed at NASA 1 . The initial requir ement text written in plain English is the following: [REQ-LIV-002]: “Once the rover is navigating a designate d path, it shall continue to move and successfully complete the segment by reaching the trac cone, i.e., the rover must demonstrate non-blocking behavior toward this goal. ” The requirement states that whenever the r over enters and main- tains a state dened as being on the path, it is guarante ed that it will, at some future point, successfully detect and arrive at a target, e.g., a trac cone. The ro ver is prohibited from entering an innite loop, deadlock state, or any other condition that prevents progress towards the cone while navigating on a path. Next, we describe in detail the modules of each component. 2.1 REA CT The REACT component is an AI-based Re quirements Assistant that aims to help users systematically translate ambiguous natural language requirements into precise, veriable formal specications. By combining the generative capabilities of LLMs with the rigor of formal metho ds, REACT aims to support early consistency analysis, conict detection, and automated test case generation directly from requirements. Next, we describe its integrated modules. The REA CT Author module assists users in authoring precise requirements at scale by combining the linguistic strengths of LLMs with the rigor of formal semantics. It takes as input unrestricted nat- ural language (plain English) text and generates structured natural language (Restricte d English (RE)) [ 2 , 5 , 6 , 8 ], which is a human- readable language with constrained grammar that guarantees un- ambiguous interpretation. For example, in Figur e 2 (step 1), the natural language require- ment [REQ-LIV -002] is fed to REACT Author , which uses LLMs to generate multiple candidate RE requirements translations. Rather than producing a single RE output, the module often produces a list of RE candidates. This is a delib erate choice that stems from the inherent ambiguity of the English requirement. Since such a requirement can often be interpreted in multiple ways, the LLM is tasked with explicitly enumerating all these p otential interpreta- tions, allowing users to validate and select the intended meaning. This approach preserves readability while emb edding semantic precision directly into the requirements authoring process. The REA CT V alidate mo dule ensures the correctness of gener- ated requirements by helping users select candidates that match the intended semantics. Since LLMs often produce multiple RE candi- dates from a single plain English requirement, each with subtle but meaningful semantic dierences, the module uses formal validation to automatically distinguish these variations. Rather than exposing users to complex formal logic, semantic dierences are presented in 1 https://ntrs.nasa.gov/api/citations/20250004071/downloads/RRA V_ 2025AmesParternshipDays.mp4 2 Figure 2: Example workow from a natural language requirement to monitoring. engineer-friendly formats ( e.g., execution traces or concrete scenar- ios). By simply accepting or rejecting a semantic dier ence, users can eciently prune incorrect candidates. For example , as shown in Figure 2, in the end, the pruning process yields a single RE re- quirement that corresponds to the user’s intended semantics. The validation process is highly targeted, focusing only on ke y semantic distinctions to minimize the manual eort r equired. This human- in-the-loop validation is critical. It ensures that r equirements are vetted against the user’s actual intent rather than relying on AI interpretation, which may misunderstand domain-specic nuances. The REA CT Formalize module translates validated RE require- ments into formal specications through seamless integration with requirement formalization tools such as FRET [ 4 ]. The mo dule generates formal representations in formal logics such as Linear T emporal Logic for nite traces (LTLf ) [ 3 ], as shown in Figure 2. T o handle the complexity and uncertainty inherent in autonomous systems, this module supports translation to formal logics that accommodate requirement types specic to AI components, includ- ing Vision-Language Model (VLM)-based p erception systems. This capability is essential for formalizing requirements that capture uncertainty , sp ecify condence criteria and bound emergence. The REA CT Analyze module supports robust early v erication and validation (V&V) by performing automated formal analysis across the requirement set. This capability systematically detects inconsistencies and conicts at design time, before implementa- tion begins, reducing the need for costly downstream rew ork and preventing defects from pr opagating into code. Finally , the REA CT Generate T est Cases module leverages the formalized requirements to automatically produce candidate test cases with coverage guarantees, which aims at addressing the requirements-based testing and coverage objectives of DO-178 2 . 2 Note that DO-178 was not designed for the assurance challenges presented by Learning-Enabled Components like DNNs. However , the reality in aerospace (and other safety-critical domains) is that autonomous platforms often consist of mixed AI and traditional components. Therefore, DO-178 remains the mandatory baseline requirement for all non- AI, safety-critical software embedded in the system. This module accelerates testing workows and ensures com- prehensive, r equirement-driven validation. It guarantees complete traceability by creating an explicit link between every formal re- quirement and its corresponding test cases. These test cases, can then be given as input to SemaLens, to generate videos that check the semantic robustness of perceptions models. 2.2 SemaLens The SemaLens component of our framework aims to leverage emerging multi-modal foundation models such as vision-language models to analyze, test, debug, explain, and monitor DNNs used in perception for autonomous systems in terms of human under- standable concepts. Vision-Language Models (VLMs) [ 16 ], such as CLIP [ 12 ], are powerful models trained on massive amounts of images and textual data and can thus serve as a rich repository of human-understandable concepts for diverse images. SemaLens Monitor: This mo dule enables spatial and temporal reasoning over sequences of images ( and videos). It uses VLMs to extract concepts and spatial r elationships from individual images and uses temporal logic to capture temporal relationships between images in a sequence, thereby building a monitor for automatic analysis of videos and image sequences. The monitor could be used oine to nd “unusual” sequences of (unlabeled) images/vide os; e.g., parse through collection of past ac- cident videos and sele ct sequences corresponding to risk scenarios. It can also be deployed online to check image se quences/videos at run-time for conformance with requirements and ag de viations, particularly useful in safety-critical settings. Although VLMs are still struggling to extract complex spatial relationships from images, previous work, e .g., [15], has shown promising results. SemaLens Monitor nds a natural integration with REACT . Con- sider the example in Figure 2. REACT Formalize synthesizes the Addressing the complexity of the integrated AI components requires leveraging emerging standards and guidelines specically designed for assurance under uncer- tainty . For aerospace, a key example is the guidance emerging from the SAE G-34 working group, which is focused on dening certication and assurance methodologies for AI in A viation to complement DO-178. 3 property in LTLf from the given requirement in English. The for- mula is parsed and converted to a Deterministic Finite A utomaton (DF A) (note that symb ols: ‘|’, ‘&’, ‘ ∼ ’ denote: logical or , and, negation , respectively). T o evaluate the automaton on a sequence of images, the predicates 𝑜 𝑛 _ 𝑝𝑎𝑡 ℎ and 𝑐𝑜 𝑛𝑒 _ 𝑒𝑛𝑐 𝑜𝑢 𝑛𝑡 𝑒𝑟 need to b e evaluated on each image . This is done by feeding the image through the CLIP model (ViT -B/16) and computing the similarity [ 13 ] between the respective image emb edding and the emb eddings of textual cap- tions corresponding to the predicates. In the example, a pr edicate is evaluated to True if the respe ctive similarity is greater than a threshold (0.4 in our case). Figure 2 shows a sequence of images with respective predicate evaluations which satisfy the property; the monitor returns T rue from the third image onward. SemaLens Img Generate: This module aims to use text-conditional diusion models to generate semantically diverse test images (and videos) that conform with requirements written in natural language. It generates test images/videos constraine d by input preconditions. While previous work has e xplored text-conditional diusion mod- els for requirements-based test image generation, see e.g., [ 11 ], we plan to build on that work with further adding semantic perturba- tions conditioned on prompts. The r esulting test-suites can serve to check semantic robustness of perception models. The module can be integrated with REACT Generate T ests to take test sequences that satisfy temporal sp ecications and use them to generate videos; the test sequences ensure requirement co verage, while the diusion model helps cover semantic features. SemaLens T est: This mo dule aims to use VLMs to dene novel coverage metrics over sets of (unlab eled) images in terms of semantic features (that can for instance app ear in the operational design domain – ODD – of the autonomous system). An image is said to cover a featur e if the similarity score between the image embedding (as compute d by the VLM) and the textual emb edding of the feature is higher than a user-spe cied threshold. Statistical measur es can b e used to quantify how well a feature is covered by the set of images. This capability enables both black-box and white b ox testing; in black-box mode we analyze an (unlabeled) data set to compute coverage of relevant features and identify gaps; in white-box mode we rst map the embedding space of a p erception component to the embedding space of a VLM and compute coverage through the lens of the VLM. SemaLens AED (Analyze, Explain, and Debug): This module uses a VLM as a lens to reason about the logic and behavior of a separate vision model. The crux of the approach is to build a map that aligns the embeddings of a vision model with the internal representations of the CLIP model [ 7 , 9 , 10 ]. The embedding space of CLIP thereby acts as a proxy to analyze , explain, and debug the DNN model’s behavior in terms of user-dene d concepts (without requiring manual annotations). For instance, if a model classies an image as a truck , it can be examined if it is doing so for the right reasons, by checking if relevant concepts such as metallic and r ectangular are also detected by the model. Such concepts can act as semantically-meaningful ex- planations for the behavior of the perception module. Furthermore, if an image is misclassied by the model, the semantic mapping can be used to localize if the bug lies in the vision encoder leading to wrong concepts being extracted or in the logic of the head . The behavior of the mo del with respect to dierent concepts can be analyzed statistically , to obtain semantic heatmaps , that enable the identication of non-robust and brittle features. The heatmaps could also b e deployed at runtime to ag adversaries and unsafe inputs, when their semantic prole deviates from expected patterns. Both SemaLens AED and SemaLens T est can be integrated with REA CT to obtain the vocabulary of high-level human-understandable concepts of interest. 3 Benets W e summarize below the benets enabled by our framework. 3.1 REA CT Rigorous yet Accessible Approach: REA CT combines the previ- sion of formal analysis with the usability of pure English, oering a practical pathway for users to impro ve requirement quality without demanding expertise in formal sp ecication. Provides execution traces, interpretable dierences, and rationale for detected issues – not opaque outputs. Early V erication and V alidation to Prevent Costly Failures and Redesigns: Designed to enable early V erication and V alida- tion (V&V) directly from English requirements at the earliest stages of design. It catches ambiguities and inconsistencies b efore they propagate downstream, r educing costly late-stage xes. Reduced manual validation eort: Our validation approach fo- cuses on key semantic distinctions, which reduces manual eort. Scalability to Complex Projects through AI: Leverages LLMs to process and reason over large sets of requirements, making it suitable for time-sensitive, high-assurance systems. Standards Compliance: Facilitates alignment with industr y stan- dards, such as DO-178C. 3.2 SemaLens Better reliability: Runtime Analysis detects and mitigates AI er- rors by che cking conformance to r equirements in real-time thereby improving system reliability . Safer AI de cisions: Explains what AI “sees” and why it makes certain decisions contributing to safer autonomy . Reduced manual eort in debugging: Interprets and debugs without requiring costly , time-consuming human annotations. AI-enabled spatial-temporal reasoning : Enables complex rea- soning over multiple modalities (image/text). Generation of diverse test inputs : Enables testing in unusual scenarios (beyond simulation) and for data-sparse environments to ensure robustness and safety of autonomous systems with DNN perception. Black-box and white-box coverage wrt high-level features : Proposes novel coverage metrics for image domain without expen- sive human annotations. 4 4 Conclusion This research idea paper describes emerging ideas on providing safety-assurance for autonomous systems by exploring no vel meth- ods to leverage AI in various life cycle stages. W e proposed a work- ow that incorporates two synergistic comp onents, REA CT and Se- maLens , which oer a comprehensive approach harnessing multi- modal capabilities of foundation models to enable requirements engineering, verication, validation and monitoring for AI-enabled safety-critical systems. By ghting AI with AI, our approach strives to address challenges that currently prevent certication of learning- enabled components in safety-critical applications. References [1] Arden Albee, Steven Battel, Richard Brace , Garry Burdick, John Casani, Jerey Lavell, Charles Leising, Duncan MacPherson, Peter Burr, and Duane Dipprey . Report on the loss of the mars polar lander and deep space 2 missions. 2000. [2] Marco Autili, Lars Grunske, Markus Lump e, Patrizio Pelliccione, and Antony T ang. Aligning qualitative, real-time, and probabilistic property sp ecication patterns using a structured english grammar . IEEE Transactions on Software Engineering , 41(7):620–638, 2015. [3] Giuseppe De Giacomo and Moshe Y. V ardi. Linear temp oral logic and linear dynamic logic on nite traces. In Proce e dings of the Twenty- Third International Joint Conference on Articial Intelligence , IJCAI ’13, page 854–860. AAAI Press, 2013. [4] Dimitra Giannakop oulou, Anastasia Mavridou, Julian Rhein, Thomas Press- burger , Johann Schumann, and Nija Shi. Formal requirements elicitation with fret. In International W orking Conference on Requirements Engine ering: Foundation for Software Quality (REFSQ-2020) , numb er ARC-E-DAA -TN77785, 2020. [5] Dimitra Giannakopoulou, Thomas Pr essburger , Anastasia Mavridou, and Johann Schumann. Automated formalization of structured natural language require- ments. Information and Software T echnology , 137:106590, 2021. [6] Lars Grunske. Specication patterns for probabilistic quality properties. In Proceedings of the 30th International Conference on Software Engine ering , ICSE ’08, page 31–40, New Y ork, N Y , USA, 2008. Association for Computing Machinery . [7] Boyue Caroline Hu, Divya Gopinath, Corina S. Păsăreanu, Nina Narodytska, Ravi Mangal, and Susmit Jha. Debugging and runtime analysis of neural networks with vlms (a case study ). In 2025 IEEE/ACM 4th International Conference on AI Engineering – Software Engine ering for AI (CAIN) , pages 161–172, 2025. [8] Sascha Konrad and Betty H. C. Cheng. Automated analysis of natural language properties for uml models. In Jean-Michel Bruel, editor , Satellite Events at the MoDELS 2005 Conference , pages 48–57, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg. [9] Ravi Mangal, Nina Narodytska, Divya Gopinath, Boyue Caroline Hu, Anirban Roy , Susmit Jha, and Corina S. Pasareanu. Concept-based analysis of neural networks via vision-language models. In AI V erication - First International Symposium, SAIV 2024, Montreal, QC, Canada, July 22-23, 2024, Proceedings , volume 14846 of Lecture Notes in Computer Science , pages 49–77. Springer , 2024. [10] Mazda Moayeri, Keivan Rezaei, Maziar Sanjabi, and Soheil Feizi. T ext-to-concept (and back) via cross-model alignment, 2023. [11] Nusrat Jahan Mozumder , Felipe T oledo, Swaroopa Dola, and Matthew B. Dwyer . Rbt4dnn: Requirements-based testing of neural networks, 2025. [12] Alec Radford, Jong W ook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry , Amanda A skell, Pamela Mishkin, Jack Clark, Gretchen Krueger , and Ilya Sutskever . Learning transferable visual models from natural language supervision. In Marina Meila and T ong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning , volume 139 of Proceedings of Machine Learning Research , pages 8748–8763. PMLR, 18–24 Jul 2021. [13] Alec Radford, Jong W ook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry , Amanda A skell, Pamela Mishkin, Jack Clark, Gretchen Krueger , and Ilya Sutskever . Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning , pages 8748–8763. PMLR, 2021. [14] Kristin Y vonne Rozier . Spe cication: The biggest bottleneck in formal methods and autonomy . In W orking Conference on V eried Software: The ories, T ools, and Experiments , pages 8–26. Springer , 2016. [15] Felipe T oledo, Sebastian Elbaum, Divya Gopinath, Ramneet Kaur , Ravi Mangal, Corina S. Păsăreanu, Anirban Roy , and Susmit Jha. Monitoring safety properties for autonomous driving systems with vision-language models. In 2025 IEEE Engineering Reliable Autonomous Systems (ERAS) , pages 1–8, 2025. [16] Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey , 2024. 5

AI 기반 안전중요 시스템을 위한 요구공학과 시각 인식 검증 통합 프레임워크

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment