Quality Model for Machine Learning Components

Quality Model for Machine Learning Components
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Despite increased adoption and advances in machine learning (ML), there are studies showing that many ML prototypes do not reach the production stage and that testing is still largely limited to testing model properties, such as model performance, without considering requirements derived from the system it will be a part of, such as throughput, resource consumption, or robustness. This limited view of testing leads to failures in model integration, deployment, and operations. In traditional software development, quality models such as ISO 25010 provide a widely used structured framework to assess software quality, define quality requirements, and provide a common language for communication with stakeholders. A newer standard, ISO 25059, defines a more specific quality model for AI systems. However, a problem with this standard is that it combines system attributes with ML component attributes, which is not helpful for a model developer, as many system attributes cannot be assessed at the component level. In this paper, we present a quality model for ML components that serves as a guide for requirements elicitation and negotiation and provides a common vocabulary for ML component developers and system stakeholders to agree on and define system-derived requirements and focus their testing efforts accordingly. The quality model was validated through a survey in which the participants agreed with its relevance and value. The quality model has been successfully integrated into an open-source tool for ML component testing and evaluation demonstrating its practical application.


💡 Research Summary

The paper addresses a critical gap in the current practice of testing machine‑learning (ML) components: most testing efforts focus solely on model performance metrics such as accuracy, while ignoring system‑derived non‑functional requirements like throughput, resource consumption, latency, robustness, and security. This narrow view often leads to integration, deployment, and operational failures, especially when ML components are developed by teams separate from the product owners or are outsourced. Traditional software quality frameworks such as ISO 25010 and the newer ISO 25059 for AI systems provide comprehensive quality models, but they conflate system‑level attributes with component‑level attributes, making them impractical for ML component developers who cannot assess many system‑level qualities in isolation.

To fill this gap, the authors construct a dedicated quality model for ML components. Their methodology starts with the ISO 25010, ISO 25029, ISO 25019, and ISO 25059 standards, complemented by two recent studies—one industrial (Chouliaras et al.) and one academic (Habibullah et al.)—that discuss non‑functional requirements for ML. From these sources they extract 163 raw quality attributes (QAs). Using card‑sorting, deduplication, and annotation activities, they group the raw QAs into 80 clusters, then label each cluster as Type 1 (directly testable at the component level), Type 2 (system‑derived requirement that can be expressed as a testable component property), or Type 3 (purely system‑level). After iterative discussion, 44 clusters are marked as Type 1 (or both Type 1 and Type 2), 65 as Type 2, and 10 as Type 3. The authors then craft concise, ML‑focused definitions for the 44 Type 1/2 clusters, eventually converging on 35 distinct QAs.

A final categorization step groups these 35 QAs into nine high‑level categories via another round of card‑sorting. After further refinement, the “Trustworthiness” category is removed as it proved to be a composite of multiple attributes, leaving seven categories and 30 final quality attributes. The categories cover performance & efficiency, reliability & stability, data & I/O, security & privacy, ethics & fairness, maintainability & operability, and integration & interfacing. Each attribute is accompanied by a clear definition and suggested measurable indicators, enabling automated test generation.

The model’s relevance is validated through an invitation‑only online survey conducted in August–September 2025. Twenty‑two practitioners from a government technology transition center, a research institute, and a large industrial organization responded. Demographic analysis shows a wide range of roles, experience levels, and familiarity with quality attributes—over half of respondents reported little or only reading‑level knowledge of QAs, underscoring the need for a shared vocabulary. Survey responses reveal that most practitioners currently test only performance‑related properties; non‑functional aspects such as latency, memory usage, and robustness are rarely addressed, especially for large language models (LLMs), where additional concerns like bias, consistency, and privacy emerge. Participants rated the 30 proposed attributes as highly important, with “data quality”, “time behavior”, and “security & privacy” identified as both critical and challenging to test.

To demonstrate practical applicability, the authors integrate the quality model into MLTE (ML Test and Evaluation), an open‑source framework for ML component testing. MLTE provides a test catalog organized according to the quality model, ensuring that at least one concrete test example exists for each attribute. The tool assists developers in eliciting system‑derived requirements, mapping them to component‑level tests, and embedding these tests into CI/CD pipelines, thereby bridging the gap between system architects and ML model developers.

In conclusion, the paper delivers a rigorously derived, empirically validated quality model tailored to ML components, together with tooling that operationalizes the model. By distinguishing between component‑testable attributes and system‑derived constraints, the model enables developers to articulate, measure, and verify non‑functional requirements early in the development lifecycle, reducing costly integration failures. Future work is suggested in extending the model to domain‑specific ML types (e.g., reinforcement learning, time‑series forecasting), incorporating dynamic monitoring for evolving models, and conducting large‑scale industrial case studies to further refine the model and promote its adoption as a de‑facto standard in MLOps practices.


Comments & Academic Discussion

Loading comments...

Leave a Comment