Using Semantic Similarity for Input Topic Identification in Crawling-based Web Application Testing

February 23, 2026

Reading time: 6 minute

...

📝 Abstract

To automatically test web applications, crawling-based techniques are usually adopted to mine the behavior models, explore the state spaces or detect the violated invariants of the applications. However, in existing crawlers, rules for identifying the topics of input text fields, such as login ids, passwords, emails, dates and phone numbers, have to be manually configured. Moreover, the rules for one application are very often not suitable for another. In addition, when several rules conflict and match an input text field to more than one topics, it can be difficult to determine which rule suggests a better match. This paper presents a natural-language approach to automatically identify the topics of encountered input fields during crawling by semantically comparing their similarities with the input fields in labeled corpus. In our evaluation with 100 real-world forms, the proposed approach demonstrated comparable performance to the rule-based one. Our experiments also show that the accuracy of the rule-based approach can be improved by up to 19% when integrated with our approach.

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

Using Semantic Similarity for Input Topic Identification in Crawling-based Web Application Testing Jun-Wei Lin and Farn Wang Graduate Institute of Electrical Engineering National Taiwan University, Taipei, Taiwan {d01921014, farn}@ntu.edu.tw

ABSTRACT To automatically test web applications, crawling-based techniques are usually adopted to mine the behavior models, explore the state spaces or detect the violated invariants of the applications. However, in existing crawlers, rules for identifying the topics of input text fields, such as login ids, passwords, emails, dates and phone numbers, have to be manually configured. Moreover, the rules for one application are very often not suitable for another. In addition, when several rules conflict and match an input text field to more than one topics, it can be difficult to determine which rule suggests a better match. This paper presents a natural-language approach to automatically identify the topics of encountered input fields during crawling by semantically comparing their similarities with the input fields in labeled corpus. In our evaluation with 100 real-world forms, the proposed approach demonstrated comparable performance to the rule-based one. Our experiments also show that the accuracy of the rule-based approach can be improved by up to 19% when integrated with our approach. CCS Concepts • Software and its engineering~Software testing and debugging Keywords Input topic identification; web application testing; semantic similarity

INTRODUCTION Web applications nowadays play important roles in our financial, social and other daily activities. Testing modern web applications is challenging because their behaviors are determined by the interactions among programs written in different languages and running concurrently in the front-end and the back-end. To avoid dealing with these complex interactions separately, test engineers treat the application as a black-box and abstract the DOMs (Document Object Models) presented to the end-user in the browser as states to model the behaviors of the application as a state transition diagram on which model-based testing can be conducted. Since manual state exploration is often labor-intensive and incomplete, crawling-based techniques [9, 10, 13, 14, 15, 24, 25, 27, 29] are introduced to systematically and automatically explore the state spaces of web applications. Although such techniques automate the testing of complicated web applications to a great extent, they are limited in valid input value generation. It is crucial for a crawler to provide valid input values to the application under test (AUT) because many web applications require specific input values to their input fields in order to access the pages and functions behind the current forms. To achieve proper coverage of the state space of the application, a user of existing crawlers needs to manually configure the rules for identifying the input topics in advance so as to feed appropriate input values at run time. For example, Figure 1 illustrates an input field requesting a first name, a value of the topic of first_name. To identify the topic of the input field, the values of its attributes such id and name have to be compared with a preset feature string, “firstName”, and an appropriate value can then be determined by the identified topic. Because input values in different topics such as email, URL and password are necessary for a web page requesting them, the manual configuration has to be repeated. Moreover, the rules for one application are likely not suitable for another, since the naming conventions for input fields in different web applications are diverse. Finally, it could be difficult to determine the topic of an encountered input field when it matches multiple rules for different topics. These drawbacks of the rule- based approach for input field topic identification has greatly limited the broad application of the existing crawling-based techniques.

To address the issues of the rule-based approach for input topic identification in web application testing, several observations suggest the possibility of using natural-language techniques. First, in markup languages like HTML and XML, the words to describe the attributes of input fields such as id, name, type, and maxlength are extremely limited. As a result, unlike in a traditional natural- language task such as sentimental analysis which needs a large corpus, we could build a representative corpus of moderate size for the inference. Second, computer programs identify the topics

In Browser:

The DOM Element:

The Extracted Feature Vector:

Figure 1. An example input field asking for an first name, a value belonging to the topic of first_name.

of the input fields by looking at their DOM attributes, but human knows what to fill in by reading the corresponding labels or descriptions

View Original ArXiv

This content is AI-processed based on ArXiv data.

Using Semantic Similarity for Input Topic Identification in Crawling-based Web Application Testing

📝 Abstract

💡 Analysis

📄 Content

Table of Contents

Table of Contents

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found