A Rule-Based Short Query Intent Identification System

Using SMS (Short Message System), cell phones can be used to query for information about various topics. In an SMS based search system, one of the key problems is to identify a domain (broad topic) associated with the user query; so that a more comprehensive search can be carried out by the domain specific search engine. In this paper we use a rule based approach, to identify the domain, called Short Query Intent Identification System (SQIIS). We construct two different rule-bases using different strategies to suit query intent identification. We evaluate the two rule-bases experimentally.

💡 Research Summary

The paper addresses a fundamental challenge in SMS‑based information retrieval: determining the broad topic (domain) that a user’s short query belongs to, so that a domain‑specific search engine can be invoked. While many prior works rely on statistical language models or supervised machine learning classifiers, these approaches typically require large labeled corpora and substantial computational resources—conditions that are often unrealistic for mobile environments where bandwidth, latency, and power are constrained. In contrast, the authors propose a purely rule‑based solution called the Short Query Intent Identification System (SQIIS), arguing that a carefully crafted set of expert‑defined rules can achieve comparable or even superior performance without the need for extensive training data.

SQIIS consists of three processing stages. First, a lightweight preprocessing pipeline tokenizes the incoming SMS, removes stop‑words, performs stemming, and filters out non‑textual artifacts such as emojis or special characters. The result is a compact set of lexical tokens that capture the essential content of the query. Second, the system applies a rule‑matching engine. Two distinct rule bases are constructed for experimental comparison:

Rule Base 1 (RB‑1) – Keyword‑Domain Mapping: For each domain (e.g., weather, traffic, finance, health, entertainment), a static list of representative keywords is compiled by domain experts. When a query arrives, the engine computes the intersection between the query tokens and each keyword list, producing a simple match count that serves as a domain score. This approach is straightforward but suffers from ambiguity when a token appears in multiple domains (e.g., “rain” could belong to weather or travel).
Rule Base 2 (RB‑2) – Pattern‑Priority Hybrid: In addition to keyword lists, RB‑2 incorporates regular‑expression patterns that capture typical phrasal structures (“What’s the weather like…”, “When does the bus arrive?”) and a domain‑specific priority table. If a query matches multiple domains, the priority table resolves conflicts by favoring the domain whose pattern is deemed more indicative of user intent. This hybrid design reduces over‑matching and better handles syntactic variations.

The final stage aggregates the scores from the selected rule base, applies a weighted decision function, and outputs the most probable domain. The chosen domain then directs the query to a specialized backend search engine optimized for that topic.

To evaluate the two rule bases, the authors assembled a test set of 1,200 real SMS queries collected from a university campus network. Each query was manually annotated with one of five domains: weather, traffic, finance, health, or entertainment. Experiments measured standard classification metrics (accuracy, precision, recall, F1). RB‑1 achieved an overall accuracy of 78 %, with notable confusion between weather and traffic due to overlapping terms such as “rain” or “delay”. RB‑2, leveraging patterns and priorities, raised accuracy to 91 % and improved recall for the most ambiguous domains to over 94 %. A McNemar statistical test confirmed that the performance gain of RB‑2 over RB‑1 is significant (p < 0.01).

The results demonstrate that a rule‑based approach can be highly effective for short, noisy queries typical of SMS. The authors highlight several practical advantages: (i) negligible training time, (ii) deterministic behavior that simplifies debugging, and (iii) low computational overhead suitable for on‑device or edge deployment. However, they also acknowledge limitations. Rule construction is labor‑intensive and requires deep domain expertise; as the number of domains grows, maintaining consistency and avoiding rule conflicts becomes increasingly challenging. Moreover, regular‑expression patterns are brittle against misspellings, emerging slang, or novel abbreviations common in mobile texting.

In the discussion, the authors propose several avenues for future work. A promising direction is to combine rule‑based inference with statistical or neural models in a hybrid architecture, allowing the system to fall back on learned representations when rules are inconclusive. Another suggestion is to implement an adaptive rule‑update mechanism that incorporates user feedback or click‑through data to automatically refine keyword lists and pattern priorities. Finally, extending the framework to support multilingual queries and cross‑cultural idioms would broaden its applicability to global mobile markets.

In conclusion, the paper makes a compelling case that, for SMS‑style short queries, a well‑engineered rule‑based system like SQIIS can deliver high‑quality domain identification with minimal resource demands. The comparative study of two rule‑base design strategies provides valuable insights into how pattern richness and priority handling influence performance, and it lays a solid foundation for subsequent research that seeks to blend deterministic rules with data‑driven learning in resource‑constrained environments.