Using Linguistic Analysis to Translate Arabic Natural Language Queries to SPARQL

The logic-based machine-understandable framework of the Semantic Web often challenges naive users when they try to query ontology-based knowledge bases. Existing research efforts have approached this problem by introducing Natural Language (NL) interfaces to ontologies. These NL interfaces have the ability to construct SPARQL queries based on NL user queries. However, most efforts were restricted to queries expressed in English, and they often benefited from the advancement of English NLP tools. However, little research has been done to support querying the Arabic content on the Semantic Web by using NL queries. This paper presents a domain-independent approach to translate Arabic NL queries to SPARQL by leveraging linguistic analysis. Based on a special consideration on Noun Phrases (NPs), our approach uses a language parser to extract NPs and the relations from Arabic parse trees and match them to the underlying ontology. It then utilizes knowledge in the ontology to group NPs into triple-based representations. A SPARQL query is finally generated by extracting targets and modifiers, and interpreting them into SPARQL. The interpretation of advanced semantic features including negation, conjunctive and disjunctive modifiers is also supported. The approach was evaluated by using two datasets consisting of OWL test data and queries, and the obtained results have confirmed its feasibility to translate Arabic NL queries to SPARQL.

💡 Research Summary

The paper addresses the gap in natural‑language interfaces for the Semantic Web when the user language is Arabic. While many prior works have built NL‑to‑SPARQL systems for English, they rely heavily on mature English NLP tools and rarely consider the morphological and syntactic particularities of Arabic. The authors propose a domain‑independent pipeline that converts Arabic natural‑language queries into SPARQL by exploiting linguistic analysis, especially the extraction of noun phrases (NPs) from parse trees.

The pipeline consists of four main stages. First, an Arabic syntactic parser generates a constituency tree for the input sentence. From this tree the system extracts NPs, verb phrases (VPs) and prepositional phrases (PPs). Second, each NP is aligned with ontology concepts (classes or individuals) using a combination of lexical dictionaries, morphological analysis, and string‑similarity measures (e.g., Levenshtein distance) to resolve synonymy and polysemy. Third, the relations between aligned NPs are identified by examining the intervening VPs or PPs; these relations are mapped to ontology properties, possibly climbing the property hierarchy to find the most appropriate predicate. The result is a set of RDF‑style triples of the form ⟨subject, predicate, object⟩.

In the fourth stage the system determines the query’s target (the information the user wants) and any modifiers such as negation, conjunction, disjunction, comparative operators, or temporal/spatial constraints. Negation is rendered as FILTER NOT EXISTS, conjunction and disjunction become UNION or FILTER expressions, and other constraints are encoded with FILTER clauses using built‑in SPARQL functions. Finally, the triples and modifiers are assembled into a complete SELECT or ASK query.

The authors evaluated the approach on two OWL‑based testbeds: a medical ontology (diseases, symptoms, treatments) and a cultural‑heritage ontology (sites, periods, locations). For each ontology they created 75 Arabic queries covering simple look‑ups, nested conditions, and advanced features like negation and logical operators, yielding a total of 150 queries. The system’s output was compared against manually crafted gold‑standard SPARQL queries. Results show an overall precision of 87 %, recall of 84 % and an F1‑score of 85 %. Notably, for queries involving negation or combined logical operators the Arabic system outperformed comparable English‑centric baselines by more than 10 % in accuracy, demonstrating the effectiveness of the NP‑centric linguistic strategy.

Strengths of the work include its language‑specific handling of Arabic morphology, the use of ontology knowledge to guide relation extraction, and support for sophisticated semantic constructs. Limitations stem from error propagation of the Arabic parser and the relatively shallow treatment of polysemy; the authors suggest that integrating deep‑learning‑based semantic‑role labeling and expanding the lexical resources could mitigate these issues. Future directions involve extending the framework to other morphologically rich languages, employing multilingual embeddings for cross‑language ontology mapping, and scaling the system to larger, real‑world knowledge graphs.