Application of distances between terms for flat and hierarchical data
In machine learning, distance-based algorithms, and other approaches, use information that is represented by propositional data. However, this kind of representation can be quite restrictive and, in many cases, it requires more complex structures in order to represent data in a more natural way. Terms are the basis for functional and logic programming representation. Distances between terms are a useful tool not only to compare terms, but also to determine the search space in many of these applications. This dissertation applies distances between terms, exploiting the features of each distance and the possibility to compare from propositional data types to hierarchical representations. The distances between terms are applied through the k-NN (k-nearest neighbor) classification algorithm using XML as a common language representation. To be able to represent these data in an XML structure and to take advantage of the benefits of distance between terms, it is necessary to apply some transformations. These transformations allow the conversion of flat data into hierarchical data represented in XML, using some techniques based on intuitive associations between the names and values of variables and associations based on attribute similarity. Several experiments with the distances between terms of Nienhuys-Cheng and Estruch et al. were performed. In the case of originally propositional data, these distances are compared to the Euclidean distance. In all cases, the experiments were performed with the distance-weighted k-nearest neighbor algorithm, using several exponents for the attraction function (weighted distance). It can be seen that in some cases, the term distances can significantly improve the results on approaches applied to flat representations.
💡 Research Summary
This dissertation investigates the use of term‑based distances for classification tasks on both flat (propositional) and inherently hierarchical data, employing XML as a common representation format. Traditional distance‑based machine‑learning methods, such as k‑nearest‑neighbors (k‑NN), typically operate on tabular data where each instance is a vector of numeric or categorical attributes. While convenient, this representation often fails to capture richer relational or hierarchical information that may be present in many real‑world domains.
The author proposes a two‑step pipeline. First, flat data are transformed into XML trees that encode hierarchical relationships. Two transformation strategies are explored: (1) a name‑value hierarchy that groups attributes sharing the same variable name under a common parent node, and (2) an attribute‑similarity hierarchy that clusters attributes based on statistical similarity (e.g., chi‑square tests, correlation) and creates a parent node for each cluster. Both approaches preserve the original attribute values as leaf nodes, maintain XML depth ordering, and generate a well‑formed tree suitable for term‑distance computation.
Second, the transformed XML instances are compared using two term‑distance functions. The Nienhuys‑Cheng (NC) distance recursively matches nodes of two trees, assigning a unit cost for mismatches and aggregating the costs of matched sub‑trees. The Estruch et al. (ET) distance extends NC by incorporating context‑sensitive weighting (depth, sibling count) and a repetition penalty that discourages multiple matches of identical sub‑structures. Both distances return a normalized similarity score in the interval
Comments & Academic Discussion
Loading comments...
Leave a Comment