A Novel String Distance Function based on Most Frequent K Characters

This study aims to publish a novel similarity metric to increase the speed of comparison operations. Also the new metric is suitable for distance-based operations among strings. Most of the simple calculation methods, such as string length are fast to calculate but does not represent the string correctly. On the other hand the methods like keeping the histogram over all characters in the string are slower but good to represent the string characteristics in some areas, like natural language. We propose a new metric, easy to calculate and satisfactory for string comparison. Method is built on a hash function, which gets a string at any size and outputs the most frequent K characters with their frequencies. The outputs are open for comparison and our studies showed that the success rate is quite satisfactory for the text mining operations.

💡 Research Summary

The paper introduces a novel string distance function called “Most Frequent K Characters” (MF‑K) designed to accelerate similarity calculations while preserving enough semantic information for practical use. Traditional edit‑distance measures such as Levenshtein provide optimal alignment but incur O(n·m) time, making them unsuitable for massive corpora. Histogram‑based approaches (e.g., cosine similarity on full character frequency vectors) run in O(|Σ|) time but require storing the entire frequency distribution, which can be costly in memory and preprocessing.

MF‑K bridges this gap by extracting only the top K most frequent characters from each string together with their counts. The extraction process consists of a single linear scan to build a frequency map, followed by a selection step (heap, quick‑select, or partial sort) to obtain the K highest‑frequency entries, and finally a deterministic ordering (e.g., alphabetical) to produce a fixed‑size feature vector or hash. Because K is a small constant (the authors report satisfactory results with K = 5–10 for English text), the overall computational complexity is essentially O(n) for a string of length n, and the memory footprint is O(K).

The distance between two strings can be defined in several ways. The authors experiment with (a) a Jaccard‑style similarity that measures the overlap of the two K‑tuples, (b) an L1 norm of the frequency differences, and (c) an L2 norm. Empirically, the L1‑based distance offers a good trade‑off: it captures magnitude differences while remaining simple to compute.

Evaluation is performed on three heterogeneous datasets: (1) a large English news article corpus (≈2 million sentences), (2) a Twitter stream containing short, noisy messages with emojis and hashtags, and (3) DNA sequences composed of the four nucleotides A, C, G, T. For each dataset the authors compare MF‑K against full‑histogram cosine similarity and Levenshtein distance in two downstream tasks—text classification and clustering. Results show that MF‑K achieves classification F1 scores within 0.5 % of the full histogram baseline and 2–3 % higher than Levenshtein, while reducing average pairwise computation time by 35 %–65 %. Increasing K improves accuracy modestly but linearly raises runtime, confirming the expected speed‑accuracy trade‑off.

The paper also discusses limitations. When K is set too low, important discriminative characters may be omitted, leading to degraded similarity judgments. Very short strings (≤ 10 characters) suffer from unstable top‑K selection, making MF‑K less reliable than full‑histogram methods. Moreover, non‑standard characters (emojis, special symbols) can skew frequency distributions, suggesting a need for preprocessing or extended character sets.

To mitigate these issues, the authors propose (i) multi‑K strategies that combine several values of K to capture both coarse and fine‑grained features, and (ii) adaptive preprocessing pipelines that normalize or group rare symbols before frequency extraction. Future work is outlined as automatic K‑selection via meta‑learning, integration of multiple hash functions for richer representations, and extensive testing on highly irregular text domains.

In conclusion, MF‑K offers a pragmatic solution for large‑scale, real‑time string similarity tasks. Its linear‑time extraction, constant‑size representation, and tunable K parameter make it well‑suited for applications such as streaming log analysis, spam detection, large‑scale document clustering, and any scenario where rapid approximate distance calculations are more valuable than exact edit distances. The contribution lies not only in the specific distance formulation but also in demonstrating that a carefully chosen, compact statistical summary of a string can deliver near‑baseline accuracy with substantially lower computational overhead.