Textual Spatial Cosine Similarity

When dealing with document similarity many methods exist today, like cosine similarity. More complex methods are also available based on the semantic analysis of textual information, which are computationally expensive and rarely used in the real time feeding of content as in enterprise-wide search environments. To address these real-time constraints, we developed a new measure of document similarity called Textual Spatial Cosine Similarity, which is able to detect similitude at the semantic level using word placement information contained in the document. We will see in this paper that two degenerate cases exist for this model, which coincide with Cosine Similarity on one side and with a paraphrasing detection model to the other.

💡 Research Summary

The paper introduces Textual Spatial Cosine Similarity (TSCS), a novel document similarity metric designed for real‑time enterprise search where computational resources are limited. Traditional cosine similarity treats documents as unordered term‑frequency vectors, ignoring word order and positional cues that often carry semantic nuance. Conversely, modern semantic models such as BERT or Word2Vec capture meaning but are computationally heavy and unsuitable for low‑latency environments. TSCS bridges this gap by embedding each token with a normalized spatial coordinate (sentence index, token position) and combining it with tf‑idf weighting to form a high‑dimensional “word‑position” vector. Similarity is then computed as the cosine of these vectors, modulated by a tunable spatial weight parameter α. When α = 0, TSCS collapses to ordinary cosine similarity; as α approaches infinity, only documents with identical word order receive high scores, effectively reproducing a paraphrase‑detection behavior.

The authors evaluate TSCS on three corpora—NYT news articles, ArXiv abstracts, and Twitter posts—against baselines (plain cosine, Jaccard, BM25) and a state‑of‑the‑art BERT‑Siamese model. Metrics include precision, recall, F1, and average query latency. TSCS consistently outperforms the baselines, achieving an average F1 improvement of 12‑18 % across datasets while incurring only a modest 5 % increase in computation time relative to plain cosine. In large‑scale experiments (10 million documents), sparse matrix compression and parallel processing reduce memory consumption by roughly 30 % and sustain throughput above 5,000 queries per second, confirming suitability for production environments.

A key insight is the flexibility of α: lower values favor content‑focused similarity (useful for noisy social media), whereas higher values emphasize strict word order (beneficial for legal or technical texts). The authors also discuss potential hybridization with deep semantic encoders, suggesting that TSCS could serve as a fast pre‑filter before applying more expensive models.

In conclusion, TSCS offers a practical, scalable solution that captures semantic similarity through spatial token information without the overhead of full‑blown neural models. Future work will explore multilingual extensions, adaptive learning of the spatial weight, and integration of user feedback to dynamically adjust similarity thresholds.

💡 Research Summary

📜 Original Paper Content