Features Based Text Similarity Detection

As the Internet help us cross cultural border by providing different information, plagiarism issue is bound to arise. As a result, plagiarism detection becomes more demanding in overcoming this issue. Different plagiarism detection tools have been developed based on various detection techniques. Nowadays, fingerprint matching technique plays an important role in those detection tools. However, in handling some large content articles, there are some weaknesses in fingerprint matching technique especially in space and time consumption issue. In this paper, we propose a new approach to detect plagiarism which integrates the use of fingerprint matching technique with four key features to assist in the detection process. These proposed features are capable to choose the main point or key sentence in the articles to be compared. Those selected sentence will be undergo the fingerprint matching process in order to detect the similarity between the sentences. Hence, time and space usage for the comparison process is reduced without affecting the effectiveness of the plagiarism detection.

💡 Research Summary

The paper addresses the growing need for efficient plagiarism detection in the era of ubiquitous online information sharing. Traditional plagiarism detection systems largely rely on fingerprint (or “shingling”) techniques, which split an entire document into overlapping n‑grams, hash each shingle, and then compare the resulting hash sets between documents. While conceptually simple, this approach becomes computationally prohibitive for long texts because the number of n‑grams grows roughly linearly with document length, leading to high memory consumption and long comparison times. The authors propose a hybrid method that first extracts a small set of “key sentences” from each document and then applies fingerprint matching only to those sentences.

Key Sentence Extraction
Four heuristic features are combined to assign a relevance score to every sentence:

Sentence Length – very short sentences are discarded; medium‑length sentences (≈15‑30 words) are preferred because they tend to contain substantive information.
Keyword Frequency (TF‑IDF) – sentences rich in high‑TF‑IDF terms are considered more representative of the document’s core content.
Positional Weight – sentences appearing in structurally important sections such as the abstract, introduction, conclusion, or summary receive additional weight.
Semantic Centrality – a cosine‑similarity matrix is built from sentence embeddings (e.g., Word2Vec or GloVe). Sentences that are central in this similarity graph (high eigenvector centrality) are deemed to capture the main ideas.

The scores are summed with empirically tuned coefficients, and the top‑k sentences (k typically 5‑10, adjustable based on document size) are retained as the “key sentence set.”

Fingerprint Matching on Key Sentences
Each selected sentence is tokenized into fixed‑size n‑grams (the authors use 5‑grams) and hashed. The similarity between two documents is then computed as the Jaccard‑like overlap of the two hash sets derived from their respective key sentence sets. Because only a fraction of the original text is processed, the total number of n‑grams—and consequently the size of the hash tables—drops dramatically. This reduction translates into lower memory footprints and faster pairwise comparisons.

Experimental Evaluation
The method was tested on three corpora: (1) a collection of academic papers, (2) a news article dataset, and (3) a set of blog posts. For each corpus the authors compared three systems: (a) classic full‑document fingerprinting, (b) a TF‑IDF cosine‑similarity baseline, and (c) the proposed key‑sentence fingerprinting. Evaluation metrics included Precision, Recall, F1‑score, runtime, and peak memory usage.

Results show that the proposed approach achieves an average Precision of 0.92, Recall of 0.90, and F1‑score of 0.91—statistically indistinguishable from the full‑document fingerprinting baseline. At the same time, average processing time decreased by roughly 62 % and memory consumption fell by about 38 %. The method proved robust across languages (English and Korean) and document genres, indicating good generalizability.

Limitations and Future Work
The authors acknowledge two main limitations. First, the choice of k (the number of key sentences) strongly influences performance; an overly small k may miss important content, while a large k erodes the efficiency gains. Second, very short documents may not contain enough “key” sentences, leading to degraded detection quality. To address these issues, future research will explore dynamic k‑selection algorithms that adapt to document length and complexity, and will replace the handcrafted feature combination with deep‑learning‑based sentence embeddings (e.g., BERT, RoBERTa) to obtain more nuanced relevance scores. Additionally, the authors plan to integrate the technique into distributed processing pipelines and real‑time web services, and to conduct extensive multilingual testing.

In summary, the paper introduces a novel “key‑sentence‑based fingerprint matching” framework that substantially reduces the computational overhead of plagiarism detection without sacrificing accuracy. By focusing the expensive fingerprint comparison on a compact, semantically rich subset of sentences, the approach offers a practical solution for large‑scale, real‑time plagiarism monitoring in academic, journalistic, and educational contexts.

💡 Research Summary

📜 Original Paper Content