Abstract
Plagiarism detection has become increasingly challenging due to the widespread availability of paraphrasing tools and generative artificial intelligence systems. Traditional plagiarism detection techniques based on lexical similarity, such as TF-IDF and n-gram matching, often fail to identify semantically similar but lexically modified text. This paper presents a hybrid plagiarism detection framework that combines lexical similarity measures with semantic similarity derived from sentence transformer models. The proposed approach integrates TF-IDF–based cosine similarity with lightweight sentence embeddings generated using MiniLM and SBERT models. To enhance semantic detection performance, a MiniLM-based sentence transformer is fine-tuned on the PAN 2011 plagiarism detection corpus. Experimental evaluation demonstrates that the hybrid similarity approach significantly improves detection accuracy compared to purely lexical methods, particularly for paraphrased plagiarism cases. The framework is further validated using threshold-based analysis and real-world web content retrieved through automated scraping. The proposed system provides an efficient and scalable solution for plagiarism detection, balancing computational efficiency with semantic understanding, and is suitable for academic and real-world forensic applications.
Keywords
Array
Array
Array
Array
References
I (we), the author(s), hereby declare under full moral, financial and criminal liability that the manuscript submitted for publication to the Journal of Computer and Forensic Sciences
a) is the result of my (our) own original research and that I (we) hold the right to publish it;
b) does not infringe any copyright or other third-party proprietary rights;
c) complies with the Journal’s research and publishing ethics standards;
d) has not been published elsewhere, under this or any other title;
e) is not under consideration by another publication, under this or any other title.
I (we) also declare under full moral, financial and criminal liability:
f) that all conflicts of interest that may directly or potentially influence or impart bias on the work have been disclosed in the manuscript;
g) that if the article has been accepted for publishing I (we) will transfer all copyright ownership of the manuscript to the University of Criminal Investigation and Police Studies in Belgrade.
Signed by the Corresponding Author on behalf of the all other authors.