Shodhmapak: A Comprehensive Plagiarism Detection Tool for Hindi and Punjabi Texts

Main Article Content

Jitesh Pubreja, Vishal Goyal, Rajeev Puri

Abstract

 Plagiarism identification in Indian regional languages such as Hindi and Punjabi is extremely complex in nature since there are variations in scripts, morphological complexity, and no tools are available in a standard form. The tools for identifying plagiarism such as Urkund and Turnitin are English-based and are not sufficient enough to detect web-based content, paraphrased content, and content in different formats in Unicode in Indian languages. A tool for identifying plagiarism in Hindi and Punjabi specifically, known as Shodhmapak, is being introduced in this context in order to fill this gap. The tool employs advanced Natural Language Processing (NLP) techniques such as stemming, lemmatization, synonym substitution, and semantic similarity analysis for identifying exact and paraphrased plagiarism. A Unicode-based content unit for content handling in smooth varied formats and optimized document and index mechanisms for better system performance ensure efficient content handling and system performance. Besides, real-time web spidering and Google Search API interface ensure efficient identification of web-based and paraphrased content. Extensive comparative case studies involving Urkund and Shodhmapak validate better performance in web-based and paraphrased content identification in Shodhmapak compared to available tools. The system handles content in non-Unicode files efficiently, detects synonym-based content alteration, and runs efficiently in 15 seconds for each document, making it best suited for identifying plagiarism in content in Hindi and Punjabi.

Article Details

Section
Articles