Enhancing Urdu Sentiment Analysis: A Morphological Rules-Based Approach for Compound Word Tokenization

Main Article Content

Saqib Khushhal, Abdul Majid

Abstract

Sentiment Analysis (SA) is an ongoing area of research that focuses on understanding individuals' thoughts, attitudes, and emotions regarding various subjects, such as products, issues, or people. Urdu sentiment analysis is becoming increasingly important as people prefer expressing their thoughts and feelings in their native language. However, sentiment analyzers that work well for widely studied languages like English are often ineffective for Urdu due to differences in script, morphology, and grammar. One of the significant challenges in analyzing Urdu text is word segmentation, as there are no explicit word boundaries like those found in other languages where spaces are used to separate words. In Urdu, compound words can be formed by strings of characters that collectively represent a single word or meaning. Traditionally, bigram or trigram techniques are used to identify these compound words during tokenization. This study proposes a morphological rules-based approach to identify compound terms in Urdu text for tokenization. Alongside conventional methods, we utilize these compound terms for sentiment analysis of Urdu text documents. Additionally, we consider negation, and intensifiers present with compound words to classify statements as positive, negative, or neutral. We conduct a comprehensive evaluation on a suitably sized dataset to compare the effectiveness of the proposed method against traditional techniques. The results indicate that our suggested method can categorize Urdu text content as positive, negative, or neutral with improved accuracy.

Article Details

Section
Articles