Building Foundational Language Models for Marathi: Challenges and Opportunities

Main Article Content

Gajanan Dadarao Bansod

Abstract

This research package examines the technical, linguistic, sociocultural, and ethical challenges involved in building foundational (large-scale, pre-trained) language models for Marathi and outlines practical opportunities for researchers, developers, and policymakers. We (1) map existing Marathi resources and open corpora, (2) identify major algorithmic and data-collection bottlenecks, (3) propose a reproducible methodology for creating and evaluating Marathi foundational models, and (4) present expected outcomes, evaluation strategies, risk mitigation, and recommendations for sustainable ecosystem development. The study blends a literature survey of recent Marathi corpora and Indic initiatives with an applied research design focused on dataset curation, model pretraining, fine-tuning, and release practices. Key contributions include a prioritized dataset construction roadmap, multilingual transfer strategies, evaluation benchmarks (intrinsic and extrinsic), and governance recommendations for inclusive, safe, and usable Marathi LLMs.

Article Details

Section
Articles