The Language Research Group at the Indian Institute of Technology Gandhinagar (IITGN) has developed an artificial intelligence (AI) model for Hindi called “Ganga-1B,” which is a “breakthrough in language modeling.” Named after the longest river that flows through India, Ganga-1B is the first pre-trained Hindi model developed by an academic research institute.
“This effort aims to improve performance in understanding and generating text in Indian languages, with the first milestone being the release of the Ganga-1B model trained on an extensive monolingual Hindi dataset,” said Prof. Mayank Singh, Head of the Lingo Research Group, IITGN and Assistant Professor of Computer Science and Engineering.
The Ganga-1B model is based on datasets found in the public domain for Hindi language, including news articles, web documents, books, government publications, educational materials, and quality-filtered social media conversations.
“Project Unity aims to develop pocket-sized, open source Large Language Models (LLMs) for Indian languages, built and trained from scratch on Indian data. This effort will empower the Indian open source community to build LLMs and chatbots that can be trained and deployed in resource-limited scenarios,” Professor Mayank Singh told The Indian Express.
Ganga-1B, which has already been downloaded by over 600 people within 48 hours of its announcement, took about a year and a half to build using open source data from various websites.

The research team is working on models for other languages, including Gujarati, Urdu, Tamil, Telugu and Marathi, and is researching the use of AI in e-governance in regional languages, as well as a Masters in Educational Law course to support students and teachers in schools.
The dataset is further curated by native Indian speakers to ensure high quality.
© Indian Express Ltd.
First uploaded: 07 Sep 2024 05:29 IST
