
The whole English to OVP translation process. Credits: arXiv (2024). Translation: 10.48550/arxiv.2405.08997
Recent computer science PhD graduate Jared Coleman and his supervisor Bhaskar Krishnamachari bonded over a shared love of both human and computer languages.
Krishnamachari grew up in India and spoke Tamil, Hindi and English, and began studying French and Chinese in college, while Coleman, who grew up in an English-speaking country, fell in love with Spanish in high school and picked up Portuguese in college from his now-wife and friends.
During the pandemic, Coleman began taking online classes in Owens Valley Paiute, a little-known language. Coleman is a member of the Owens Valley Big Pine Paiute Tribe; his father, David, grew up on the tribe's reservation in Big Pine, California, and Paiute is the language of his ancestors.
ChatGPT and other large-scale language models (LLMs) deliver human-level performance on many natural language tasks in English, as one-fifth of the world speaks English, and the same is true for other widely used languages. However, Paiute is considered an “under-resource language,” meaning there are no publicly available Paiute sentences translated into English to train machine learning models.
In their new paper, “LLM-assisted rule-based machine translation for low/no-resource languages,” published in a preprint server, arXivColeman and Krishnamachari propose an approach to machine translation called LLM-RBMT (rule-based machine translation) to aid in the learning of low-resource languages. Co-authors of the paper include USC Dawn Saif Associate Professor of Linguistics Khalil Iskars and Independent Researcher Ruben Rosales.
Their approach consists of a more “old-fashioned” rule-based translation tool and a more advanced natural language-based LLM. In the researchers' method, the LLM does not translate into or from Owens Valley Paiute. Instead, the LLM helps guide the rule-based translation, which relies on grammar and lexical rules to translate between languages.
“In essence, LLMs act as sophisticated intermediaries, leveraging advanced language understanding to ensure that rules-based systems produce accurate translations,” Coleman says.
The translation tool simplifies complex sentences and uses placeholders (in this case, English words) for unknown words, a process that loses some meaning but still produces an understandable and grammatically correct translation.
Coleman says the method mirrors the way language learners naturally speak, mixing known and unknown words, making it a practical tool for the real world.
“The tool is smart enough that, if you give it a few hints, it can do most of the translation on its own,” Krishnamachari added.
Personal satisfaction
Coleman has also built and maintains a suite of digital tools made possible by this research, including online dictionaries, writing tools and translation systems, linked to a language revitalization effort he calls “Kubixi,” which means “brain” in Paiute.
Overall, the paper, to be presented at NAACL’s AmericasNLP workshop, finds that the superior general-purpose language skills of LLMs make them a promising tool to help revive endangered languages.
Coleman, meanwhile, credits tribal members, past and present, with paving the way. “Many people from my tribe have been working for many years on various language revitalization efforts: classes, dictionaries, recordings,” Coleman says. “So while I'm excited about this research, I see it as just one piece of a much larger puzzle.”
In fact, the paper suggests many directions for future research, such as adding more complex sentence structures to test the limits of the methodology outlined in the paper. But it's also a personal and academic achievement for Coleman, who will be joining the ranks of assistant professors of computer science at Loyola Marymount University this fall.
“My father didn't grow up speaking the language, like a lot of families, he was forbidden to speak the language in boarding school, so the language was forbidden and he was forced to use it,” Coleman said.
“Because my great-grandparents worked with linguists to document and record their language, I'm able to hear my great-grandfather's voice and words. And now it's very personally satisfying to listen to him and know what he was saying.”
For more information:
Jared Coleman et al. “LLM-assisted rule-based machine translation for low-resource/no-resource languages” arXiv (2024). Translation: 10.48550/arxiv.2405.08997
arXiv
Provided by University of Southern California
Quote: Students develop AI tool to revitalize endangered indigenous languages (June 21, 2024) Retrieved June 21, 2024 from https://techxplore.com/news/2024-06-student-ai-tool-revitalize-endangered.html
This document is subject to copyright. It may not be reproduced without written permission, except for fair dealing for the purposes of personal study or research. The content is provided for informational purposes only.
