What does it mean to build local AI?

After Openai published CHATGPT in November 2022, the foundation of the large-scale language model of artificial intelligence appeared to be “wired”: western, industrial, rich, educated, democratic. Everyone assumed that if a large-scale language model spoke a particular language and reflected a particular worldview, it would become a Western language. Openai admitted that ChatGpt was distorted towards Western perspectives and English.

But even before Openai's US competitors (Google and Anthropic) released their own big language models the following year, Southeast Asian developers were aware of the need for AI tools to speak to their regions in many languages.

Furthermore, in areas where memories of distant civilizations often clash with modern postcolonial history, language is highly political. Even seemingly monolingual countries consider diversity prominently. Cambodians speak almost 30 languages. Thailand, about 70. And there are over 100 Vietnamese people. This is a region in which communities speak a volume that seamlessly mixes languages and speaks volumes, and where oral traditions are sometimes more common than textual means of capturing deep cultural and historical nuances encoded in the language.

Not surprisingly, those looking to build true local AI models in regions with so many underrated languages face many obstacles, ranging from a lack of high-quality, highly annotated data to the lack of access to the computing power needed to build and train models from scratch. In some cases, the challenge is even more fundamental, reflecting a lack of native speakers and disruptions in standardized orthographic orthographic orthographic supply.

Given these constraints, many of the region's AI developers have settled on a tweaked, established model built by foreign incumbents. This involves obtaining a pre-suppressed model fed with a large amount of data and training it on a small dataset for a particular skill or task. Between 2020 and 2023, Southeast Asian models such as Phobert (Vietnamese), Indobert (Indonesian), and Typhoon (Thai) were derived from much larger models such as Google's Bert, Meta's Roberta (Llama), and France's Mistral. Even the early versions of Seallm, a set of models optimized for regional languages and released by Alibaba's Damo Academy, were built on the architecture of Meta, Mistral, and Google.

However, in 2024, Alibaba Cloud's Qwen disrupt this Western domination, offering more options to Southeast Asia. The Carnegie Fund for International Peace Survey discovered that five of the 21 regional models launched that year were built in Qwen.

Ironically, efforts to localize AI could potentially deepen developers' reliance on much larger players, at least in the early stages.

Erinaur

Still, as Southeast Asian developers had to explain potential western biases in the underlying models available before, we now have to keep in mind the ideologically filtered perspective embedded in the pre-treated Chinese model. Ironically, efforts to localize AI and secure a larger agency for the Southeast Asian community can deepen developer reliance on much larger players, at least in the early stages.

Nevertheless, Southeast Asian developers are also beginning to tackle this issue. Several models, including Sea Lion (a collection of 11 official regional languages), phogpt (Vietnamese), and Malam (Malay), are pre-trained from scratch on a large, general dataset for each particular language. This critical step in the machine learning process allows for further fine-tuning of these models to specific tasks.

Although Sea Lion continues to rely on Google's architecture prior to training, the use of regional language datasets has facilitated the development of homemade models such as Sahabat-Ai that communicate in Indonesia, Sandanese, Java, Bali and Batacnese. Sahabat-Ai proudly describes it as “a testament to Indonesia's commitment to AI sovereignty.”

However, to represent a native perspective, it also requires a strong foundation of local knowledge. We cannot faithfully present Southeast Asian perspectives and values without understanding the politics of language, traditional sensemaking and historical dynamics.

For example, in modern contexts, time and space, widely understood as linear, split and measurable for the purpose of maximizing productivity, are perceived differently in many Indigenous communities. Bali's historical works that go against the traditional patterns of the chronology may be considered Western myths and legends, but continue to shape how these communities understand the world.

Historians of the area note that applying western lenses to local texts increases the risk of misinterpreting Indigenous perspectives. In the 18th and 19th centuries, Indonesian colonial managers frequently read their understanding of the Javanese Chronicle in translated copies. As a result, many biased British and European observations of Southeast Asians have come to be treated as valid historical accounts, internalizing ethnic classifications and stereotypes from official documents. If AI is trained with this data, bias can become even more entrenched.

Data is not knowledge. Because language is inherently social and political and reflects the relationship experiences of those who use it, agents in the age of AI must go beyond the technical sufficiency of the model communicating in local languages. We need to consciously filter legacy bias, question our identity assumptions, and rediscover Indigenous knowledge repositories in language. Without very little understanding in the first place, we cannot faithfully project culture through technology.

Disclaimer: The views expressed by the writers in this section are unique and do not necessarily reflect the Arab news perspective.

Source link