Tuesday, November 26, 2024

Singapore builds something similar to ChatGPT to better represent Southeast Asian languages ​​and cultures

Must read


“Do we want to force everyone in Southeast Asia to adapt to this machine, or do we want to make it more accessible so that people in the region can get the most out of the technology even if they don’t speak English?” “Is that so?” he said.

“We are not trying to compete with the big LLMs. We are trying to complement them so that they can better represent us,” said Teo, Senior Director of AI Products. .

Japanese author’s AI disclosure sparks debate: “Some readers may feel cheated”

There are over 7,000 languages ​​spoken around the world. However, LLMs such as Open AI’s GPT-4 and Meta’s Llama 2, which are used to build AI systems such as chatbots and other tools, are primarily developed for and trained in English.

Governments and technology companies are trying to bridge this gap. India LLM in the United Arab Emirates to create datasets in local languages, power generation AI tools in Arabic, and create AI models in Arabic. China, Japan and Vietnam in the local language.

Nourianti Jali, an assistant professor at Oklahoma State University’s School of Communication, said these models can help local residents participate more equitably in a global AI economy largely dominated by big tech companies.

“We also need regional LLMs because they support technology independence,” she said. “Less reliance on her LLMs in the West could provide better privacy for local residents and better align with national and regional interests.”

“Verification and filtering required”

Researchers say multilingual language models trained on text from multiple languages ​​simultaneously can infer semantic and grammatical connections between high- and low-resource languages ​​with more data.

These models have been used in a wide variety of applications, from translation to customer service chatbots to content moderation on social media platforms that have struggled to identify hate speech in low-resource languages ​​such as Burmese and Amharic. Can be used in applications.

About 13% of SEA-LION’s data comes from Southeast Asian languages, which is more than any other major LLM, Teo said. More than 9 percent of his data comes from Chinese text and about 63 percent from English.

As multilingual language models are often trained on translated text and other low-quality data that may contain errors, AI Singapore has We are paying attention,” Mr Teo said from his office at the National University of Singapore.

Gone are the days of primitive data – now much of what is on the internet is material produced by LLM

Leslie Teh, AI Singapore

“The days of pure data are over. Much of what is on the internet now is LLM-generated material, so it needs to be verified and filtered,” he said.

“We can’t be perfect, but we can’t eliminate everything that seems bad either,” he added.

As more governments provide data and companies test SEA-LION, they believe its smaller size means it can be deployed more quickly and is less expensive to fine-tune and deploy. Teo said.

Indonesian e-commerce company Tokopedia says that since the majority of its customer interactions take place in Indonesian, “its local fluency model strengthens its ability to connect with customers and improve the customer experience.” ,” said Paul Kondylis, associate vice president of data at Tokopedia. Science.

data bias

As more countries and regions create their own LLMs, digital and human rights experts worry that they will only be reproducing the dominant views expressed online, and that authoritarian This can be particularly problematic in countries with conservative governments, strict media censorship, or the lack of a strong civil society.

For example, social media platforms in China are censoring references to the Tiananmen Square massacre and criticism of the government, while several countries in Southeast Asia have enacted laws to restrict content that authorities deem misleading.

“Training a model based on such data risks perpetuating a biased, biased, incomplete, or even misleading narrative,” Jari said.

“These models may fail to surface important sociopolitical issues such as human rights violations, corruption, and legitimate criticism of political power,” she says.

Former Indonesian President Suharto photographed in 2004. SEA-LION placed more emphasis on his achievements than on rights records when compared to Western language models.Photo: Associated Press

For example, in response to a question about former Indonesian President Suharto, Rama 2 and GPT-4 mentioned his shaky human rights record, while SEA-LION’s response focused primarily on his accomplishments.

If a model is trained only on favorable articles about the government, it is “more likely to adopt a worldview in which the government is entirely positive and leaves opposing views behind,” says a policy analyst at the Center for Democracy. List, says Aliya Bhatia. Technology, a US nonprofit organization.

“Regional LLMs may better reflect the linguistic and cultural nuances of local language speakers, but they may also be less informed about the world in general,” she added.

“There is a real risk that the government-sponsored model will instill a historical revisionist perspective and undermine democratic values.”

The dark side of unchecked AI use: laziness and loss of learning ability

However, according to AI Singapore, the alternative, relying entirely on Western LLMs with a “disproportionately large influence” from wealthy, liberal Western democracies, is based on cultural values, political beliefs. , it means perpetuating various prejudices related to social norms.

“These LLMs have a very specific West Coast American bias, and they’re very woke. They don’t represent us,” Teo said.

“We’re not saying our perspective is the only perspective. We’re just trying to recalibrate it.”



Source link

More articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest article