The mystery of the black box of artificial intelligence has faded a little | Tech News

Machine Learning


One of the more strange and disturbing things about today's major artificial intelligence (AI) systems is that no one (not even the people building the AI) knows how they work.

Click here to follow our WhatsApp channel

That's because the large language models that power ChatGPT and other popular chatbots, the kind of AI systems that power them, aren't programmed line by line by human engineers like traditional computer programs.

Instead, these systems essentially teach themselves by ingesting large amounts of data and identifying patterns and relationships in language, then use that knowledge to predict the next word in a sequence.

Building AI systems in this way makes it difficult to reverse engineer or identify specific bugs in the code to fix the problem. Right now, suppose a user types “Which American city has the best food?” Even if the chatbot responded “Tokyo,” there would be no way to understand why the model made that mistake or why the next person asking the question would receive a different answer.

And when large language models malfunction or go off the rails, no one can explain exactly why. The mysteriousness of large language models is a key reason why some researchers worry that powerful AI systems could eventually pose a threat to humanity.

After all, if we don't understand what's going on inside these models, how do we know if they could be used to develop new biological weapons, spread political propaganda, or write malicious computer code for cyber attacks? And if powerful AI systems start to disobey or deceive us, how can we stop them if we don't understand what causes their behavior in the first place?

But this week, a team of researchers at AI company Anthropic announced what they call a major advance. They hope this will help them better understand how AI language models actually work and prevent them from becoming harmful. The team summarized their results in a blog post called “Mental Mapping of Large-Scale Language Models.”

The research team looked at one of Anthropic's AI models, Claude 3 Sonnet, a version of the company's Claude 3 language model, and used a technique called “dictionary learning” to train Claude to speak about specific topics. The researchers revealed patterns in how combinations of neurons, which are the mathematical units within an AI model, are activated when prompted by something. The research team identified about 10 million of these patterns, which they call “signatures.” For example, one trait was found to be active whenever Claude was asked to talk about San Francisco. Other features were active whenever topics such as immunology or certain scientific terms, such as the chemical element lithium, were mentioned. Additionally, some characteristics were related to more abstract concepts, such as deception and gender bias.

We also found that manually turning certain features on or off can change the behavior of the AI ​​system. For example, we found that by activating features associated with the concept of flattery more strongly, Claude returned the user with flowery, exaggerated praise, even when the flattery was inappropriate. .

Chris Oler, who led the human interpretability research team, said these findings could give AI companies more effective control over their models.

“We are discovering characteristics that may shed light on concerns about bias, safety risks and autonomy,” he said. “I'm really excited to think that we might be able to take these controversial issues that people are discussing and turn them into something that can actually have more productive discussions.”Other researchers have also A similar phenomenon has been discovered in language models. However, Anthropic's team is one of the first to apply these techniques. Jacob Andreas, an associate professor of computer science at MIT who reviewed the study, characterized it as a hopeful sign that large-scale interpretation may be possible.

crack the code

> A black box problem can be defined as the inability to fully understand an AI's decision-making process.

> Large language models like ChatGPT are not programmed line by line by human engineers.

> So when they misbehave or go erratic, no one can explain exactly why.

> Researchers looked inside Claude 3 Sonnet, one of Antropic's AI models

They used a technique known as “dictionary learning” to uncover patterns of how combinations of neurons became activated when Claude was prompted to talk about specific topics.

> For example, we found that a feature was activated every time Claude was asked to talk about San Francisco.

> We also found that manually turning certain features on or off can change the way the AI ​​system behaves.

> Researchers believe these findings could give AI companies more effective control over their models



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *