ChatGPT, Claude, and other large language models have now accumulated so much human knowledge that they are far from simple answer generation tools. They can also express abstract concepts such as a particular tone, personality, bias, or mood. However, it is not clear from the knowledge contained in these models exactly how they represent abstract concepts in the first place.
Now, a team at MIT and the University of California, San Diego has developed a method to test whether large-scale language models (LLMs) contain hidden biases, personality, mood, and other abstract concepts. Their method can focus on connections within a model that encode concepts of interest. Additionally, this method can manipulate, or “guide”, these connections to strengthen or weaken the concept of the answer the model is asked to give.
The team has proven that their method can quickly eradicate and orient over 500 general concepts in the largest LLMs currently in use. For example, researchers might focus on the personas the model represents, such as “social influencer” or “conspiracy theorist,” or positions such as “marriage fear” or “Boston fan.” These representations can then be adjusted to enhance or minimize the concept of answers that the model generates.
In the case of the “conspiracy theorist” concept, the team was able to identify a representation of this concept within one of the largest vision language models currently available. When they enhanced the language and asked the model to explain the origin of the famous “blue marble” image of Earth taken from Apollo 17, the model produced answers with the tone and perspective of a conspiracy theorist.
The research team is aware of the risks of extracting certain concepts and explains (and warns about) them as well. But overall, they see the new approach as a way to uncover hidden concepts and potential vulnerabilities in LLMs, which can then be tweaked to make the models more secure or perform better.
“What you can really say about LLM is that while it includes these concepts, not all of them are actively exposed,” says Adityanarayanan “Adit” Radhakrishnan, assistant professor of mathematics at MIT. “With our method, there is a way to distill these different concepts and activate them in ways that prompts can’t give you answers.”
The research team published their results today in a study published in the journal Science. Co-authors of the study include Radhakrishnan, Daniel Beaglehole, and Mikhail Belkin of the University of California, San Diego, and Enric Boix Adsela of the University of Pennsylvania.
fish in a black box
As the use of OpenAI’s ChatGPT, Google’s Gemini, Anthropic’s Claude, and other artificial intelligence assistants explodes, scientists are racing to understand how the models represent certain abstractions such as “hallucinations” and “deception.” In the context of LLM, hallucinations are false responses or responses containing misleading information that the model has “hallucinated” or falsely constructed as fact.
To investigate whether concepts such as “hallucinations” are encoded in LLMs, scientists often employ an “unsupervised learning” approach. This is a type of machine learning in which algorithms search a wide range of unlabeled representations to find patterns that may be related to concepts such as “hallucinations.” But for Radhakrishnan, such an approach may be too broad and computationally expensive.
“It’s like going fishing with a big net trying to catch one type of fish. You catch a lot of fish and you have to look through them to find the right one,” he says. “Instead, we will tailor the bait to the right type of fish.”
He and his colleagues had previously developed the beginnings of a more targeted approach using a type of predictive modeling algorithm known as a recursive feature machine (RFM). RFM is designed to directly identify features and patterns in data by leveraging the mathematical mechanisms that neural networks (a broad category of AI models that include LLMs) implicitly use to learn features.
Because this algorithm was generally an effective and efficient approach to capturing features, the team wondered if it could be used to eradicate the representation of concepts in LLMs, the most widely used type of neural network and perhaps the least understood.
“We wanted to apply feature learning algorithms to LLM to discover the representation of concepts in these large, complex models in a targeted way,” says Radhakrishnan.
Convergence on concept
The team’s new approach identifies concepts of interest within the LLM and “steers” or guides the model’s response based on this concept. The researchers looked for 512 concepts categorized into five classes: Experts (social influencers, medievalists). Mood (proud, mildly amused). Location preference (Boston, Kuala Lumpur). and Persona (Ada Lovelace, Neil deGrasse Tyson).
The researchers then searched for representations of each concept in some of today’s large-scale linguistic and visual models. They accomplished this by training RFM to recognize numerical patterns within the LLM that may represent specific concepts of interest.
A typical large-scale language model is, broadly speaking, a neural network that receives natural language prompts such as “Why is the sky blue?” It then splits the prompt into individual words, with each word encoded mathematically as a list or vector of numbers. The model takes these vectors into a series of computational layers, creating a matrix of many numbers that are used to identify other words that are most likely to be used to respond to the original prompt across each layer. Eventually, the layers converge into a set of numbers that are decoded back to text in the form of a natural language response.
The team approach trains the RFM to recognize numerical patterns within the LLM that may be associated with specific concepts. As an example, to determine whether an LLM contains the expression “conspiracy theorist,” the researchers first train an algorithm to identify patterns among the LLM expressions of 100 clearly conspiracy-related prompts and 100 other non-conspiracy-related prompts. In this way, the algorithm learns patterns associated with conspiracy theorist concepts. Researchers can then mathematically tune the activity of conspiracy theorist concepts by perturbing the LLM representation with these identified patterns.
This method can be applied to find and manipulate general concepts within LLM. Among many examples, the researchers identified expressions and manipulated LLMs to give answers in the tone and perspective of a “conspiracy theorist.” They also identified and reinforced the concept of “repudiation prevention,” showing that models are normally programmed to reject certain prompts, but instead respond, for example, by giving instructions on how to rob a bank.
Radhakrishnan says that using this approach, vulnerabilities in LLM can be quickly searched for and minimized. It can also be used to reinforce certain traits, personalities, moods, or preferences, such as emphasizing the concept of “simplicity” or “reasoning” in the responses LLM produces. The team has published the code underlying the method.
“LLM obviously stores many of these abstract concepts in some representation internally,” Radhakrishnan says. “Once you understand these expressions well enough, there are ways to build highly specialized LLMs that are safe to use and can be very effective for certain tasks.”
This research was supported in part by the National Science Foundation, the Simons Foundation, the TILOS Institute, and the U.S. Office of Naval Research.
