Adding noise to large models can be an alternative to adjusting GRPO/PPO parameters

Machine Learning


By simply adding Gaussian noise to a model, its performance can match or exceed that of classical parameter tuning algorithms such as GRPO/PPO.

A new paper from MIT tackles the “parameter tuning” headache that everyone struggles with.

Countless people are working day and night to turn pre-trained models into experts in specific task areas, and many are losing their hair.

But now, a pair of MIT teachers and students say in a new paper:

Without complex parameter tuning, just by randomly changing the parameters and integrating the results, the model performance can be comparable to the performance of professional parameters (tuning methods such as GRPO/PPO)..

Before this paper was published, the general consensus was that expert models were trained.

Whether it’s gradient descent or reinforcement learning, you need to optimize parameters step by step.

However, this paper reveals that expert models already exist. They are just hidden in weight space. The actual format of the pre-trained model looks like this:

Expert models are clustered together like a shrub. (i.e. the “Neural Thickets” phenomenon mentioned in the paper)

That is, by slightly perturbing parameters near pre-trained weights, it is possible to “encounter” an expert in a new task.

Based on this, the authors further proposed a very simple method. landopt:

Simply add Gaussian noise to a large language model (single-step operation – no iterations, no learning rate, no gradients) and integrate them. Achieve performance equal to or better than standard GRPO/PPO in numerical reasoning, programming, writing, and chemistry tasks.

Additionally, the authors found that: The larger the model, the better the effect.

A “thicket of nerves” is hidden around the pre-trained model

Simply put, this paper presents a counter-conceptual conclusion.

A large number of “expert models” already exist, mostly pre-trained models..

In the weight space, models capable of solving different tasks are not randomly distributed, but “grow” densely around pre-trained weights.

Therefore, in theory, a complex training process is not always necessary. After a few more tries in this area, you might find someone who is an expert at the task and performs well.

When many people hear this, their reaction may be: “Oh, isn’t this just guesswork and trial and error?”

Yes, it’s really just a guess.

For a long time, random guessing has been considered an unreliable machine learning algorithm. For example, the probability of randomly guessing ChatGPT’s parameter vector is almost zero.

However, the paper found that the situation was different for pre-trained models.

Since parameter perturbations that can improve task performance are very dense around the model weights, effective improvement solutions can also be found by random guessing..

In this paper, the authors applied 1000 random weight perturbations to a pre-trained Qwen2.5 model (0.5B to 32B) and projected them onto a two-dimensional plane by random projection.

The results show that the larger the model, the denser the “high accuracy regions” around it. After the perturbation, the performance of the small model mostly degrades (blue region), but there are “experts” here and there around the large model that perform better (red region).

In other words, the larger the model, the more obvious and effective this perturbation effect becomes.

Furthermore, it should be noted that These random perturbations result in “specialized experts” rather than “all-round players.”.

Experiments show that a single random change cannot improve the model’s performance on all tasks. For example, one change can make the model more mathematically accurate, but the code may be written worse. Another change could make the model good at solving chemistry problems, but bad at writing stories.

Similarly, the larger the model, the more obvious this specialization becomes.

The paper also provides a preliminary explanation through a very simple experiment as to why this model “hides a large number of experts”.

They chose the simplest and easiest to understand 1D signal autoregressive model and trained it to predict the next value of a time series signal.

Three situations occurred:

No pre-training: No matter how many perturbations are added, no changes can be found around the model that improve performance, and random guessing is meaningless.

Single task pre-training: The model can only perform very well on the tasks it was pre-trained for and does not show any other high-quality changes around the parameters.

Multi-task mixed pre-training: The region around the model parameters is instantly filled with perturbations that can improve performance. By making small changes, you can unlock special abilities to predict certain types of signals, allowing you to recreate the dense state of a “neural thicket.”

Therefore, this paper reaches the following core conclusions: The key to the emergence of the “Neural Thickets” phenomenon lies in massive multi-task pre-training of large models.

In other words, it is easy to find “experts” whose foundation is strong enough to randomly disrupt the surroundings.

Inspiration for the RandOpt algorithm

The above research also led the paper’s authors to propose a new algorithm. landopt.

RandOpt’s operating mechanism can be divided into two simple steps: randomly finding experts + team voting.

“Find a random expert” is similar to the one described above. If we randomly perturb the parameters of a pre-trained model N times, we obtain N new versions of the model.

By simply testing these models with a small amount of validation data, you can find the K models with the best performance.

Once we have these K models, the next step is the actual inference stage.

Each of these K “experts” is asked to answer a question, and the final result is determined by a “majority vote” principle.

There are two things to note about the whole process.

First, when adding perturbation sigma (i.e., noise intensity), RandOpt tries different intensities of noise (e.g., small perturbation, medium perturbation, large perturbation) to ensure that different types of experts are found.

Second, these N models can run simultaneously on multiple GPUs and are extremely fast.

Of course, the paper tested this new algorithm on different models.

Preliminary results show that for pure large-scale language models such as mathematics, programming, story writing, and chemistry, RandOpt’s accuracy is comparable to, and in some cases even better than, that of currently mainstream professional parameter tuning methods (PPO/GRPO/ES).

For the visual-language model, the improvement effect of RandOpt was even more obvious, with the accuracy increasing directly from 56.6% to 69.0%.

Meanwhile, in this paper, in addition to language and vision, that is, language models, a similar “neural thicket” phenomenon is also observed. image diffusion model

Certain regions in parameter space tend to produce images with particular tones and visual styles.

And the paper’s authors remind us that RandOp performs better in the following situations:

The more times you make random changes, the more powerful the “expert” you choose will be.

The larger the model, the more effective RandOpt is.

Introduction of paper authors

Finally, let me introduce the two authors of this study.

Yul GanHe holds a master’s degree in engineering from Peking University and is currently a Ph.D. student at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL).

Previously, he interned at Microsoft, and his primary research interests include multimodal large-scale language models, inference, multi-agent systems, and scientific AI.

Another author is Philip Isolais his supervisor and currently serves as an associate professor in the Department of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology.

Phillip Isola joined OpenAI as a technical staff member in 2017 after completing his doctoral work at the University of California, Berkeley.

But less than a year later, he went to Google for a year as a visiting researcher.

I then returned to my alma mater, the Massachusetts Institute of Technology, where I did my graduate studies and have taught there ever since.

Phillip Isola’s main research interests are fundamental theories of AI and computer vision. He has contributed to classic research proposals such as pix2pix and LPIPS perceptual loss, and his paper on Google Scholar has been cited more than 100,000 times.

Through this study, the teacher-student pair wants us to once again say:

It’s time to re-understand pre-trained models. It is not just a “usable model”, but a “collection of many experts”.

As long as the pre-training is sufficient, there is no need to make complex parameter tuning later to ensure that the model performs well on a particular task. You can save time and computing power by just making random changes and team voting like RandOpt does.



Source link