Uncovering challenges in language model performance: Studies on saturation and representation degeneration

Screenshot 2024-04-20 at 11.12.41 PM — https://arxiv.org/abs/2404.07647

Language models (LMs) face self-supervised learning challenges due to representation degeneration. LMs like BERT and GPT-2 LM consist of neural networks processing token sequences that generate context representations with small angular variations, small scale, and high outlier dimension. A language modeling head (usually a linear layer with parameter W) generates a probability distribution of the next token. Current trends include scaling up generative pretraining like GPT-2, despite concerns about energy and hardware limitations. Evaluation of the Pythia model suite reveals that when training small models on a wide corpus, performance saturates in the late stages of pre-training.

A Pythia model trained with 300B tokens from Pile shows performance degradation with smaller variants during the late training stages on the Lambada dataset. Scaling laws predict inefficiencies in training compact models on vast corpora, but recent efforts have sought to reduce inference costs by training smaller language models on extensive datasets. I'm focusing on that. The softmax bottleneck highlights the limitations of models with insufficient hidden dimensions. Representation degeneracy in pre-trained models leads to low-entropy singular value distributions and impacts language modeling. Some studies use singular value decomposition (SVD) to analyze the performance limitations of linear classifiers and link scaling laws to data dimensions.

Researchers from Inria Paris and the Sorbonne University have provided a thorough study to analyze the correlation between saturation and representational degeneration in language modeling heads, especially for small-scale models. They demonstrated that linear language modeling heads can cause performance bottlenecks for architectures with small hidden dimensions. This bottleneck arises from the mismatch between the hidden dimensions of the smaller model and the top ranks of the target context probability distribution, which affects performance through the softmax bottleneck phenomenon.

The researchers investigated the performance saturation of Pythia models of various sizes and observed saturation up to 410 million parameters. Loss saturation indicates an increase in intradomain loss at advanced training stages. The scaling law matches the model's data points across 410 million parameters and reveals the optimal parameters (A = 119.09 and α = 0.246). The final checkpoint is on average about 8% below the extrapolation, but the best checkpoint is about 4% below the extrapolation due to incomplete learning rate cooldown.

The main contributions of this study are:

Characterize the performance saturation of small-scale language models through scaling law evaluation and extrapolation.
We identify joint degeneracy of representations in smaller models, especially rank saturation in the LM prediction head.
We empirically verify the substantial impact of high-rank and low-rank linear heads on performance in the target context distribution.
Theoretically quantify the performance limitations caused by LM heads.

Anisotropy is a common representation degeneracy in small-scale language models, indicating a reduction in angular variation between layers. Measuring anisotropy using average cosine similarity indicates its widespread presence. In the Pythia model, a correlation is observed between anisotropy and performance saturation. The singular value distribution of the language modeling head highlights spectral saturation patterns that occur concurrently with performance saturation. Theoretical analysis aims to establish a formal relationship between the dimensions of the context distribution and the performance bottleneck caused by low-rank heads.

In conclusion, this study investigates performance saturation in small-scale language models due to the challenge of mapping between low-dimensional output representations and high-rank context probability distributions through linear language modeling heads. This paper establishes a theoretical link between this performance gap and the spectral properties of the contextual probability distribution. Empirical results confirm that the ranking of the mapping is relatively high. Experiments revealed that the performance degrades significantly when the hidden dimension of the LM head becomes less than 1000. The analysis correlates saturation with final layer anisotropy and spectral saturation in the LM head of a small model, improving our understanding of the impact of softmax bottlenecks on language modeling.

Please check paper. All credit for this research goes to the researchers of this project.Don't forget to follow us twitter.Please join us telegram channel, Discord channeland LinkedIn groupsHmm.

If you like what we do, you'll love Newsletter..

Don't forget to join us 40,000+ ML subreddits

Learn more about content partnerships here Please fill out the form here.

Asjad is an intern consultant at Marktechpost. He is pursuing a degree in mechanical engineering from the Indian Institute of Technology, Kharagpur. Asjad is a machine learning and deep learning enthusiast and is constantly researching the applications of machine learning in healthcare.

🐝 Join the fastest growing AI research newsletter from researchers at Google + NVIDIA + Meta + Stanford + MIT + Microsoft and more…

Source link