4 over 6 NVFP4 quantization improves accuracy with machine learning adaptive block scaling

Machine Learning


The increasing size of modern artificial intelligence models is driving the demand for efficient numeric formats, and lower precision formats like NVFP4 offer significant speed and memory benefits. However, accurately training and running these models in this format has proven difficult, often leading to instability and poor performance. Jack Cook, Junxian Guo, and Guangxuan Xiao from the Massachusetts Institute of Technology, along with Yujun Lin and Song Han from MIT and NVIDIA, announced an improvement to the NVFP4 quantization algorithm called Four Over Six. Their method addresses the problem of accurately representing values ​​by evaluating multiple scaling options for each data block, improving performance and preventing training divergence, demonstrating significant benefits when pretraining large language models. This advancement is expected to lead to greater efficiency in both the training and deployment of increasingly complex artificial intelligence systems.

Mitigating 4-bit LLM performance loss

Researchers are tackling the challenge of reducing the computational demands of large-scale language models (LLMs) through quantization, a technique that reduces numerical precision. Reducing precision to a 4-bit format like NVFP4 provides significant speed and memory benefits, but often results in decreased model performance. This research focuses on how to minimize this performance loss and maintain accuracy. The research team developed a technique called 2D block scaling. It splits the model’s weight matrix into blocks and assigns each block a unique scale factor, preserving the matrix structure during training.

This approach increases training stability and prevents significant performance degradation. To further refine the representation of these weight blocks, the team introduced a technique called 4/6. This improves the precision of the scale factors within each block. Experiments with Llama 3 and Qwen3 models ranging from 1 billion to 70 billion parameters demonstrate that 4/6 consistently improves performance as measured by perplexity on the WikiText-2 dataset. The choice of block size (1×16 or 16×16) also affects performance and should be carefully considered during implementation. Combining 2D block scaling and 4/6 techniques effectively alleviates the performance loss associated with 4-bit quantization, enabling a more efficient and accessible LLM.

New quantization reduces training instability

Researchers have developed a new quantization technique, Four Over Six (4/6), to address the challenges of training large language models using low-precision numerical formats like NVFP4. While formats such as NVFP4 offer speed and memory advantages, they require all matrix multiplication operands to be quantized, which often leads to training instability and poor performance. The research team determined that standard NVFP4 quantization concentrates errors near the maximum values ​​in a data block, limiting accurate representation of values ​​near the maximum of the block. To address this, this work pioneered a modification to the NVFP4 algorithm that evaluates two potential scaling factors for each block of values ​​and selectively scales some blocks to a maximum value of 4 and others to a maximum value of 6.

This optimizes the representation of near-maximal values, prevents divergence in some training cases, and brings the training loss much closer to the loss achieved with BF16 accuracy. Designed for efficient implementation on NVIDIA Blackwell GPUs, 4/6 improves downstream accuracy and can be easily integrated into existing post-training quantization techniques. Experiments using transformers and hybrid model architectures demonstrate that scaling blocks up to a value of 4 results in a mean squared error of 0 compared to 4.33 for standard NVFP4 quantization, providing a lightweight solution that improves numerical accuracy and provides both increased speed and a natively quantized model.

Selective scaling improves accuracy of low precision

Researchers developed a technique called four over six (4/6) to improve the accuracy of low-precision numerical formats, especially NVFP4. NVFP4 is increasingly used to accelerate computations due to its speed and memory efficiency. Current NVFP4 quantization methods scale all data blocks uniformly, which can cause performance degradation because the precision representing near-maximum values ​​can be lost. The team found that by scaling some blocks to smaller ranges while maintaining larger ranges in others, they could significantly improve the representation of these important values. Experiments revealed that values ​​close to the maximum in the data block are the main cause of performance degradation during quantization. This is because these values ​​are often represented inaccurately in standard FP4 format.

By adaptively scaling the blocks using a scale of 6 for some blocks and a scale of 4 for others, the team was able to better maintain accuracy for these near-maximum values. We tested 4/6 with the Llama and Qwen language models using WikiText-2 and C4 datasets and found that simply scaling all blocks to 4 degraded performance compared to standard NVFP4 quantization. However, by intelligently choosing the scale to either 4 or 6 based on the mean squared quantization error, the team achieved improved performance across a variety of models and datasets. Specifically, selecting the optimal scale using mean squared error consistently yielded better results than uniform scaling, demonstrating the effectiveness of our adaptive approach.

The team achieved word perplexity values ​​of 35.09 and 20.48, and 66.32 and 37 on the WikiText-2 dataset. The team found that evaluating multiple scale factors for each block of values ​​during quantization improves accuracy, especially for values ​​near the maximum, which are prone to errors in low-precision formats. Results show that Four Over Six reduces divergence during pretraining across a variety of model architectures and sizes, bringing training losses closer to those achieved with higher precision forms. Furthermore, our method is effectively integrated with existing post-training quantization techniques and consistently improves performance across a variety of tasks. This work paves the way for more efficient training and deployment of large-scale models using low-precision numerical formats, providing a promising path toward reducing computational costs and increasing accessibility in artificial intelligence.



Source link