Can large language models understand context?

Understanding context is key to understanding human language, and large-scale language models (LLMs) are increasingly demonstrating this ability to an impressive extent. However, while the assessment of LLM includes a variety of areas within the domain of natural language processing, limited attention has been paid to investigating language ability to understand contextual features. In this paper, we present a context understanding benchmark by adapting existing datasets for the evaluation of generative models. This benchmark consists of four different tasks and nine datasets, and features prompts designed to assess your ability to understand the context of all models. First, we evaluate the performance of LLM under the pre-training scenario of in-context learning. Experimental results show that dense pre-trained models struggle to understand more subtle contextual features when compared to state-of-the-art fine-tuned models. Second, as LLM compression is gaining importance in both research and real-world applications, we evaluate the context understanding of the quantized model under an in-context learning setting. We find that 3-bit post-training quantization degrades benchmark performance to varying degrees. We conduct an extensive analysis of these scenarios to substantiate our experimental results.