Impressive performance improvements in modern language models currently rely on scaling parameters. That is, a larger model stores more world knowledge and better logic. However, there is no need to compress all world knowledge into parameters, as only a portion is used for each prompt, which is impractical on edge devices with limited memory and compute for inference time. We address this shortcoming with a memory expansion architecture and a pre-training strategy tailored to existing hardware paradigms. We introduce a small language model that accesses a large hierarchical parametric memory bank that encodes knowledge of the world. Fetch small context-sensitive blocks of memory and add them to the model during pre-training and inference. Our pre-training learns to store long-tail world knowledge in memory parameters, while a small language model acts as an anchor to capture common knowledge and common reasoning abilities. Through trillion token-scale experiments, we have shown significant improvements. A 160M parameter model enhanced with 18M parameter memory fetched from a 4.6B memory bank obtained comparable performance to a regular model with more than twice as many parameters. Through extensive experiments, we study the optimal type and size of parametric memory in transformers and extend it to parameters beyond 21B. We find that our proposed hierarchical feedforward memory works robustly across transformer architectures, regardless of whether it is added during pre-training or post-hoc.
