Pre-training with hierarchical memory: Separating long tail and common knowledge

Impressive performance improvements in modern language models currently rely on scaling parameters. That is, a larger model stores more world knowledge and better logic. However, there is no need to compress all world knowledge into parameters, as only a portion is used for each prompt, which is impractical on edge devices with limited memory and compute for inference time. We address this shortcoming with a memory expansion architecture and a pre-training strategy tailored to existing hardware paradigms. We introduce a small language model that accesses a large hierarchical parametric memory bank that encodes knowledge of the world. Fetch small context-sensitive blocks of memory and add them to the model during pre-training and inference. Our pre-training learns to store long-tail world knowledge in memory parameters, while a small language model acts as an anchor to capture common knowledge and common reasoning abilities. Through trillion token-scale experiments, we have shown significant improvements. A 160M parameter model enhanced with 18M parameter memory fetched from a 4.6B memory bank obtained comparable performance to a regular model with more than twice as many parameters. Through extensive experiments, we study the optimal type and size of parametric memory in transformers and extend it to parameters beyond 21B. We find that our proposed hierarchical feedforward memory works robustly across transformer architectures, regardless of whether it is added during pre-training or post-hoc.

Source link

創建binance帳戶 commented on MEGA sconto del 34% su Amazon: Your article helped me a lot, is there any more re
binance registrering commented on Global Industrial Automation Services Market Size to Reach: Your point of view caught my eye and was very inte
binance commented on WestMetric Defends Controversial On-Page SEO Services for the Era of AI: I don't think the title of your article matches th
创建个人账户 commented on AI in CMO Strategy: Transforming Marketing Leadership: Can you be more specific about the content of your
binance account creation commented on The rise of Artificial Intelligence in Film & TV: Thank you for your sharing. I am worried that I la

Pre-training with hierarchical memory: Separating long tail and common knowledge

RECENT POSTS

URI professor examines how machine learning can help diagnose depression – Rhody Today

Erica Kirk comes under fire for allegations that TPUSA used AI video of Charlie endorsing her as CEO and fabricated a chapter

OpenAI’s impact on users facing state government scrutiny

Related Posts