Panmnnesia Powers Recommendation Models with CXL – Blocks and Files

South Korean startup Panmnesia says there is a way to make its recommendation models run five times faster by feeding data from an external memory pool to the GPU via CXL caching instead of hosting CPU-controlled memory transfers. I’m here.

Its TrainingCXL technology was developed by computing researchers at the Korea Advanced Institute of Science and Technology (KAIST) in Daejeon. Pannesia, which means “remember everything”, has started to commercialize.

Panmnesia’s CEO is KAIST Associate Professor Myung-Soo Jeong, who said in an interview with Seoul Finance: High Bandwidth Memory (HBM) does not solve this problem. [virtual] A place is essential to increase storage capacity. ”

DirectCXL was developed by Panmnesia. DirectCXL is a hardware and software way of implementing his CXL memory pool and switching technology specifically for recommendation engines, but it is also useful for other large-scale machine learning applications.

Panmnesia memory expander module and chassis with switch and expander module — *Panmnesia memory expander module (top) and chassis with switch and expander module (bottom)*

A recommendation engine model is a machine learning process used by Facebook, Amazon, YouTube, and e-commerce sites to suggest products to users based on what they are currently buying or renting. Computer eXpress Link (CXL) is a protocol for interconnecting devices with memory across PCIe gen 5 or gen 6 bus-based systems, allowing physically separate but coherent memory regions to share data. Technology. Panmnesia uses the CXL caching protocol cxl.cache to accelerate data access in distributed CXL memory pools and do some processing in the CXL memory controller to reduce model execution time.

The background to this technology is explained in two complex academic papers from IEEE. In “Fail Tolerant Training with Persistent Memory Decomposition via CXL”, for external memory he uses Optane PMEM. On the other hand, the “memory pool by CXL” behind the paywall uses DRAM.

To understand what’s going on (how Pannesia achieved a 5x+ speedup), let’s take a look at the basics of the hardware scheme and its CXL components and some recommendation engine technologies. need to understand the concept of

Hardware and CXL

Panmnesia’s technology includes the CXL v2 switch and CXL memory controller, providing access to shared external CXL memory for both the CPU and GPU, each with their own local memory capacity. This shared coherent memory pool, called the Host Physical Address (HPE), is larger than the GPU and CPU’s individual memory, so instead of fetching data for large recommendation models piecemeal from many resources, can be stored in memory. slow SSD storage. The controller also has PIM (Processing-in-Memory) capabilities and an additional near-data processor, oddly called a DPU. These speed recommendation engines preprocess certain types of data to reduce size and processing complexity before sending it to the GPU.

The recommendation engine model relies on data items called embedding vectors. A vector is a sequence of numbers of arbitrary dimension (tens, hundreds, or more) that describes a complex data item such as a word, phrase, paragraph, or object. image or video. Vector databases are used in AI and ML applications such as semantic search, chatbots, cybersecurity threat detection, product search, and recommendations.

A machine learning model takes an input vector describing an item, looks for similar items in a database, and performs parallel and iterative computations to arrive at a decision. As a rule of thumb, the larger the database and the more comprehensive the vector, the more likely follow-up is recommended. bone While checking out the first movie, not the movie good will hunting.

Training a recommendation engine can take days, so a 5x speedup is desirable. Embedded vectors are loaded into his DRAM in the CXL controller. Then, using the CXL.cache protocol, the model is run and the intermediate values are fed back to the memory of the CXL controller in the same way and are made available to the GPU. Using the CXL.cache protocol saves time by eliminating the need for the host CPU to move data from one his HPA memory area to another his HPA memory area.

Preprocessing the vectors embedded with the controller PIM and DPU elements to reduce their size before sending them to the GPU for the next training cycle means of course doing the work in parallel with the running GPU. means you can do it. in parallel. This saves time and eliminates the need to purchase additional GPUs to increase GPU memory.

As companies adopt ChatGPT-style chatbots to operate on their own private datasets, training will be required and this represents a large emerging market for Pannesia.The TrainingCXL video explains the technology of PMEM incarnation I’m here.

Panmnesia’s founders have patented some of their technology, with more patents pending. Overall, with Pannesia’s academic founders setting up the company and setting up the business processes and technical documentation, the technology looks worth a shot. Check out the website – lots of content.

Source link