Three AI engines enter the bar in a row… • The Register

Machine Learning


Developers who want a deeper understanding of machine learning inference on local hardware can launch the new llama engine.

Software developer Leonardo Russo has released llama3pure, which includes three standalone inference engines. There is a pure C implementation for desktop, a pure JavaScript implementation for Node.js, and a pure JavaScript version for web browsers that does not require WebAssembly.

“All versions are compatible with Llama and Gemma architectures,” Russo explained. register By email. “The goal is to provide independent, dependency-free alternatives in both C and JavaScript that can read GGUF files and handle prompts.”

GGUF stands for GPT Generation Unified Format. This is a common format for distributing machine learning models.

Llama3pure is not intended as a replacement for llama.cpp, which is a widely used inference engine for running local models and is significantly faster at responding to prompts. Llama3pure is an educational tool.

“We see llama3pure as a more flexible alternative to llama.cpp, especially in terms of architectural transparency and broad hardware compatibility,” Russo explained. “While llama.cpp is the standard for high-performance optimization, involving a complex ecosystem of dependencies and build configurations, llama3pure takes a different approach.”

Russo believes that developers can benefit from embedding the inference engine in a single human-readable file that makes the logic for file parsing and token generation explicit.

“The main goal of this project is to provide an inference engine contained within a single file of pure code,” he said. “By removing external dependencies and abstraction layers, developers can see the entire execution flow from GGUF parsing to final token without having to jump between files and libraries. It’s built for people who need to understand exactly what their hardware is doing.”

Russo believes it can also be useful if a developer is running legacy software or hardware, where client-side WebAssembly is not an option, and a decoupled tool without the possibility of future dependency conflicts is desired.

He said the C and Node.js engines have been tested with Llama models of up to 8 billion parameters and Gemma models of up to 4 billion parameters. The main limiting factor is the physical RAM required to host the model weights.

The RAM required to run a machine learning model on local hardware is approximately 1 GB per billion parameters if the model is quantized at 8 bits. Doubling or halving the precision also doubles or halves the memory required. Models are typically quantized at 16 bits, so a model with 1 billion parameters typically requires 2GB.

According to Russo, the weight calculation for GGUF is different.

“The GGUF weights are loaded directly into RAM, which means that the RAM usage typically matches the overall file size,” he explained. “You can reduce the size of the context window by passing a specific parameter (context_size), a feature supported by most inference engines, including the three I designed. Reducing the size of the context window is a common ‘trick’ to save RAM when running a model locally, but it also means that the AI ​​won’t ‘remember’ as much as it was originally designed.”

I also mentioned that llama3pure is currently focused on single-turn inference. He plans to implement chat history state management at a later date.

Russo says that in his day-to-day work, he uses Gemma 3, a C-based inference engine, as a personal assistant, ensuring sensitive data is kept private and offline.

“For a coding assistant, I recommend the Gemma 3 27B,” he said. “Local models have historically been slow when it comes to latency concerns, but running optimized versions on modern hardware provides an experience very close to cloud-based models like Claude, without having to pay for such services.”

While Russo expects common use cases for AI assistance to continue to rely on cloud-hosted models, he expects developers and enterprises to increasingly focus on local AI. Developer machines with 32 GB or 48 GB of RAM may lack the context window available in cloud-hosted models, but provide security and privacy without relying on a service provider.

When asked how he feels as a developer about the transition to AI, Russo said he expects developers to eventually transition to AI supervisors.

“AI models provide answers with a high degree of confidence, even if they are wrong, so a human expert must always be in the loop to verify the output,” he said. “Rather than becoming obsolete, technical knowledge will become increasingly important in auditing work generated by AI.

“Titles may change, but senior developers will always be needed to maintain these systems, creating a workflow that is significantly faster than human-only development. For junior and mid-level developers, AI provides an opportunity to learn faster than previous generations. Managed correctly, AI can facilitate major leaps in the industry’s intellectual evolution.” ®



Source link