Researchers show that changing AI calculations may reduce strain on hardware

Advanced AI models tend to require large amounts of memory and occupy a lot of storage space. One way to reduce this footprint is through a process called quantization, which changes how weights in a model are represented and stored. However, quantization also has drawbacks.

Andrés Mac Allister, CEO and founder of SEMQ Group, believes there are other ways to make machine learning more efficient and less resource-intensive. Instead of compressing model weights (particularly embeddings), he argues that we can separate the semantics (meaning) and how that meaning is represented.

Weights in models that involve embedding (mapping tokens into vectors) are numbers in a machine learning model that determine how strongly one piece of information is related to another. Taken together, these reflect learned behavior.

These parameters are typically expressed in full precision (FP32) and require 4 bytes for each parameter. The FP32 7B parameter model requires approximately 28 GB of disk space and memory.

To save space, the model may be quantized with FP16/BF16, which requires 2 bytes per parameter. The resulting model will require approximately 14 GB of disk space and memory. There are also smaller quantization options such as FP8, INT8/Q8, Q6, Q5, Q4, Q3, and Q2, each of which reduces storage and memory footprint while also reducing precision. The results are even worse.

SEMQ stands for Symbolic Embedding Multi-Quantization. As explained in a paper published earlier this year, SEMQ “replaces raw vectors with a fixed-dimensional symbolic structure that preserves relational properties such as relative similarity ordering and neighborhood structure while separating representation from metrics, indexing, and execution semantics.”

Essentially, Mac Allister devised a way to build a semantic abstraction layer that separates the meaning captured in the embeddings (vectors representing the data) from the way the data is represented.

A valid idea is that the semantic relationship depends primarily on the relative orientation of the embedding vectors, so preserving the absolute magnitude of those vectors becomes less important. Save less data.

The potential impact on enterprises running AI workloads depends on the portion of infrastructure costs that are attributable to semantic state.

“Embeddings are typically represented as long vectors of floating point numbers,” Mac Allister explained in an email. register. “In traditional embedding systems, semantic state is typically stored as a sequence of high-precision numerical coordinates, which jointly encode both the magnitude and orientation of the embedding space.

“Our original question was: Could a significant portion of useful semantic information be expressed in terms of the structural relationships between components, how they move relative to each other, what areas they occupy, and what directional configurations they form across space?”

To this end, SEMQ aims to represent relative geometries rather than enumerations of independent floating-point magnitudes.

“This is important because semantic systems typically do more than just store each raw number individually; they also take into account relationships, similarities, neighborhoods, continuity, search behavior, and changes over time. The result is a portable representation of semantic state that can be reproduced, audited, compared, and transferred between processes,” said Mac Allister.

According to Mac Allister, initial validation testing focused on converting and restoring embedding-based semantic state to a deterministic .semq representation and evaluating the stability of search and classification operations has yielded positive results.

“For example, in one benchmark using MTEB’s Banking77 dataset and the all-MiniLM-L6-v2 embedded model, the FP32 baseline achieved 92.26 percent accuracy. SEMQ achieved 92.27 percent, effectively matching the FP32 baseline to within 0.03 percentage points.”

Therefore, SEMQ significantly outperformed 4-bit quantization and recorded an accuracy of 56.05 percent, which is 36.22 percent lower than FP32.

“These are not claims that traditional quantization is universally ineffective, but they do show that in this particular semantic classification setting there is a big difference between preserving relevant semantic structure and simply reducing numerical precision,” Mac Allister said.

Applying SEMQ can be done at the point of data ingestion. Organizations can use the SDK on vectors generated by embedded models on documents and encode that data as .semq artifacts. You can also load, query, compare, restore, and validate that encoding at query time.

“This means teams can adopt SEMQ without replacing LLMs, embedded models, vector databases, or agent frameworks,” said Mac Allister. “It can run in parallel with your existing stack, initially as a sidecar layer, and then become the representation used for select acquisition or memory workloads.”

He said potential use cases include making embeddings and memory state portable between systems, and reproducing semantic state across different executions and machines. Audit model changes. Reduce dependence on stateful pipelines that are opaque or difficult to reproduce. and differ in semantic status.

He added that SEMQ can be extended to cognitive states at runtime.

“In our research, .semq files are used to snapshot and restore the KV cache state of transformers across process boundaries,” he said. “This isn’t even a pre-training workflow; it’s a runtime state workflow for pausing, transferring, and resuming an active model session.”

Mac Allister isn’t ready to talk about specific customers yet. He said his company is working through a founding design partnership program with organizations considering applications in enterprise AI, search, agent memory and auditable AI workflows. This includes some AI infrastructure hyperscalers and some companies operating in the AI application layer.

“We cannot reveal the names of all organizations yet because we have signed NDAs with them,” he said. “What I can say is that we are seeing interest from teams working with AI systems where reproducibility, state, reduced infrastructure overhead, and the ability to inspect semantic behavior are operationally critical. So this is a big issue for large enterprises.” ®

Source link