How to run AI models at scale on edge devices

It’s possible, but edge device vendors have to do work to optimize their models. A hybrid approach can also extend the applicability of LLM by combining cloud and edge processing.

When we think of artificial intelligence (AI), most people think of Hollywood’s ferocious androids, or more realistically, the meaning of life, the universe, and everything (that is, Course 42). And the latter is undoubtedly true, if imperfectly. But most people don’t know that they use AI when taking pictures or playing games on their smartphones. Users are not intentionally or directly interacting with artificial intelligence. Instead, edge AI is often hidden within the app, improving performance and functionality.

But thanks to the explosive revolution sparked by generative AI, large language models (LLM), and ChatGPT, the time has come when people want to interact directly with AI applications on their mobile devices, in their cars, or in the doctor’s office. . Let’s call this explicit AI. AI apps like Apple Siri and Google Assistant mostly run in the cloud today. However, running directly on edge devices can provide many benefits, provided those devices have the capacity and performance to run the job. Find out what’s possible with Edge today.

Build apps running on Microsoft Build laptops.

microsoft

AI on the edge

As cloud vendors begin to consider the “tear-dropping” costs of generative AI, major enterprises are looking for more load-bearing resources on the edge. Data center GPUs offer great performance, but can cost upwards of $30,000 each. Inflection AI, a startup founded by the former head of Deep Mind, has received his $1.3 billion from industry heavyweights to build a cloud supercomputer powered by 22,000 of his NVIDIA H100 GPUs, which will cost hundreds of millions of dollars. procured.

To reduce costs and increase access to LLM capabilities, Microsoft introduced Office 365 Co-pilot. It uses AI hardware both in the cloud and locally where possible to assist users across the Windows OS. In another example of pursuing the benefits of on-device AI, Google has launched his Gecko version of the Palm 2 model. It’s so lightweight that it works on mobile devices and is fast enough even offline for great interactive applications on the device. Meta also released his LLAMA generated AI model. It has a version with only 7B parameters for edge devices.

These cloud providers not only deliver significant cost savings, but also use artificial intelligence on devices closer to the data source to help customers reduce latency, improve privacy, reduce costs, and improve accessibility across devices. We help you realize other benefits, such as improvements.

Delivering performant AI solutions on edge devices requires addressing several challenges. First and foremost are the computational and memory constraints of edge devices. This is the biggest hurdle to running large-scale AI apps on the edge, which has significantly less compute and memory resources compared to cloud servers. This means that AI models should be optimized for small devices.

Heterogeneity also becomes an obstacle. Edge devices come in different shapes and sizes and have different capabilities and limitations. This makes it difficult for application developers to deliver AI solutions that can run on many devices. A robust AI stack supported across a wide range of devices is key.

Finally, security and privacy must be maintained. Edge devices are often connected to the Internet, making them vulnerable to cyberattacks. You can minimize this risk by implementing security measures to protect your data and devices from unauthorized access.

how do we get there?

Optimization and quantization of large-scale language models are critical to making generative AI practical on edge devices. A large-scale language model (LLM) is a type of AI model that can be used for various tasks such as natural language processing (NLP) and machine translation. However, training and running LLM can be computationally expensive. There are several techniques that can be used to optimize and quantize LLM for edge devices.

One technique is to use a technique called “knowledge distillation” or “domain reduction”. This involves training a smaller model to mimic the behavior of a larger model on a smaller data set. Another technique to reduce model size and improve performance is “quantization”. This involves reducing the accuracy of model weights and activations without significantly affecting model accuracy. This can be difficult. For example, you don’t want to accidentally choose a high-efficiency model with 8-bit integers that don’t give you an exact answer. Researchers see significant reductions in model size and performance improvements while achieving accuracies within 0.5-1.0 percent of those achieved with 32-bit floating point. In the future, 4-bit regions are equally promising. Many Hugging Face models are already available with 4-bit quantization.

hybrid approach

In some cases, it may be necessary to use a hybrid solution that performs some processing locally and some in the cloud. This makes it a good option for applications that require high accuracy or need to process large amounts of data that edge devices cannot accommodate. Local processing must be enhanced with cloud computing services, and applications must know what to do when and where to provide a seamless experience for users. A hybrid AI approach can be applied to virtually any generative AI application and device segment, including phones, laptops, XR headsets, vehicles, and IoT.

Hybrid AI also allows the device and the cloud to run the model simultaneously. The device runs a “light” version of the model for low latency, and the cloud processes multiple tokens of the “full” model in parallel, modifying the device’s response as needed.

Conclusion

As the world races to deploy large-scale language models, most people are shocked to learn that they can cost up to ten times more than traditional search algorithms. Many would say that LLM’s explosive momentum will die if these costs cannot be contained and reduced. Edge AI shows the promise of leveraging AI-enabled edge devices to improve the user experience with high quality and low latency, while significantly offloading the required processing.

Edge AI is therefore a promising technology with the potential to revolutionize a wide range of applications. The challenges of using edge AI are being solved by advances in model optimization, quantization, and hybrid solutions. As AI technology continues to develop, we expect to see even more innovative and breakthrough edge AI applications in the coming years.

To learn more about cutting-edge Edge AI, visit our website for a more complete analysis.

Cambrian AI ResearchGenerative AI Running on Edge and Hybrid Infrastructure – Cambrian AI Research

follow me twitter Or LinkedIn. check out my website.

Disclosure: This article is

by the author and should not be taken as buying or investing advice from that author.

Companies mentioned. My company, Cambrian-AI Research, is fortunate to have many semiconductor companies such as BrainChip, Cadence, Cerebras Systems, Esperanto, IBM, Intel, NVIDIA, Qualcomm, Graphcore, SIMA,ai, Synopsys, Tenstorrent, Ventana Microsystems. We welcome you as a customer. We have no investment positions in the companies mentioned in this article. For more information, please visit our company’s website https://cambrian-AI.com.

I love learning and sharing the amazing hardware and services that are being built to enable the next big thing in technology: artificial intelligence.

How to run AI models at scale on edge devices

AI on the edge

how do we get there?

hybrid approach

Conclusion

Leave a Reply

RECENT POSTS

Compare 45+ MLOps tools in 2026

Bipartisan House of Representatives releases draft bill to limit state governance…

President Trump commemorates D-Day with AI video of him riding a lion and photo showing Obama library as a trash can

AI on the edge

how do we get there?

hybrid approach

Conclusion

Related Posts

Leave a Reply