5 Open LLM Inference Platforms for Next-Generation AI Applications

Applications of AI


Open large-scale language models are becoming increasingly performant and are viable alternatives to commercial LLMs such as GPT-4 and Gemini. Given the cost of AI accelerator hardware, developers are looking to APIs to access state-of-the-art language models.

Cloud platforms like Azure OpenAI, Amazon Bedrock, and Google Cloud Vertex AI are obvious choices, but there are also purpose-built platforms that are faster and cheaper than the hyperscalers.

Here we present five generative AI inference platforms that leverage open LLM, including Llama 3, Mistral, and Gemma, some of which also support underlying models targeting vision.

1. Grok

Groq is an AI infrastructure company that claims to build the world's fastest AI inference technology. The company's flagship product is the Language Processing Unit (LPU) inference engine, a hardware and software platform that aims to deliver superior computational speed, quality, and energy efficiency for AI applications. Developers praise Groq for its speed and performance.

The GroqCloud service is powered by an extensive network of LPUs, allowing users to use popular open source LLMs such as Meta AI's Llama 3 70B up to 18x faster (claimed) than other providers. The API is available using Groq's Python client SDK or the OpenAI client SDK. It's easy to integrate Groq with LangChain and LlamaIndex to build advanced LLM applications and chatbots.

In terms of pricing, Groq offers a variety of options. The cloud service charges based on tokens processed. Prices range from $0.06 to $0.27 per million tokens depending on the model you use. The free tier is perfect for getting started with Groq.

2. Perplexity Labs

Perplexity is quickly becoming an alternative to Google and Bing. The company's flagship product is an AI-powered search engine, but it also offers an inference engine through Perplexity Labs.

In October 2023, Perplexity Labs introduced pplx-api, an API designed to facilitate fast and efficient access to open source LLMs. Currently in public beta, pplx-api allows users with Perplexity Pro subscriptions to access the API, allowing a broad user base to test and provide feedback while Perplexity Labs continues to enhance the tool.

The API supports popular LLMs including Mistral 7B, Llama 13B, Code Llama 34B, and Llama 70B. It is designed to be cost-effective for both deployment and inference, with Perplexity Labs reporting significant cost savings. Users can seamlessly integrate the API with their existing applications using an OpenAI client-compatible interface, making it convenient for developers familiar with the OpenAI ecosystem. For a quick overview, see my tutorial on the Perplexity API.

This platform includes: Rama 3 Sonar Mini 32k Online and Rama 3 Sonar Large 32k Onlineis based on the FreshLLM paper. These models based on Llama3 can return citations, but this functionality is currently in closed beta.

Perplexity Labs offers a flexible pricing model for its API. A pay-as-you-go plan charges users based on the number of tokens processed, with no up-front commitments. A Pro plan, priced at $20 per month or $200 per year, includes a $5 monthly credit for API usage, unlimited file uploads, and dedicated support.

Rates range from $0.20 to $1.00 per million tokens depending on the size of the model. In addition to the token fee, online models are charged a flat fee of $5 per 1,000 requests.

3. Fireworks AI

Fireworks AI is a generative AI platform that enables developers to leverage state-of-the-art open source models for their applications. It offers a wide range of language models, including FireLLaVA-13B (a visual language model), FireFunction V1 (for function invocation), Mixtral MoE 8x7B and 8x22B (models for following instructions), and Meta's Llama 3 70B model.

In addition to language models, Fireworks AI also supports image generation models such as Stable Diffusion 3 and Stable Diffusion XL. These models are accessible through Fireworks AI's serverless APIs, which the company says offer industry-leading performance and throughput.

The platform has a competitive pricing model. It offers a pay-as-you-go pricing structure based on the number of tokens processed. For example, the Gemma 7B model is $0.20 per million tokens, and the Mixtral 8x7B model is $0.50 per million tokens. Fireworks AI also offers on-demand deployment, allowing users to rent GPU instances (A100 or H100) by the hour. The API is compatible with OpenAI, allowing easy integration with LangChain and LlamaIndex.

Fireworks AI targets developers, companies, and enterprises at different price points: the Developer tier offers a rate limit of 600 requests per minute and up to 100 deployed models, while the Business and Enterprise tiers offer custom rate limits, team collaboration features, and dedicated support.

4. Cloudflare

Cloudflare AI Workers is an inference platform that enables developers to run machine learning models on Cloudflare's global network with just a few lines of code. It provides a serverless, scalable solution for GPU-accelerated AI inference, allowing developers to leverage pre-trained models for a variety of tasks, including text generation, image recognition, and speech recognition, without the need to manage infrastructure or GPUs.

Cloudflare AI Workers provides a curated set of popular open source models covering a wide range of AI tasks. Notable supported models include llama-3-8b-instruct, mistral-8x7b-32k-instruct, gemma-7b-instruct, and vision models such as vit-base-patch16-224 and segformer-b5-finetuned-ade-512-pt.

Cloudflare AI Workers provides versatile integration points for incorporating AI capabilities into existing applications or creating new ones. Developers can run AI models within their applications using Cloudflare's serverless execution environment, Workers, and Pages Functions. For those looking to integrate with their current stack, a REST API is available, allowing inference requests from any programming language or framework. The API supports tasks such as text generation, image classification, and speech recognition, and developers can power their AI applications with Cloudflare's Vectorize (a vector database) and AI Gateway (a control plane for managing AI models and services).

Cloudflare AI Workers uses a pay-as-you-go pricing model based on the number of neurons processed, providing an affordable solution for AI inference. Neurons act as a token-like unit as the platform offers a diverse set of models beyond LLM. Every account has a free tier that allows 10,000 neurons per day, and Neurons aggregates the usage of different models. Beyond this, Cloudflare charges $0.011 per 1,000 additional neurons. Pricing varies depending on the size of the model. For example, Llama 3 70B costs $0.59 per million input tokens and $0.79 per million output tokens, while Gemma 7B costs $0.07 per million tokens for both input and output.

5. NVIDIA NIM

The Nvidia NIM API provides access to a wide range of large-scale pre-trained language and other AI models, optimized and accelerated by Nvidia's software stack. Through the Nvidia API catalog, developers can explore and try more than 40 models from providers such as Nvidia, Meta, Microsoft, and Hugging Face. These include powerful text generation models such as Meta's Llama 3 70B, Microsoft's Mixtral 8x22B, and Nvidia's own Nemotron 3 8B, as well as vision models such as Stable Diffusion and Kosmos 2.

The NIM API makes it easy for developers to integrate these state-of-the-art AI models into their applications with just a few lines of code. Models are hosted on Nvidia's infrastructure and exposed through standardized OpenAI-compatible APIs, enabling seamless integration. Developers can prototype and test their applications for free using the hosted APIs, and when they're production ready, they can also deploy models on-premise or in the cloud using recently released Nvidia NIM containers.

Nvidia offers both free and paid tiers for the NIM API. The free tier includes 1,000 credits to start with, and the paid pricing is based on the number of tokens processed and the size of the model, starting at $0.07 per million tokens for smaller models like the Gemma 7B, up to $0.79 per million output tokens for larger models like the Llama 3 70B.

The above list is a subset of inference platforms that provide language models as a service. In future posts, we will discuss self-hosted model servers and inference engines that can run on Kubernetes. Stay tuned!

group Created with Sketch.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *