Want to embed LLM with custom prompts in your app? Absolutely! Here's how to get started • The Register

Practice Large Language Models (LLMs) are commonly associated with chatbots like ChatGPT, Copilot, and Gemini, but they're not limited to Q&A style interactions: LLMs are becoming integrated into everything from IDEs to office productivity suites.

Beyond content generation, these models, with the right training, prompts, and guardrails, can be used to gauge the sentiment of text, identify topics in documents, clean up data sources, etc. In fact, it's not that difficult to incorporate LLM into your application code for these purposes to add language-based analysis, thanks to extensible inference engines such as Llama.cpp and vLLM. These engines handle the process of loading and parsing the model and running inference with it.

In this hands-on, we'll take a look at Mistral.rs, a relatively new LLM engine written in Rust, aimed at intermediate and above developers.

The open-source code supports a growing number of popular models, including those from the startup Mistral that the project is named after. Additionally, Mistral.rs can be integrated into projects using a Python, Rust, or OpenAI-compatible API, making it relatively easy to insert into new or existing projects.

But before we get into how to get Mistral.rs up and running, and the different ways you can use it to incorporate generative AI models into your code, we need to cover the hardware and software requirements.

Hardware and Software Support

With the right flags, Mistral.rs can run on Nvidia CUDA, Apple Metal, and even directly on the CPU, although there will be a significant performance hit if you choose the CPU option. At the time of writing, the platform does not yet support AMD or Intel GPUs.

This guide shows you how to deploy Mistral.rs on an Ubuntu 22.04 system. The engine supports macOS, but for simplicity, this guide uses Linux.

We recommend a GPU with a minimum of 8GB vRAM, or at least 16GB system memory when running on a CPU. Results may vary depending on the model.

Nvidia users should ensure they have the latest proprietary drivers and CUDA binaries installed before proceeding, setup details can be found here.

Get dependencies

Installing Mistral.rs is pretty easy but depends a bit on your specific use case. Before we begin, let's get our dependencies in order:

According to the Mistral.rs README, the only packages needed are libssl-dev and pkg-config. However, we discovered that we needed a few extra packages to complete the installation. If you're like us and are running Ubuntu 22.04, you can install them by running the following command:

sudo apt install curl wget python3 python3-pip git build-essential libssl-dev pkg-config

Once these are complete, you can run the Rustup script to install and activate Rust.

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

. "$HOME/.cargo/env"

Yes, this involves downloading and running the script immediately, and if you'd like to inspect the script before running it, the code for that is here.

By default, Mistral.rs uses Hugging Face to retrieve the model, many of these files require you to log in before they can be deployed, so you'll need to install huggingface_hub by running the following command:

pip install --upgrade huggingface_hub

huggingface-cli login

You'll be prompted to enter your Hugging Face access token, which you can create by visiting huggingface.co/settings/tokens.

Installing Mistral.rs

Once the dependencies are installed, we can move on to deploying Mistral.rs itself. First, git Download the latest release of Mistral.rs from GitHub and move it into your working directory.

git clone https://github.com/EricLBuehler/mistral.rs.git

cd mistral.rs

Here, it gets a bit complicated depending on your system configuration and the type of accelerator you are using. In this case, we consider CPU (slow) and CUDA (fast) based inference on Mistral.rs.

For CPU-based inference, simply run:

cargo build --release

Meanwhile, if you have an Nvidia-based system, you need to do the following:

cargo build --release --features cuda

This may take a few minutes to complete, so have a cup of tea or coffee while you wait. Once the executable has finished compiling, copy it to your working directory.

cp ./target/release/mistralrs-server ./mistralrs_server

Testing Mistral.rs

With Mistral.rs installed, you can run a test model such as Mistral-7b-Instruct in interactive mode to see if it actually works. If you have a GPU with around 20GB or more of vRAM, just run the following command:

./mistralrs_server -i plain -m mistralai/Mistral-7B-Instruct-v0.3 -a mistral

However, your GPU may not have the memory required to run your model at the 16-bit precision it was designed for, which requires 2 GB of memory for every billion parameters, plus additional space for the key-value cache. And even if you have enough system memory to deploy it to a CPU, you can expect to see a significant decrease in performance as memory bandwidth quickly becomes a bottleneck.

Instead, we use quantization to shrink the model to a more suitable size. In Mistral.rs, there are two ways to do this. The first is to simply use on-the-fly quantization, which downloads the full-sized model and quantizes it to the desired size. In this case, we quantize the model from 16-bit to 4-bit. To do this, add the following code: --isq Q4_0 Add the following to the previous command:

./mistralrs_server -i --isq Q4_0 plain -m mistralai/Mistral-7B-Instruct-v0.3 -a mistral

Note: If Mistral.rs crashes before completion, your system may be out of memory and you may need to add a swapfile (I added a 24GB swapfile) to complete the process. You can temporarily add and enable a swapfile by running the following command (don't forget to remove it after reboot):

sudo fallocate -l 24G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Once the model is quantized, you'll be presented with a chat-style interface that allows you to query the model. You'll also notice that the model now uses significantly less memory (around 5.9 GB in our tests) and performs much better.

However, if you don't want to quantize your model on the fly, Mistral.rs also supports pre-quantized GGUF and GGML files, such as those from Tom “TheBloke” Jobbins of Hugging Face.

The process is nearly the same, except this time you need to specify that you're running a GGUF model and set the ID and filename of the LLM you want – in this case, download a 4-bit quantized version of TheBloke's Mistral-7B-Instruct.

./mistralrs_server -i gguf --quantized-model-id TheBloke/Mistral-7B-Instruct-v0.2-GGUF --quantized-filename mistral-7b-instruct-v0.2.Q4_0.gguf

Putting your LLM to work

While it's great for running interactive chatbots in the terminal, it's not that useful for building AI-enabled apps. Instead, you can integrate Mistral.rs into your code using the Rust or Python API, or via an OpenAI API-compatible HTTP server.

First, let's look at connecting to an HTTP server, which is the easiest to use. In this example, we'll use the same 4-bit quantized Mistral-7B model as in the previous example. Instead of starting Mistral.rs in interactive mode, -i Together -p Specify the port through which the server can be accessed.

./mistralrs_server -p 8342 gguf --quantized-model-id TheBloke/Mistral-7B-Instruct-v0.2-GGUF --quantized-filename mistral-7b-instruct-v0.2.Q4_0.gguf

Once the server is running, there are several ways to access it programmatically. The first is curl We pass in the instructions we want to give to the model.Now, we ask the question “What is a Transformer in machine learning?”

curl http://localhost:8342/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "Mistral-7B-Instruct-v0.2-GGUF",
"prompt": "In machine learning, what is a transformer?"
}'

After a few seconds, the model will neatly output a block of JSON-formatted text.

You can also interact with this using the openAI Python library, though you'll probably need to install it using: pip beginning:

pip install openai

Then you can call the Mistral.rs server using such a template created for the completion task.

import openai

query = "In machine learning, what is a transformer?" # The prompt we want to pass to the LLM

client = openai.OpenAI(
    base_url="http://localhost:8342/v1", #The address of your Mistral.rs server
    api_key = "EMPTY"
)

completion = client.completions.create(
    model="",
    prompt=query,
    max_tokens=256,
    frequency_penalty=1.0,
    top_p=0.1,
    temperature=0,
)

print(completion.choices[0].text)

For more examples showing how to work with HTTP servers, see the Mistral.rs Github repository.

Integrating Mistral.rs Deeply into Your Project

Although convenient, the HTTP server is not the only way to integrate Mistral.rs into your project – you can achieve similar results using the Rust or Python APIs.

Here's a basic example from the Mistral.rs repository. It shows how to use the project as a Rust crate (called a library in the Rust world) to pass queries to Mistral-7B-Instruct and generate responses: Note: I found the original example code required some tweaking to run.

use std::sync::Arc;
use std::convert::TryInto;
use tokio::sync::mpsc::channel;

use mistralrs::{
    Constraint, Device, DeviceMapMetadata, GGUFLoaderBuilder, GGUFSpecificConfig, MistralRs,
    MistralRsBuilder, ModelDType, NormalRequest, Request, RequestMessage, Response, SamplingParams,
    SchedulerMethod, TokenSource,
};

fn setup() -> anyhow::Result<Arc<MistralRs>> {
    // Select a Mistral model
    // We do not use any files from HF servers here, and instead load the
    // chat template from the specified file, and the tokenizer and model from a
    // local GGUF file at the path `.`
    let loader = GGUFLoaderBuilder::new(
        GGUFSpecificConfig { repeat_last_n: 64 },
        Some("mistral.json".to_string()),
        None,
        ".".to_string(),
        "mistral-7b-instruct-v0.2.Q4_K_M.gguf".to_string(),
    )
    .build();
    // Load, into a Pipeline
    let pipeline = loader.load_model_from_hf(
        None,
        TokenSource::CacheToken,
        &ModelDType::Auto,
        &Device::cuda_if_available(0)?,
        false,
        DeviceMapMetadata::dummy(),
        None,
    )?;
    // Create the MistralRs, which is a runner
    Ok(MistralRsBuilder::new(pipeline, SchedulerMethod::Fixed(5.try_into().unwrap())).build())
}

fn main() -> anyhow::Result<()> {
    let mistralrs = setup()?;

    let (tx, mut rx) = channel(10_000);
    let request = Request::Normal(NormalRequest {
        messages: RequestMessage::Completion {
            text: "In machine learning, what is a transformer ".to_string(),
            echo_prompt: false,
            best_of: 1,
        },
        sampling_params: SamplingParams::default(),
        response: tx,
        return_logprobs: false,
        is_streaming: false,
        id: 0,
        constraint: Constraint::None,
        suffix: None,
        adapters: None,
    });
    mistralrs.get_sender().blocking_send(request)?;

    let response = rx.blocking_recv().unwrap();
    match response {
        Response::CompletionDone(c) => println!("Text: {}", c.choices[0].text),
        _ => unreachable!(),
    }
    Ok(())
}

If you want to try this for yourself, first exit your current directory, create a folder for your new Rust project, and enter that directory. cargo new This is the recommended way to create a project, but we'll do it manually this time so you can see the steps.

cd ..
mkdir test_app
cd test_app

Once you get there, mistral.json From a template ../mistral.rs/chat_templates/ Download mistral-7b-instruct-v0.2.Q4_K_M.gguf Model file from Hugging Face.

next, Cargo.toml A file that contains the dependencies needed to build your app. This file tells the Rust toolchain the details of your project. Inside this .toml file, paste the following:

[package]
name = "test_app"
version = "0.1.0"
edition = "2018"

[dependencies]
tokio = "1"
anyhow = "1"
mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git", tag="v0.1.18", features = ["cuda"] }

[[bin]]
name = "main"
path = "test_app.rs"

note: , features = ["cuda"] If you're not using GPU acceleration, ignore this part.

Finally, the contents of the above demo app, test_app.rs.

In these four files test_app.rs, Cargo.toml, mistral-7b-instruct-v0.2.Q4_K_M.ggufand mistral.json In the same folder, you can test if it works by running the following command:

cargo run

After about a minute, the answer to your query will appear on the screen.

Obviously, this is a very rudimentary example, but it shows how you can use Mistral.rs to integrate LLM into a RUST app by including the crate and using its library interfaces.

If you're interested in using Mistral.rs in your Python or Rust projects, we highly recommend checking out the documentation for more information and examples.

We'll have more stories about using LLMs in the future, so let us know in the comments what you think we should explore next.®

Editor's note: Nvidia provided The Register with an RTX A6000 Ada Generation graphics card in support of this article and similar articles. Nvidia had no role in the content of this article.

Source link