5 Tips for Building Optimized Hug-Face Transpipelines

Images by editor | chatgpt

# introduction

Hugging my face It has become the norm for many AI developers and data scientists. This is because it significantly reduces the barriers to working with advanced AI. Instead of working with AI models from scratch, developers can access a wide range of assumptions without hassle. Users can also adapt these models with custom datasets and deploy them quickly.

One of the hugging face framework API wrappers is Transpipelinea set of packages consisting of pre-protected models, their token agents, pre-treatment and post-treatment, and related components to make AI use cases work. These pipelines abstract complex code and provide a simple, seamless API.

However, using a transpipeline can become messy and may not yield the best pipeline. So we explore five different ways that you can optimize your transformer pipeline.

Let's get into it.

# 1. Batch inference request

In many cases, when using transpipelines, you do not take full advantage of the graphics processing unit (GPU). Batching multiple inputs can greatly increase GPU utilization and increase inference efficiency.

Instead of processing one sample at a time, you can use a pipeline batch_size Parameters or pass a list of inputs so that the model handles several inputs in one forward pass. Here is an example code:

from transformers import pipeline

pipe = pipeline(
    task="text-classification",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    device_map="auto"
)

texts = [
    "Great product and fast delivery!",
    "The UI is confusing and slow.",
    "Support resolved my issue quickly.",
    "Not worth the price."
]

results = pipe(texts, batch_size=16, truncation=True, padding=True)
for r in results:
    print(r)

Batching requests allows for higher throughput with minimal latency impact.

# 2. Uses low precision and quantization

Many assumption models fail in inference because there is not enough memory in the development and production environments. Lower numerical accuracy reduces memory usage and speeds up inference without sacrificing too much accuracy.

For example, here is how to use half-precision on a GPU in a transpipeline:

import torch
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    torch_dtype=torch.float16
)

Similarly, quantization techniques can compress model weights without significantly degrading performance.

# Requires bitsandbytes for 8-bit quantization
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_8bit=True,
    device_map="auto"
)

Using reduced accuracy and quantization in production typically speeds up pipelines and reduces memory usage without significantly affecting model accuracy.

# 3. Choose an efficient model architecture

Many applications do not need the largest model to solve tasks. Choosing a light transformer architecture, such as a distillation model, provides better latency and throughput with acceptable accuracy trade-offs.

Compact or distilled versions such as Distilbert retain most of the accuracy of the original model, but with much less parameters, it provides faster inference.

The architecture is optimized for inference and choose a model that meets the accuracy requirements of the task.

# 4. Use cash advances

Many systems waste calculations by repeating expensive tasks. Caches can significantly improve performance by reusing the results of costly calculations.

with torch.inference_mode():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=120,
        do_sample=False,
        use_cache=True
    )

Efficient caching reduces calculation times, improves response times, and reduces production system delays.

# 5. Uses the acceleration runtime via optimal (ONNX runtime)

Many pipelines run in a Pytorch A less optimal mode that adds Python overhead and additional memory copies. use Best Open Neural Network Exchange (ONNX) Runtime – Via onnx runtime – Convert models into static graphs, fuse operations, and runtimes allow faster kernels on GPUs with less overhead on Central Processing Units (CPUs) or GPUs. The result is usually faster inference, especially on CPU or mixed hardware, without changing the way the pipeline is called.

The required packages:

pip install -U transformers optimum[onnxruntime] onnxruntime

Next, transform the model with code like this:

from optimum.onnxruntime import ORTModelForSequenceClassification

ort_model = ORTModelForSequenceClassification.from_pretrained(
    model_id,
    from_transformers=True
)

By converting Pipeline to the Onnx runtime via Optimum, you can get lower latency and more efficient inference while maintaining your existing pipeline code.

# I'll summarize

Transformers Pipelines is an API wrapper for a hug face framework that promotes AI application development by condensing complex code into simpler interfaces. In this article, we have explored five tips for optimizing efficient model architecture selection, caching, and more from batch inference requests.

I hope this helped!

Cornelius Judas Ujaya Data Science Assistant Manager and Data Writer. While working full-time at Allianz Indonesia, he loves to share data tips with Python via social and writing media. Cornellius writes about a variety of AI and machine learning topics.

Source link

binance "oppna konto commented on Forget Ray-Ban Meta smart glasses. We tested cheaper ones that support ChatGPT.: Thanks for sharing. I read many of your blog posts
Binance账户 commented on The Smartest Man Who Ever Lived: Your point of view caught my eye and was very inte
打开Binance账户 commented on Top 10 Machine Learning Jobs with the Best Salaries in 2023: Your point of view caught my eye and was very inte
binance Registrera dig commented on Generative-AI-Jobs: Die 11 gefragtesten KI-Berufe: Thanks for sharing. I read many of your blog posts
create a binance account commented on WHOOP 4.0 review: Fitness tracker brand launches new AI features: Can you be more specific about the content of your

5 Tips for Building Optimized Hug-Face Transpipelines

# introduction

# 1. Batch inference request

# 2. Uses low precision and quantization

# 3. Choose an efficient model architecture

# 4. Use cash advances

# 5. Uses the acceleration runtime via optimal (ONNX runtime)

# I'll summarize

Leave a Reply

RECENT POSTS

Telekom Srbija uses SAS to modernize customer engagement and AI-driven marketing

AI in healthcare: applications and best practices

Composer AI Review | Quasa Project Spotlight — Trading with AI — Quasa

# introduction

# 1. Batch inference request

# 2. Uses low precision and quantization

# 3. Choose an efficient model architecture

# 4. Use cash advances

# 5. Uses the acceleration runtime via optimal (ONNX runtime)

# I'll summarize

Related Posts

Leave a Reply