
Images by editor | chatgpt# introduction
Hugging my face It has become the norm for many AI developers and data scientists. This is because it significantly reduces the barriers to working with advanced AI. Instead of working with AI models from scratch, developers can access a wide range of assumptions without hassle. Users can also adapt these models with custom datasets and deploy them quickly.
One of the hugging face framework API wrappers is Transpipelinea set of packages consisting of pre-protected models, their token agents, pre-treatment and post-treatment, and related components to make AI use cases work. These pipelines abstract complex code and provide a simple, seamless API.
However, using a transpipeline can become messy and may not yield the best pipeline. So we explore five different ways that you can optimize your transformer pipeline.
Let's get into it.
# 1. Batch inference request
In many cases, when using transpipelines, you do not take full advantage of the graphics processing unit (GPU). Batching multiple inputs can greatly increase GPU utilization and increase inference efficiency.
Instead of processing one sample at a time, you can use a pipeline batch_size Parameters or pass a list of inputs so that the model handles several inputs in one forward pass. Here is an example code:
from transformers import pipeline
pipe = pipeline(
task="text-classification",
model="distilbert-base-uncased-finetuned-sst-2-english",
device_map="auto"
)
texts = [
"Great product and fast delivery!",
"The UI is confusing and slow.",
"Support resolved my issue quickly.",
"Not worth the price."
]
results = pipe(texts, batch_size=16, truncation=True, padding=True)
for r in results:
print(r)
Batching requests allows for higher throughput with minimal latency impact.
# 2. Uses low precision and quantization
Many assumption models fail in inference because there is not enough memory in the development and production environments. Lower numerical accuracy reduces memory usage and speeds up inference without sacrificing too much accuracy.
For example, here is how to use half-precision on a GPU in a transpipeline:
import torch
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
model_id,
torch_dtype=torch.float16
)
Similarly, quantization techniques can compress model weights without significantly degrading performance.
# Requires bitsandbytes for 8-bit quantization
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
model_id,
load_in_8bit=True,
device_map="auto"
)
Using reduced accuracy and quantization in production typically speeds up pipelines and reduces memory usage without significantly affecting model accuracy.
# 3. Choose an efficient model architecture
Many applications do not need the largest model to solve tasks. Choosing a light transformer architecture, such as a distillation model, provides better latency and throughput with acceptable accuracy trade-offs.
Compact or distilled versions such as Distilbert retain most of the accuracy of the original model, but with much less parameters, it provides faster inference.
The architecture is optimized for inference and choose a model that meets the accuracy requirements of the task.
# 4. Use cash advances
Many systems waste calculations by repeating expensive tasks. Caches can significantly improve performance by reusing the results of costly calculations.
with torch.inference_mode():
output_ids = model.generate(
**inputs,
max_new_tokens=120,
do_sample=False,
use_cache=True
)
Efficient caching reduces calculation times, improves response times, and reduces production system delays.
# 5. Uses the acceleration runtime via optimal (ONNX runtime)
Many pipelines run in a Pytorch A less optimal mode that adds Python overhead and additional memory copies. use Best Open Neural Network Exchange (ONNX) Runtime – Via onnx runtime – Convert models into static graphs, fuse operations, and runtimes allow faster kernels on GPUs with less overhead on Central Processing Units (CPUs) or GPUs. The result is usually faster inference, especially on CPU or mixed hardware, without changing the way the pipeline is called.
The required packages:
pip install -U transformers optimum[onnxruntime] onnxruntime
Next, transform the model with code like this:
from optimum.onnxruntime import ORTModelForSequenceClassification
ort_model = ORTModelForSequenceClassification.from_pretrained(
model_id,
from_transformers=True
)
By converting Pipeline to the Onnx runtime via Optimum, you can get lower latency and more efficient inference while maintaining your existing pipeline code.
# I'll summarize
Transformers Pipelines is an API wrapper for a hug face framework that promotes AI application development by condensing complex code into simpler interfaces. In this article, we have explored five tips for optimizing efficient model architecture selection, caching, and more from batch inference requests.
I hope this helped!
Cornelius Judas Ujaya Data Science Assistant Manager and Data Writer. While working full-time at Allianz Indonesia, he loves to share data tips with Python via social and writing media. Cornellius writes about a variety of AI and machine learning topics.
