Optimizing AI Pipelines at Data Summit 2023

AI pipelines have plenty of room for improvement. These enhancements range from performance optimizations, faster model loading, taking advantage of asynchronous hot swapping, leveraging multi-cloud and multi-region his Kubernetes clusters, improving spin-up times and autoscaling.

Erwann Millon, Founding Engineer of Krea.ai, Leads Data Summit Session “Examining State-of-the-Art Technologies for Data Management” to Maximize AI Pipelines Through In-Depth Case Studies of Inference Pipelines Discussed best practices for Serve hundreds of models on multi-cloud clusters.

The annual Data Summit Conference will be held May 10-11, 2023 in Boston, with a pre-conference workshop on May 9.

Millon explained that inference needs revolve around providing many models with high controllability, fast scaling, high availability, and at low cost. Ultimately, inference is key to optimizing his AI pipeline.

Fast load times are important to accommodate this wide range of thousands of models. Using Pytorch and a few lines of code, users can achieve fast load times with the Torch Meta Device, a “fake” torch device that neither allocates memory nor initializes parameters.

For faster loading from disk with a large number of parameters, Huggingface safetensor can load models 2-10x faster than pickle, while supporting loading from zero copy directly to Cuda. Large models can also benefit from chunked state dictionaries that are stored as separate files and loaded in parallel using multithreading in the performance library.

Millon posed the question, “Can we intelligently load the model in the background during inference to achieve zero latency?”

Yes, he insisted. However, due to Python limitations, this can be somewhat difficult. Models can load faster by utilizing parent and child processes that interact with each other. Importantly, he pointed out, this is not possible on a CPU. Alternatively, you can load the child process onto the GPU to circumvent this technique.

These optimizations allow us to provide hundreds of stable diffusion models with zero delay from a single A100. Despite this efficiency, it is still not sufficient to effectively support inference.

Storage can be optimized by sharing models in Kreai.ai’s Kubernetes cluster with an NFS solution backed by a write-through system, balancing spin-up time, agility and scalability.

Million explained that while these are essential for optimizing inference, A100 GPUs are hard to come by. He suggested the following alternatives:

sky pilot
Google Anthos
Scaling on a single instance (ECG/GCE)
NVIDIA Multi-Instance GPU
AWS
Alternative GPU such as A10G
Compensate for VRAM with quantization, model offloading, and model parallelism
Inference on TPUs

Millon’s top recommendation is to profile everything. “If you profile everything in detail, this is the best way to identify the biggest performance hits,” he explained.

He concluded by emphasizing the following points: