Attach a square cubernetes to a round AI-Native app

Kubernetes has been called a lot over the past decade. Cloud operating system, universal control plane, large layer of abstraction. And it has gained its reputation. Kubernetes tame container confusion, gave it a common language for infrastructure, and became the backbone of the cloud-native movement.

But today we are staring at a new frontier: ai-native application. Training huge models across GPU clusters. Runs a distributed inference pipeline. Provides low and low response at the edge. Managing a data pipeline is just as important as the calculation itself. And suddenly, our trusty Kubernetes hammers are a bit unsuitable for AI nails.

I raise a question: aAre you trying to put square kubernetes in a round hole in your ai-native app?

Kubernetes: The winning control plane

Give Kubernetes that deadline. Born on Google and opened in 2014, Kubernetes is designed to schedule and organize stateless microservices. Abstracts messy details about where the container works, how it shrinks, and how it connects to other services. The scalability of custom resources, operators, controllers and more has made Kubernetes more tuned not only the workload but the entire surrounding ecosystem.

That flexibility is why Kubernetes won. It is currently the de facto standard for cloud-native platforms. If you're building microservices, you're almost certainly running them on Kubernetes, through on-prem, cloud or managed services.

However, AI-Native workloads are not microservices. And then the tension begins.

Why does AI not fit nicely?

AI workloads use stress Kubernetes in ways not designed before.

First, there is Hardware Scheduling. Kubernetes was built as a major resource for CPU and memory. GPU? Tpus? Other accelerators? They are bolted and awkwardly represented as extended resources. Efficiently scheduling GPU jobs is a completely different ball game.

Number 2, Job Type. Kubernetes thrives with stateless services and short-lived work. AI workloads are often long-term, stateful, distributed across hundreds or thousands of nodes. LLM training is not the same as providing a web API.

Third, Data Gravity. AI workloads are more than just calculations. They rely on large datasets that must be shuffled, step-by-step, and streamed. Kubernetes does not natively manage its complexity.

Finally, Delay sensitivity. Inference workloads are cruelly sensitive to milliseconds. The abstractions that make Kubernetes so powerful can bring friction that AI teams can't afford.

Workaround

Of course, the industry is not standing still. Many projects are working on making Kubernetes more AI-friendly.

Kubeflow It has become the go-to framework for Kubernetes' machine learning pipeline.
K8S Ray and Kuberay Bring distributed AI workloads into a cluster.
Volcano It focuses on batch and high performance computing job scheduling.
All cloud providers build their own AI-On-Kubernetes products along with custom operators and GPU schedulers.

These solutions work. But too often they feel like add-ons – the adapters are bolted to Kubernetes, not designed features. It's like putting a new transmission in a sedan and calling it a race car. It avoids the track, but was it really built for that?

What AI-Native needs

So, what does a control plane look like, designed for AI-Native apps?

It starts with GPU and Accelerator First Scheduling. It's not an afterthought, but a core of the system.

It will be integrated Data Pipeline As a top-class concern. Pods and volumes as well as high-throughput streaming, shards and cache.

It manages Distributed Training Jobs Natively understand how to tune thousands of GPUs across multiple elastic clusters.

Optimize Scale inference – Adjusted for concurrency, delay, model load, not CPU usage.

And that's probably true Policy and cost recognitionbecause AI cloud invoices are already shocking. The true AI-Native control plane forces GuardRails on a Runaway GPU job before the CFO knocks.

Can kubernetes bend without breaking?

Some argue that Kubernetes could evolve. After all, it wasn't designed to run a database, but even so, the operator and CRD made it possible. With ample extensions, Kubernetes could also become an AI control plane.

Rebuttal? Kubernetes is optimized for microservices. Modifying it for AI can always feel unnatural – it's more like duct tape than design. AI-Native workloads can be better handled with dedicated systems such as cloud vendors Ray, Mosaic, and even their own orchestrators.

My feeling? We're going to see a The future of hybrids. Kubernetes remains the control plane of the enterprise infrastructure, where compliance, networking, and security policies are exposed. However, AI-specific orchestrators are optimized for training and reasoning and sit with it. The challenge is to integrate the two without creating even more complexity.

Platform Engineering Angle

For platform teams, discussion is not academic. Their job is to hide all this complexity from developers. Whether Kubernetes evolves to handle AI workloads or adopts new orchestrators, the key is that it offers a golden pathway for developers to not care about what's under the hood.

It means a building IDPS (Internal Developer Platform) Process GPU, data, and AI pipelines seamlessly. Developers should request training job or inference endpoints without having to understand whether Kubernetes, Ray, or something else is doing heavy lifting.

In that sense, the platform engineering movement could be a bridge, and by wrapping it in abstraction, it makes Kubernetes “good enough” for corporate AI.

Shimmy's Take

I've had this industry for a long time before to see this pattern. When Kubernetes first came out, it wasn't very suitable for stateful apps, databases, or service meshes. However, the ecosystem has adapted. Operators, CRDs and sidecars bent Kubernetes in undesigned directions.

So can Kubernetes bend again for AI? perhaps. But here's the difference: AI waves move faster More than we've seen before. You may not be able to afford to wait for an incremental ecosystem modification. The pace of model training, the demand for GPU clusters, the need for inference at the edge – all outweigh the evolutionary capabilities of Kubernetes.

My view: Kubernetes plays the role. It's not too entrenched. But it may not be the perfect fit. Instead, we see that the new generation of AI-Native Control Plans will rise. The challenge is to sew them into the Kubernetes world where we already live.

Close thoughts

Kubernetes deserves to be crowned as a cloud-native universal control plane. However, ai-native workloads are another beast. No matter how many extensions you throw in, they don't fit neatly into the Kubernetes model.

The future may not be to force a square Kubernetes into a round hole in an AI-Native app. Instead, it's about knowing where Kubernetes belongs in the AI age and where new abstractions are fully needed.

The truth is, Kubernetes doesn't have to do it all. It just has to do the job well. And if you need something new in your AI-Native app, perhaps the most cloud-native thing we can do is embrace that evolution.

Source link

Attach a square cubernetes to a round AI-Native app

Kubernetes: The winning control plane

Why does AI not fit nicely?

Workaround

What AI-Native needs

Can kubernetes bend without breaking?

Platform Engineering Angle

Shimmy's Take

Close thoughts

Related

Leave a Reply

RECENT POSTS

Scaling MLflow for Enterprise AI: New features in SageMaker AI with MLflow

Dell considers acquisition of Dataloop focused on AI data infrastructure: Report

Google removes AI videos of Disney characters from YouTube following service outage

Kubernetes: The winning control plane

Why does AI not fit nicely?

Workaround

What AI-Native needs

Can kubernetes bend without breaking?

Platform Engineering Angle

Shimmy's Take

Close thoughts

Related

Related Posts

Leave a Reply