Discord detailed how they rebuilt their machine learning platform after reaching the limits of single GPU training. By standardizing on Ray and Kubernetes, introducing a one-command cluster CLI, and automating workflows through Dagster and KubeRay, the company turned distributed training into a daily operation. This change enabled daily retraining of our large-scale model and helped drive a 200% improvement in key ad ranking metrics.
As bespoke models grow in scale and frequency, companies like Uber, Pinterest, and Spotify are publishing similar engineering reports. Each describes the transition from manual, team-owned GPU setups to a shared, programmable platform that prioritizes consistency, observability, and faster iterations.
The move to Discord began when the team created a Ray cluster on their own by copying an open source sample and modifying the YAML files for each workload. This resulted in configuration drift, unclear ownership, and inconsistent GPU utilization. The platform team set out to make distributed ML predictable by standardizing the way clusters are created and managed.
The resulting platform is centered around Ray and Kubernetes, but defined by abstractions above it. Engineers request a cluster with high-level parameters through the CLI, and the tool uses vetted templates to generate the Kubernetes resources needed to run Ray. This eliminates the need for teams to understand low-level cluster configuration and ensures that scheduling, security, and resource policies are applied consistently.
The training workflow is integrated into Dagster and interacts with KubeRay to create and destroy Ray clusters as part of the pipeline. Jobs that previously required manual setup are now run through a single orchestration layer, and the system automates the lifecycle of your cluster.

Discord also built X-Ray, a UI that displays active clusters, job logs, and resource usage. These components have turned distributed training into a predictable workflow rather than a bespoke setup exercise.

According to the company, the platform enables daily training of large-scale models and makes it easier for engineers to adopt distributed technologies. The team was able to implement the new training framework without requiring direct involvement from the platform group. This improvement led to tangible product benefits, including increased ad relevance.
Other organizations have described similar transitions. Uber migrated parts of its Michelangelo platform to Ray on Kubernetes and reported increased throughput and GPU utilization. Pinterest built a Ray control plane to manage the cluster lifecycle and centralize logs and metrics, reducing the operational burden for ML engineers. Spotify has created Spotify-Ray, a GKE-based environment where users launch Ray clusters through the CLI or SDK to integrate experimentation and production workflows.
However, we also publish a warning account about the limitations of our internal ML platform. In a post earlier this year, CloudKitchens discussed how its first Kubernetes-based system became overly complex, with simple ML jobs taking over 10 minutes to start and Bazel-based workflows creating an ongoing maintenance burden for Python users.
Taken together, these statements suggest a move to shared ML platforms with clearer abstractions and more predictable workflows. Platforms can enable faster iterations and more reliable access to distributed computing, but they can also come with their own design and maintenance trade-offs.
