Accelerate distributed training with Snowflake ML

Imagine a large financial services company that needs a cluster of GPUs to train an underlying model to detect synthetic and account takeover fraud and to process large streams of customer event sequences. At this scale, machine learning training is no longer about the model (architecture selection, feature engineering), but is constrained by the infrastructure. That means adjusting per-node memory limits, adjusting dozens of cluster configurations (number of workers, executor memory), and fighting out-of-memory failures that waste hours of computing. At Snowflake, we believe training should be about the model, not the infrastructure. We encountered and solved these fundamental scale problems by building the ML Container Runtime. We designed custom distributed training APIs, efficient data connectors, and adjusted system defaults to eliminate infrastructure bottlenecks, allowing users to focus on their models and not their infrastructure. To validate the effectiveness of our engineering approach, we quantified its impact using the TPCx-AI benchmark.

Our results demonstrate up to 2.5x faster distributed PyTorch training (TPCx-AI use case 9) and up to 1.8x faster distributed XGBoost training (TPCx-AI use case 8) and up to 8x lower infrastructure cost per run in the tested environments compared to Databricks.

Inside: Engineering the benefits of Snowflake

The optimizations we made can be categorized into three areas. Optimizing data ingestion Efficiently pipeline data to training nodes. Performance optimization Using the distributed training API, memory architecture This sets up the platform for such complex AI/ML workloads.

Optimized data ingestion of unstructured data

While the core GPU compute for training models such as PyTorch DDP and ResNet-50 is the same no matter which platform they run on, Snowflake’s architectural advantage lies in its highly optimized data path for moving image bytes from cloud storage to the GPU.

Snowflake direct data path. Snowflake reads images directly over HTTP, completely bypassing the file system, greatly streamlining the process.

List of single trip files: The list of files is executed as a single SQL. LIST Run a command against a Snowflake stage and return all file paths and metadata in one round trip.

Zero copy, in-memory read: file open() Use Python for operations fsspec Methods that issue direct HTTP GETs to signed S3 URLs and load data directly into in-memory BytesIO buffer. This eliminates the need for intermediate steps such as the kernel VFS layer, FUSE daemon, and writing to local disk.

Built-in back pressure: Ray Data’s pull-based streaming executor dynamically adjusts the size of the operator queue, ensuring reads only operate as fast as the GPU can consume them, and provides built-in backpressure.

Three additional optimizations are automatically built into the data path at runtime, so no user configuration is required.

Validating authorized files: Snowflake’s reader handles SQL LIST The results are considered authoritative and redundant file existence checks that require a second round trip for each image are skipped, effectively halving the number of requests.

Parallel fetch: The reader is multi-threaded, so files can be fetched in parallel within each Ray read task, even though tasks typically download files in a single-threaded manner.

Permanent connection: Each rayworker remains long-lived. urllib3 Connection pooling with automatic reconnection avoids the overhead of creating and tearing down TCP/TLS connections for each file.

Optimizing the XGBoost Training Pipeline: Reducing the Training Matrix

Snowflake’s distributed XGBoost trainer is built on Ray primitives with additional ingestion and DMatrix build pass optimizations to significantly reduce the training matrix held in worker memory.

Automatic float32 downcast on ingest. Snowflake’s ingester casts float64 columns to float32 on Arrow batch boundaries before the trainer recognizes them. Each worker maintains a training matrix that is twice as small as it would be with raw float64 input, and the precision remains the same (XGBoost’s DMatrix is internally a float32).

Zero copy Arrow import. Snowflake reads training shards directly from Ray’s object store as Arrows and feeds them into DMatrix construction without any intermediate pandas conversion. This reduces peak memory and CPU time for data conversion.

QuantileDMatrix support. XG boost hist I only need the tree method bottled Feature values to grow the tree instead of the original float. standard DMatrix We still store the raw floating point values and use the sorted index to calculate the bin edges. QuantileDMatrix Derive bin edges via streaming quantile sketches and save only a compact binned representation (typically uint8 (with 256 bins) — Up to 4x less peak memory, especially for wide tables.

Elimination of idle task workers. Ray retains worker processes after task completion for possible reuse. During XGBoost training, the workers used to load data remain resident throughout the training period, retaining memory. Snowflake eliminates these idle workers after building the DMatrix, reclaiming memory without impacting model quality or training time.

In addition to the above optimizations, the trainer also exposes opt-in External memory mode (ExtMemQuantileDMatrix) It streams batches from Ray’s object store through a custom iterator and allows XGBoost to cache histograms to disk, allowing training on data sets that exceed the total worker RAM.

unified memory architecture

Snowflake’s architecture has significant advantages. Its runtime has no JVM, unlike Spark-based engines that reserve large amounts of memory for the JVM heap, which is inaccessible to native training code. This has two important advantages.

Maximize training efficiency with integrated memory. The entire memory of every node is available for training, including XGBoost’s native C++ code. On a 28 GB node, Snowflake provides approximately 18 GB more memory to the trainer than the standard JVM heap configuration.

Source link