Use multi-layer checkpoints for large-scale AI training jobs

AI and ML Jobs


Conceptually, a multi-tiered checkpointing solution provides a single “magic” local filesystem volume for your ML training jobs that you use to store and restore checkpoints. This is “magic” because it provides RAM disk-level read/write speeds while also providing the data durability associated with Cloud Storage.

When enabled, local volumes (node ​​storage) are the only storage tier visible to ML jobs. Checkpoints written there are automatically replicated to one, two, or more peer nodes within the cluster, and periodically backed up to Cloud Storage.

When the job is restarted, checkpoint data specific to the new portion of the training job running on the node (i.e., NodeRank) automatically appears on the local volumes used by the ML job. In the background, the required data may be fetched from another node in the cluster or from Cloud Storage. ML jobs also transparently search for the latest fully written checkpoint (regardless of location).

The component responsible for moving data between tiers is called the replicator, and it runs on every node as part of the CSI driver that provides local volume mounting.

Looking more closely, the Replicator performs the following important functions:

  • Centralized intelligence: Analyzes Cloud Storage backups and aggregate data in your cluster to determine the latest complete checkpoint to restore your job to upon restart. Additionally, it detects successful checkpoint saves by all nodes, tells you when old data can be safely garbage collected, and strategically decides which checkpoints to back up to Cloud Storage.

  • Smart peer selection: The replicator is aware of the underlying network topology used by both TPUs and GPUs, so it uses smart criteria to select replication peers for each node. This includes prioritizing “close” peers with high bandwidth and low latency. This “close” peer is potentially at higher risk of correlated failures (e.g., within the same TPU slice or GPU superblock). Therefore, we also choose “distant” peers, i.e. peers that incur slightly more network overhead but are more resilient to independent failures (e.g., reside in a separate GPU superblock). In data parallelism scenarios, peers that have identical data are preferred.

  • Automatic data deduplication: When data parallelism is employed, multiple nodes run the same training pipeline, resulting in the same checkpoint savings. Replicator peer selection ensures that these nodes are paired together, eliminating the need for actual data replication. Instead, each node verifies the data integrity of its peers. No additional bandwidth is consumed, replication is instantaneous, and local storage usage is significantly reduced. A copy of the standard checkpoint is maintained even if a peer is misconfigured.

  • Huge model mode assuming data parallelism: Beyond optimization, this mode accommodates maximum models where local node storage is insufficient to store both the node’s own checkpoints and its peers’ data. In such cases, the ML job configures the replicator as follows: assume Data parallelism significantly reduces local storage requirements. This also applies to scenarios where a dedicated node handles Cloud Storage backups, rather than the node that stores the latest checkpoint itself.

  • Optimized cloud storage usage: By leveraging data deduplication, all unique data is stored only once in Cloud Storage, optimizing storage space, bandwidth consumption, and associated costs.

  • Automatic garbage collection: The Replicator continuously monitors checkpoint storage across all nodes. Once we verify that the latest checkpoints have been successfully saved to all locations, we automatically begin deleting older checkpoints and retain any checkpoints that are still backed up in Cloud Storage until the process is complete.

Wide range of checkpointing solutions

Google Cloud offers a comprehensive portfolio of checkpoint solutions to meet a variety of AI training needs. Options such as Direct Cloud Storage and Cloud Storage FUSE are simpler approaches that serve small to medium-sized workloads very effectively. While parallel file systems such as Luster provide high throughput for large clusters, multi-tier checkpointing is purpose-built for the most demanding, highest-scale (>1,000 nodes) training jobs that require frequent saves and rapid recovery.

The multi-layer checkpointing feature is currently in preview, with a focus on JAX on Cloud TPUs and PyTorch on GPUs. Follow us to get started now User guideif you have any questions or feedback, please feel free to contact our account team.



Source link