Deep Lake: A Lakehouse for Deep Learning: Tensor Storage Format

author:

(1) Sasun Hambarzumian, Active Loop, Mountain View, California, USA

(2) Abhinav Tuli, Active Loop, Mountain View, California, USA

(3) Levon Ghukasian, ActiveLoop, Mountain View, California, USA

(4) Fariz Rahman, Active Loop, Mountain View, CA, USA.

(5) Hrant Topchyan, Activeloop, Mountain View, California, USA

(6) David Isayan, Active Loop, Mountain View, CA, USA

(7) Mark McQuaid, Active Loop, Mountain View, CA, USA

(8) Mikhail Harutyunyan, Active Loop, Mountain View, California, USA

(9) Tatevik Hakobyan, Activeloop, Mountain View, California, USA

(10) Ivo Stranik, Active Loop, Mountain View, California, USA

(11) David Buniatian, Active Loop, Mountain View, CA, USA.

List of Links

3. Tensor storage formats

The Deep Lake dataset follows a columnar storage architecture with tensors as columns, as shown in Figure 3. Each tensor is a collection of chunks, which are binary blobs containing data samples. An index map associated with each tensor helps to find the correct chunk and index of a sample within that chunk, for a given sample index.

3.1 Dataset

A sample in a dataset represents a single row indexed across parallel tensors. In contrast to document storage formats, sample elements are logically independent, allowing partial access to samples to perform performant queries or to stream selected tensors over the network to GPU training instances. Multiple tensors can be grouped together. Groups implement syntactic nesting to define how tensors are related to each other. Syntactic nesting avoids the complexities of hierarchical memory layout formalisms. Changes to a dataset's schema are also tracked over time by versioning, as are changes to the dataset content.

3.2 Tensors

Tensors are typed and can be added to or modified in-place. Default access to an index or set of indices returns the data as a NumPy array. [55]Instead of storing the one-dimensional data seen in Parquet [79] Or Arrow series [13]Tensors can accommodate n-dimensional data, where the first dimension typically corresponds to the index or batch dimension. Tensors can contain dynamically formed arrays, also known as irregular tensors, in contrast to other statically chunked array formats such as Zarr. [52].

3.3 Types

Htype defines the expectations of the samples in a tensor, for example its data type (dtype as seen in NumPy). [55]), shape, number of dimensions, or compression. Typed tensors simplify interaction with deep learning frameworks and allow for sanity checking and efficient memory layout. By inheriting from the generic tensor htype, types such as image, video, audio, bbox, dicom, etc. can be constructed. For example, a tensor with an image htype expects the samples being added to have a dtype of uint8 and a shape of length 3 (i.e. width, height, number of channels). We further extend the concept of htype to include metatypes that support storing image sequences in a tensor (sequence[image]), referencing remotely stored images while preserving the normal behavior of image tensors (link[image]), and even cross-format support.

Figure 3: Each sample (row) is stored in a set of column tensors with dynamically sized chunks.

3.4 Memory Layout

Deep Lake datasets contain a JSON-formatted provenance file and a folder for each tensor. Tensors contain chunks, chunk encoders, tile encoders, and tensor metadata. Tensors can be optionally hidden. For example, hidden tensors can be used to maintain downsampled versions of images or to preserve shape information for fast queries.

Tensors are stored in chunks at the storage level. Statically (speculatively) shaped chunks avoid maintaining chunk map tables, but incur large user overhead when specifying tensors, limit the use of custom compression, underutilize storage for dynamically shaped tensors, and inefficient post-processing. Deep Lake chunks are constructed based on lower and upper bounds on chunk size to fit a limited number of samples. This comes with the trade-off of having a compressed index map, which maintains a mapping of sample indexes to chunk IDs for each tensor, but enables chunk sizes in the optimal range for streaming while accommodating mixed-shape samples. The approach taken in this paper can be considered an optimized trade-off between file system page maps and compute-defined mapless array storage systems. For practical reasons, a single chunk encoder can scale to billions of images while maintaining a chunk encoder of 150 MB per 1 PB of tensor data. Sharding the chunk encoder allows for further scaling. A chunk contains header information such as byte range, shape of the sample, and the sample data itself. If a sample is larger than the upper chunk size (such as in the case of large aerial or microscopic photographs), the sample is tiled into chunks across the spatial dimensions. The only exception to tiling is video, which is maintained by efficient frame mapping to indexes, decompression of only key frames, and range-based requests while streaming.

3.5 Access Patterns

The tensor storage format is optimized for deep learning training and inference, including sequential and random access. Sequential access is used to perform scan queries, convert tensors to other tensors, or perform inference. Random access use cases include when multiple annotators write labels to the same image or when a model backstores predictions along with the dataset. While strict mode is disabled, sparse tensors can be accommodated because indices outside the range of a tensor can be assigned. However, random assignment over time produces inefficiently stored chunks of data. To correct the data layout, we implement an on-the-fly re-chunking algorithm to optimize the data layout. One of the primary access patterns in Deep Lake is shuffle-stream access for training machine learning models. Random or custom-ordered access is required when streaming chunks to the training process. This is achieved by range-based requests to access sub-elements within a chunk, performing complex queries before training to determine the order, and maintaining a buffer cache for the retrieved unused data. This eliminates the need for a separate compute cluster to run the shuffle algorithm. [50].

Each tensor has its own chunk, with a default chunk size of 8MB. If an individual data point (image, label, annotation, etc.) is smaller than the chunk size, one chunk will consist of data from multiple indices. Conversely, if an individual data point is larger than the chunk size, the data will be split into multiple chunks (tiling). The exception to the chunking logic is video data.

The Deep Lake format is optimized to maximize GPU processing throughput, with the expected layout of a deep learning framework including CPU prefetch, decompression or decoding, transformation, and GPU memory transfer.

3.6 Storage Providers

Deep Lake can connect to any storage provider, including object storage such as AWS S3. [1]Google Cloud Storage (GCS) [3]a POSIX-compatible file system, or local in-memory storage. Additionally, build memory caches by chaining different storage providers together, for example a Least Recently Used (LRU) cache of remote S3 storage and local in-memory data.

Source link

Deep Lake: A Lakehouse for Deep Learning: Tensor Storage Format

List of Links