Deep Lake: A Lakehouse for Deep Learning: Discussions and Limitations

Machine Learning


author:

(1) Sasun Hambarzumian, Active Loop, Mountain View, California, USA

(2) Abhinav Tuli, Active Loop, Mountain View, California, USA

(3) Levon Ghukasian, ActiveLoop, Mountain View, California, USA

(4) Fariz Rahman, Active Loop, Mountain View, CA, USA.

(5) Hrant Topchyan, Activeloop, Mountain View, California, USA

(6) David Isayan, Active Loop, Mountain View, California, USA

(7) Mark McQuaid, Active Loop, Mountain View, CA, USA

(8) Mikhail Harutyunyan, Active Loop, Mountain View, California, USA

(9) Tatevik Hakobyan, Activeloop, Mountain View, California, USA

(10) Ivo Stranik, Active Loop, Mountain View, California, USA

(11) Davit Buniatyan, Activeloop, Mountain View, CA, USA.

7. Discussion and Limitations

Primary use cases for Deep Lake include (a) training deep learning models, (b) data lineage and version control, (c) data query and analysis, and (d) data inspection and quality control. [55] Implementing arrays as basic blocks

Figure 10: GPU utilization on one 16xA100 GPU machine while training a 1B parameter CLIP model. [60]The dataset is LAION-400M [68] Streaming from AWS US East to GCP US Central data center. Each color represents the utilization of a single A100 GPU during training.Figure 10: GPU utilization on one 16xA100 GPU machine while training a 1B parameter CLIP model. [60]The dataset is LAION-400M [68] Streaming from AWS US East to GCP US Central data center. Each color represents the utilization of a single A100 GPU during training.

Versioning, streaming data loader, and a visualization engine from the ground up.

7.1 Format Design Space

The Tensor Storage Format (TSF) is a binary file format designed specifically for storing tensors, which are multidimensional arrays of numbers used in many machine learning and deep learning algorithms. The TSF format is designed to be efficient and compact, allowing for fast and efficient storage and access of tensor data. One of the main advantages of the TSF format is that it supports a wide range of tensor data types, including dynamically shaped tensors.

In comparison, Parquet [79] and Arrow [13] The format is a columnar file format designed for storing and processing large analytical datasets. Unlike TSF, which was designed specifically for tensor data, Parquet and Arrow are optimized for efficient storage and querying of analytical workloads over tabular and time-series data. They are well suited for big data applications because they use columnar storage and compression techniques to minimize storage space and improve performance. However, when it comes to tensor data, TSF has some advantages over Parquet and Arrow: TSF can support tensor operations and efficient streaming into deep learning frameworks.

Other Tensor Formats [18, 52, 23, 57] It is efficient for highly parallelizable workloads as it does not require coordination between chunks. The main tradeoff of the tensor storage format is the ability to store dynamically shaped arrays inside tensors without padding the memory footprint. For example, in computer vision it is very common to store multiple images of different shapes or videos of dynamic length. To support the flexibility, a small amount of overhead is introduced in the form of the chunked encoder mentioned above, but in practice we have not observed any impact on production workloads.

7.2 Data Loader

Deep Lake achieves state-of-the-art results in local and remote settings, as seen in the large image iterative processing benchmark in Figure 7. It is primarily faster than FFCV. [39]claims to have reduced ImageNet model training costs by up to 98 cents per model trained. Additionally, Deep Lake achieves similar ingest performance as the WebDataset. [19]Deep Lake performs significantly better on large images. While Parquet is optimized for small cell and analytical workloads, Deep Lake is optimized for large-scale, dynamically shaped tensor data. Compared to other data lake solutions, the minimalist Python package design makes Deep Lake easy to integrate into large-scale distributed training and inference workloads.

7.3 Future challenges

Deep Lake's current implementation leaves room for further improvement. First, the storage format does not support custom ordering for more efficient storage layouts, which is necessary for vector searches and key-value indexes. Second, Deep Lake implements branch-based locking for concurrent access, similar to the Delta ACID transaction model. [27]Deep Lake can scale to high performance parallel workloads. Third, the current TQL implementation only supports a subset of SQL operations (i.e., operations such as joins are not supported). Future work will focus on improving the completeness of SQL, extending it to more numeric operations, running federated queries on external data sources, and benchmarking against SQL engines.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *