Deep Lake: A Lakehouse for Deep Learning: Machine Learning Use Cases

author:

(1) Sasun Hambarzumian, Active Loop, Mountain View, California, USA

(2) Abhinav Tuli, Active Loop, Mountain View, California, USA

(3) Levon Ghukasian, ActiveLoop, Mountain View, California, USA

(4) Fariz Rahman, Active Loop, Mountain View, CA, USA.

(5) Hrant Topchyan, Activeloop, Mountain View, California, USA

(6) David Isayan, Active Loop, Mountain View, CA, USA

(7) Mark McQuaid, Active Loop, Mountain View, CA, USA

(8) Mikhail Harutyunyan, Active Loop, Mountain View, California, USA

(9) Tatevik Hakobyan, Activeloop, Mountain View, California, USA

(10) Ivo Stranik, Active Loop, Mountain View, California, USA

(11) David Buniatian, Active Loop, Mountain View, CA, USA.

List of Links

5. Machine Learning Use Cases

This section describes the applications of Deep Lake.

A typical scenario for deep learning applications is:

(1) A raw set of files collected in an object storage bucket, which may include images in native formats such as JPEG, PNG, MP4, etc., videos, and other types of multimedia data.

(2) Associated metadata and labels stored in a relational database, optionally in the same bucket as the raw data in a normalized tabular format such as CSV, JSON, or Parquet format.

An empty Deep Lake dataset is created, as shown in Figure 4. Then, empty tensors are defined to store both the raw data and metadata. The number of tensors can be arbitrary. In our basic example of an image classification task, we have two tensors:

• Image tensor with htype 𝑖𝑚𝑎𝑔𝑒 and sample compression JPEG

• Label tensors with htype 𝑐𝑙𝑎𝑠𝑠_𝑙𝑎𝑏𝑒𝑙 and chunked compression LZ4.

After declaring the tensor, data can be added to the dataset. If the raw image compression matches that of the tensor samples, the binary is copied directly into the chunks without additional decoding. Label data is extracted from a SQL query or CSV table into categorical integers and added to the label tensor. The label tensor chunks are stored using LZ4 compression. All Deep Lake data is stored in buckets and is self-contained. Once stored, the data can be accessed through a NumPy interface or a streamable deep learning data loader. A model running on a compute machine then iterates over the stream of image tensors and stores the model output in a new tensor called predictions. Additionally, we will discuss how to train, version, query, and inspect the quality of Deep Lake datasets below.

5.1 Training the Deep Learning Model

Deep learning models are trained at multiple levels within an organization, from exploratory training done on personal computers to large-scale training done on distributed machines containing many GPUs. The time and effort required to move data from long-term storage to the training client often rivals the training itself. Deep Lake solves this problem by enabling high-speed streaming of data without bottlenecking the downstream training process, avoiding the cost and time required to replicate data to local storage.

5.2 Data Lineage and Version Control

Deep learning data is constantly evolving as new data is added and existing data is quality controlled. Analytics and training workloads occur in parallel while data is modified. Therefore, knowing which data versions were used in a particular workload is critical to understand the relationship between data and model performance. Deep Lake enables deep learning practitioners to know which versions of data were used in their analytics workloads and time travel between these versions if an audit is required. All data is mutable, so it can be edited to meet compliance-related privacy requirements. Similar to Git for code, Deep Lake also introduces the concept of data branches, allowing you to experiment and edit data without affecting the work of your colleagues.

5.3 Querying and Analyzing Data

Deep learning models are rarely trained on all the data an organization has collected for a particular application. Training datasets are often built by filtering raw data based on criteria that will improve model performance. This can include balancing the data, removing redundant data, or selecting data that contains specific features. Deep Lake provides tools to query and analyze the data so that deep learning engineers can create datasets that will produce the most accurate models.

5.4 Data Inspection and Quality Control

While unsupervised learning is becoming more applicable in real-world use cases, most deep learning applications still rely on supervised learning. The quality of a supervised learning system is determined by the quality of the data, which is often achieved by thorough manual data inspection. This process is time-consuming, so it is important to provide stakeholders with the tools to inspect vast amounts of data very quickly. Deep Lake allows you to inspect deep learning datasets of any size from a browser, without any setup time or data download required. Additionally, you can extend the tools to compare model results with actual results. Combined with querying and versioning, you can apply iterative improvements to your data to achieve the best possible model.

Source link