author:
(1) Sasun Hambarzumian, Active Loop, Mountain View, California, USA
(2) Abhinav Tuli, Active Loop, Mountain View, California, USA
(3) Levon Ghukasian, ActiveLoop, Mountain View, California, USA
(4) Fariz Rahman, Active Loop, Mountain View, CA, USA.
(5) Hrant Topchyan, Activeloop, Mountain View, California, USA
(6) David Isayan, Active Loop, Mountain View, CA, USA
(7) Mark McQuaid, Active Loop, Mountain View, CA, USA
(8) Mikhail Harutyunyan, Active Loop, Mountain View, California, USA
(9) Tatevik Hakobyan, Activeloop, Mountain View, California, USA
(10) Ivo Stranik, Active Loop, Mountain View, California, USA
(11) Davit Buniatyan, Activeloop, Mountain View, CA, USA.
List of Links
Abstract
Traditional data lakes provide essential data infrastructure for analytical workloads by enabling time travel, running SQL queries, ingesting data with ACID transactions, and visualizing petabyte-scale datasets in cloud storage. This enables organizations to break down data silos, enable data-driven decisions, improve operational efficiency, and reduce costs. However, with the growing use of deep learning, traditional data lakes are no longer well designed for applications such as natural language processing (NLP), audio processing, computer vision, and applications involving non-tabular datasets. In this white paper, we introduce Deep Lake, an open source lakehouse for deep learning applications developed at Activeloop.[1][2]Deep Lake maintains the benefits of a vanilla data lake, but with one key difference: it stores complex data such as images, videos, annotations, and tabular data in the form of tensors, and quickly streams the data over the network to (a) Tensor Query Language, (b) in-browser visualization engines, and (c) deep learning frameworks without sacrificing GPU utilization. Datasets stored in Deep Lake can be accessed through PyTorch. [58]tensorflow [25]Jax [31]and integrates with numerous MLOps tools.
Keywords – Deep Lake, Deep Learning, Data Lake, Lakehouse, Cloud Computing, Distributed Systems
1.First of all
A data lake is a central repository where organizations can store structured, unstructured, and semi-structured data in one place. Data lakes provide a more efficient way to manage, govern, and analyze data. They also provide a way to break down data silos and gain insights that were previously hidden across disparate data sources. First-generation data lakes traditionally collected data in distributed storage systems such as HDFS. [71] Or AWS S3 [1]Because the data was disorganized, data lakes became “data swamps,” giving rise to the second generation of data lakes led by Delta, Iceberg, and Hudy. [27, 15, 10]They operate strictly on standardized structured formats such as Parquet, ORC, and Avro. [79, 6, 20] It offers features like time travel, ACID transactions, schema evolution etc. Data lakes directly integrate with query engines like Presto, Athena etc.
Hive and Photon [70, 12, 76, 66] Run analytical queries and connect to frameworks like Hadoop, Spark, and Airflow. [14, 82, 9] For maintaining ETL pipelines. As a result, the integration of data lakes and query engines has led to a clear separation of compute and storage, and systems like Lakehouse have emerged. [28] Acts as an alternative to data warehouses like Snowflake, BigQuery, Redshift, Clickhouse, etc. [33, 4, 40, 2].
Over the past decade, deep learning has outperformed traditional machine learning techniques in dealing with unstructured and complex data such as text, images, video, and audio. [44, 47, 38, 83, 51, 30, 63, 56]Deep learning systems have not only outperformed traditional techniques, but have also achieved superhuman accuracy in applications such as cancer detection from x-ray images, anatomical reconstruction of human neurons, gaming, driving cars, unfolding proteins, and generating images. [61, 48, 72, 42, 77]Large-scale language models with Transformer-based architectures have achieved state-of-the-art results across translation, inference, summarization, and text completion tasks. [78, 36, 81, 32]Large-scale multi-modal networks embed unstructured data into vectors, enabling cross-modal search. [29, 60]Additionally, it is also used to generate photorealistic images from text. [62, 65].
A major contributing factor to the success of deep learning models has been the availability of large datasets such as CoCo (330,000 images), ImageNet (1.2 million images), Oscar (a multilingual text corpus), and LAION (400 and 5 billion images). [49, 34, 74, 68]However, there is no established data infrastructure blueprint that can scale like traditional analytical workloads, and the Modern Data Stack (MDS) lacks the capabilities required to deploy high-performance deep learning-based solutions, leading organizations to choose to develop in-house systems.
In this article, we introduce Deep Lake, a lakehouse specialized for deep learning workloads.
What sets it apart from traditional data lakes is that it allows you to store complex data like images, videos, annotations, and tabular data as tensors and quickly stream that data over the network to deep learning frameworks without sacrificing GPU utilization, and it also provides native interoperability between deep learning frameworks like PyTorch, TensorFlow, and JAX. [58, 25, 31].
The main technical contributions of this paper are:
• Tensor storage format Store the dynamically formed array in object storage.
• Streaming Data Loader Schedules fetching, decompression, and user-defined transformations to optimize data transfer throughput to GPUs for deep learning.
• Tensor Query Language Perform SQL-like operations on multidimensional array data.
• In-browser visualization engine It streams data from object storage and renders it in the browser using WebGL.
The remainder of the paper is as follows: We first consider the current challenges of deep learning on unstructured data. We then introduce the Tensor Storage Format (TSF) and its key concepts. We then discuss Deep Lake's capabilities and applications within the ML cycle. We then present performance experiments and discuss their results. Finally, we conclude by reviewing related work and listing possible limitations.
[1] The source code is available here: https://github.com/activeloopai/deeplake
[2] Documentation is available at https://docs.deeplake.ai.