author:
(1) Sasun Hambarzumian, Active Loop, Mountain View, California, USA
(2) Abhinav Tuli, Active Loop, Mountain View, California, USA
(3) Levon Ghukasian, ActiveLoop, Mountain View, California, USA
(4) Fariz Rahman, Active Loop, Mountain View, CA, USA.
(5) Hrant Topchyan, Activeloop, Mountain View, California, USA
(6) David Isayan, Active Loop, Mountain View, CA, USA
(7) Mark McQuaid, Active Loop, Mountain View, CA, USA
(8) Mikhail Harutyunyan, Active Loop, Mountain View, California, USA
(9) Tatevik Hakobyan, Activeloop, Mountain View, California, USA
(10) Ivo Stranik, Active Loop, Mountain View, California, USA
(11) David Buniatian, Active Loop, Mountain View, CA, USA.
List of Links
2. Current Issues
This section discusses the current and past challenges of managing unstructured and complex data.
2.1 Complex Data Types in the Database
In general, storing binary data, such as images, directly in a database is not recommended. This is because databases are not optimized for storing and serving large files and can result in performance issues. Additionally, binary data does not fit well into a database's structured format, making it difficult to query and manipulate. This can result in slower load times for users. Databases are typically more expensive to operate and maintain than other types of storage, such as file systems or cloud storage services. Therefore, storing large amounts of binary data in a database can be more costly than other storage solutions.
2.2 Complex Data and Tabular Formats
The rise of large-scale analytical and BI workloads has led to the development of compressed structured formats such as Parquet, ORC, and Avro, as well as temporary, in-memory formats such as Arrow. [79, 6, 20, 13]As tabular formats have gained adoption, attempts to extend them have emerged, such as Petastorm. [18] Or Feather [7] New formats for deep learning are emerging. To the best of our knowledge, these formats have not yet been widely adopted. This approach mainly benefits from native integration with the Modern Data Stack (MDS). However, as mentioned above, it requires fundamental changes to upstream tools to adapt to deep learning applications.
2.3 Object Storage for Deep Learning
The current cloud-native choice for storing large unstructured datasets is object storage such as AWS S3. [1]Google Cloud Storage (GCS) [3]or MinIO [17]Object storage has three main advantages over distributed network file systems: (a) it is cost-effective, (b) it is scalable, and (c) it acts as a format-agnostic repository. However, cloud storage is not without its drawbacks. First, it incurs latency overhead, which is especially noticeable when iterating over many small files such as text or JSON. Second, the ingestion of unstructured data without metadata control can result in “data swamp.” Additionally, object storage has built-in versioning, but it is rarely used in data science workflows. Finally, data on object storage is copied to virtual machines before training, incurring storage overhead and additional costs.
2.4 Second Generation Data Lakes
Delta, Iceberg and Hudi lead second generation data lake [27, 15, 10] It extends object storage by managing tabular files with the following key properties:
(1) Update operation: Insert or delete rows at the top of a tabular file.
(2) Streaming: Downstream data ingestion with ACID properties and upstream integration with a query engine that exposes a SQL interface.
(3) Schema evolution: Evolve column structures while maintaining backward compatibility.
(Four) Time travel and audit log tracking: It preserves historical state with rollback properties that allow queries to be replayed, and also supports row-level control of data lineage.
(5) Layout optimization: Built-in optimization for file size and data compression with support for custom ordering, significantly improving query speed.
However, second-generation data lakes are still bound by the limitations of the inherent data formats used in deep learning, as discussed in Section 2.2. Therefore, in this paper, we extend second-generation data lake capabilities for deep learning use cases by rethinking the format and upstream features such as querying, visualization, and native integration into deep learning frameworks, completing the ML lifecycle as shown in Figure 2.