Operationalizing Lakehouse Table Formats: A Comparative Study of Iceberg, Delta, and Hudi Workloads

Janardhan Reddy Kasireddy

doi:10.15662/IJRPETM.2023.0602002

pdf

Published: 2023-03-15

DOI: https://doi.org/10.15662/IJRPETM.2023.0602002

Keywords:

Table format interoperability, Open table formats, , Data lakehouse table formats, Apache, Iceberg, Delta Lake

Janardhan Reddy Kasireddy

Lead Data Engineer, Info Drive Systems (Finra Contractor), USA

Abstract

Several open table formats, such as Apache Iceberg, Delta Lake, and Apache Hudi, have been adopted rapidly with the data lakehouse architecture and each boasts of being ACID compliant and providing sophisticated transactional support. Nonetheless, companies tend to choose those formats depending on the hype in the market instead of their applicability to workload which results in poor operational effectiveness and higher costs. This paper aims to provide a comparison of how lakehouse table formats can be operationalized focusing on engineering aspects and not popularity.

Our table formats decision matrix weighs the different dimensions through which table formats are considered, which are Change Data Capture (CDC), upserts and deletes, time travel, compaction strategies, and streaming ingestion. In order to prove the structure, we run a standardized load with hybrid data based on the TPC-DS benchmark data at 1 TB size of structured transactional work and a synthetic e-commerce streaming data of 10,000 events per second with frequent inserts, updates, and deletions. Workloads are run in Iceberg, Delta, and Hudi and measures are derived in terms of operation load, performance, and overall cost.

Findings show that there is a large difference in performance and maintenance overheads depending on the table format and workload characteristics. Iceberg shows better performance in time-travel queries and advanced streaming situations, Delta has more efficient compaction and less complicated CDC and Hudi is better at incremental updates of moderate operational complexity. The decision matrix enables engineers to make format choice in accordance to workload patterns minimizing speculation and enhancing performance.

The work offers practical advice to the engineers and architects working on lakehouses, which offers a data-driven method to choosing table formats, balancing performance, operation effort, and expense, facilitating the take-up of transactional data lakes within the enterprise settings.

Issue

Vol. 6 No. 2 (2023): International Journal of Research Publications in Engineering, Technology and Management

Section

Articles

How to Cite

Operationalizing Lakehouse Table Formats: A Comparative Study of Iceberg, Delta, and Hudi Workloads. (2023). International Journal of Research Publications in Engineering, Technology and Management (IJRPETM), 6(2), 8371-8381. https://doi.org/10.15662/IJRPETM.2023.0602002

References

[1] M. Armbrust et al., “Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores,” Proc. VLDB Endowment, vol. 13, no. 12, pp. 3411–3424, Aug. 2020.

[2] B. Inmon, Data Lake Architecture: Designing the Data Lake and Avoiding the Garbage Dump. Basking Ridge, NJ, USA: Technics Publications, 2016.

[3] A. Laurent, D. Laurent, and C. Madera, Data Lakes. Hoboken, NJ, USA: John Wiley & Sons, 2020.

[4] F. Ravat and Y. Zhao, “Data Lakes: Trends and Perspectives,” in Proc. 30th Int. Conf. Database and Expert Systems Applications (DEXA), Linz, Austria, 2019, pp. 304–313.

[5] C. Giebler, C. Gröger, E. Hoos, H. Schwarz, and B. Mitschang, “Leveraging the Data Lake: Current State and Challenges,” in Proc. 21st Int. Conf. Data Warehousing and Knowledge Discovery (DaWaK), Linz, Austria, 2019, pp. 179–188.

[6] J. Couto, O. T. Borges, D. D. Ruiz, S. Marczak, and R. Prikladnicki, “A Mapping Study on Data Lakes: An Improved Definition and Possible Architectures,” in Proc. 31st Int. Conf. Software Engineering and Knowledge Engineering (SEKE), Lisbon, Portugal, 2019, pp. 453–458.

[7] T. Ivanov and M. Pergolesi, “The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance,” Concurrency and Computation: Practice and Experience, vol. 32, no. 5, 2020.

[8] A. Davoudian and M. Liu, “Big Data Systems: A Software Engineering Perspective,” ACM Computing Surveys, vol. 53, no. 3, pp. 1–39, 2020.

[9] O. Herden, “Architectural Patterns for Integrating Data Lakes into Data-Warehouse Architectures,” in Proc. Int. Conf. Big Data Analytics (BDA), Hyderabad, India, 2020, pp. 12–27.

[10] M. Mehmood et al., “Implementing Big Data Lake for Heterogeneous Data Sources,” in Proc. IEEE Int. Conf. Data Engineering Workshops (ICDEW), Macau, China, 2019, pp. 37–44.

[11] M. Armbrust et al., “Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics,” CIDR, 2021.

Article Sidebar

Main Article Content

Abstract

Article Details

Issue

Section

How to Cite

References