PyHadoopLake: A Python-Native Framework for Building Scalable Lakehouse Architectures on Hadoop
Main Article Content
Abstract
This paper presents PyHadoopLake which is a Python-native framework designed to simplify and modernize the construction of scalable lakehouse architectures on top of the Hadoop distributed file system. The proposed framework integrates extract transform and load operations directly within a Python-driven pipeline while leveraging the distributed processing capabilities of Hadoop. The methodology focuses on combining PySpark with HDFS and Delta Lake table formats to deliver a unified and reproducible data engineering environment. Experimental results demonstrate meaningful improvements in pipeline throughput and data processing latency when compared with traditional Hadoop ETL approaches. The framework offers a practical and extensible solution for organizations seeking to modernize their data infrastructure without abandoning their existing Hadoop investments.
Article Details
Section
How to Cite
References
[1] Armbrust M. Ghodsi A. Xin R. and Zaharia M. Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. Proceedings of CIDR 2021.
[2] Zaharia M. Chowdhury M. Das T. Dave A. Ma J. McCauley M. Franklin M. J. Shenker S. and Stoica I. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Proceedings of NSDI 2012.
[3] Kleppmann M. Designing Data-Intensive Applications. O'Reilly Media 2017.
[4] Reis J. and Housley M. Fundamentals of Data Engineering. O'Reilly Media 2022.
[5] Apache Software Foundation. Apache Hadoop Documentation Version 3.3.1. https://hadoop.apache.org 2022.
[6] Databricks Inc. Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proceedings of VLDB 2020.
[7] Apache Software Foundation. PySpark Documentation. https://spark.apache.org/docs/latest/api/python 2022.
[8] Kreps J. Narkhede N. and Rao J. Kafka: A Distributed Messaging System for Log Processing. Proceedings of NetDB Workshop 2011.