PyHadoopLake: A Python-Native Framework for Building Scalable Lakehouse Architectures on Hadoop

Sunil Kumar Mudusu

doi:10.15662/IJRPETM.2022.0505007

PDF

Published: 2022-09-14

DOI: https://doi.org/10.15662/IJRPETM.2022.0505007

Keywords:

Data Lakehouse, Hadoop, Python, ETL Pipeline, PySpark, Distributed Computing, HDFS

Sunil Kumar Mudusu

AI Data Engineer | Highmark Health Solutions | Pittsburgh, PA, USA

Abstract

This paper presents PyHadoopLake which is a Python-native framework designed to simplify and modernize the construction of scalable lakehouse architectures on top of the Hadoop distributed file system. The proposed framework integrates extract transform and load operations directly within a Python-driven pipeline while leveraging the distributed processing capabilities of Hadoop. The methodology focuses on combining PySpark with HDFS and Delta Lake table formats to deliver a unified and reproducible data engineering environment. Experimental results demonstrate meaningful improvements in pipeline throughput and data processing latency when compared with traditional Hadoop ETL approaches. The framework offers a practical and extensible solution for organizations seeking to modernize their data infrastructure without abandoning their existing Hadoop investments.

Issue

Vol. 5 No. 5 (2022): The International Journal of Research Publications in Engineering, Technology and Management

Section

Articles

How to Cite

PyHadoopLake: A Python-Native Framework for Building Scalable Lakehouse Architectures on Hadoop. (2022). International Journal of Research Publications in Engineering, Technology and Management (IJRPETM), 5(5), 7449-7452. https://doi.org/10.15662/IJRPETM.2022.0505007

References

[1] Armbrust M. Ghodsi A. Xin R. and Zaharia M. Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. Proceedings of CIDR 2021.

[2] Zaharia M. Chowdhury M. Das T. Dave A. Ma J. McCauley M. Franklin M. J. Shenker S. and Stoica I. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Proceedings of NSDI 2012.

[3] Kleppmann M. Designing Data-Intensive Applications. O'Reilly Media 2017.

[4] Reis J. and Housley M. Fundamentals of Data Engineering. O'Reilly Media 2022.

[5] Apache Software Foundation. Apache Hadoop Documentation Version 3.3.1. https://hadoop.apache.org 2022.

[6] Databricks Inc. Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proceedings of VLDB 2020.

[7] Apache Software Foundation. PySpark Documentation. https://spark.apache.org/docs/latest/api/python 2022.

[8] Kreps J. Narkhede N. and Rao J. Kafka: A Distributed Messaging System for Log Processing. Proceedings of NetDB Workshop 2011.

Article Sidebar

Main Article Content

Abstract

Article Details

Issue

Section

How to Cite

References