Why Data Engineering, Not Model Scale, Became the True Bottleneck in Generative AI

Main Article Content

Samanth Gurram

Abstract

In the present paper, it is discussed why the scale of the model is not the key factor in implementing the generative AI, but it is data engineering. Enterprise data systems were not fast enough where foundation models were becoming larger in memory and benchmarking score. The ratios of usable data, the difference between deployment, the index of data quality, and amplification of bias and usability skew can be measured in the serving of the different AI projects by means of the quantitative analysis of various projects. The results indicate that the fitness of data infrastructure is a more adequate clarification of success in manufacturing, rather than the magnitude of the model. The retrieval augmented systems are very dependent on controlled and clean streams of information. The findings show that accountable AI (that is scalable) must have well-founded data engineering bases.


 

Article Details

Section

Articles

How to Cite

Why Data Engineering, Not Model Scale, Became the True Bottleneck in Generative AI. (2023). International Journal of Research Publications in Engineering, Technology and Management (IJRPETM), 6(4), 9028-9036. https://doi.org/10.15662/IJRPETM.2023.0604008

References

[1] Whang, S. E., Roh, Y., Song, H., & Lee, J. (2021). Data collection and quality challenges in Deep Learning: A Data-Centric AI perspective. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2112.06409

[2] Hutchinson, B., Smart, A., Hanna, A., Denton, E., Greer, C., Kjartansson, O., Barnes, P., & Mitchell, M. (2020). Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2010.13561

[3] Qiu, J., Wu, Q., Ding, G., Xu, Y., & Feng, S. (2016). A survey of machine learning for big data processing. EURASIP Journal on Advances in Signal Processing, 2016(1). https://doi.org/10.1186/s13634-016-0355-x

[4] Inoubli, W., Aridhi, S., Mezni, H., Maddouri, M., & Nguifo, E. M. (2016). An experimental survey on big data frameworks. HAL (Le Centre Pour La Communication Scientifique Directe). https://doi.org/10.48550/arxiv.1610.09962

[5] Polyzotis, N., Roy, S., Whang, S. E., & Zinkevich, M. (2017). Data Management Challenges in Production Machine Learning. Data Management Challenges in Production Machine Learning, 1723–1726. https://doi.org/10.1145/3035918.3054782

[6] Bailis, P., Olukotun, K., Re, C., & Zaharia, M. (2017). Infrastructure for usable machine Learning: the Stanford DAWN project. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1705.07538

[7] Oussous, A., Benjelloun, F., Lahcen, A. A., & Belfkih, S. (2017). Big Data technologies: A survey. Journal of King Saud University - Computer and Information Sciences, 30(4), 431–448. https://doi.org/10.1016/j.jksuci.2017.06.001

[8] Roh, Y., Heo, G., & Whang, S. E. (2018). A Survey on Data Collection for Machine Learning: a Big Data -- AI Integration Perspective. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1811.03402

[9] Arpteg, A., Brinne, B., Crnkovic-Friis, L., & Bosch, J. (2018). Software engineering challenges of deep learning. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1810.12034

[10] Elshawi, R., Sakr, S., Talia, D., & Trunfio, P. (2018). Big data systems Meet Machine learning challenges: Towards Big data science as a service. Big Data Research, 14, 1–11. https://doi.org/10.1016/j.bdr.2018.04.004