Designing Resilient and Scalable Cloud-Native Frameworks for Generative AI Content Production
Main Article Content
Abstract
The high uptake of Generative Artificial Intelligence (GenAI) to create large-scale content has escalated the necessity to replace fragile, inelastic, costly cloud-native systems with strong, resilient, and cost-effective ones. The proposed study suggests a Resilient and Scalable Cloud-Native Framework (RSCF) that is based on Generative AI content production workloads, combining microservice architecture, container orchestration, event-driven pipelines, and intelligent resource scheduling. It is designed in five main layers, which include: (1) API Gateway and Access Control Layer to invoke secure models, (2) Model Serving Layer to use containerized endpoints to invoke models, support auto-scaling, (3) Data and Feature Engineering Layer to support distributed storage and real-time preprocessing, (4) Orchestration and Workflow Layer to provide task coordination, and (5) Observability and Governance Layer to monitor, log, comply, and optimize costs.
The framework includes fault-tolerant design schemes like circuit breakers, traffic management using service mesh, multi-region deployment schemes, and automated rollback to become more resilient. Horizontal pod autoscaling, serverless compute integration, and high-throughput inference are the methods of providing scalability to compute resources with the assistance of GPUs. The architecture is optimized to handle different workloads of generative synthesis such as text, image, audio, and video synthesis as well as highly available with low latency in fluctuating workloads.
Experimental validation shows that it has better throughput stability, lower latency variance, and better infrastructure utilization than monolithic and partly containerized deployments. The suggested framework is a practical, cloud-agnostic framework that allows enterprises to make the operationalization of Generative AI pipelines with a production-grade reliability, governance, and performance guarantee.
Article Details
Section
How to Cite
References
[1] S. Huang, C. Chen, J. Chen, and H. Chao, “A survey on resource management for cloud native mobile computing: Opportunities and challenges,” Symmetry, vol. 15, no. 2, p. 538, 2023, doi: 10.3390/sym15020538.
[2] Microsoft Azure, “Azure Kubernetes Service,” 2023. [Online]. Available: https://azure.microsoft.com/en-us/products/kubernetes-service. Accessed: Oct. 21, 2023.
[3] S. Bathina, “Atomic Omnichannel: Reinventing retail personalization with generative-AI content factories,” ISCSITR-International Journal of Computer Science and Engineering (ISCSITR-IJCSE), vol. 6, no. 4, pp. 46–62, 2025.
[4] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas, “Apache Flink: Stream and batch processing in a single engine,” Bull. IEEE Tech. Committee on Data Eng., vol. 38, no. 4, 2015.
[5] C. Chen et al., “Accelerating large language model decoding with speculative sampling,” arXiv preprint arXiv:2302.01318, 2023.
[6] L. Chen, Z. Ye, Y. Wu, D. Zhuo, L. Ceze, and A. Krishnamurthy, “Punica: Multi-tenant LoRA serving,” arXiv preprint arXiv:2310.18547, 2023.
[7] T. Dao, “FlashAttention-2: Faster attention with better parallelism and work partitioning,” arXiv preprint arXiv:2307.08691, 2023.
[8] T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “FlashAttention: Fast and memory-efficient exact attention with IO-awareness,” in Advances in Neural Information Processing Systems, vol. 35, 2022, pp. 16344–16359.
[9] Y. Gan et al., “An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems,” in Proc. 24th Int. Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2019, pp. 3–18.
[10] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, “Dominant resource fairness: Fair allocation of multiple resource types,” in Proc. 8th USENIX Symp. Networked Systems Design and Implementation (NSDI), 2011.
[11] A. Gujarati et al., “Serving DNNs like clockwork: Performance predictability from the bottom up,” in Proc. 14th USENIX Symp. Operating Systems Design and Implementation (OSDI), 2020, pp. 443–462.
[12] E. Jonas et al., “Cloud programming simplified: A Berkeley view on serverless computing,” arXiv preprint arXiv:1902.03383, 2019.
[13] W. Kwon et al., “Efficient memory management for large language model serving with PagedAttention,” arXiv preprint arXiv:2309.06180, 2023.