DESIGNING FAULT-TOLERANT DISTRIBUTED SYSTEMS FOR HIGH-AVAILABILITY CONSUMER INTERNET PLATFORMS
Main Article Content
Abstract
One of the most difficult tasks of software engineers today is arguably the development of fault-tolerant distributed systems of internet consumer platforms. Social media and e-commerce services, video-streaming applications, and fintech applications are all types of internet consumer platforms that must be 24/7 with millions of simultaneous users living in geographically dispersed areas. Poor performance or an outage will lead to losses, damaged reputation, and churn. This paper gives a detailed discussion of architectures that have been adopted to guarantee high availability and fault tolerance of the distributed systems. It describes the theory underlying it in the context of the CAP theorem and consistency models and takes into account practical solutions such as redundancy, consensus protocols, circuit breakers, service mesh technologies, replication, chaos engineering and CI/CD release pipelines that can be used to minimize the blast radius of deployments. This paper, based on the existing state-of-art research and best practices, proposes a vivid image to professionals interested in constructing highly available internet infrastructure.
Article Details
Section
How to Cite
References
[1] B. Burns et al."Borg, Omega, and Kubernetes," ACM Queue, vol. 14, no. 1, pp. 70–93, 2016.
[2] B. Beyer, C. Jones, J. Petoff, and N. R. Murphy, Site Reliability Engineering: How Google Runs Production Systems. Sebastopol, CA: O'Reilly Media, 2016.
[3] S. Liu, A Survey on Fault-Tolerance in Distributed Optimization and Machine Learning, arXiv preprint arXiv:2106.08545, pp. 1–20, 2021.
[4] J. Hellings and M. Sadoghi, The Fault-Tolerant Cluster-Sending Problem, arXiv preprint arXiv:1908.01455, pp. 1–18, 2019..
[5] E. Taheri, S. Pasricha, and M. Nikdast, DeFT: A Deadlock-Free and Fault-Tolerant Routing Algorithm for 2.5D Chiplet Networks, arXiv preprint arXiv:2112.09234, pp. 1– 15, 2021.
[6] J. Xu, B. Randell, A. Romanovsky, R. J. Stroud, and A. F. Zorzo, Supporting and Controlling Complex Concurrency in Fault-Tolerant Distributed Systems, arXiv preprint arXiv:2111.06339, pp. 1–18, 2021.
[7] K. K. Pattan, Optimizing Fault-Tolerance in Distributed Systems with AI-Augmented Replica Management, International Journal of Intelligent Systems and Applications in Engineering, vol. 9, no. 1, pp. 139–160, 2021.
[8] A. D. Kshemkalyani and M. Singhal, Distributed Computing: Principles, Algorithms, and Systems. Cambridge, UK: Cambridge University Press, 2008.
[9] A. Karve, T. Kichkaylo, G. Pacifici, A. Spreitzer, V. Steinder, A. Tantawi, and A. Youssef, "Dynamic application placement in enterprise data centers," in Proc. IEEE Int. Conf. Autonomic Computing, 2006, pp. 163–172.
[10] A. Wiggins, "The Twelve-Factor App," Heroku, San Francisco, CA, 2012. [Online]. Available: https://12factor.net/. [Accessed: Aug. 2021].
[11] C. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes, "Borg, Omega, and Kubernetes," ACM Queue, vol. 14, no. 1, pp. 70–93, Jan. 2016.
[12] A. Das, I. Gupta, and A. Motivala, "SWIM: Scalable weakly-consistent infection-style process group membership protocol," in Proc. IEEE Int. Conf. Dependable Systems and Networks, 2002, pp. 303–312.
[13] M. T. Nygard, Release It! Design and Deploy Production-Ready Software. Raleigh, NC: Pragmatic Bookshelf, 2007.
[14] M. Brooker, "Exponential backoff and jitter," AWS Architecture Blog, 2015. [Online]. Available: https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/.
[15] L. Lamport, "The part-time parliament," ACM Trans. Comput. Syst., vol. 16, no. 2, pp. 133–169, May 1998.
[16] D. Ongaro and J. Ousterhout, "In search of an understandable consensus algorithm," in Proc. USENIX Annual Technical Conf. (ATC), 2014, pp. 305–319.
[17] J. C. Corbett et al., "Spanner: Google's globally distributed database," ACM Trans. Comput. Syst., vol. 31, no. 3, pp. 1–22, Aug. 2013.
[18] J. Humble and D. Farley, Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation. Upper Saddle River, NJ: Addison-Wesley, 2010.
[19] A. Basiri, N. Behnam, R. de Rooij, L. Hochstein, L. Kosewski, J. Reynolds, and C. Rosenthal, "Chaos engineering," IEEE Software, vol. 33, no. 3, pp. 35–41, May–Jun. 2016.
[20] H. Meng, S. Zhang, Y. Liu, R. Nie, and Y. Zhang, "Localizing failure root causes in a microservice through causality inference," in Proc. IEEE Int. Symp. Reliable Distributed Systems (SRDS), 2020, pp. 227–236.