Chaos engineering
Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production.[1]
Concept
In software development, a given software system's ability to tolerate failures while still ensuring adequate quality of service—often generalized as resilience—is typically specified as a requirement. However, development teams often fail to meet this requirement due to factors such as short deadlines or lack of knowledge of the field. Chaos engineering is a technique to meet the resilience requirement.
Chaos engineering can be used to achieve resilience against infrastructure failures, network failures, and application failures.
Operational readiness using chaos engineering
Calculating how much confidence we can have in the interconnected complex systems that are put into production environment requires operational readiness metrics. Operational readiness can be evaluated using chaos engineering simulations supported by Kubernetes infrastructure in big data. Solutions for operational readiness of a platform stands for strengthening the backup, restore, network file transfer, failover capabilities and overall security. Gautam Siwach et al, performed evaluation of inducing chaos to a Kubernetes environment which terminates random pods with data from edge devices in data centers while processing analytics on big data network and infer the recovery time of pods to calculate an estimated response time.[2][3]
History
1983 – Apple
While MacWrite and MacPaint were being developed for the first Apple Macintosh computer, Steve Capps created "Monkey", a desk accessory which randomly generated user interface events at high speed, simulating a monkey frantically banging the keyboard and moving and clicking the mouse. It was promptly put to use for debugging by generating errors for programmers to fix, because automated testing was not possible; the first Macintosh had too little free memory space for anything more sophisticated.[4]
2003 – Amazon
While working to improve website reliability at Amazon, Jesse Robbins created "GameDay", an initiative that increases reliability by purposefully creating major failures on a regular basis. Robbins has said GameDay was inspired by firefighter training and research in other fields lessons in complex systems, reliability engineering.[5]
2006 – Google
While at Google, Kripa Krishnan created a similar program to Amazon's GameDay called "DiRT".[5]
2011 – Netflix
While overseeing Netflix's migration to the cloud in 2011 Nora Jones, Casey Rosenthal, and Greg Orzell [6][7][8] expanded the discipline while working together at Netflix by setting up a tool that would cause breakdowns in their production environment, the environment used by Netflix customers. The intent was to move from a development model that assumed no breakdowns to a model where breakdowns were considered to be inevitable, driving developers to consider built-in resilience to be an obligation rather than an option:
"At Netflix, our culture of freedom and responsibility led us not to force engineers to design their code in a specific way. Instead, we discovered that we could align our teams around the notion of infrastructure resilience by isolating the problems created by server neutralization and pushing them to the extreme. We have created Chaos Monkey, a program that randomly chooses a server and disables it during its usual hours of activity. Some will find that crazy, but we could not depend on the random occurrence of an event to test our behavior in the face of the very consequences of this event. Knowing that this would happen frequently has created a strong alignment among engineers to build redundancy and process automation to survive such incidents, without impacting the millions of Netflix users. Chaos Monkey is one of our most effective tools to improve the quality of our services."[9]
By regularly "killing" random instances of a software service, it was possible to test a redundant architecture to verify that a server failure did not noticeably impact customers.
The concept of chaos engineering is close to the one of Phoenix Servers, first introduced by Martin Fowler in 2012.[10]
Chaos engineering tools
Chaos Monkey
Chaos Monkey is a tool invented in 2011 by Netflix to test the resilience of its IT infrastructure.[7] It works by intentionally disabling computers in Netflix's production network to test how remaining systems respond to the outage. Chaos Monkey is now part of a larger suite of tools called the Simian Army designed to simulate and test responses to various system failures and edge cases.
The code behind Chaos Monkey was released by Netflix in 2012 under an Apache 2.0 license.[11][12]
The name "Chaos Monkey" is explained in the book Chaos Monkeys by Antonio Garcia Martinez:[13]
Imagine a monkey entering a 'data center', these 'farms' of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices and returns everything that passes by the hand [i.e. flings excrement]. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy.
Simian Army
The Simian Army[12] is a suite of tools developed by Netflix to test the reliability, security, or resilience of its Amazon Web Services infrastructure and includes the following tools:[14]
At the very top of the Simian Army hierarchy, Chaos Kong drops a full AWS "Region".[15] Though rare, loss of an entire region does happen and Chaos Kong simulates a systems response and recovery to this type of event.
Chaos Gorilla drops a full Amazon "Availability Zone" (one or more entire data centers serving a geographical region).[16]
Proofdock chaos engineering platform
Proofdock is a chaos engineering platform that focuses on and leverages the Microsoft Azure platform and the Azure DevOps services. Users can inject failures on the infrastructure, platform and application level.[17]
Gremlin
Gremlin is a "failure-as-a-service" platform.[18]
Facebook Storm
To prepare for the loss of a datacenter, Facebook regularly tests the resistance of its infrastructures to extreme events. Known as the Storm Project, the program simulates massive data center failures.[19]
Days of Chaos
Voyages-sncf.com created a "Day of Chaos"[20] in 2017, gamifying the simulation of pre-production failures.[21] They presented their results at the 2017 DevOps REX conference.[22]
See also
- Fault injection
- Fault tolerance
- Fault-tolerant computer system
- Data redundancy
- Error detection and correction
- Fall back and forward
- Grease (networking)
- Resilience (network)
- Robustness (computer science)
Notes and references
- ^ "Principles of Chaos Engineering". principlesofchaos.org. Retrieved 21 October 2017.
- ^ Siwach, Gautam (29 November 2022). Evaluating operational readiness using chaos engineering simulations on Kubernetes architecture in Big Data (pdf). 2022 International Conference on Smart Applications, Communications and Networking (SmartNets). Botswana. pp. 1–7. Retrieved 3 January 2023.
- ^ "Machine Learning Podcast Host and Technology Influencer: Gautam Siwach". LA Weekly. 7 October 2022.
- ^ Hertzfeld, Andy. "Monkey Lives". Folklore. Retrieved 11 September 2023.
- ^ a b Limoncelli, Tom (13 September 2012). "Resilience Engineering: Learning to Embrace Failure". ACM Queue. 10 (9) – via ACM.
- ^ Jones, Nora; Rosenthal, Casey (2020). Chaos Engineering (1st ed.). O'Reilly Media. ISBN 9781492043867. OCLC 1143015464.
- ^ a b "The Netflix Simian Army". Netflix Tech Blog. Medium. 19 July 2011. Retrieved 21 October 2017.
- ^ US 20120072571, Orzell, Gregory S. & Izrailevsky, Yury, "Validating the resiliency of networked applications", published 2012-03-22
- ^ "Netflix Chaos Monkey Upgraded". Netflix Tech Blog. Medium. 19 October 2016. Retrieved 21 October 2017.
- ^ "PhoenixServer". martinFowler.com. Martin Fowler (software engineer). 10 July 2012. Retrieved 14 January 2021.
- ^ "Netflix libère Chaos Monkey dans la jungle Open Source" [Netflix releases Chaos Monkey into the open source jungle]. Le Monde Informatique (in French). Retrieved 7 November 2017.
- ^ a b "SimianArmy: Tools for your cloud operating in top form. Chaos Monkey is a resiliency tool that helps applications tolerate random instance failures". Netflix, Inc. 20 October 2017. Retrieved 21 October 2017.
- ^ "Mais qui sont ces singes du chaos ?" [But who are these monkeys of chaos?]. 15marches (in French). 25 July 2017. Retrieved 21 October 2017.
- ^ SemiColonWeb (8 December 2015). "Infrastructure : quelles méthodes pour s'adapter aux nouvelles architectures Cloud ? - D2SI Blog". D2SI Blog (in French). Archived from the original on 21 October 2017. Retrieved 7 November 2017.
- ^ "Chaos Engineering Upgraded", medium.com, 19 April 2017, retrieved 10 April 2020
- ^ "The Netflix Simian Army", medium.com, retrieved 12 December 2017
- ^ "A chaos engineering platform for Microsoft Azure". medium.com. 25 June 2020. Retrieved 28 June 2020.
- ^ "Gremlin raises $18 million to expand 'failure-as-a-service' testing platform". VentureBeat. 28 September 2018. Retrieved 24 October 2018.
- ^ Hof, Robert (11 September 2016), "Interview: How Facebook's Storm Heads Off Project Data Center Disasters", Forbes, retrieved 21 October 2017
- ^ "Days of Chaos". Days of Chaos (in French). Retrieved 18 February 2022.
- ^ "DevOps: feedback from Voyages-sncf.com". Moderator's Blog (in French). 17 March 2017. Retrieved 21 October 2017.
- ^ devops REX (3 October 2017). "[devops REX 2017] Days of Chaos : le développement de la culture devops chez Voyages-Sncf.com à l'aide de la gamification". Retrieved 18 February 2022.
External links
- Principle of Chaos Engineering – The Chaos Engineering manifesto
- Chaos Engineering – Adrian Hornsby
- How Chaos Engineering Practices Will Help You Design Better Software – Mariano Calandra