Fault Injection 21 July 2023

Democratizing Chaos Engineering with Fault Injection Testing

Pablo Chacin

Traditional testing methodologies and tools are inadequate for finding and fixing the new class of failures that result from microservices interacting with each other in complex ways while running on top of intricate infrastructures.

One real-world example is a site-wide failure in www.amazon.com that started with a single server failing to retrieve product information from a catalog. Due to mishandling of the error, the server started returning empty responses. Because it was not actually accessing the catalog, it was returning these empty responses much faster than usual. This caused the load balancer to send more and more traffic to this failing server instead of to the working replicas, eventually taking the whole site down. 1

Chaos Engineering is a discipline that emerged as a response to this reality. It builds on the idea of experimenting on a system by injecting different types of faults to uncover its systemic weaknesses.

It proposes that organizations can build confidence in their capability to withstand turbulent conditions by proactively submitting their production systems to faults in a continuous way, instead of waiting for incidents to occur. Later in this article, we will see how Fault Injection builds on top of this concept and allows for granular, controlled, and reproducible chaos testing.

It was started at large internet companies, Netflix being a notable example. These companies eventually open-sourced their tools and championed the adoption of this practice.

The Challenges of Adopting Chaos Engineering

Despite its promises, some problems still need to be addressed for Chaos Engineering to be adopted by most organizations.

Chaos Engineering raises the adoption bar too high by focusing on practices such as testing in production and running planned outages that most organizations are not prepared for.

Adopting Chaos Engineering tools may also be challenging. Most of them are designed for teams with in-depth knowledge of infrastructure and operations, adding a significant operational burden for installing and keeping them running.

Chaos Engineering focus on testing infrastructure level faults, such as killing or overloading instances or disrupting the network, doesn’t align well with the vision most developers have of their applications in terms of services and their interactions using protocols such as HTTP and gRPC.

Finally, Chaos experiments are hard to reproduce and their results are hard to predict. Injecting faults that disrupt infrastructure resources may affect multiple application components, increasing their blast radius and introducing unexpected side effects.

Therefore, we believe that to promote the broader adoption of Chaos Engineering two fundamental changes are needed.

First, Chaos Engineering must evolve towards lightweight practices with low up-front investment and tangible short-term benefits, but that can evolve as the organization matures.

Second, Chaos Engineering tooling needs to be adapted to a development-focused crowd and be embedded earlier in the development process, effectively shifting left chaos testing.

To achieve the objectives stated above, we propose the adoption of Fault Injection Testing.

Fault Injection testing

Fault injection is the software testing technique of introducing faults to a system to validate if it can endure and recover from those conditions. Its main purpose is to exercise the error-handling code paths.

This is not a novel idea. It has been used extensively in the development of safety-critical systems. However, the challenge for modern applications is to inject the complex error patterns they will experience in their interactions with other components.

These patterns are hard to reproduce using unit tests or even integration tests. Therefore, being able to inject them into a live system is more cost-effective.

Fault injection testing differs from established chaos testing practices in two significant ways.

First, by shifting the emphasis from experimenting for uncovering unknown faults to verification of the proper handling of known or expected faults.

This change comes from the realization that many catastrophic failures come from deficiencies in the logic for handling non-critical errors. For example, cascading errors produced by deficient or inexistent retry logic or retry storms produced by poorly tunned timeout and retry settings.

Therefore, it is important to be able to submit the application to a comprehensive set of well-defined fault scenarios to validate how they are handled.

Second, it does not pretend to reproduce the root causes of known incidents such as resource exhaustion or compute instances becoming unavailable. Instead focuses on simulating the effects of these incidents in the applications.

This change comes from the realization that in distributed systems infrastructure faults eventually surface at the protocol level. In particular, they affect the latency and the return codes of requests between services.

Four Tenets of Fault Injection Testing

We postulate that for Fault Injection Testing to overcome the limitations of Chaos Engineering, it should be based on the following principles:

Incremental adoption

Organizations should be able to incorporate fault injection testing into their existing teams and development processes in an incremental manner in order to increase their understanding of their systems and build confidence in their ability to operate reliably.

Application-centric fault injection

Developers should be able to reproduce in their tests the same fault patterns observed in their applications in familiar terms such as latency and error rates without having to understand the underlying infrastructure.

Fault injection as code

Developers should be able to control all aspects of the fault injection test as part of the test automation code: load generators, test validations, and fault injections.

Controlled tests

Fault injection tests should be reproducible and predictable and have a minimal blast radius. Tests should be possible in shared infrastructures with little or no interference between teams and services.

Conclusions

Fault Injection Testing incorporates the principles of chaos engineering early into the development process as an integral part of the testing practices, shifting the emphasis from experimentation to verification, and from uncovering unknown faults to ensuring proper handling of known faults.

Building confidence in the ability to fulfill the reliability expectations of the customers should not be a privilege of the technology elite. Neither should the adoption of a systematic approach for achieving this goal pose a burden to organizations. We believe that Chaos Engineering can be democratized by promoting the adoption of Fault Injection Testing as part of the existing testing practices of the organizations.

1 Challenges with distributed systems, Amazon's Builders Library

< Back to all posts