Press "Enter" to skip to content

Finding Faults: CNCF Has Moved Chaos Mesh to Its Incubator

When you’re a big enterprise with a cloud native infrastructure running on thousands of servers and from numerous clouds, nipping a problem in the bud isn’t good enough. Chaos Mesh can help you find an issue before it gets triggered and causes a costly outage.

Chaos Mesh icon
Chaos Mesh icon. Source: CNCF

Last week the Cloud Native Computing Foundation announced that its Technical Oversight Committee has voted to move its Chaos Mesh project, which applies chaos engineering techniques to cloud native deployments, from its entry level Sandbox to its Incubator.

This indicated that the project has reached a level of maturity that should make it more attractive to enterprises that are leery of deploying software in early stages of development. Before the status change, however, Chaos Mesh was already being used in production by more than 50 organizations, including DataStax, Percona, and Prudential.

If the high number of companies already using the software isn’t enough to convince you that the project is ready for prime time, consider this: Microsoft Azure Chaos Studio has integrated Chaos Mesh into its software as a service platform, which allows users to inject faults into AKS clusters.

“We built Chaos Mesh with a simple mission – to make chaos engineering easier so that complicated systems can become resilient as they should be,” Cwen Yin, maintainer and co-creator of Chaos Mesh, said in a statement. “We are excited to see Chaos Mesh become an incubating project.

What is Chaos Mesh?

In the digital world, a system generally works until something unanticipated happens that brings it down, so anticipating the unanticipated becomes crucial to system engineers and developers.

This is why Linux distributions (and even Microsoft Windows) seek volunteers to test release candidates of their upcoming releases in real world environments, where the software will be subjected to many more hardware and software combinations to find unknown conflicts than would be possible in a closed developer environment.

In the Kubernetes driven cloud native world there’s even more to go wrong, with containers interacting with everything from microservices to cloud functions to other containers — with all aspects in a state of flux and subject to constant updates. In such an environment, its easy for a routine action in the CI/CD pipeline, such as updating a container, to inadvertently add a failure point that could have a cascading effect that brings down a whole system.

And when big business has an outage — even one that only affects a small part of the entire system (say, the cash registers in a retail chain), the cost can quickly run up to hundreds of thousands of dollars or more.

One of the tools being utilized to not only help prevent such failures but to quicken the time to recovery when a failure occurs is Chaos Mesh, which uses chaos engineering techniques to discover a system’s ability to tolerate faults.

“At my company, Digital China, we combine Chaos Mesh with our DevOps platform to provide a one-click CI/CD process,” Lei Li wrote in an article for The New Stack, in December. Li is not only a senior software engineer for Digital China but is also a contributor for PostgreSQL’s TiDB, where Chaos Mesh originated.

“Every time a developer submits a piece of code, it triggers the CI/CD process,” he said. “In this process, the system builds the code and performs unit tests and a SonarQube quality check. It then packages the image and releases it to Kubernetes. At the end of the day, our daily reporting system pulls the latest images of each project and performs chaos engineering on them.

“The simulation doesn’t require any application code change; Chaos Mesh takes care of the hard work,” he added. “It injects all kinds of physical node failures into the system, such as network latency, network loss, and network duplication. It also injects Kubernetes failures, such as Pod or container faults. These faults may reveal vulnerabilities in our application code or the system architecture. When the loopholes surface, we can fix them before they can do real damage in production.”

Chaos Mesh uses Kubernetes CustomResourceDefinitions to define chaos objects and can be tightly integrated with other cloud native projects such as Argo, Grafana, and Prometheus, to make the chaos experience more manageable, customizable, and observable. To help with the latter, the software centers around a dashboard with user-friendly web interfaces for manipulating and observing Chaos experiments.

According to CNCF, the project has fully mapped the road ahead, and the development team is actively adding new features and functionality while working at improving the overall chaos experience.

“No cloud native deployment is perfect — failures will always happen, so using chaos engineering to establish a culture of resiliency can save organizations time and money,” Chris Aniszczyk, CTO of CNCF, said in a statement. “We are very excited to see how Chaos Mesh can grow as an incubating project and impact the state of the chaos and resiliency engineering space.”

Breaking News: