Exploring the Azure Chaos Studio Preview

⚠ Please note – Chaos Studio is a Preview Service – so expect changes and updates!

Introduction

Chaos Engineering is the process of testing a solution and its ability to withstand problems and faults whilst maintaining the required level of service or uptime. Chaos Testing is carried out to provide a level of comfort that the solution undergoing testing will be able to withstand potential issues during production operation. This has traditionally been a challenge with on-premises infrastructure, whereby limited resource and scope for temporary duplication, migration, or service expansion is often limited by physical hardware. In the Cloud world however, there is almost unlimited scope for introducing problems and designing solutions to resolve them, and we can utilise additional services, regions, and zones (for example) to mitigate many issues, and work around these types of events.

Designing with “Chaos in Mind“… aka Chaos Engineering has been used by numerous large organizations for a number of years – but scope for smaller organizations to use this type of tooling has often been limited to performing periodic DR or Resiliency Testing, not true “Chaos Testing”. In many cases, testing specific systems could require having Development or Test environments, and access to tooling required to run tests, for example. There are, however, a number of well-known tools and use cases, you can read about here.

✅ You can learn more about Chaos Engineering over at Microsoft Learn here: https://learn.microsoft.com/en-gb/azure/architecture/framework/resiliency/chaos-engineering

Azure Chaos Studio

Azure Chaos Studio is designed to provide access to a range of Chaos Engineering tools within Microsoft Azure. Chaos Studio is organised into two main sections: Targets and Experiments. Targets represent Azure Resources we wish to inflict chaos upon, and Experiments define the Chaos we will inflict.

Azure Chaos Studio Overview

Lab Environment for Testing

To allow me to do some testing of Azure Chaos Studio, I have created a sample environment across two Azure Regions, with Traffic Manager, Firewalling, Internal Load Balancers, and Clusters of Web Servers (running IIS). This is designed to simulate a business Web Application, with resiliency across various layers of the application, and a multi-region approach to the design. I’ve used a Custom Script Extension during deployment to install IIS on each Server, and to change the IIS default page to the server name – which allows me to see easily where the request ends up, and confirm I am accessing our test web site during any Chaos Experiments run.

This lab has 3 access points:
    1. The Traffic Manager DNS Record, which is the main entry point for the Web Application and will be something like: http://chaos-demo-########.trafficmanager.net.
    2. An access URL for the Region 1 Public IP (NAT’d out via the Regional Firewall).
    3. An access URL for the Region 2 Public IP (Also NAT’d out via the Regional Firewall).

❓ The goal is to test the resiliency of this application using Chaos Studio. So, if we introduce failures and problems – users should still have access to the application.

Lab Environment Overview
Lab Environment Overview

✅ You can download a copy of this Lab Environment here: https://github.com/jakewalsh90/Terraform-Azure/tree/main/Chaos-Studio-Test. If you need a refresher or introduction to Terraform deployment on Azure – please check out my Blog Series here.

Onboarding Resources before the chaos

Before we can start creating chaos, we need to onboard Resources to Chaos Studio. This is done by selecting “Targets” in Chaos Studio in the Portal:

Chaos Studio Targets

Within my lab environment, I can onboard the following Resources, by selecting the Resources and clicking “Enable Targets”:

Enabling Targets

Upon Clicking “Enable Targets”, two options will be presented:

Target Types

There are two options when enabling a target, which relate to the type of fault we will be testing for:

  • Service-direct Targets -These run directly against Azure Resource, and do not require any installation or instrumentation. 
  • Agent-based Targets – These run inside Virtual Machines or Scale Sets and provide failures within the Guest OS. 

ℹ Note: to enable Agent-based Targets, you will need to create a Managed Identity. When onboarding you also have the option of sending diagnostic information to Application Insights.

See here for further information on fault types: https://learn.microsoft.com/en-gb/azure/chaos-studio/chaos-studio-overview#how-chaos-studio-works 

Now we can see the Targets are enabled:

Targets enabled within Chaos Studio

We can also look into the available actions – this shows us the Faults we apply to the various resources. I’ve shown a Key Vault and Virtual Machine example below:

Key Vault Fault Options
Virtual Machine Fault Options

We are now ready to start designing some Experiments and seeing what chaos we can inflict upon our test infrastructure.

Creating Experiments – bring on the chaos!

To create our Chaos, we need to create an Experiment. An Experiment defines the steps that will be taken to cause problems within our Targets. A chaos experiment within Chaos Studio is simply a set of steps that will be taken to introduce problems upon a set of Targets. Experiments are created using an experiment designer, that allows us to create various stages and actions. When we start designing the Steps, we start with a blank experiment:

Experiment Designer

We can then begin adding steps to the experiment to introduce chaos. I’ve created two experiments:

Test 1 – an NSG change that blocks all Inbound traffic!
Test 2 – Shutdown of a VM!

Once we have added these faults, we can also use additional steps, and include time delays. For this first experiment, I will keep things simple:

An overview of our Chaos Experiment

I’ve created two experiments, firstly, a network block on an NSG that stops all inbound & outbound traffic, and secondly, the shutdown of a Virtual Machine. For my demo environment this means the NSG block applies in Region 1, and the VM shutdown in Region 2. This test will effectively break 3 out of 4 of the hosting servers within my architecture and leave us with just 1 – a good test to see if the application is still accessible!

⚠ Before we can run the experiment, we need to assign our experiment permissions against target resources. This is detailed here: https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-studio-tutorial-service-direct-portal#give-experiment-permission-to-your-target-resource.

The planned outcome of my Chaos Experiment – a big problem in Region 1, and a minor problem in Region 2.

Running the test Experiments:

To run the test experiment, we can browse to the Chaos Studio Portal pane, which will allow us to run the experiment, and view the history:

Experiments within Chaos Studio – time to start the chaos!

From here we can run our experiments, by selecting “Start experiment(s)”. As you can see, the experiments then run:

Chaos in progress…

Whilst the experiments were in progress, I checked the actions requested – and as you can see below, the NSG rule has been added, and the chosen VM shut down:

NSG rule added by Chaos Studio
VM shutdown by Chaos Studio
Testing the Resiliency of the Lab Environment:

During the Chaos Testing experiments, I accessed the test URL via Traffic Manager – and was able to see the page served by VM “vm-neu-a-0” successfully:

A successful test – accessing the page of our only remaining accessible VM.

This is the only VM still accessible – as both of my UK VMs are subject to the NSG rule block, and the other North Europe VM was shut down by Chaos Studio! Chaos successfully inflicted! It is worth noting that at the end of the test Chaos Studio removed the NSG Rule and started the shutdown VM back up

Once the experiment completes, we see the success message, this indicates that all requested actions within our experiments were carried out by Chaos Studio without issue:

Successful experiments!

What other experiment actions can we run?

I’m glad you asked! Chaos Studio currently has a range of actions we can carry out upon Target Resources. These are split between Time Delays (so we can allow experiments to wait between faults) and faults themselves. There is a full list of available faults here: https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-studio-fault-libraryBelow I’ve outlined a few of the faults available and their actions, to give you an idea of what is available:

  • CPU Pressure – increases CPU load on a virtual machine, useful for testing applications under high load.
  • Virtual Memory Pressure – increases Memory consumption on a virtual machine, again useful for high load scenarios.
  • Disk I/O Pressure – adds disk pressure to the Primary Storage of a virtual machine, also great for high load testing.
  • Stop Windows Service – fairly self-explanatory, this one stops the specified windows service. Useful for testing monitoring solutions, network probes, etc.
  • Kill Process – again fairly self-explanatory, this fault kills the specified process.
  • Network disconnect – this blocks outbound network traffic to a specified IP and port range. This would be useful when testing monitoring and availability, and specific location testing.
  • AKS Chaos Mesh – this is a range of faults, which requires Chaos Mesh to be deployed in the AKS Cluster. You can read more about faults with Chaos Mesh and AKS here.
  • NSG Rules – this fault allows the injection of NSG rules, which is one of the faults I used in my testing above. Useful for simulating network changes or failures.

Conclusion

As you’ll hopefully have seen from this post, Chaos Studio provides a defined and structured way to test Azure Resources and Services for resiliency and against potential problems. There are numerous advantages to Chaos Testing, and I’ve outlined a few of the key ones below:

  • Creating better architectures – There’s absolutely no doubt that testing against potential failures helps create better architectures. By testing for common faults, issues, and problems we can design around these, and validate that our architectures do actually withstand these problems.
  • Benchmarking reliability – Chaos Testing allows us to test against potential issues, thus allowing us to change our infrastructures as a result of any testing. This allows us to make our applications and services more resilient. Ultimately, greater resilience in applications = better user experience. In many cases this means greater uptime, more orders placed, or better customer satisfaction, for example.
  • Performing standardised infrastructure tests – Chaos Testing can also be used to perform standard testing against services, both old and new. Organisations can opt to create tests that new designs must pass before they go into production – ensuring a standardised level of testing.
  • Preparing for the worst – Chaos Testing is also a great way to understand where your vulnerabilities lie. Whilst this might lead to uncomfortable truths and changes needed to protect against certain faults, it’s far better to be informed and have this information early – than to wait until an actual issue occurs.
  • Integration into DevOps and IAC methodologies – By integrating Chaos Testing into deployments we can automate the process, and validate programmatically, that our architectures and code meet our requirements around resiliency and fault tolerance.

✅ I hope this overview of the Azure Chaos Studio Preview has been useful – until next time!

Further Reading:
Skip to content