Skip to content

Commit

Permalink
Adds challenge #1
Browse files Browse the repository at this point in the history
  • Loading branch information
fuchshuber committed Mar 15, 2021
1 parent b846eed commit 7e2abac
Show file tree
Hide file tree
Showing 4 changed files with 222 additions and 3 deletions.
59 changes: 59 additions & 0 deletions challenges/challenge-0/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,8 @@ First and foremost, let's create a resource group where we will install the clus
$ az group create -n <ResourceGroupName> -l westeurope
```

Please use for the resource group name the name policy: `<your-initials>-aks-rg`.

Next, create the cluster (this will take approximately 5-10min.):

```shell
Expand All @@ -74,6 +76,8 @@ $ az aks create \
--zones 1 2 3
```

Please use for the aks cluster group name the name policy: `<your-initials>-aks-cluster`.

The command above will create a Kubernetes cluster in the "West Europe" region and will place our three worker nodes in three different [availability zones](https://docs.microsoft.com/en-us/azure/availability-zones/az-overview).

When the cluster has been created, download the access credentials:
Expand Down Expand Up @@ -142,6 +146,16 @@ $ terraform apply \
-var="akscluster=<ClusterName>"
```

Confirm the prompt with `yes` during execution:

```shell
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.

Enter a value: yes
```

After the script has finished (appr. after another 10-15 min.), you will see something like this:

```shell
Expand All @@ -158,6 +172,51 @@ nip_hostname = "104-45-73-97.nip.io"

## Smoke Test

### Kubernetes Deployments & Pods

Because all deployments of the demo application take place in the `contactsapp` namespace, we first set this namespace as the new default for kubectl.

```shell
kubectl config set-context --current --namespace=contactsapp
```

Check all deployments: All deployments must be `ready`

```shell
$kubectl get deployments

NAME READY UP-TO-DATE AVAILABLE AGE
ca-deploy 1/1 1 1 3h31m
frontend-deploy 1/1 1 1 3h36m
mssql-deployment 1/1 1 1 3h31m
resources-deploy 1/1 1 1 3h31m
resources-function-deploy 1/1 1 1 3h31m
search-deploy 1/1 1 1 3h31m
search-function-deploy 1/1 1 1 3h31m
textanalytics-function-deploy 1/1 1 1 3h31m
visitreports-deploy 1/1 1 1 3h31m
```

Check all pods: All pods must be `ready` and `Running`

```shell
$kubectl get pods

NAME READY STATUS RESTARTS AGE
ca-deploy-56d84bcf47-nlzp8 1/1 Running 0 3h33m
frontend-deploy-5d85979b7b-v2tgk 1/1 Running 0 3h38m
mssql-deployment-5998699cd8-cp6zv 1/1 Running 0 3h33m
resources-deploy-7f5f968587-6j8cc 1/1 Running 0 3h33m
resources-function-deploy-58744cb66-c8qdf 1/1 Running 0 3h33m
search-deploy-f7789698-vk6fx 1/1 Running 0 3h33m
search-function-deploy-84b4b6bc84-dn7pr 1/1 Running 0 3h33m
textanalytics-function-deploy-6bc56f6b8c-8pr67 1/1 Running 0 3h33m
visitreports-deploy-5fc8bf9cf5-q97dr 1/1 Running 0 3h33m
```


### Web-UI

You can now copy & paste the value of the variable `nip_hostname` and open the URL in a browser, in this case <http://104-45-73-97.nip.io>. You should now see the SCM Contacts Management application.

![home](./img/app_home.png)
Expand Down
161 changes: 161 additions & 0 deletions challenges/challenge-1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# Challenge 1 - Chaos Engineering "hello world"

## Introduction

Our first challenge is the Kubernetes “hello world” of Chaos Testing. What happens if the contact service fails? The Search Service from our demo application takes all search requests for contacts, searches in Azure Cognitive Search, and returns the results as a JSON response. The web UI communicates directly with this service.

## Step #1: Get to know the demo app

![](images/search-mustermann.png)

* Open the demo app in the browser (see [Smoke Test](../challenge-0#smoke-test)) and create a few test contacts.
* Run a few searches using the search box.
* Run the search directly in the cloud shell

```shell
export APP=<NIP_HOSTNAME>

curl http://$APP/api/search/contacts?phrase=mustermann
```

If you want to format the JSON response, you can use the JSON processor (`jp`) to do so:

```shell
curl http://$APP/api/search/contacts?phrase=mustermann | jq '.'
```

## Step #2: Game Day

Now we start with our first experiment. Admittedly, it is a bit naive. But the first task is about the process and tooling. Not about the complexity of the experiment. In this step, we do not yet use Chaos Test Tooling. We use only `kubectl` at first. This is exactly how one would proceed on a Game Day. You set up a hypothesis with your colleagues and design an experiment based on it.

| | |
|-----------------------|------------------------------|
| Target | Search Service |
| Experiment | Service failure is simulated |
| Hypothesis | Kubernetes detects the failure of a pod in real-time and immediately starts a new instance. There is only a downtime < 1 second. |
| Steady State | The Search Service responds with HTTP Response Code 200 and an average response time of less than 200ms. |
| Blast Radius | The experiment involves only the Search Service. The Contact App UI and all use cases (except Search Service) are fully available. |
| Technical information | We simulate the failure by directly terminating a Search Service pod instance |

Let's do the experiment.

```shell
kubectl delete pods --force --grace-period=0 -l service=searchapi && \
sleep 1 && \
curl http://$APP/api/search/contacts?phrase=mustermann

<html>
<head><title>503 Service Temporarily Unavailable</title></head>
<body>
<center><h1>503 Service Temporarily Unavailable</h1></center>
<hr><center>nginx</center>
</body>
</html>
```

Our experiment fails. One second after killing the pod, the ingress returns an HTTP 503 error. The Search Service is not available again for another 4-5 seconds. We want to prevent this and run the experiment again with Chaos Toolkit.

## Step #3: Chaos Toolkit

### Install Chaos Toolkit and Kubernetes extention

```shell
pip install chaostoolkit
pip install chaostoolkit-kubernetes
```

Let's check the first [experiment](kill-search-pod.yaml):

The first part contains the meta data and the configuration block with our `APP` environment variable.

```yaml
version: 1.0.0
title: Validate search service outage
description: Kills search service and validates search api availability

# define azure properties
configuration:
app:
type: "env"
key: "APP"
```
The second part is the experiment's `steady-state`.

```yaml
# define the steady state hypothesis
steady-state-hypothesis:
title: Verifying search api remains healthy
probes:
- type: probe
name: search-api-must-still-respond
tolerance: 200 # http response code 200 is expected
provider:
type: http
url: http://${app}/api/search/contacts?phrase=mustermann
timeout: 0.2
```

The third part is the experiment's action: Our solution uses the Kubernetes driver extension and terminates exactly one pod from the “contactsapp” namespace of all deployments with the label `service=searchapi`. The driver uses your current kubectl context (~/.kube/config).

```yaml
# define the experiment's action
method:
- type: action
name: terminate-pod
provider:
type: python
module: chaosk8s.pod.actions
func: terminate_pods
# Terminates one "searchapi" pod randomly
arguments:
label_selector: service=searchapi
ns: contactsapp
qty: 1
rand: true
grace_period: 0
pauses:
after: 1
```

Now we know our experiment. We can easily run it in the Cloud Shell console.

```shell
$ chaos run challenges/challenge-1/kill-search-pod.yaml
```

Like our first Game Day test with `kubectl` and `curl`, this test fails. Restarting the pod takes too long. As a simple solution, we now increase the number of replicas of the Search Service deployment to 2.

```shell
$ kubectl scale --replicas=2 deployment/search-deploy
```

After this change, we can run our experiment again.

```shell
$ chaos run challenges/challenge-1/kill-search-pod.yaml
```

The experiment is now successful. After the action, the rules of `steady-state` are still followed.

```shell
[2021-03-08 14:59:23 INFO] Validating the experiment's syntax
[2021-03-08 14:59:26 INFO] Experiment looks valid
[2021-03-08 14:59:26 INFO] Running experiment: Validate search service outage
[2021-03-08 14:59:26 INFO] Steady-state strategy: default
[2021-03-08 14:59:26 INFO] Rollbacks strategy: default
[2021-03-08 14:59:26 INFO] Steady state hypothesis: Verifying search api remains healthy
[2021-03-08 14:59:26 INFO] Probe: search-api-must-still-respond
[2021-03-08 14:59:27 INFO] Steady state hypothesis is met!
[2021-03-08 14:59:27 INFO] Playing your experiment's method now...
[2021-03-08 14:59:27 INFO] Action: terminate-pod
[2021-03-08 14:59:27 INFO] Pausing after activity for 1s...
[2021-03-08 14:59:28 INFO] Steady state hypothesis: Verifying search api remains healthy
[2021-03-08 14:59:28 INFO] Probe: search-api-must-still-respond
[2021-03-08 14:59:29 INFO] Steady state hypothesis is met!
[2021-03-08 14:59:29 INFO] Let's rollback...
[2021-03-08 14:59:29 INFO] No declared rollbacks, let's move on.
[2021-03-08 14:59:29 INFO] Experiment ended with status: completed
```

Congratulations. You have successfully completed the first challenge. You have discovered the workflow of Chaos Toolkit and the Kubernetes extention. Let's continue with [challenge #2](../challenge-2/README.md).
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 2 additions & 3 deletions challenges/challenge-1/kill-search-pod.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ configuration:
app:
type: "env"
key: "APP"
azure_app_endpoint: 20-73-198-159.nip.io

# define the steady state hypothesis
steady-state-hypothesis:
Expand All @@ -19,10 +18,10 @@ steady-state-hypothesis:
tolerance: 200 # http response code 200 is expected
provider:
type: http
timeout: 2
url: http://${app}/api/search/contacts?phrase=mustermann
timeout: 0.3

# define the experiment
# define the experiment's action
method:
- type: action
name: terminate-pod
Expand Down

0 comments on commit 7e2abac

Please sign in to comment.