Skip to content

fishaffair/chaos-experiment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chaos experiments

A simple ansible playbook to run chaos experiments on infrastructure that have been deployed via this project.

Playbook chooses a random host for every group in inventory:

  • etcd_cluster
  • postgres_cluster (patroni and psql)
  • haproxy (balancer)

And runs several experiments:

  • Network packet loss / delay
  • RAM / CPU / IO load
  • Service stop (etcd,patroni)

Note

you can disable some of it in vars.yml

After specified delay the rollback role begins to put everything in its place.

How to run experiments?

Firstly, clone repo and check playbook on your hosts:

ansible-playbook chaos.yml -i inventory --check

and run it:

ansible-playbook chaos.yml -i inventory

Warning

The cluster will be active only then at least:

  • one patroni
  • two etcd is available

Main experiments:

1. Patroni failover

1.1 Experiment description: Stop patroni service on current patroni leader with systemd role.

1.2 Expected results: Patroni replica has promoted self to a new leader.

1.3 Real outcomes:

Patroni logs:

INFO: Cleaning up failover key after acquiring leader lock...
INFO:patroni.watchdog.base:Software Watchdog activated with 25 second timeout, timing slack 15 seco>
INFO: Software Watchdog activated with 25 second timeout, timing slack 15 s>
INFO:patroni.__main__:promoted self to leader by acquiring session lock INFO:patroni.ha:Lock owner: psql-2; I am psql-2
INFO: promoted self to leader by acquiring session lock
INFO:patroni.__main__:updated leader lock during promote
INFO: Lock owner: psql-2; I am psql-2
INFO: updated leader lock during promote

Alt text

1.4 Results analysis: Patroni replica has promoted self to a new leader as espected.

2. Etcd failover

2.1 Experiment description: Stop etcd service on current etcd leader with systemd role.

2.2 Expected results: Etcd hosts have a valid quorum with at least two active nodes, so new leader will be reelected accordingly.

2.3 Real outcomes:

Alt text Alt text

e1f06668267121f5 [term 38] received MsgTimeoutNow from b586ded327f9460d and starts an election to get leadership."
e1f06668267121f5 lost leader b586ded327f9460d at term 39"
e1f06668267121f5 became leader at term 39"
1f06668267121f5 elected leader e1f06668267121f5 at term 39"

2.4 Results analysis: During quorum desigion etcd has randomly elected one node as leader.

3. Network delay

3.1 Experiment description: Create network packet loss via tc to patroni master.

3.2 Expected results: Increased latency from blackbox to any API request.

3.3 Real outcomes: Latency has increased:

Alt text Alt text

3.4 Results analysis: negative impact on API request latency.

4. Cpu load

4.1 Experiment description: Push maximum load on CPU with stress-ng.

4.2 Expected results: Email alert from alertmanager and latency bump on blacbox probe statistics.

4.3 Real outcomes:

Alt text

4.4 Results analysis: The CPU stress test did not show a significant impact on latency for API requests.

5. Alerting test

5.1 Experiment description: Check that alertmanager alert rules working as espected.

5.2 Expected results: New email alerts from prometheus alertmanager.

5.3 Real outcomes: Have email alerts due to abnormal cluster state.

Alt text Alt text

5.4 Results analysis: Notifications related to the operating system, such as: disk, memory, and RAM load the CPU most of the time, receives a notification about the CPU load, although alerts are configured for other types. The solution would be to lower the thresholds for other types of attacks (disk, memory) in order to receive them earlier.

About

Etcd and patroni chaos experiment via ansible

Topics

Resources

Stars

Watchers

Forks