stress test - keep increasing the bar #529

dougsland · 2023-09-01T22:22:04Z

Describe the bug

Following our previous stress test we have more feedback from owners/developers/users/etc.
Based on the feedback, let's improve tools and tests to generate reports frequently.

Description:

BlueChi extends D-Bus for multi-node environments. This means some D-Bus load from an external system (bluechi-agent) shows up where bluechi (master) is running. A set of basic stress tests could include a measurement of the time until a change on a bluechi-agent is visible on the master node. This measurement could be done for a wide variety of different conditions below.

Before we start I would recommend address this issue first, so we can be faster: containers/qm#164

number of bluechi-agents attached to a master
number of state changes per time interval per bluechi-agent
various NIC port speeds and loads (try causing congestion with with iperf utilizing basically 100% of the network bandwidth)
various CPU/memory/disk/cache utilizations (what happens if CPU load is high? does it mean that state changes are reflected later and if so what delay does it cause?)
various base loads on d-bus (simulate a large number of messages being sent and received to measure the system's performance and responsiveness)
The same measurement can be done vice-versa, what happens if these conditions materialize on a bluechi-agent while the master node is idling?
Another test could be fault injection in network layer where you introduce random package losses and measure error rate (number of erroneous states, mean time until the system recovers from failure)
Heartbeat interval N (see https://github.com/containers/bluechi/blob/main/config/agent/agent.conf#L30)
Each agent will emit a small signal on the peer D-Bus to the controller every N ms. Meaning, the "idle" system alone generates a small amount of traffic. There was a bug where a missing or 0 value would lead the agent to spam as many signals as possible. Since the controller basically does nothing on the signal, this should affect only the agent. So what happens if it is set to 1ms? Does having 100 or 1000 agents spamming a signal every millisecond to the controller really not affect it?
Monitors (see https://github.com/containers/bluechi/blob/main/data/org.eclipse.bluechi.Monitor.xml)
When a monitor with a subscription is registered, the state changes of systemd units are forwarded from the agent to the controller (which forwards it to the monitor). So monitors and subscriptions can be used to increase the traffic as well as the workload on the controller (as the agent keeps track of the unit state changes anyway).
For example, starting units in a short period of time on 100 agents while a monitor subscription watching all nodes and all units is active would result in a huge peak load on the controller.
Side note: The monitor can easily be set up via the bluechi python bindings.

How the previous stress test happened?

Most of the description about load of agent's into the controller is done in the initial stress test execution (below the steps). However, we must keep working to work on the items listed above.

Steps:

git clone https://github.com/containers/qm && cd qm/tests/e2e
./tools/remove-containers (remove any previous created container/image, helps not count the time consumed to remove old environment)
./run-test-e2e --number-of-nodes=500 &> output-500-nodes.txt

The text was updated successfully, but these errors were encountered:

dougsland added the bug Something isn't working label Sep 1, 2023

engelmi added testing This issue adds or improves the testing and removed bug Something isn't working labels Sep 14, 2023

mkemel added the backlog This is next up in priority label Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stress test - keep increasing the bar #529

stress test - keep increasing the bar #529

dougsland commented Sep 1, 2023 •

edited

Loading

stress test - keep increasing the bar #529

stress test - keep increasing the bar #529

Comments

dougsland commented Sep 1, 2023 • edited Loading

Describe the bug

Description:

How the previous stress test happened?

dougsland commented Sep 1, 2023 •

edited

Loading