Add alerts to notify vertical or horizontal scaling #2866

aruniiird · 2024-10-22T08:35:55Z

Now CPU usage high alerts are categorized to TWO different sections,
First section: where we have high CPU usage due to high MDS requests rate: at this point we need to scale horzontally by adding more mds pods.
Second section: where we have only CPU usage high: at this point we need to add scale vertically by adding more resources (CPU, memory) to the pods.

openshift-ci · 2024-10-22T08:36:04Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aruniiird
Once this PR has been reviewed and has the lgtm label, please assign agarwal-mudit for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aruniiird · 2024-10-22T08:38:48Z

Converting this PR to draft, as we have to add/update the runbooks links at https://github.com/openshift/runbooks repo

aruniiird · 2024-10-22T13:50:51Z

Created a PR: openshift/runbooks#217, to add the new files to the runbooks repo

umangachapagain · 2024-11-14T05:47:27Z

metrics/deploy/prometheus-ocs-rules.yaml

+        runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCpuUsageHighNeedsHorizontalScaling.md
+        severity_level: warning
+      expr: |
+        (label_replace(pod:container_cpu_usage:sum{pod=~"rook-ceph-mds.*"}/ on(pod, namespace) kube_pod_resource_limit{resource='cpu',pod=~"rook-ceph-mds.*"}, "ceph_daemon", "mds.$1", "pod", "rook-ceph-mds-(.*)-(.*)") + on (ceph_daemon, namespace) group_left(managedBy) (0 * (ceph_mds_metadata ==1)) > 0.67) and on (ceph_daemon, namespace, managedBy) (rate(ceph_mds_request[1h]) > 1000)


Can you add a bit of explanation about the expression?

Why are we comparing with 0.67? 1000?

Why does >1000 mean HorizontalScaling and <1000 mean VerticalScaling for the same expression?

High request leading to high cpu usage can be helped with offloading the it to multiple mds, and low requests but still high CPU usage might be because of lack of resources. That's what I understand from the alert

@umangachapagain , @weirdwiz , we have made a change where we are not changing the MDSCPUUsage alert expression (except for a minor cosmetic change), but we are changing the description and runbook_url link according to the mds request load.

High request leading to high cpu usage can be helped with offloading the it to multiple mds, and low requests but still high CPU usage might be because of lack of resources. That's what I understand from the alert

@weirdwiz , yes you are absolutely right.

@umangachapagain ,

Why are we comparing with 0.67? 1000?

About the 0.67 not really sure, as this was already there for the existing MDSCPUUsageHigh alert, which we never changed. A logical conclusion I draw here is, if you are using more than ≈70%-ntage of CPU for past 6 hours, then it is considered as a sign of high CPU usage.

Why does >1000 mean HorizontalScaling and <1000 mean VerticalScaling for the same expression?

Now by keeping 67% as our CPU threshold, the number 1000 was reached during testing, when (approx) 1000 or more requests were hitting the MD server, we saw a gradual CPU usage rise and in a 1hr window frame it reaches the CPU threshold.
That means if the rate of mds-requests is approx 1000 reqs / sec for an hour we see CPU usage crosses 67% threshold.

PS: please see the new changes, here we are not doing much modification to the expression, but making description and runbook_url text changes according to the query (for past 6hrs rate query: rate(ceph_mds_request[6h]))

umangachapagain · 2024-11-14T05:49:17Z

metrics/deploy/prometheus-ocs-rules.yaml

      annotations:
        description: |-
          Ceph metadata server pod ({{ $labels.pod }}) has high cpu usage.
          Please consider increasing the CPU request for the {{ $labels.pod }} pod as described in the runbook.
+          This may help to process more requests and thus evict more items from cache.


We should either remove this statement, or word it with more assurity. "This may help" is not a good response to an alert IMO.

I think incedental affects should be consolidated to the runbooks

weirdwiz · 2024-11-14T08:31:16Z

metrics/deploy/prometheus-ocs-rules.yaml

+        runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCpuUsageHighNeedsHorizontalScaling.md
+        severity_level: warning
+      expr: |
+        (label_replace(pod:container_cpu_usage:sum{pod=~"rook-ceph-mds.*"}/ on(pod, namespace) kube_pod_resource_limit{resource='cpu',pod=~"rook-ceph-mds.*"}, "ceph_daemon", "mds.$1", "pod", "rook-ceph-mds-(.*)-(.*)") + on (ceph_daemon, namespace) group_left(managedBy) (0 * (ceph_mds_metadata ==1)) > 0.67) and on (ceph_daemon, namespace, managedBy) (rate(ceph_mds_request[1h]) > 1000)


We should think about the delta for calculating the rate, are we countering for short bursts of high requests and then silence, or are we looking at the scenario where there is consistently high request rate?

If it's the latter, the delta will work appropriately

Here we are looking for a consistent high requests rate.
Now the delta is brought up to 6 hrs (the time of waiting period). We now moved the mds-request rate query to the annotation part, so that at the time of it being fired (that is after 6hrs) the rate will give appropriate description and runbook_url link. As you have mentioned (about higher the delta lower the jitter/error-rate), through a 6h delta span we should not have unnecessary variances.

Now CPU usage high alerts are categorized to TWO different scenarios, First scenario: where we have high CPU usage due to high rate of mds requests coming in: Solution: at this point we need to scale horizontally Second section: where we have only CPU usage high: Solution: at this point we need to add more resources to the existing mds pods, thus scaling vertically. Signed-off-by: Arun Kumar Mohan <[email protected]>

aruniiird · 2024-11-14T14:00:53Z

Screenshot of a sample alert to show how description and runbook_url link is shown

Vertical scaling example

Horizontal scaling example

aruniiird marked this pull request as draft October 22, 2024 08:36

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 22, 2024

aruniiird force-pushed the add-new-CPU-usage-alerts-for-vertical-and-horizontal-scaling branch from 75b3bc3 to 4afc8a5 Compare November 11, 2024 19:26

aruniiird marked this pull request as ready for review November 13, 2024 08:45

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 13, 2024

agarwal-mudit requested review from umangachapagain and iamniting November 13, 2024 13:51

umangachapagain reviewed Nov 14, 2024

View reviewed changes

weirdwiz reviewed Nov 14, 2024

View reviewed changes

aruniiird force-pushed the add-new-CPU-usage-alerts-for-vertical-and-horizontal-scaling branch from 4afc8a5 to 418a011 Compare November 14, 2024 09:38

agarwal-mudit requested review from weirdwiz and umangachapagain November 14, 2024 12:45

aruniiird force-pushed the add-new-CPU-usage-alerts-for-vertical-and-horizontal-scaling branch from 418a011 to 377ceb8 Compare November 14, 2024 13:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add alerts to notify vertical or horizontal scaling #2866

Add alerts to notify vertical or horizontal scaling #2866

aruniiird commented Oct 22, 2024 •

edited

Loading

openshift-ci bot commented Oct 22, 2024

aruniiird commented Oct 22, 2024

aruniiird commented Oct 22, 2024

umangachapagain Nov 14, 2024

weirdwiz Nov 14, 2024

aruniiird Nov 14, 2024

umangachapagain Nov 14, 2024

weirdwiz Nov 14, 2024

aruniiird Nov 14, 2024

weirdwiz Nov 14, 2024

aruniiird Nov 14, 2024

aruniiird commented Nov 14, 2024

Add alerts to notify vertical or horizontal scaling #2866

Are you sure you want to change the base?

Add alerts to notify vertical or horizontal scaling #2866

Conversation

aruniiird commented Oct 22, 2024 • edited Loading

openshift-ci bot commented Oct 22, 2024

aruniiird commented Oct 22, 2024

aruniiird commented Oct 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aruniiird commented Nov 14, 2024

aruniiird commented Oct 22, 2024 •

edited

Loading