-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add alerts to notify vertical or horizontal scaling #2866
base: main
Are you sure you want to change the base?
Add alerts to notify vertical or horizontal scaling #2866
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: aruniiird The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Converting this PR to draft, as we have to add/update the runbooks links at |
Created a PR: openshift/runbooks#217, to add the new files to the runbooks repo |
75b3bc3
to
4afc8a5
Compare
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCpuUsageHighNeedsHorizontalScaling.md | ||
severity_level: warning | ||
expr: | | ||
(label_replace(pod:container_cpu_usage:sum{pod=~"rook-ceph-mds.*"}/ on(pod, namespace) kube_pod_resource_limit{resource='cpu',pod=~"rook-ceph-mds.*"}, "ceph_daemon", "mds.$1", "pod", "rook-ceph-mds-(.*)-(.*)") + on (ceph_daemon, namespace) group_left(managedBy) (0 * (ceph_mds_metadata ==1)) > 0.67) and on (ceph_daemon, namespace, managedBy) (rate(ceph_mds_request[1h]) > 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a bit of explanation about the expression?
- Why are we comparing with 0.67? 1000?
- Why does
>1000
mean HorizontalScaling and<1000
mean VerticalScaling for the same expression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High request leading to high cpu usage can be helped with offloading the it to multiple mds, and low requests but still high CPU usage might be because of lack of resources. That's what I understand from the alert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@umangachapagain , @weirdwiz , we have made a change where we are not changing the MDSCPUUsage
alert expression (except for a minor cosmetic change), but we are changing the description
and runbook_url
link according to the mds request load.
High request leading to high cpu usage can be helped with offloading the it to multiple mds, and low requests but still high CPU usage might be because of lack of resources. That's what I understand from the alert
@weirdwiz , yes you are absolutely right.
Why are we comparing with 0.67? 1000?
About the 0.67
not really sure, as this was already there for the existing MDSCPUUsageHigh
alert, which we never changed. A logical conclusion I draw here is, if you are using more than ≈70%-ntage of CPU for past 6 hours, then it is considered as a sign of high CPU usage.
Why does >1000 mean HorizontalScaling and <1000 mean VerticalScaling for the same expression?
Now by keeping 67%
as our CPU threshold, the number 1000
was reached during testing, when (approx) 1000 or more requests were hitting the MD server, we saw a gradual CPU usage rise and in a 1hr window frame it reaches the CPU threshold.
That means if the rate of mds-requests is approx 1000 reqs / sec for an hour we see CPU usage crosses 67%
threshold.
PS: please see the new changes, here we are not doing much modification to the expression, but making description
and runbook_url
text changes according to the query (for past 6hrs rate query: rate(ceph_mds_request[6h])
)
annotations: | ||
description: |- | ||
Ceph metadata server pod ({{ $labels.pod }}) has high cpu usage. | ||
Please consider increasing the CPU request for the {{ $labels.pod }} pod as described in the runbook. | ||
This may help to process more requests and thus evict more items from cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should either remove this statement, or word it with more assurity. "This may help" is not a good response to an alert IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think incedental affects should be consolidated to the runbooks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCpuUsageHighNeedsHorizontalScaling.md | ||
severity_level: warning | ||
expr: | | ||
(label_replace(pod:container_cpu_usage:sum{pod=~"rook-ceph-mds.*"}/ on(pod, namespace) kube_pod_resource_limit{resource='cpu',pod=~"rook-ceph-mds.*"}, "ceph_daemon", "mds.$1", "pod", "rook-ceph-mds-(.*)-(.*)") + on (ceph_daemon, namespace) group_left(managedBy) (0 * (ceph_mds_metadata ==1)) > 0.67) and on (ceph_daemon, namespace, managedBy) (rate(ceph_mds_request[1h]) > 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should think about the delta for calculating the rate, are we countering for short bursts of high requests and then silence, or are we looking at the scenario where there is consistently high request rate?
If it's the latter, the delta will work appropriately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are looking for a consistent high requests rate.
Now the delta is brought up to 6 hrs (the time of waiting period). We now moved the mds-request rate query to the annotation
part, so that at the time of it being fired (that is after 6hrs) the rate will give appropriate description
and runbook_url
link. As you have mentioned (about higher the delta lower the jitter/error-rate), through a 6h delta span we should not have unnecessary variances.
4afc8a5
to
418a011
Compare
Now CPU usage high alerts are categorized to TWO different scenarios, First scenario: where we have high CPU usage due to high rate of mds requests coming in: Solution: at this point we need to scale horizontally Second section: where we have only CPU usage high: Solution: at this point we need to add more resources to the existing mds pods, thus scaling vertically. Signed-off-by: Arun Kumar Mohan <[email protected]>
418a011
to
377ceb8
Compare
Now CPU usage high alerts are categorized to TWO different sections,
First section: where we have high CPU usage due to high MDS requests rate: at this point we need to scale horzontally by adding more mds pods.
Second section: where we have only CPU usage high: at this point we need to add scale vertically by adding more resources (CPU, memory) to the pods.