Skip to content

Commit

Permalink
Add alerts to notify vertical or horizontal scaling
Browse files Browse the repository at this point in the history
Now CPU usage high alerts are categorized to TWO different scenarios,
First scenario: where we have high CPU usage due to high rate of mds
requests coming in:
Solution: at this point we need to scale horizontally
Second section: where we have only CPU usage high:
Solution: at this point we need to add more resources to the existing
mds pods, thus scaling vertically.

Signed-off-by: Arun Kumar Mohan <[email protected]>
  • Loading branch information
aruniiird committed Nov 14, 2024
1 parent 201c936 commit 377ceb8
Showing 1 changed file with 10 additions and 4 deletions.
14 changes: 10 additions & 4 deletions metrics/deploy/prometheus-ocs-rules.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -407,13 +407,19 @@ spec:
- alert: MDSCPUUsageHigh
annotations:
description: |-
Ceph metadata server pod ({{ $labels.pod }}) has high cpu usage.
Please consider increasing the CPU request for the {{ $labels.pod }} pod as described in the runbook.
Ceph metadata server pod ({{ $labels.pod }}) has high cpu usage
{{if query "rate(ceph_mds_request[6h]) >= 1000"}} and cannot cope
up with the current rate of mds requests. Please consider Horizontal
scaling, by adding another MDS pod{{else}}. Please consider Vertical
scaling, by adding more resources to the existing MDS pod{{end}}.
Please see 'runbook_url' for more details.
message: Ceph metadata server pod ({{ $labels.pod }}) has high cpu usage
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCpuUsageHigh.md
runbook_url: '{{if query "rate(ceph_mds_request[6h]) >= 1000"}}https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCpuUsageHighNeedsHorizontalScaling.md
{{else}}https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCpuUsageHighNeedsVerticalScaling.md
{{end}}'
severity_level: warning
expr: |
pod:container_cpu_usage:sum{pod=~"rook-ceph-mds.*"}/ on(pod) kube_pod_resource_limit{resource='cpu',pod=~"rook-ceph-mds.*"} > 0.67
label_replace(pod:container_cpu_usage:sum{pod=~"rook-ceph-mds.*"}/ on(pod, namespace) kube_pod_resource_limit{resource='cpu',pod=~"rook-ceph-mds.*"}, "ceph_daemon", "mds.$1", "pod", "rook-ceph-mds-(.*)-(.*)") + on (ceph_daemon, namespace) group_left(managedBy) (0 * (ceph_mds_metadata ==1)) > 0.67
for: 6h
labels:
severity: warning
Expand Down

0 comments on commit 377ceb8

Please sign in to comment.