Skip to content

Commit 1ce21b2

Browse files
authored
Add an alert to catch mismtach of sts replicas vs expected/ready (#654)
1 parent 74df668 commit 1ce21b2

File tree

4 files changed

+74
-0
lines changed

4 files changed

+74
-0
lines changed

Diff for: docs/sop/observatorium.md

+34
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@
3131
* [ObservatoriumNoRulesLoaded](#observatoriumnorulesloaded)
3232
* [ObservatoriumPersistentVolumeUsageHigh](#observatoriumpersistentvolumeusagehigh)
3333
* [ObservatoriumPersistentVolumeUsageCritical](#observatoriumpersistentvolumeusagecritical)
34+
* [ObservatoriumExpectedReplicasUnavailable](#observatoriumexpectedreplicasunavailable)
3435
* [Observatorium Gubernator Alerts](#observatorium-gubernator-alerts)
3536
* [GubernatorIsDown](#gubernatorisdown)
3637
* [Observatorium Obsctl Reloader Alerts](#observatorium-obsctl-reloader-alerts)
@@ -866,6 +867,39 @@ One or more PVCs are filled to more than 95%. The remaining free space does not
866867
- Locate the affected deployment in the [AppSRE Interface](https://gitlab.cee.redhat.com/service/app-interface/-/tree/master/data/services/rhobs), depending on which namespace the alert is coming from
867868
- Increase the size of the PVC by adjusting the relevant parameter in one of the `saas.yaml` files
868869

870+
## ObservatoriumExpectedReplicasUnavailable
871+
872+
### Impact
873+
874+
A StatefulSet belonging to the RHOBS service is not running the expected number of replicas for a prolonged period of time.
875+
This may impact the metric query or ingest performance of the system.
876+
877+
### Summary
878+
879+
A StatefulSet has an undesired amount of replicas. This may be caused by a number of reasons, including:
880+
881+
1. Pod stuck in a terminating state.
882+
2. Pod unable to be scheduled due to constraints on the cluster such as node capacity or resource limits.
883+
884+
### Severity
885+
886+
`critical`
887+
888+
### Access Required
889+
890+
- Console access to the cluster that runs Observatorium.
891+
- Edit access to the Observatorium namespaces:
892+
- `observatorium-metrics-stage`
893+
- `observatorium-metrics-production`
894+
- `observatorium-mst-stage`
895+
- `observatorium-mst-production`
896+
897+
### Steps
898+
899+
- Check the alert and establish which component is the one affected.
900+
- Determine the reason for the missing replica(s).
901+
- Act on the above information to address the issue.
902+
869903
# Observatorium Gubernator Alerts
870904

871905
## GubernatorIsDown

Diff for: observability/prometheusrules.jsonnet

+14
Original file line numberDiff line numberDiff line change
@@ -357,6 +357,20 @@ local renderAlerts(name, environment, mixin) = {
357357
severity: 'critical',
358358
},
359359
},
360+
{
361+
alert: 'ObservatoriumExpectedReplicasUnavailable',
362+
annotations: {
363+
description: 'The StatefulSet {{ $labels.statefulset }} in namespace {{ $labels.namespace }} has a mismatch between the expected and ready replicas.',
364+
summary: 'One or more workloads in Observatorium persistently have less replicas in a ready state than expected for an extended period.',
365+
},
366+
expr: |||
367+
kube_statefulset_replicas - kube_statefulset_status_replicas_ready > 0
368+
|||,
369+
'for': '20m',
370+
labels: {
371+
severity: 'critical',
372+
},
373+
},
360374
],
361375
},
362376
],

Diff for: resources/observability/prometheusrules/observatorium-custom-metrics-production.prometheusrules.yaml

+13
Original file line numberDiff line numberDiff line change
@@ -67,3 +67,16 @@ spec:
6767
labels:
6868
service: telemeter
6969
severity: critical
70+
- alert: ObservatoriumExpectedReplicasUnavailable
71+
annotations:
72+
dashboard: https://grafana.app-sre.devshift.net/d/no-dashboard/observatorium-metrics?orgId=1&refresh=10s&var-datasource={{$externalLabels.cluster}}-prometheus&var-namespace={{$labels.namespace}}&var-job=All&var-pod=All&var-interval=5m
73+
description: The StatefulSet {{ $labels.statefulset }} in namespace {{ $labels.namespace }} has a mismatch between the expected and ready replicas.
74+
message: The StatefulSet {{ $labels.statefulset }} in namespace {{ $labels.namespace }} has a mismatch between the expected and ready replicas.
75+
runbook: https://github.com/rhobs/configuration/blob/main/docs/sop/observatorium.md#observatoriumexpectedreplicasunavailable
76+
summary: One or more workloads in Observatorium persistently have less replicas in a ready state than expected for an extended period.
77+
expr: |
78+
kube_statefulset_replicas - kube_statefulset_status_replicas_ready > 0
79+
for: 20m
80+
labels:
81+
service: telemeter
82+
severity: critical

Diff for: resources/observability/prometheusrules/observatorium-custom-metrics-stage.prometheusrules.yaml

+13
Original file line numberDiff line numberDiff line change
@@ -67,3 +67,16 @@ spec:
6767
labels:
6868
service: telemeter
6969
severity: high
70+
- alert: ObservatoriumExpectedReplicasUnavailable
71+
annotations:
72+
dashboard: https://grafana.app-sre.devshift.net/d/no-dashboard/observatorium-metrics?orgId=1&refresh=10s&var-datasource={{$externalLabels.cluster}}-prometheus&var-namespace={{$labels.namespace}}&var-job=All&var-pod=All&var-interval=5m
73+
description: The StatefulSet {{ $labels.statefulset }} in namespace {{ $labels.namespace }} has a mismatch between the expected and ready replicas.
74+
message: The StatefulSet {{ $labels.statefulset }} in namespace {{ $labels.namespace }} has a mismatch between the expected and ready replicas.
75+
runbook: https://github.com/rhobs/configuration/blob/main/docs/sop/observatorium.md#observatoriumexpectedreplicasunavailable
76+
summary: One or more workloads in Observatorium persistently have less replicas in a ready state than expected for an extended period.
77+
expr: |
78+
kube_statefulset_replicas - kube_statefulset_status_replicas_ready > 0
79+
for: 20m
80+
labels:
81+
service: telemeter
82+
severity: high

0 commit comments

Comments
 (0)