Skip to content

Commit e92d5f5

Browse files
authored
Changes on Kubernetes monitors (#62)
* Add Kubernetes monitors * typoe * replace default apiserver by heartbeat * add explanation on apiserver * Add nginx is down monitor * add vars on titles * fix metric
1 parent 9f38952 commit e92d5f5

20 files changed

+784
-93
lines changed

caas/kubernetes/cluster/README.md

+11-2
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,8 @@ module "datadog-monitors-caas-kubernetes-cluster" {
1717

1818
Creates DataDog monitors with the following checks:
1919

20-
- Kubernetes API server does not respond
20+
- Kubernetes API server does not respond on {{kube_cluster_name}} (disabled by default)
21+
- Kubernetes cluster heartbeat alert on {{kube_cluster_name}}
2122

2223
<!-- BEGIN_TF_DOCS -->
2324
## Requirements
@@ -44,12 +45,13 @@ Creates DataDog monitors with the following checks:
4445
| Name | Type |
4546
|------|------|
4647
| [datadog_monitor.apiserver](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
48+
| [datadog_monitor.heartbeat](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
4749

4850
## Inputs
4951

5052
| Name | Description | Type | Default | Required |
5153
|------|-------------|------|---------|:--------:|
52-
| <a name="input_apiserver_enabled"></a> [apiserver\_enabled](#input\_apiserver\_enabled) | Flag to enable API server monitor | `string` | `"true"` | no |
54+
| <a name="input_apiserver_enabled"></a> [apiserver\_enabled](#input\_apiserver\_enabled) | Flag to enable API server monitor (do not work on some clusters, see https://docs.datadoghq.com/containers/kubernetes/control_plane/?tab=datadogoperator#ManagedServices) | `string` | `"false"` | no |
5355
| <a name="input_apiserver_extra_tags"></a> [apiserver\_extra\_tags](#input\_apiserver\_extra\_tags) | Extra tags for API server monitor | `list(string)` | `[]` | no |
5456
| <a name="input_apiserver_message"></a> [apiserver\_message](#input\_apiserver\_message) | Custom message for API server monitor | `string` | `""` | no |
5557
| <a name="input_apiserver_no_data_timeframe"></a> [apiserver\_no\_data\_timeframe](#input\_apiserver\_no\_data\_timeframe) | Number of minutes before reporting no data | `string` | `10` | no |
@@ -60,6 +62,12 @@ Creates DataDog monitors with the following checks:
6062
| <a name="input_filter_tags_custom_excluded"></a> [filter\_tags\_custom\_excluded](#input\_filter\_tags\_custom\_excluded) | Tags excluded for custom filtering when filter\_tags\_use\_defaults is false | `string` | `""` | no |
6163
| <a name="input_filter_tags_separator"></a> [filter\_tags\_separator](#input\_filter\_tags\_separator) | Set the filter tags separator (, or AND) | `string` | `","` | no |
6264
| <a name="input_filter_tags_use_defaults"></a> [filter\_tags\_use\_defaults](#input\_filter\_tags\_use\_defaults) | Use default filter tags convention | `string` | `"true"` | no |
65+
| <a name="input_heartbeat_enabled"></a> [heartbeat\_enabled](#input\_heartbeat\_enabled) | Flag to enable heartbeat monitor | `string` | `"true"` | no |
66+
| <a name="input_heartbeat_extra_tags"></a> [heartbeat\_extra\_tags](#input\_heartbeat\_extra\_tags) | Extra tags for heartbeat monitor | `list(string)` | `[]` | no |
67+
| <a name="input_heartbeat_message"></a> [heartbeat\_message](#input\_heartbeat\_message) | Custom message for heartbeat monitor | `string` | `""` | no |
68+
| <a name="input_heartbeat_no_data_timeframe"></a> [heartbeat\_no\_data\_timeframe](#input\_heartbeat\_no\_data\_timeframe) | Number of minutes before reporting no data | `string` | `20` | no |
69+
| <a name="input_heartbeat_time_aggregator"></a> [heartbeat\_time\_aggregator](#input\_heartbeat\_time\_aggregator) | Time aggregator for heartbeat monitor | `string` | `"min"` | no |
70+
| <a name="input_heartbeat_timeframe"></a> [heartbeat\_timeframe](#input\_heartbeat\_timeframe) | Timeframe for heartbeat monitor | `string` | `"last_30m"` | no |
6371
| <a name="input_message"></a> [message](#input\_message) | Message sent when a monitor is triggered | `any` | n/a | yes |
6472
| <a name="input_new_group_delay"></a> [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before monitor new resource | `number` | `300` | no |
6573
| <a name="input_new_host_delay"></a> [new\_host\_delay](#input\_new\_host\_delay) | Delay in seconds before monitor new resource | `number` | `300` | no |
@@ -74,6 +82,7 @@ Creates DataDog monitors with the following checks:
7482
| Name | Description |
7583
|------|-------------|
7684
| <a name="output_apiserver_id"></a> [apiserver\_id](#output\_apiserver\_id) | id for monitor apiserver |
85+
| <a name="output_heartbeat_id"></a> [heartbeat\_id](#output\_heartbeat\_id) | id for monitor heartbeat |
7786
<!-- END_TF_DOCS -->
7887
## Related documentation
7988

caas/kubernetes/cluster/inputs.tf

+39-3
Original file line numberDiff line numberDiff line change
@@ -66,11 +66,11 @@ variable "apiserver_no_data_timeframe" {
6666
}
6767

6868
# Datadog monitors variables
69-
69+
## API server monitor variables
7070
variable "apiserver_enabled" {
71-
description = "Flag to enable API server monitor"
71+
description = "Flag to enable API server monitor (do not work on some clusters, see https://docs.datadoghq.com/containers/kubernetes/control_plane/?tab=datadogoperator#ManagedServices)"
7272
type = string
73-
default = "true"
73+
default = "false"
7474
}
7575

7676
variable "apiserver_extra_tags" {
@@ -91,3 +91,39 @@ variable "apiserver_threshold_warning" {
9191
default = 3
9292
}
9393

94+
## Heartbeat monitor variables
95+
variable "heartbeat_enabled" {
96+
description = "Flag to enable heartbeat monitor"
97+
type = string
98+
default = "true"
99+
}
100+
101+
variable "heartbeat_message" {
102+
description = "Custom message for heartbeat monitor"
103+
type = string
104+
default = ""
105+
}
106+
107+
variable "heartbeat_no_data_timeframe" {
108+
description = "Number of minutes before reporting no data"
109+
type = string
110+
default = 20
111+
}
112+
113+
variable "heartbeat_time_aggregator" {
114+
description = "Time aggregator for heartbeat monitor"
115+
type = string
116+
default = "min"
117+
}
118+
119+
variable "heartbeat_timeframe" {
120+
description = "Timeframe for heartbeat monitor"
121+
type = string
122+
default = "last_30m"
123+
}
124+
125+
variable "heartbeat_extra_tags" {
126+
description = "Extra tags for heartbeat monitor"
127+
type = list(string)
128+
default = []
129+
}

caas/kubernetes/cluster/monitors-k8s-cluster.tf

+30-3
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
resource "datadog_monitor" "apiserver" {
22
count = var.apiserver_enabled == "true" ? 1 : 0
3-
name = "${var.prefix_slug == "" ? "" : "[${var.prefix_slug}]"}[${var.environment}] Kubernetes API server does not respond"
3+
name = "${var.prefix_slug == "" ? "" : "[${var.prefix_slug}]"}[${var.environment}] Kubernetes API server does not respond on {{kube_cluster_name}}"
44
message = coalesce(var.apiserver_message, var.message)
55

66
type = "service check"
77

88
query = <<EOQ
9-
"kube_apiserver_controlplane.up"${module.filter-tags.service_check}.last(6).count_by_status()
9+
"kube_apiserver_controlplane.up"${module.filter-tags.service_check}.by("kube_cluster_name").last(6).count_by_status()
1010
EOQ
1111

1212
monitor_thresholds {
@@ -16,7 +16,7 @@ EOQ
1616

1717
new_host_delay = var.new_host_delay
1818
new_group_delay = var.new_group_delay
19-
notify_no_data = var.notify_no_data
19+
notify_no_data = false
2020
no_data_timeframe = var.apiserver_no_data_timeframe
2121
renotify_interval = 0
2222
notify_audit = false
@@ -26,3 +26,30 @@ EOQ
2626

2727
tags = concat(local.common_tags, var.tags, var.apiserver_extra_tags)
2828
}
29+
30+
resource "datadog_monitor" "heartbeat" {
31+
count = var.heartbeat_enabled == "true" ? 1 : 0
32+
name = "${var.prefix_slug == "" ? "" : "[${var.prefix_slug}]"}[${var.environment}] Kubernetes cluster heartbeat alert on {{kube_cluster_name}}"
33+
message = coalesce(var.heartbeat_message, var.message)
34+
type = "metric alert"
35+
36+
query = <<EOQ
37+
${var.heartbeat_time_aggregator}(${var.heartbeat_timeframe}):
38+
sum:kubernetes.pods.running${module.filter-tags.query_alert} by {kube_cluster_name} > 1000000
39+
EOQ
40+
41+
monitor_thresholds {
42+
critical = 1000000 # high threshold to handle no data only
43+
}
44+
45+
new_group_delay = var.new_group_delay
46+
notify_no_data = true
47+
no_data_timeframe = var.heartbeat_no_data_timeframe
48+
renotify_interval = 0
49+
notify_audit = false
50+
timeout_h = var.timeout_h
51+
include_tags = true
52+
require_full_window = true
53+
54+
tags = concat(local.common_tags, var.tags, var.heartbeat_extra_tags)
55+
}

caas/kubernetes/cluster/outputs.tf

+5
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,8 @@ output "apiserver_id" {
33
value = datadog_monitor.apiserver.*.id
44
}
55

6+
output "heartbeat_id" {
7+
description = "id for monitor heartbeat"
8+
value = datadog_monitor.heartbeat.*.id
9+
}
10+

caas/kubernetes/ingress/vts/README.md

+10
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ Creates DataDog monitors with the following checks:
1919

2020
- Nginx Ingress 4xx errors
2121
- Nginx Ingress 5xx errors
22+
- Nginx Ingress {{kube_replica_set}} is down on {{kube_cluster_name}}
2223

2324
<!-- BEGIN_TF_DOCS -->
2425
## Requirements
@@ -46,6 +47,7 @@ Creates DataDog monitors with the following checks:
4647

4748
| Name | Type |
4849
|------|------|
50+
| [datadog_monitor.nginx_ingress_is_down](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
4951
| [datadog_monitor.nginx_ingress_too_many_4xx](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
5052
| [datadog_monitor.nginx_ingress_too_many_5xx](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
5153

@@ -74,6 +76,13 @@ Creates DataDog monitors with the following checks:
7476
| <a name="input_ingress_5xx_threshold_warning"></a> [ingress\_5xx\_threshold\_warning](#input\_ingress\_5xx\_threshold\_warning) | 5xx warning threshold in percentage | `string` | `"10"` | no |
7577
| <a name="input_ingress_5xx_time_aggregator"></a> [ingress\_5xx\_time\_aggregator](#input\_ingress\_5xx\_time\_aggregator) | Monitor aggregator for Ingress 5xx errors [available values: min, max or avg] | `string` | `"min"` | no |
7678
| <a name="input_ingress_5xx_timeframe"></a> [ingress\_5xx\_timeframe](#input\_ingress\_5xx\_timeframe) | Monitor timeframe for Ingress 5xx errors [available values: `last_#m` (1, 5, 10, 15, or 30), `last_#h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
79+
| <a name="input_ingress_down_enabled"></a> [ingress\_down\_enabled](#input\_ingress\_down\_enabled) | Flag to enable Nginx Ingress is down monitor | `string` | `"true"` | no |
80+
| <a name="input_ingress_down_extra_tags"></a> [ingress\_down\_extra\_tags](#input\_ingress\_down\_extra\_tags) | Extra tags for Nginx Ingress is down monitor | `list(string)` | `[]` | no |
81+
| <a name="input_ingress_down_message"></a> [ingress\_down\_message](#input\_ingress\_down\_message) | Message sent when an alert is triggered | `string` | `""` | no |
82+
| <a name="input_ingress_down_threshold_critical"></a> [ingress\_down\_threshold\_critical](#input\_ingress\_down\_threshold\_critical) | Nginx Ingress is down critical threshold in percentage | `number` | `0.3` | no |
83+
| <a name="input_ingress_down_threshold_warning"></a> [ingress\_down\_threshold\_warning](#input\_ingress\_down\_threshold\_warning) | Nginx Ingress is down warning threshold in percentage | `number` | `0.7` | no |
84+
| <a name="input_ingress_down_time_aggregator"></a> [ingress\_down\_time\_aggregator](#input\_ingress\_down\_time\_aggregator) | Monitor aggregator for Nginx Ingress is down [available values: min, max or avg] | `string` | `"avg"` | no |
85+
| <a name="input_ingress_down_timeframe"></a> [ingress\_down\_timeframe](#input\_ingress\_down\_timeframe) | Monitor timeframe for Nginx Ingress is down [available values: `last_#m` (1, 5, 10, 15, or 30), `last_#h` (1, 2, or 4), or `last_1d`] | `string` | `"last_10m"` | no |
7786
| <a name="input_message"></a> [message](#input\_message) | Message sent when an alert is triggered | `any` | n/a | yes |
7887
| <a name="input_new_group_delay"></a> [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before monitor new resource | `number` | `300` | no |
7988
| <a name="input_new_host_delay"></a> [new\_host\_delay](#input\_new\_host\_delay) | Delay in seconds before monitor new resource | `number` | `300` | no |
@@ -87,6 +96,7 @@ Creates DataDog monitors with the following checks:
8796

8897
| Name | Description |
8998
|------|-------------|
99+
| <a name="output_nginx_ingress_is_down_id"></a> [nginx\_ingress\_is\_down\_id](#output\_nginx\_ingress\_is\_down\_id) | id for monitor nginx\_ingress\_is\_down |
90100
| <a name="output_nginx_ingress_too_many_4xx_id"></a> [nginx\_ingress\_too\_many\_4xx\_id](#output\_nginx\_ingress\_too\_many\_4xx\_id) | id for monitor nginx\_ingress\_too\_many\_4xx |
91101
| <a name="output_nginx_ingress_too_many_5xx_id"></a> [nginx\_ingress\_too\_many\_5xx\_id](#output\_nginx\_ingress\_too\_many\_5xx\_id) | id for monitor nginx\_ingress\_too\_many\_5xx |
92102
<!-- END_TF_DOCS -->

caas/kubernetes/ingress/vts/inputs.tf

+44-2
Original file line numberDiff line numberDiff line change
@@ -59,8 +59,8 @@ variable "filter_tags_separator" {
5959
default = ","
6060
}
6161

62-
#Ingress
63-
62+
# Nginx Ingress
63+
## Nginx Ingress 5xx errors monitor
6464
variable "ingress_5xx_enabled" {
6565
description = "Flag to enable Ingress 5xx errors monitor"
6666
type = string
@@ -102,6 +102,7 @@ variable "ingress_5xx_threshold_warning" {
102102
description = "5xx warning threshold in percentage"
103103
}
104104

105+
## Nginx Ingress 4xx errors monitor
105106
variable "ingress_4xx_enabled" {
106107
description = "Flag to enable Ingress 4xx errors monitor"
107108
type = string
@@ -148,3 +149,44 @@ variable "artificial_requests_count" {
148149
description = "Number of false requests used to mitigate false positive in case of low trafic"
149150
}
150151

152+
## Nginx Ingress is down monitor
153+
variable "ingress_down_enabled" {
154+
type = string
155+
default = "true"
156+
description = "Flag to enable Nginx Ingress is down monitor"
157+
}
158+
159+
variable "ingress_down_message" {
160+
default = ""
161+
description = "Message sent when an alert is triggered"
162+
}
163+
164+
variable "ingress_down_time_aggregator" {
165+
type = string
166+
default = "avg"
167+
description = "Monitor aggregator for Nginx Ingress is down [available values: min, max or avg]"
168+
}
169+
170+
variable "ingress_down_timeframe" {
171+
type = string
172+
default = "last_10m"
173+
description = "Monitor timeframe for Nginx Ingress is down [available values: `last_#m` (1, 5, 10, 15, or 30), `last_#h` (1, 2, or 4), or `last_1d`]"
174+
}
175+
176+
variable "ingress_down_threshold_critical" {
177+
type = number
178+
default = 0.3
179+
description = "Nginx Ingress is down critical threshold in percentage"
180+
}
181+
182+
variable "ingress_down_threshold_warning" {
183+
type = number
184+
default = 0.7
185+
description = "Nginx Ingress is down warning threshold in percentage"
186+
}
187+
188+
variable "ingress_down_extra_tags" {
189+
type = list(string)
190+
default = []
191+
description = "Extra tags for Nginx Ingress is down monitor"
192+
}

caas/kubernetes/ingress/vts/monitors-ingress.tf

+28
Original file line numberDiff line numberDiff line change
@@ -60,3 +60,31 @@ EOQ
6060
tags = concat(local.common_tags, var.tags, var.ingress_4xx_extra_tags)
6161
}
6262

63+
resource "datadog_monitor" "nginx_ingress_is_down" {
64+
count = var.ingress_down_enabled == "true" ? 1 : 0
65+
name = "${var.prefix_slug == "" ? "" : "[${var.prefix_slug}]"}[${var.environment}] Nginx Ingress {{kube_replica_set}} is down on {{kube_cluster_name}}"
66+
message = coalesce(var.ingress_down_message, var.message)
67+
type = "query alert"
68+
69+
query = <<EOQ
70+
${var.ingress_down_time_aggregator}(${var.ingress_down_timeframe}):
71+
avg:nginx_ingress.nginx_up${module.filter-tags.query_alert} by {kube_replica_set,kube_cluster_name}
72+
<= ${var.ingress_down_threshold_critical}
73+
EOQ
74+
75+
monitor_thresholds {
76+
warning = var.ingress_down_threshold_warning
77+
critical = var.ingress_down_threshold_critical
78+
}
79+
80+
evaluation_delay = var.evaluation_delay
81+
new_group_delay = var.new_group_delay
82+
notify_no_data = true
83+
renotify_interval = 0
84+
notify_audit = false
85+
timeout_h = var.timeout_h
86+
include_tags = true
87+
require_full_window = true
88+
89+
tags = concat(local.common_tags, var.tags, var.ingress_down_extra_tags)
90+
}

caas/kubernetes/ingress/vts/outputs.tf

+5
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
output "nginx_ingress_is_down_id" {
2+
description = "id for monitor nginx_ingress_is_down"
3+
value = datadog_monitor.nginx_ingress_is_down.*.id
4+
}
5+
16
output "nginx_ingress_too_many_4xx_id" {
27
description = "id for monitor nginx_ingress_too_many_4xx"
38
value = datadog_monitor.nginx_ingress_too_many_4xx.*.id

caas/kubernetes/node/README.md

+9-16
Original file line numberDiff line numberDiff line change
@@ -17,16 +17,15 @@ module "datadog-monitors-caas-kubernetes-node" {
1717

1818
Creates DataDog monitors with the following checks:
1919

20-
- Kubernetes Node Disk pressure
21-
- Kubernetes Node Frequent unregister net device
22-
- Kubernetes Node Kubelet API does not respond
23-
- Kubernetes Node Kubelet sync loop that updates containers does not work
24-
- Kubernetes Node Memory pressure
25-
- Kubernetes Node not ready
26-
- Kubernetes Node Out of disk
27-
- Kubernetes Node unschedulable
28-
- Kubernetes Node volume inodes usage
29-
- Kubernetes Node volume space usage
20+
- Kubernetes Node {{kube_node}} disk pressure on {{kube_cluster_name}}
21+
- Kubernetes Node {{kube_node}} frequent unregister net device
22+
- Kubernetes Node {{kube_node}} Kubelet API does not respond on {{kube_cluster_name}}
23+
- Kubernetes Node {{kube_node}} Kubelet sync loop that updates containers does not work on {{kube_cluster_name}}
24+
- Kubernetes Node {{kube_node}} memory pressure on {{kube_cluster_name}}
25+
- Kubernetes Node {{kube_node}} not ready on {{kube_cluster_name}}
26+
- Kubernetes Node {{kube_node}} unschedulable on {{kube_cluster_name}}
27+
- Kubernetes Node volume {{persistentvolumeclaim}} inodes usage
28+
- Kubernetes Node volume {{persistentvolumeclaim}} space usage
3029

3130
<!-- BEGIN_TF_DOCS -->
3231
## Requirements
@@ -53,7 +52,6 @@ Creates DataDog monitors with the following checks:
5352

5453
| Name | Type |
5554
|------|------|
56-
| [datadog_monitor.disk_out](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
5755
| [datadog_monitor.disk_pressure](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
5856
| [datadog_monitor.kubelet_ping](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
5957
| [datadog_monitor.kubelet_syncloop](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
@@ -68,10 +66,6 @@ Creates DataDog monitors with the following checks:
6866

6967
| Name | Description | Type | Default | Required |
7068
|------|-------------|------|---------|:--------:|
71-
| <a name="input_disk_out_enabled"></a> [disk\_out\_enabled](#input\_disk\_out\_enabled) | Flag to enable Out of disk monitor | `string` | `"true"` | no |
72-
| <a name="input_disk_out_extra_tags"></a> [disk\_out\_extra\_tags](#input\_disk\_out\_extra\_tags) | Extra tags for Out of disk monitor | `list(string)` | `[]` | no |
73-
| <a name="input_disk_out_message"></a> [disk\_out\_message](#input\_disk\_out\_message) | Custom message for Out of disk monitor | `string` | `""` | no |
74-
| <a name="input_disk_out_threshold_warning"></a> [disk\_out\_threshold\_warning](#input\_disk\_out\_threshold\_warning) | Out of disk monitor (warning threshold) | `string` | `3` | no |
7569
| <a name="input_disk_pressure_enabled"></a> [disk\_pressure\_enabled](#input\_disk\_pressure\_enabled) | Flag to enable Disk pressure monitor | `string` | `"true"` | no |
7670
| <a name="input_disk_pressure_extra_tags"></a> [disk\_pressure\_extra\_tags](#input\_disk\_pressure\_extra\_tags) | Extra tags for Disk pressure monitor | `list(string)` | `[]` | no |
7771
| <a name="input_disk_pressure_message"></a> [disk\_pressure\_message](#input\_disk\_pressure\_message) | Custom message for Disk pressure monitor | `string` | `""` | no |
@@ -137,7 +131,6 @@ Creates DataDog monitors with the following checks:
137131

138132
| Name | Description |
139133
|------|-------------|
140-
| <a name="output_disk_out_id"></a> [disk\_out\_id](#output\_disk\_out\_id) | id for monitor disk\_out |
141134
| <a name="output_disk_pressure_id"></a> [disk\_pressure\_id](#output\_disk\_pressure\_id) | id for monitor disk\_pressure |
142135
| <a name="output_kubelet_ping_id"></a> [kubelet\_ping\_id](#output\_kubelet\_ping\_id) | id for monitor kubelet\_ping |
143136
| <a name="output_kubelet_syncloop_id"></a> [kubelet\_syncloop\_id](#output\_kubelet\_syncloop\_id) | id for monitor kubelet\_syncloop |

0 commit comments

Comments
 (0)