Skip to content

Commit 31fb26d

Browse files
authored
Add snapshot monitors for elasticsearch (#60)
* add snapshot monitors for elasticsearch * fix missing default on variables
1 parent c6d6293 commit 31fb26d

File tree

4 files changed

+203
-35
lines changed

4 files changed

+203
-35
lines changed

database/elasticsearch/README.md

+31-13
Original file line numberDiff line numberDiff line change
@@ -23,22 +23,24 @@ Creates DataDog monitors with the following checks:
2323
- Elasticsearch average search fetch latency
2424
- Elasticsearch average search query latency
2525
- Elasticsearch average Young-generation garbage collections latency
26-
- Elasticsearch change alert on the average time spent by tasks in the queue
27-
- Elasticsearch change alert on the number of currently active queries
28-
- Elasticsearch change alert on the number of query cache evictions
29-
- Elasticsearch change alert on the number of request cache evictions
30-
- Elasticsearch change alert on the number of search fetches currently running
31-
- Elasticsearch change alert on the total number of evictions from the fielddata cache
32-
- ElasticSearch Cluster has unassigned shards
33-
- ElasticSearch Cluster is initializing shards
34-
- ElasticSearch Cluster is relocating shards
35-
- ElasticSearch Cluster status not green
36-
- ElasticSearch does not respond
37-
- ElasticSearch free space < 10%
26+
- Elasticsearch change alert on the average time spent by tasks in the queue on {{cluster_name}}
27+
- Elasticsearch change alert on the number of currently active queries on {{cluster_name}}
28+
- Elasticsearch change alert on the number of query cache evictions on {{node_name}}
29+
- Elasticsearch change alert on the number of request cache evictions on {{node_name}}
30+
- Elasticsearch change alert on the number of search fetches currently running on {{cluster_name}}
31+
- Elasticsearch change alert on the total number of evictions from the fielddata cache on {{node_name}}
32+
- ElasticSearch Cluster has unassigned shards on {{cluster_name}}
33+
- ElasticSearch Cluster is initializing shards on {{cluster_name}}
34+
- ElasticSearch Cluster is relocating shards on {{cluster_name}}
35+
- ElasticSearch Cluster status not green on {{cluster_name}}
36+
- ElasticSearch does not respond on {{server}}:{{port}}
37+
- ElasticSearch free space < 10% on {{node_name}}
3838
- Elasticsearch JVM HEAP memory usage
3939
- Elasticsearch JVM memory Old usage
4040
- Elasticsearch JVM memory Young usage
41-
- Elasticsearch number of current open HTTP connections anomaly detected
41+
- Elasticsearch number of current open HTTP connections anomaly detected on {{node_name}}
42+
- Elasticsearch {{policy}} snapshot deletion failure on {{cluster_name}}
43+
- Elasticsearch {{policy}} snapshot failed on {{cluster_name}}
4244

4345
<!-- BEGIN_TF_DOCS -->
4446
## Requirements
@@ -85,6 +87,8 @@ Creates DataDog monitors with the following checks:
8587
| [datadog_monitor.request_cache_evictions_change](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
8688
| [datadog_monitor.search_query_change](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
8789
| [datadog_monitor.search_query_latency](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
90+
| [datadog_monitor.slm_snapshot_deletion_failures](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
91+
| [datadog_monitor.slm_snapshots_failed](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
8892
| [datadog_monitor.task_time_in_queue_change](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
8993

9094
## Inputs
@@ -258,6 +262,18 @@ Creates DataDog monitors with the following checks:
258262
| <a name="input_search_query_latency_threshold_warning"></a> [search\_query\_latency\_threshold\_warning](#input\_search\_query\_latency\_threshold\_warning) | Cluster Status warning threshold | `string` | `10` | no |
259263
| <a name="input_search_query_latency_time_aggregator"></a> [search\_query\_latency\_time\_aggregator](#input\_search\_query\_latency\_time\_aggregator) | Time aggregator for the Cluster Status monitor | `string` | `"avg"` | no |
260264
| <a name="input_search_query_latency_timeframe"></a> [search\_query\_latency\_timeframe](#input\_search\_query\_latency\_timeframe) | Timeframe for the Cluster Status monitor | `string` | `"last_15m"` | no |
265+
| <a name="input_slm_snapshot_deletion_failures_enabled"></a> [slm\_snapshot\_deletion\_failures\_enabled](#input\_slm\_snapshot\_deletion\_failures\_enabled) | Flag to enable SLM Snapshot deletion failures monitor | `string` | `"true"` | no |
266+
| <a name="input_slm_snapshot_deletion_failures_extra_tags"></a> [slm\_snapshot\_deletion\_failures\_extra\_tags](#input\_slm\_snapshot\_deletion\_failures\_extra\_tags) | Extra tags for SLM Snapshot deletion failures monitor | `list(string)` | `[]` | no |
267+
| <a name="input_slm_snapshot_deletion_failures_message"></a> [slm\_snapshot\_deletion\_failures\_message](#input\_slm\_snapshot\_deletion\_failures\_message) | Custom message for SLM Snapshot deletion failures monitor | `string` | `""` | no |
268+
| <a name="input_slm_snapshot_deletion_failures_threshold_critical"></a> [slm\_snapshot\_deletion\_failures\_threshold\_critical](#input\_slm\_snapshot\_deletion\_failures\_threshold\_critical) | SLM Snapshot deletion failures critical threshold | `string` | `1` | no |
269+
| <a name="input_slm_snapshot_deletion_failures_time_aggregator"></a> [slm\_snapshot\_deletion\_failures\_time\_aggregator](#input\_slm\_snapshot\_deletion\_failures\_time\_aggregator) | Time aggregator for SLM Snapshot deletion failures monitor | `string` | `"avg"` | no |
270+
| <a name="input_slm_snapshot_deletion_failures_timeframe"></a> [slm\_snapshot\_deletion\_failures\_timeframe](#input\_slm\_snapshot\_deletion\_failures\_timeframe) | SLM Snapshot deletion failures timeframe | `string` | `"last_5m"` | no |
271+
| <a name="input_slm_snapshots_failed_enabled"></a> [slm\_snapshots\_failed\_enabled](#input\_slm\_snapshots\_failed\_enabled) | Flag to enable SLM Snapshots Failed monitor | `string` | `"true"` | no |
272+
| <a name="input_slm_snapshots_failed_extra_tags"></a> [slm\_snapshots\_failed\_extra\_tags](#input\_slm\_snapshots\_failed\_extra\_tags) | Extra tags for SLM Snapshots Failed monitor | `list(string)` | `[]` | no |
273+
| <a name="input_slm_snapshots_failed_message"></a> [slm\_snapshots\_failed\_message](#input\_slm\_snapshots\_failed\_message) | Custom message for SLM Snapshots Failed monitor | `string` | `""` | no |
274+
| <a name="input_slm_snapshots_failed_threshold_critical"></a> [slm\_snapshots\_failed\_threshold\_critical](#input\_slm\_snapshots\_failed\_threshold\_critical) | SLM Snapshots Failed critical threshold | `string` | `1` | no |
275+
| <a name="input_slm_snapshots_failed_time_aggregator"></a> [slm\_snapshots\_failed\_time\_aggregator](#input\_slm\_snapshots\_failed\_time\_aggregator) | Time aggregator for SLM Snapshots Failed monitor | `string` | `"avg"` | no |
276+
| <a name="input_slm_snapshots_failed_timeframe"></a> [slm\_snapshots\_failed\_timeframe](#input\_slm\_snapshots\_failed\_timeframe) | SLM Snapshots Failed timeframe | `string` | `"last_5m"` | no |
261277
| <a name="input_tags"></a> [tags](#input\_tags) | Global variables | `list(string)` | <pre>[<br> "type:database",<br> "provider:elasticsearch",<br> "resource:elasticsearch"<br>]</pre> | no |
262278
| <a name="input_task_time_in_queue_change_enabled"></a> [task\_time\_in\_queue\_change\_enabled](#input\_task\_time\_in\_queue\_change\_enabled) | Flag to enable Cluster Status monitor | `string` | `"true"` | no |
263279
| <a name="input_task_time_in_queue_change_extra_tags"></a> [task\_time\_in\_queue\_change\_extra\_tags](#input\_task\_time\_in\_queue\_change\_extra\_tags) | Extra tags for Cluster Status monitor | `list(string)` | `[]` | no |
@@ -295,6 +311,8 @@ Creates DataDog monitors with the following checks:
295311
| <a name="output_request_cache_evictions_change_id"></a> [request\_cache\_evictions\_change\_id](#output\_request\_cache\_evictions\_change\_id) | id for monitor request\_cache\_evictions\_change |
296312
| <a name="output_search_query_change_id"></a> [search\_query\_change\_id](#output\_search\_query\_change\_id) | id for monitor search\_query\_change |
297313
| <a name="output_search_query_latency_id"></a> [search\_query\_latency\_id](#output\_search\_query\_latency\_id) | id for monitor search\_query\_latency |
314+
| <a name="output_slm_snapshot_deletion_failures_id"></a> [slm\_snapshot\_deletion\_failures\_id](#output\_slm\_snapshot\_deletion\_failures\_id) | id for monitor slm\_snapshot\_deletion\_failures |
315+
| <a name="output_slm_snapshots_failed_id"></a> [slm\_snapshots\_failed\_id](#output\_slm\_snapshots\_failed\_id) | id for monitor slm\_snapshots\_failed |
298316
| <a name="output_task_time_in_queue_change_id"></a> [task\_time\_in\_queue\_change\_id](#output\_task\_time\_in\_queue\_change\_id) | id for monitor task\_time\_in\_queue\_change |
299317
<!-- END_TF_DOCS -->
300318
## Related documentation

database/elasticsearch/inputs.tf

+79
Original file line numberDiff line numberDiff line change
@@ -1110,3 +1110,82 @@ variable "not_responding_extra_tags" {
11101110
default = []
11111111
}
11121112

1113+
#
1114+
# SLM - Snapshots Failed
1115+
#
1116+
1117+
variable "slm_snapshots_failed_enabled" {
1118+
description = "Flag to enable SLM Snapshots Failed monitor"
1119+
type = string
1120+
default = "true"
1121+
}
1122+
1123+
variable "slm_snapshots_failed_message" {
1124+
description = "Custom message for SLM Snapshots Failed monitor"
1125+
type = string
1126+
default = ""
1127+
}
1128+
1129+
variable "slm_snapshots_failed_time_aggregator" {
1130+
description = "Time aggregator for SLM Snapshots Failed monitor"
1131+
type = string
1132+
default = "avg"
1133+
}
1134+
1135+
variable "slm_snapshots_failed_timeframe" {
1136+
description = "SLM Snapshots Failed timeframe"
1137+
type = string
1138+
default = "last_5m"
1139+
}
1140+
1141+
variable "slm_snapshots_failed_threshold_critical" {
1142+
description = "SLM Snapshots Failed critical threshold"
1143+
type = string
1144+
default = 1
1145+
}
1146+
1147+
variable "slm_snapshots_failed_extra_tags" {
1148+
description = "Extra tags for SLM Snapshots Failed monitor"
1149+
type = list(string)
1150+
default = []
1151+
}
1152+
1153+
#
1154+
# SLM - Snapshot deletion failures
1155+
#
1156+
1157+
variable "slm_snapshot_deletion_failures_enabled" {
1158+
description = "Flag to enable SLM Snapshot deletion failures monitor"
1159+
type = string
1160+
default = "true"
1161+
}
1162+
1163+
variable "slm_snapshot_deletion_failures_message" {
1164+
description = "Custom message for SLM Snapshot deletion failures monitor"
1165+
type = string
1166+
default = ""
1167+
}
1168+
1169+
variable "slm_snapshot_deletion_failures_time_aggregator" {
1170+
description = "Time aggregator for SLM Snapshot deletion failures monitor"
1171+
type = string
1172+
default = "avg"
1173+
}
1174+
1175+
variable "slm_snapshot_deletion_failures_timeframe" {
1176+
description = "SLM Snapshot deletion failures timeframe"
1177+
type = string
1178+
default = "last_5m"
1179+
}
1180+
1181+
variable "slm_snapshot_deletion_failures_threshold_critical" {
1182+
description = "SLM Snapshot deletion failures critical threshold"
1183+
type = string
1184+
default = 1
1185+
}
1186+
1187+
variable "slm_snapshot_deletion_failures_extra_tags" {
1188+
description = "Extra tags for SLM Snapshot deletion failures monitor"
1189+
type = list(string)
1190+
default = []
1191+
}

0 commit comments

Comments
 (0)