Skip to content

Commit bfb3e0a

Browse files
JakeSCahillpgellertFeediver1
authored
DOC-975 crash_loop_sleep_sec broker config (#966)
Co-authored-by: Gellért Peresztegi-Nagy <[email protected]> Co-authored-by: Joyce Fee <[email protected]>
1 parent d10ccdf commit bfb3e0a

File tree

4 files changed

+48
-5
lines changed

4 files changed

+48
-5
lines changed

modules/get-started/pages/whats-new.adoc

+7-1
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,7 @@ The following `rpk` commands are new in this version:
112112
* xref:reference:rpk/rpk-cluster/rpk-cluster-storage-status-mount.adoc[`rpk cluster storage status mount`]
113113
* xref:reference:rpk/rpk-cluster/rpk-cluster-storage-unmount.adoc[`rpk cluster storage unmount`]
114114

115-
== New properties
115+
== New cluster properties
116116

117117
The following cluster properties are new in this version:
118118

@@ -133,3 +133,9 @@ The following cluster properties are new in this version:
133133
* xref:reference:properties/cluster-properties.adoc#iceberg_rest_catalog_request_timeout_ms[`iceberg_rest_catalog_request_timeout_ms`]
134134
* xref:reference:properties/cluster-properties.adoc#iceberg_rest_catalog_token[`iceberg_rest_catalog_token`]
135135
* xref:reference:properties/cluster-properties.adoc#iceberg_rest_catalog_trust_file[`iceberg_rest_catalog_trust_file`]
136+
137+
== New broker properties
138+
139+
The following broker properties are new in this version:
140+
141+
- xref:reference:properties/broker-properties.adoc#crash_loop_sleep_sec[`crash_loop_sleep_sec`]

modules/manage/pages/cluster-maintenance/configure-availability.adoc

+3-1
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ See also: xref:develop:produce-data/configure-producers.adoc[Configure Producers
4646

4747
A Redpanda broker may create log segments at startup. If a broker crashes after startup, and if it gets stuck in a crash loop, it could produce progressively more stored state that uses more disk space and takes more time for each restart to process.
4848

49-
To prevent infinite crash loops, the Redpanda node property xref:reference:node-properties.adoc#crash_loop_limit[`crash_loop_limit`] sets an upper limit on the number of consecutive crashes that can happen within one hour of each other. After it reaches the limit, a broker cannot restart until its internal consecutive crash counter is reset to zero by one of the following conditions:
49+
To prevent infinite crash loops, the Redpanda broker property xref:reference:node-properties.adoc#crash_loop_limit[`crash_loop_limit`] sets an upper limit on the number of consecutive crashes that can happen within one hour of each other. After it reaches the limit, a broker cannot restart until its internal consecutive crash counter is reset to zero by one of the following conditions:
5050

5151
* The `redpanda.yaml` configuration file is updated.
5252
* The `startup_log` file in the broker's xref:reference:node-properties.adoc#data_directory[data_directory] is manually deleted.
@@ -58,3 +58,5 @@ To prevent infinite crash loops, the Redpanda node property xref:reference:node-
5858
* The `crash_loop_limit` property is disabled by default. You must manually enable it by setting it to a non-zero value.
5959
* If the limit is less than two, the broker is blocked from restarting after every crash, until one of the reset conditions is met.
6060
====
61+
62+
To facilitate debugging in environments where a broker is stuck in a crash loop, set the xref:reference:properties/broker-properties.adoc#crash_loop_sleep_sec[`crash_loop_sleep_sec` configuration]. This setting determines how long the broker sleeps before terminating the process after reaching the crash loop limit. The window during which the broker remains available allows you to troubleshoot the issue. This setting is most useful when xref:troubleshoot:errors-solutions/k-resolve-errors.adoc[troubleshooting in Kubernetes environments].

modules/reference/pages/properties/broker-properties.adoc

+23-1
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ The crash-tracking logic is reset (to zero consecutive crashes) by any of the fo
5353
* The broker shuts down cleanly.
5454
* One hour passes since the last crash.
5555
* The `redpanda.yaml` broker configuration file is updated.
56-
* The `startup_log` file in the broker's <<data_directory,data_directory>> is manually deleted.
56+
* The `startup_log` file in the broker's <<data_directory, data_directory>> is manually deleted.
5757

5858
*Unit*: number of consecutive crashes of a broker
5959

@@ -67,6 +67,28 @@ The crash-tracking logic is reset (to zero consecutive crashes) by any of the fo
6767

6868
---
6969

70+
=== crash_loop_sleep_sec
71+
72+
*Introduced in v24.3.4*
73+
74+
The amount of time the broker sleeps before terminating when the limit on consecutive broker crashes (<<crash_loop_limit, `crash_loop_limit`>>) is reached. This property provides a debugging window for you to access the broker before it terminates, and is particularly useful in Kubernetes environments.
75+
76+
If `null`, the property is disabled, and the broker terminates immediately after reaching the crash loop limit.
77+
78+
For information about how to reset the crash loop limit, see `crash_loop_limit`.
79+
80+
*Unit:* seconds
81+
82+
*Visibility:* `user`
83+
84+
*Type:* integer or null
85+
86+
*Accepted values:* [`0`, `4294967295`] or `null`
87+
88+
*Default:* `null`
89+
90+
---
91+
7092
=== data_directory
7193

7294
Path to the directory for storing Redpanda's streaming data files.

modules/troubleshoot/partials/errors-and-solutions.adoc

+15-2
Original file line numberDiff line numberDiff line change
@@ -397,9 +397,22 @@ endif::[]
397397
ifdef::env-kubernetes[]
398398
=== Crash loop backoffs
399399

400-
If a broker crashes after startup, or gets stuck in a crash loop, it could produce progressively more stored state that uses additional disk space and takes more time for each restart to process.
400+
If a broker crashes after startup, or gets stuck in a crash loop, it can accumulate an increasing amount of stored state. This accumulated state not only consumes additional disk space but also prolongs the time required for each subsequent restart to process it.
401401

402-
To prevent infinite crash loops, the Redpanda Helm chart sets the `crash_loop_limit` node property to 5. The crash loop limit is the number of consecutive crashes that can happen within one hour of each other. After Redpanda reaches this limit, it will not start until its internal consecutive crash counter is reset to zero. In Kubernetes, the Pod running Redpanda remains in a `CrashLoopBackoff` state until its internal consecutive crash counter is reset to zero.
402+
To prevent infinite crash loops, the Redpanda Helm chart sets the xref:reference:properties/broker-properties.adoc#crash_loop_limit[`crash_loop_limit`] broker configuration property to `5`. The crash loop limit is the number of consecutive crashes that can happen within one hour of each other. By default, the broker terminates immediately after hitting the `crash_loop_limit`. The Pod running Redpanda remains in a `CrashLoopBackoff` state until its internal consecutive crash counter is reset to zero.
403+
404+
To facilitate debugging in environments where a broker is stuck in a crash loop, you can also set the xref:reference:properties/broker-properties.adoc#crash_loop_sleep_sec[`crash_loop_sleep_sec`] broker configuration property. This setting determines how long the broker sleeps before terminating the process after reaching the crash loop limit. By providing a window during which the Pod remains available, you can SSH into it and troubleshoot the issue.
405+
406+
Example configuration:
407+
408+
```yaml
409+
config:
410+
node:
411+
crash_loop_limit: 5
412+
crash_loop_sleep_sec: 60
413+
```
414+
415+
In this example, when the broker hits the `crash_loop_limit` of 5, it will sleep for 60 seconds before terminating the process. This delay allows administrators to access the Pod and troubleshoot.
403416

404417
To troubleshoot a crash loop backoff:
405418

0 commit comments

Comments
 (0)