You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/updating-spot-retries.md
+18-18
Original file line number
Diff line number
Diff line change
@@ -1,45 +1,45 @@
1
1
(spot-retries-page)=
2
2
3
-
# Spot instance failures and retries
3
+
# Spot Instance failures and retries
4
4
5
-
This page describes changes in how Nextflow handles spot instance failures and retries on AWS and Google Cloud, the impact of those changes, and how to configure spot retry behavior for your pipelines. These changes apply to Nextflow 24.10 and later.
5
+
This page describes changes in how Nextflow handles Spot Instance failures and retries on AWS and Google Cloud, the impact of those changes, and how to configure spot retry behavior for your pipelines. These changes apply to Nextflow 24.10 and later.
6
6
7
7
## Retry behavior
8
8
9
-
Up to version 24.10, Nextflow would silently retry spot instance failures up to `5` times when using AWS Batch or Google Batch. These retries were controlled by cloud-specific configuration parameters (e.g., `aws.batch.maxSpotAttempts`) and happened in cloud infrastructure without explicit visibility to Nextflow.
9
+
Up to version 24.10, Nextflow would silently retry Spot Instance failures up to `5` times when using AWS Batch or Google Batch. These retries were controlled by cloud-specific configuration parameters (e.g., `aws.batch.maxSpotAttempts`) and happened in cloud infrastructure without explicit visibility to Nextflow.
10
10
11
11
<h3>Before Nextflow 24.10</h3>
12
12
13
-
By default, Nextflow would instruct AWS and Google to automatically retry jobs lost to spot reclamation up to `5` times. Retries were handled by the cloud provider _within_ a Nextflow task. It was often unclear that tasks were restarted as there was no explicit message. Task runtimes and associated cloud costs were increased because they included the runtime of the reclaimed and retried tasks. Due to the high likelihood of reclamation before completion, long-running tasks running on spot instances frequently required retries, leading to inefficient allocation of resources and higher costs.
13
+
By default, Nextflow would instruct AWS and Google to automatically retry jobs lost to Spot reclamation up to `5` times. Retries were handled by the cloud provider _within_ a Nextflow task. It was often unclear that tasks were restarted as there was no explicit message. Task runtimes and associated cloud costs were increased because they included the runtime of the reclaimed and retried tasks. Due to the high likelihood of reclamation before completion, long-running tasks running on Spot Instances frequently required retries, leading to inefficient allocation of resources and higher costs.
14
14
15
15
<h3>After Nextflow 24.10</h3>
16
16
17
-
The default spot reclamation retry setting changed to `0` on AWS and Google. By default, no _internal_ retries are attempted on these platforms. Spot reclamations lead to an immediate failure, exposed to Nextflow in the same way as other generic failures (for example, returning `exit code 1` on AWS). Nextflow treats these failures like any other job failure, unless a retry strategy is configured.
17
+
The default Spot reclamation retry setting changed to `0` on AWS and Google. By default, no _internal_ retries are attempted on these platforms. Spot reclamations lead to an immediate failure, exposed to Nextflow in the same way as other generic failures (for example, returning `exit code 1` on AWS). Nextflow treats these failures like any other job failure, unless a retry strategy is configured.
18
18
19
19
## Impact on existing workflows
20
20
21
-
If you rely on silent spot retries (the previous default behavior), you may now see more tasks fail with the following characteristics:
21
+
If you rely on silent Spot retries (the previous default behavior), you may now see more tasks fail with the following characteristics:
22
22
23
23
-**AWS**: Generic failure with `exit code 1`. You may see messages indicating the host machine was terminated.
24
24
-**Google**: Spot reclamation typically produces a specific code, but is now surfaced as a recognizable task failure in Nextflow logs.
25
25
26
-
Since the default for spot retries is now `0`, you must actively enable a retry strategy if you want Nextflow to handle reclaimed spot instances automatically.
26
+
Since the default for Spot retries is now `0`, you must actively enable a retry strategy if you want Nextflow to handle reclaimed Spot Instances automatically.
27
27
28
28
## Possible actions
29
29
30
30
There are four possible actions.
31
31
32
32
### Do nothing
33
33
34
-
If you do not configure anything, you will observe more pipeline failures when spot instances are reclaimed. This approach provides clearer visibility into failures. Failed tasks can be re-run with the `-resume` option. However, frequent task reclamation may lead to a higher failure rate and each retry requires manual intervention.
34
+
If you do not configure anything, you will observe more pipeline failures when Spot Instances are reclaimed. This approach provides clearer visibility into failures. Failed tasks can be re-run with the `-resume` option. However, frequent task reclamation may lead to a higher failure rate and each retry requires manual intervention.
35
35
36
36
:::{note}
37
37
If you resume the pipeline using the resume option, it will pick up at the point the pipeline was interrupted and start with a retry of that task.
38
38
:::
39
39
40
-
### Re-enable spot retries
40
+
### Re-enable Spot retries
41
41
42
-
You can re-enable spot retries at the provider level in your Nextflow configuration:
42
+
You can re-enable Spot retries at the provider level in your Nextflow configuration:
43
43
44
44
```
45
45
// nextflow.config
@@ -56,9 +56,9 @@ google {
56
56
}
57
57
```
58
58
59
-
The above example sets the maximum number of spot retries to `5` for both AWS and Google.
59
+
The above example sets the maximum number of Spot retries to `5` for both AWS and Google.
60
60
61
-
### Make spot failures visible and retry them
61
+
### Make Spot failures visible and retry them
62
62
63
63
You can set `maxRetries` to enable Nextflow-level retries for any failure:
64
64
@@ -73,7 +73,7 @@ The above example sets retries to `5` for any failures, not just failures at the
73
73
74
74
### Use Fusion Snapshots (AWS Batch only)
75
75
76
-
If you have long-running tasks where progress lost due to spot reclamation is costly, consider [Fusion Snapshots](https://docs.seqera.io/fusion/guide/snapshots) (if supported by your environment). Fusion Snapshots allow you to resume a partially completed task on a new machine if a spot instance is reclaimed, thereby reducing wasted compute time.
76
+
If you have long-running tasks where progress lost due to Spot reclamation is costly, consider [Fusion Snapshots](https://docs.seqera.io/fusion/guide/snapshots) (if supported by your environment). Fusion Snapshots allow you to resume a partially completed task on a new machine if a Spot Instance is reclaimed, thereby reducing wasted compute time.
77
77
78
78
Key features of Fusion Snapshots:
79
79
@@ -85,9 +85,9 @@ See [Fusion Snapshots for AWS Batch](https://docs.seqera.io/fusion/guide/snapsho
85
85
86
86
## Best practices
87
87
88
-
Best practices for spot instance failures and retries:
88
+
Best practices for Spot Instance failures and retries:
89
89
90
-
-**Evaluate job duration**: If your tasks are very long (multi-hour or multi-day), spot instances can cause repeated interruptions. Consider using on-demand instances or Fusion Snapshots.
91
-
-**Set sensible retry limits**: If you enable spot retries, choose a retry count that balances the cost savings of spot usage against the overhead of restarting tasks.
92
-
-**Monitor logs and exit codes**: Failures due to spot reclamation will now appear in Nextflow logs. Monitor failures and fine-tune your strategy.
93
-
-**Consider partial usage of spot**: Some workflows may mix on-demand instances for critical or long tasks and spot instances for shorter, less critical tasks. This can optimize cost while minimizing wasted compute time.
90
+
-**Evaluate job duration**: If your tasks are very long (multi-hour or multi-day), Spot Instances can cause repeated interruptions. Consider using on-demand instances or Fusion Snapshots.
91
+
-**Set sensible retry limits**: If you enable Spot retries, choose a retry count that balances the cost savings of Spot usage against the overhead of restarting tasks.
92
+
-**Monitor logs and exit codes**: Failures due to Spot reclamation will now appear in Nextflow logs. Monitor failures and fine-tune your strategy.
93
+
-**Consider partial usage of Spot**: Some workflows may mix on-demand instances for critical or long tasks and Spot Instances for shorter, less critical tasks. This can optimize cost while minimizing wasted compute time.
0 commit comments