Skip to content

Commit 29c0cde

Browse files
[EDU-631] Fix spot instances (#5883)
--------- Signed-off-by: Christopher Hakkaart <[email protected]> Co-authored-by: Ben Sherman <[email protected]>
1 parent f0cb154 commit 29c0cde

File tree

3 files changed

+22
-22
lines changed

3 files changed

+22
-22
lines changed

docs/google.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -205,9 +205,9 @@ Any input data **not** stored in a Google Storage bucket will automatically be t
205205
The Google Storage path needs to contain at least sub-directory. Don't use only the bucket name e.g. `gs://my-bucket`.
206206
:::
207207

208-
### Spot instances
208+
### Spot Instances
209209

210-
Spot instances are supported adding the following setting in the Nextflow config file:
210+
Spot Instances are supported by adding the following setting in the Nextflow config file:
211211

212212
```groovy
213213
google {

docs/reference/config.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -157,7 +157,7 @@ The following settings are available:
157157
: :::{versionchanged} 24.08.0-edge
158158
The default value was changed from `5` to `0`.
159159
:::
160-
: Max number of execution attempts of a job interrupted by a EC2 spot reclaim event (default: `0`)
160+
: Max number of execution attempts of a job interrupted by a EC2 Spot reclaim event (default: `0`)
161161

162162
`aws.batch.maxTransferAttempts`
163163
: Max number of downloads attempts from S3 (default: `1`).
@@ -854,7 +854,7 @@ The following settings are available for Google Cloud Batch:
854854
: :::{versionchanged} 24.08.0-edge
855855
The default value was changed from `5` to `0`.
856856
:::
857-
: Max number of execution attempts of a job interrupted by a Compute Engine spot reclaim event (default: `0`).
857+
: Max number of execution attempts of a job interrupted by a Compute Engine Spot reclaim event (default: `0`).
858858
: See also: `google.batch.autoRetryExitCodes`
859859

860860
`google.batch.network`

docs/updating-spot-retries.md

+18-18
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,45 @@
11
(spot-retries-page)=
22

3-
# Spot instance failures and retries
3+
# Spot Instance failures and retries
44

5-
This page describes changes in how Nextflow handles spot instance failures and retries on AWS and Google Cloud, the impact of those changes, and how to configure spot retry behavior for your pipelines. These changes apply to Nextflow 24.10 and later.
5+
This page describes changes in how Nextflow handles Spot Instance failures and retries on AWS and Google Cloud, the impact of those changes, and how to configure spot retry behavior for your pipelines. These changes apply to Nextflow 24.10 and later.
66

77
## Retry behavior
88

9-
Up to version 24.10, Nextflow would silently retry spot instance failures up to `5` times when using AWS Batch or Google Batch. These retries were controlled by cloud-specific configuration parameters (e.g., `aws.batch.maxSpotAttempts`) and happened in cloud infrastructure without explicit visibility to Nextflow.
9+
Up to version 24.10, Nextflow would silently retry Spot Instance failures up to `5` times when using AWS Batch or Google Batch. These retries were controlled by cloud-specific configuration parameters (e.g., `aws.batch.maxSpotAttempts`) and happened in cloud infrastructure without explicit visibility to Nextflow.
1010

1111
<h3>Before Nextflow 24.10</h3>
1212

13-
By default, Nextflow would instruct AWS and Google to automatically retry jobs lost to spot reclamation up to `5` times. Retries were handled by the cloud provider _within_ a Nextflow task. It was often unclear that tasks were restarted as there was no explicit message. Task runtimes and associated cloud costs were increased because they included the runtime of the reclaimed and retried tasks. Due to the high likelihood of reclamation before completion, long-running tasks running on spot instances frequently required retries, leading to inefficient allocation of resources and higher costs.
13+
By default, Nextflow would instruct AWS and Google to automatically retry jobs lost to Spot reclamation up to `5` times. Retries were handled by the cloud provider _within_ a Nextflow task. It was often unclear that tasks were restarted as there was no explicit message. Task runtimes and associated cloud costs were increased because they included the runtime of the reclaimed and retried tasks. Due to the high likelihood of reclamation before completion, long-running tasks running on Spot Instances frequently required retries, leading to inefficient allocation of resources and higher costs.
1414

1515
<h3>After Nextflow 24.10</h3>
1616

17-
The default spot reclamation retry setting changed to `0` on AWS and Google. By default, no _internal_ retries are attempted on these platforms. Spot reclamations lead to an immediate failure, exposed to Nextflow in the same way as other generic failures (for example, returning `exit code 1` on AWS). Nextflow treats these failures like any other job failure, unless a retry strategy is configured.
17+
The default Spot reclamation retry setting changed to `0` on AWS and Google. By default, no _internal_ retries are attempted on these platforms. Spot reclamations lead to an immediate failure, exposed to Nextflow in the same way as other generic failures (for example, returning `exit code 1` on AWS). Nextflow treats these failures like any other job failure, unless a retry strategy is configured.
1818

1919
## Impact on existing workflows
2020

21-
If you rely on silent spot retries (the previous default behavior), you may now see more tasks fail with the following characteristics:
21+
If you rely on silent Spot retries (the previous default behavior), you may now see more tasks fail with the following characteristics:
2222

2323
- **AWS**: Generic failure with `exit code 1`. You may see messages indicating the host machine was terminated.
2424
- **Google**: Spot reclamation typically produces a specific code, but is now surfaced as a recognizable task failure in Nextflow logs.
2525

26-
Since the default for spot retries is now `0`, you must actively enable a retry strategy if you want Nextflow to handle reclaimed spot instances automatically.
26+
Since the default for Spot retries is now `0`, you must actively enable a retry strategy if you want Nextflow to handle reclaimed Spot Instances automatically.
2727

2828
## Possible actions
2929

3030
There are four possible actions.
3131

3232
### Do nothing
3333

34-
If you do not configure anything, you will observe more pipeline failures when spot instances are reclaimed. This approach provides clearer visibility into failures. Failed tasks can be re-run with the `-resume` option. However, frequent task reclamation may lead to a higher failure rate and each retry requires manual intervention.
34+
If you do not configure anything, you will observe more pipeline failures when Spot Instances are reclaimed. This approach provides clearer visibility into failures. Failed tasks can be re-run with the `-resume` option. However, frequent task reclamation may lead to a higher failure rate and each retry requires manual intervention.
3535

3636
:::{note}
3737
If you resume the pipeline using the resume option, it will pick up at the point the pipeline was interrupted and start with a retry of that task.
3838
:::
3939

40-
### Re-enable spot retries
40+
### Re-enable Spot retries
4141

42-
You can re-enable spot retries at the provider level in your Nextflow configuration:
42+
You can re-enable Spot retries at the provider level in your Nextflow configuration:
4343

4444
```
4545
// nextflow.config
@@ -56,9 +56,9 @@ google {
5656
}
5757
```
5858

59-
The above example sets the maximum number of spot retries to `5` for both AWS and Google.
59+
The above example sets the maximum number of Spot retries to `5` for both AWS and Google.
6060

61-
### Make spot failures visible and retry them
61+
### Make Spot failures visible and retry them
6262

6363
You can set `maxRetries` to enable Nextflow-level retries for any failure:
6464

@@ -73,7 +73,7 @@ The above example sets retries to `5` for any failures, not just failures at the
7373

7474
### Use Fusion Snapshots (AWS Batch only)
7575

76-
If you have long-running tasks where progress lost due to spot reclamation is costly, consider [Fusion Snapshots](https://docs.seqera.io/fusion/guide/snapshots) (if supported by your environment). Fusion Snapshots allow you to resume a partially completed task on a new machine if a spot instance is reclaimed, thereby reducing wasted compute time.
76+
If you have long-running tasks where progress lost due to Spot reclamation is costly, consider [Fusion Snapshots](https://docs.seqera.io/fusion/guide/snapshots) (if supported by your environment). Fusion Snapshots allow you to resume a partially completed task on a new machine if a Spot Instance is reclaimed, thereby reducing wasted compute time.
7777

7878
Key features of Fusion Snapshots:
7979

@@ -85,9 +85,9 @@ See [Fusion Snapshots for AWS Batch](https://docs.seqera.io/fusion/guide/snapsho
8585

8686
## Best practices
8787

88-
Best practices for spot instance failures and retries:
88+
Best practices for Spot Instance failures and retries:
8989

90-
- **Evaluate job duration**: If your tasks are very long (multi-hour or multi-day), spot instances can cause repeated interruptions. Consider using on-demand instances or Fusion Snapshots.
91-
- **Set sensible retry limits**: If you enable spot retries, choose a retry count that balances the cost savings of spot usage against the overhead of restarting tasks.
92-
- **Monitor logs and exit codes**: Failures due to spot reclamation will now appear in Nextflow logs. Monitor failures and fine-tune your strategy.
93-
- **Consider partial usage of spot**: Some workflows may mix on-demand instances for critical or long tasks and spot instances for shorter, less critical tasks. This can optimize cost while minimizing wasted compute time.
90+
- **Evaluate job duration**: If your tasks are very long (multi-hour or multi-day), Spot Instances can cause repeated interruptions. Consider using on-demand instances or Fusion Snapshots.
91+
- **Set sensible retry limits**: If you enable Spot retries, choose a retry count that balances the cost savings of Spot usage against the overhead of restarting tasks.
92+
- **Monitor logs and exit codes**: Failures due to Spot reclamation will now appear in Nextflow logs. Monitor failures and fine-tune your strategy.
93+
- **Consider partial usage of Spot**: Some workflows may mix on-demand instances for critical or long tasks and Spot Instances for shorter, less critical tasks. This can optimize cost while minimizing wasted compute time.

0 commit comments

Comments
 (0)