[backfill daemon run retries 2/n] backfill daemon incorporates retries runs when launching new runs #25853

jamiedemaria · 2024-11-11T20:31:25Z

Summary & Motivation

The backfill daemon doesn't account for run retries. See https://github.com/dagster-io/internal/discussions/12460 for more context. We've decided that we want the daemon to account for automatic and manual retries of runs that occur while the backfill is still in progress. This requires two changes: ensuring the backfill isn't marked completed if there is an in progress run or a failed run that will be automatically retried; and updating the daemon to take the results of retried runs into account when deciding what partitions to materialize in the next iteration.

This PR addresses the second point, updating the backfill daemon to take the results of retried runs into account when deciding what partitions to materialize in the next iteration.

Currently the backfill gets a list of the successfully materialized assets for the backfill by looking at the materialization events for the asset. It determines which assets failed by looking at the failed runs launched by the backfill and pulling the asset partition information from those runs. Any assets downstream of those failed assets will not be launched by the backfill

Now that we want the backfill daemon to account for run retries we need to slightly modify this logic. Since a run can be retried it is possible that an asset can have a successful materialization AND be a failed asset in a failed run. This means that when we determine which assets are failed, we need to cross check with the assets that have been successfully materialized and remove any that are in the materialized list

How I Tested These Changes

Changelog

Insert changelog entry or delete this section.

jamiedemaria · 2024-11-11T20:31:41Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

[backfill daemon run retries 3/n] retries of runs in completed backfills should not be considered part of the backfill #25900
[backfill daemon run retries 2/n] backfill daemon incorporates retries runs when launching new runs #25853 👈
[backfill daemon run retries 1/n] update how we determine backfill completion to account for retried runs #25771
master

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @jamiedemaria and the rest of your teammates on Graphite

graphite-app · 2024-11-11T21:55:03Z

python_modules/dagster/dagster/_core/execution/asset_backfill.py

            for asset_key in failed_asset_keys:
-                result.extend(
-                    asset_graph.get_partitions_in_range(
-                        asset_key, partition_range, instance_queryer
-                    )
+                asset_partition_candidates = asset_graph.get_partitions_in_range(
+                    asset_key, partition_range, instance_queryer
                )


There appears to be a scoping issue with asset_partition_candidates. The code creates these candidates for partition ranges but never uses them because the result.extend(asset_partitions_still_failed) call is outside both branches of the if/else. To fix this:

Move the asset_partition_candidates assignment inside the for loop in the first branch

Move result.extend(asset_partitions_still_failed) inside each branch of the if/else

This ensures both partition ranges and single partitions are properly filtered against the materialized subset before being added to the result.

Spotted by Graphite Reviewer

Is this helpful? React 👍 or 👎 to let us know.

jamiedemaria · 2024-11-12T20:47:31Z

python_modules/dagster/dagster/_core/execution/asset_backfill.py

        asset_graph,
    )
    updated_backfill_data = AssetBackfillData(
        target_subset=asset_backfill_data.target_subset,
        latest_storage_id=asset_backfill_data.latest_storage_id,
        requested_runs_for_target_roots=asset_backfill_data.requested_runs_for_target_roots,
        materialized_subset=updated_materialized_subset,
-        failed_and_downstream_subset=asset_backfill_data.failed_and_downstream_subset
-        | failed_subset,
+        failed_and_downstream_subset=failed_subset,


Want to confirm that this is a safe change to make. From my reading of _get_failed_asset_partitions it returns the full list of failed partitions, not a list of partitions that failed since the last tick, so in the version of this code before this PR the ORing of the two subsets is a no-op since asset_backfill_data.failed_and_downstream_subset would be a subset of failed_subset

with the change to have _get_failed_asset_partitions account for retries, ORing with asset_backfill_data.failed_and_downstream_subset would result in inaccurate data because a failed partition in asset_backfill_data.failed_and_downstream_subset could have been successfully retried and no longer in failed_subset but would still be included because of the OR operation

ah ok i forgot that we do the second function to get the downstream of the failed subset, so this isn't a simple replacement since failed_subset doesn't include downstream assets. i will update

jamiedemaria mentioned this pull request Nov 11, 2024

[backfill daemon run retries 1/n] update how we determine backfill completion to account for retried runs #25771

Open

jamiedemaria mentioned this pull request Nov 11, 2024

add test utils to mark a run successful or failed #25791

Closed

jamiedemaria changed the title ~~backfill daemon incorporates retries runs~~ backfill daemon incorporates retries runs when launching new runs Nov 11, 2024

jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from 0493541 to bf580e4 Compare November 11, 2024 20:52

jamiedemaria force-pushed the jamie/backfill-daemon-accounts-for-retries branch 3 times, most recently from 8a2005d to 6ce3f62 Compare November 11, 2024 21:54

graphite-app bot reviewed Nov 11, 2024

View reviewed changes

jamiedemaria force-pushed the jamie/backfill-daemon-accounts-for-retries branch from 6ce3f62 to 721e0a4 Compare November 12, 2024 15:24

jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from cd3cc4a to 1fb8fc2 Compare November 12, 2024 15:48

jamiedemaria force-pushed the jamie/backfill-daemon-accounts-for-retries branch from 721e0a4 to 514d595 Compare November 12, 2024 15:48

jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from 1fb8fc2 to 436cffd Compare November 12, 2024 18:57

jamiedemaria force-pushed the jamie/backfill-daemon-accounts-for-retries branch from eb96c61 to a691a45 Compare November 12, 2024 18:57

jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from 436cffd to 6dbf15a Compare November 12, 2024 20:41

jamiedemaria force-pushed the jamie/backfill-daemon-accounts-for-retries branch 2 times, most recently from c93546d to 5b52878 Compare November 12, 2024 20:44

jamiedemaria commented Nov 12, 2024

View reviewed changes

jamiedemaria marked this pull request as ready for review November 12, 2024 20:49

jamiedemaria requested review from gibsondan and clairelin135 November 12, 2024 20:49

jamiedemaria added 6 commits November 13, 2024 11:03

backfill daemon incorporates retries runs

df9bc4b

small fixes

85a7a47

add back

d5d1b0a

small

b130c25

small

80ae40f

update canceling logic

4a82253

jamiedemaria mentioned this pull request Nov 13, 2024

[backfill daemon run retries 3/n] retries of runs in completed backfills should not be considered part of the backfill #25900

Draft

jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from 6dbf15a to 6f60763 Compare November 13, 2024 16:03

jamiedemaria force-pushed the jamie/backfill-daemon-accounts-for-retries branch from 389f553 to 46e48b3 Compare November 13, 2024 16:03

jamiedemaria changed the title ~~backfill daemon incorporates retries runs when launching new runs~~ [backfill daemon run retries 2/n] backfill daemon incorporates retries runs when launching new runs Nov 13, 2024

update canceling logic for failed subset

eade146

jamiedemaria force-pushed the jamie/backfill-daemon-accounts-for-retries branch from 46e48b3 to eade146 Compare November 13, 2024 16:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[backfill daemon run retries 2/n] backfill daemon incorporates retries runs when launching new runs #25853

[backfill daemon run retries 2/n] backfill daemon incorporates retries runs when launching new runs #25853

jamiedemaria commented Nov 11, 2024 •

edited

Loading

jamiedemaria commented Nov 11, 2024 •

edited

Loading

graphite-app bot Nov 11, 2024

jamiedemaria Nov 12, 2024 •

edited

Loading

jamiedemaria Nov 12, 2024

[backfill daemon run retries 2/n] backfill daemon incorporates retries runs when launching new runs #25853

Are you sure you want to change the base?

[backfill daemon run retries 2/n] backfill daemon incorporates retries runs when launching new runs #25853

Conversation

jamiedemaria commented Nov 11, 2024 • edited Loading

Summary & Motivation

How I Tested These Changes

Changelog

jamiedemaria commented Nov 11, 2024 • edited Loading

graphite-app bot Nov 11, 2024

Choose a reason for hiding this comment

jamiedemaria Nov 12, 2024 • edited Loading

Choose a reason for hiding this comment

jamiedemaria Nov 12, 2024

Choose a reason for hiding this comment

jamiedemaria commented Nov 11, 2024 •

edited

Loading

jamiedemaria commented Nov 11, 2024 •

edited

Loading

jamiedemaria Nov 12, 2024 •

edited

Loading