-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[backfill daemon run retries 2/n] backfill daemon incorporates retries runs when launching new runs #25853
base: jamie/backfill-daemon-termination-change
Are you sure you want to change the base?
Conversation
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. Join @jamiedemaria and the rest of your teammates on Graphite |
0493541
to
bf580e4
Compare
8a2005d
to
6ce3f62
Compare
for asset_key in failed_asset_keys: | ||
result.extend( | ||
asset_graph.get_partitions_in_range( | ||
asset_key, partition_range, instance_queryer | ||
) | ||
asset_partition_candidates = asset_graph.get_partitions_in_range( | ||
asset_key, partition_range, instance_queryer | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There appears to be a scoping issue with asset_partition_candidates
. The code creates these candidates for partition ranges but never uses them because the result.extend(asset_partitions_still_failed)
call is outside both branches of the if/else. To fix this:
- Move the
asset_partition_candidates
assignment inside thefor
loop in the first branch - Move
result.extend(asset_partitions_still_failed)
inside each branch of the if/else
This ensures both partition ranges and single partitions are properly filtered against the materialized subset before being added to the result.
Spotted by Graphite Reviewer
Is this helpful? React 👍 or 👎 to let us know.
6ce3f62
to
721e0a4
Compare
cd3cc4a
to
1fb8fc2
Compare
721e0a4
to
514d595
Compare
1fb8fc2
to
436cffd
Compare
eb96c61
to
a691a45
Compare
436cffd
to
6dbf15a
Compare
c93546d
to
5b52878
Compare
asset_graph, | ||
) | ||
updated_backfill_data = AssetBackfillData( | ||
target_subset=asset_backfill_data.target_subset, | ||
latest_storage_id=asset_backfill_data.latest_storage_id, | ||
requested_runs_for_target_roots=asset_backfill_data.requested_runs_for_target_roots, | ||
materialized_subset=updated_materialized_subset, | ||
failed_and_downstream_subset=asset_backfill_data.failed_and_downstream_subset | ||
| failed_subset, | ||
failed_and_downstream_subset=failed_subset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to confirm that this is a safe change to make. From my reading of _get_failed_asset_partitions
it returns the full list of failed partitions, not a list of partitions that failed since the last tick, so in the version of this code before this PR the ORing of the two subsets is a no-op since asset_backfill_data.failed_and_downstream_subset
would be a subset of failed_subset
with the change to have _get_failed_asset_partitions
account for retries, ORing with asset_backfill_data.failed_and_downstream_subset
would result in inaccurate data because a failed partition in asset_backfill_data.failed_and_downstream_subset
could have been successfully retried and no longer in failed_subset
but would still be included because of the OR operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok i forgot that we do the second function to get the downstream of the failed subset, so this isn't a simple replacement since failed_subset
doesn't include downstream assets. i will update
6dbf15a
to
6f60763
Compare
389f553
to
46e48b3
Compare
46e48b3
to
eade146
Compare
Summary & Motivation
The backfill daemon doesn't account for run retries. See https://github.com/dagster-io/internal/discussions/12460 for more context. We've decided that we want the daemon to account for automatic and manual retries of runs that occur while the backfill is still in progress. This requires two changes: ensuring the backfill isn't marked completed if there is an in progress run or a failed run that will be automatically retried; and updating the daemon to take the results of retried runs into account when deciding what partitions to materialize in the next iteration.
This PR addresses the second point, updating the backfill daemon to take the results of retried runs into account when deciding what partitions to materialize in the next iteration.
Currently the backfill gets a list of the successfully materialized assets for the backfill by looking at the materialization events for the asset. It determines which assets failed by looking at the failed runs launched by the backfill and pulling the asset partition information from those runs. Any assets downstream of those failed assets will not be launched by the backfill
Now that we want the backfill daemon to account for run retries we need to slightly modify this logic. Since a run can be retried it is possible that an asset can have a successful materialization AND be a failed asset in a failed run. This means that when we determine which assets are failed, we need to cross check with the assets that have been successfully materialized and remove any that are in the materialized list
How I Tested These Changes
Changelog