Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backfill daemon run retries 2/n] backfill daemon incorporates retries runs when launching new runs #25853

Open
wants to merge 7 commits into
base: jamie/backfill-daemon-termination-change
Choose a base branch
from

Conversation

jamiedemaria
Copy link
Contributor

@jamiedemaria jamiedemaria commented Nov 11, 2024

Summary & Motivation

The backfill daemon doesn't account for run retries. See https://github.com/dagster-io/internal/discussions/12460 for more context. We've decided that we want the daemon to account for automatic and manual retries of runs that occur while the backfill is still in progress. This requires two changes: ensuring the backfill isn't marked completed if there is an in progress run or a failed run that will be automatically retried; and updating the daemon to take the results of retried runs into account when deciding what partitions to materialize in the next iteration.

This PR addresses the second point, updating the backfill daemon to take the results of retried runs into account when deciding what partitions to materialize in the next iteration.

Currently the backfill gets a list of the successfully materialized assets for the backfill by looking at the materialization events for the asset. It determines which assets failed by looking at the failed runs launched by the backfill and pulling the asset partition information from those runs. Any assets downstream of those failed assets will not be launched by the backfill

Now that we want the backfill daemon to account for run retries we need to slightly modify this logic. Since a run can be retried it is possible that an asset can have a successful materialization AND be a failed asset in a failed run. This means that when we determine which assets are failed, we need to cross check with the assets that have been successfully materialized and remove any that are in the materialized list

How I Tested These Changes

Changelog

Insert changelog entry or delete this section.

Copy link
Contributor Author

jamiedemaria commented Nov 11, 2024

@jamiedemaria jamiedemaria changed the title backfill daemon incorporates retries runs backfill daemon incorporates retries runs when launching new runs Nov 11, 2024
@jamiedemaria jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from 0493541 to bf580e4 Compare November 11, 2024 20:52
@jamiedemaria jamiedemaria force-pushed the jamie/backfill-daemon-accounts-for-retries branch 3 times, most recently from 8a2005d to 6ce3f62 Compare November 11, 2024 21:54
Comment on lines 1825 to 1837
for asset_key in failed_asset_keys:
result.extend(
asset_graph.get_partitions_in_range(
asset_key, partition_range, instance_queryer
)
asset_partition_candidates = asset_graph.get_partitions_in_range(
asset_key, partition_range, instance_queryer
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There appears to be a scoping issue with asset_partition_candidates. The code creates these candidates for partition ranges but never uses them because the result.extend(asset_partitions_still_failed) call is outside both branches of the if/else. To fix this:

  1. Move the asset_partition_candidates assignment inside the for loop in the first branch
  2. Move result.extend(asset_partitions_still_failed) inside each branch of the if/else

This ensures both partition ranges and single partitions are properly filtered against the materialized subset before being added to the result.

Spotted by Graphite Reviewer

Is this helpful? React 👍 or 👎 to let us know.

@jamiedemaria jamiedemaria force-pushed the jamie/backfill-daemon-accounts-for-retries branch from 6ce3f62 to 721e0a4 Compare November 12, 2024 15:24
@jamiedemaria jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from cd3cc4a to 1fb8fc2 Compare November 12, 2024 15:48
@jamiedemaria jamiedemaria force-pushed the jamie/backfill-daemon-accounts-for-retries branch from 721e0a4 to 514d595 Compare November 12, 2024 15:48
@jamiedemaria jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from 1fb8fc2 to 436cffd Compare November 12, 2024 18:57
@jamiedemaria jamiedemaria force-pushed the jamie/backfill-daemon-accounts-for-retries branch from eb96c61 to a691a45 Compare November 12, 2024 18:57
@jamiedemaria jamiedemaria force-pushed the jamie/backfill-daemon-termination-change branch from 436cffd to 6dbf15a Compare November 12, 2024 20:41
@jamiedemaria jamiedemaria force-pushed the jamie/backfill-daemon-accounts-for-retries branch 2 times, most recently from c93546d to 5b52878 Compare November 12, 2024 20:44
asset_graph,
)
updated_backfill_data = AssetBackfillData(
target_subset=asset_backfill_data.target_subset,
latest_storage_id=asset_backfill_data.latest_storage_id,
requested_runs_for_target_roots=asset_backfill_data.requested_runs_for_target_roots,
materialized_subset=updated_materialized_subset,
failed_and_downstream_subset=asset_backfill_data.failed_and_downstream_subset
| failed_subset,
failed_and_downstream_subset=failed_subset,
Copy link
Contributor Author

@jamiedemaria jamiedemaria Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to confirm that this is a safe change to make. From my reading of _get_failed_asset_partitions it returns the full list of failed partitions, not a list of partitions that failed since the last tick, so in the version of this code before this PR the ORing of the two subsets is a no-op since asset_backfill_data.failed_and_downstream_subset would be a subset of failed_subset

with the change to have _get_failed_asset_partitions account for retries, ORing with asset_backfill_data.failed_and_downstream_subset would result in inaccurate data because a failed partition in asset_backfill_data.failed_and_downstream_subset could have been successfully retried and no longer in failed_subset but would still be included because of the OR operation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok i forgot that we do the second function to get the downstream of the failed subset, so this isn't a simple replacement since failed_subset doesn't include downstream assets. i will update

@jamiedemaria jamiedemaria marked this pull request as ready for review November 12, 2024 20:49
@jamiedemaria jamiedemaria changed the title backfill daemon incorporates retries runs when launching new runs [backfill daemon run retries 2/n] backfill daemon incorporates retries runs when launching new runs Nov 13, 2024
@jamiedemaria jamiedemaria force-pushed the jamie/backfill-daemon-accounts-for-retries branch from 46e48b3 to eade146 Compare November 13, 2024 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant