Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch up all deletes/updates in an Update API call #111

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

thegedge
Copy link
Collaborator

@thegedge thegedge commented Mar 20, 2025

Closes #106

Previously, when an update call was issued to dateilager, we would run a set of queries for every object being updated. This resulted in A LOT of DB chatter. For example, during application creating in Gadget:

CleanShot 2025-03-20 at 10 25 50@2x

You can see 1000s of pgx spans are nested under update-objects and update-packed-objects. This PR fixes that by bulk inserting. Since we have a dynamic number of objects to insert, we can't do a regular INSERT due to potential limitations on the number of parameters. Instead, we use another trick: create a temporary table that is dropped after the transaction completes where we stage all of the updates, and use COPY to efficiently get data into it.

The end result is pretty good!

CleanShot 2025-03-20 at 10 29 20@2x

Ignoring the packed objects update (discussion below), the regular update call goes from 1k+ queries taking 289ms to 6 calls taking 44ms. I'll have to benchmark a bunch, but on average I'm thinking this could be close to an order of magnitude improvement. Hard to talk specifics, but it's also likely going to be a big win for the production DB too, since it has to deal with less chatter.

Topics of conversation

First, the regular update works really well, but I still need to investigate what's making the packed objects version so slow. It's specifically the CopyFrom call, but I would have suspected that to be significantly faster.

Second, there were some comments about doing certain things outside of a transaction to avoid deadlocks. I don't do that any more, but I suspect we'll be okay because we do everything in single statements. The comments mentioned deadlocks were observed in tests, but I didn't observe any issues with a Gadget-side PR using a DL built from this branch for testing.

Finally, we likely now need to load a bunch more stuff into memory at once. Assuming I can make the packed objects query have gains similar to the update, how do we feel about that? If we're feeling a bit uncomfy, I can set it all up to batch CopyFrom calls so that we stay within some amount of memory per call.

@@ -594,28 +594,43 @@ func (f *Fs) Update(stream pb.Fs_UpdateServer) error {
nextVersion = latestVersion + 1
logger.Info(ctx, "FS.Update[Init]", key.Project.Field(project), key.Version.Field(nextVersion))

var deletedPaths []string
var objectsToUpdate []*pb.Object
for _, object := range objectBuffer {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not for this PR but I wonder if we can fold this loop into the one above, we seem to receive all objects and then write them to the DB which made sense so that we didn't block the RPC stream to do DB calls but now that we are splitting and appending to multiple arrays here, maybe we do it as we receive the objects and then do the bulk update/deletes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, no problem at all to do that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve performance of large updates
2 participants