Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

THREESCALE-2989 - simplify background deletion #3958

Draft
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

akostadinov
Copy link
Contributor

@akostadinov akostadinov commented Jan 10, 2025

What this PR does / why we need it:

Presently background deletion relies on Sidekiq batches which don't work well with ActiveJob. Specifically failing a job registers triggers callbacks for successful job if I remember correctly.

Moreover there was a bug that calling #delete was calling a state machine method instead of the activerecord method.

These two have the potential to cause indefinite rescheduling and multiplication of delete jobs eventually exhausting server performance and sidekiq redis memory.

With the new implementation, we only use a single ActiveRecord job, do not rely on Sidekiq batches and basically assuring order of operations, the logic can be much better understood. Runoffs are hopefully impossible but should be much easier to contain because now only the :deletion queue is used and stopping this queue should also freeze the problem until it can be properly debugged and fixed.

Which issue(s) this PR fixes

https://issues.redhat.com/browse/THREESCALE-2989

Verification steps

  1. create a big tenant with many users, services, etc.
  2. run DeleteObjectHierarchyWorker.delete_later(myprovider)
  3. verify all objects are gone

For upgrade

  1. schedule a big tenant deletion
  2. stop server and sidekiq
  3. upgrade
  4. start server and sidekiq
  5. verify all objects get deleted

Special notes for your reviewer:
This is just an early preview to gather feedback. It is very much untested in most regards except for a simple new test.

As for review, I think you better initially ignore old code and just read the new implementation of the Worker as new code. Then see if there are areas in the old code that have not been addressed by the new code. I know it is a pain.

It remains to test compatibility, self-rescheduling, other existing users of the worker (except for the service deletion which is the only one tested atm) and who knows what else.

@akostadinov akostadinov marked this pull request as draft January 10, 2025 23:01
Comment on lines 94 to 112
ar_object.destroyed_by_association = DummyDestroyedByAssociationReflection.new(entry) if arguments.first != entry
ar_object.background_deletion_method_call
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one difference with previous logic. Previously #destroy without ! was used when destroyed by association. But I don't see why should this be done and so far haven't observed any negative effects. I think it is correct that the job should fail/retry if the #destroy! method failed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, OK, this is the response to my previous comment 😅

By the way, do you think background_deletion_method_call method is necessary? I think it causes more confusion, and I would rather just do ar_object.send background_deletion_method here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be something like

ar_object.public_send(ar.object.background_deletion_method)

which is IMO ugly. Or do I miss something?

Comment on lines +61 to +62
payment_gateway_setting
buyer_accounts
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice that these two are highlighted. Reading the original logic, all the reason for separate DeleteAccountHierarchyWorker and DeletePaymentSettingHierarchyWorker appears to be enforcing and order of executing the deletions. Since we do deleting serially here, the order is enforced by the order of entries put in background_deletion so I removed all that and just arranged the entries like this.

I must say that I cannot definitively claim that I understood all implications of the original code. More like I vaguely understood it. I found it more important to understand the intentions of the original authors because the code was misbehaving anyways and is heavily been refactored.

Although further testing may lead to a realization of details that I missed. So don't blindly trust me!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I blindly trust you

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit concerned about the fact that many of these associations are applicable only for providers. I am not sure how it used to work in the previous implementation, and whether the buyer accounts were also deleted through background deletion (when being part of the providers's hierarchy), but when trying to schedule a buyer deletion on this branch, I eventually received:

[ActiveJob] [DeleteObjectHierarchyWorker] [e95bf17b-f675-4e05-90b0-2247188a1221] Error performing DeleteObjectHierarchyWorker (Job ID: e95bf17b-f675-4e05-90b0-2247188a1221) from Sidekiq(deletion) in 5088.16ms: DeleteObjectHierarchyWorker::DoNotRetryError (seems like unexpectedly broken delete hierarchy entry: Association-Account-6:payment_gateway_setting):

Maybe I was doing something wrong... will need to check further.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They must have been deleted by background deletion in the past. I see no reason to believe otherwise.

But your question has a point. Are you able to write a test where this is reproduced? Or add to the existing test a gateway settings object and see if it reproduces.

I see special handling of payment gateway settings in the past. I thought only order should be enforced. But maybe I didn't fully understand the implications.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, speaking of tests, they need a serious rework, I'm afraid... 🥲

I will try to reproduce. But in general it looks like most of these discrepancies are fixed by the suggestion here: https://github.com/3scale/porta/pull/3958/files#r1946512669

So, for most of the associations in this list, if called for a buyer, they will be either empty or nil.

However, I can't say I like it, because we already know that buyers will not have any of the associations in this list, probably apart from payment_detail, users and contracts, so I think it's a bit waste of resources to query all the rest of the associations for them (especially as some providers have maaaaany buyers, so it can take quite a few queries to get all those non-existing associations).

Maybe we can make background_deletion an instance method instead? That could take the value from a class constant for all of the classes, except Account. And for account, we'd check whether it's provider? or not, and filter the list based on that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to reproduce. But in general it looks like most of these discrepancies are fixed by the suggestion here:

So with the change you see a difference in behavior? Original code should handle generated objects. Perhaps need to add a nil check just.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added 3612c2f

Nice catch!

Copy link
Contributor

@jlledom jlledom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a basic question I have here: what is the background deletion about?

Is it to replace dependent: destroy for something the does the same but in background jobs? is it that?

@@ -113,6 +113,8 @@ class StaticPage < CMS::Builtin
# CMS::Builtin::Page
class Page < CMS::Builtin

validates :user_id, :forum_id, presence: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this related to this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either somehow miraculously appeared out of nowhere or some confusion of github related to some other recent commits to master. Not part of this definitely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if this doesn't belong here, let's remove it to avoid confusion 🙃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea how it appears here. I don't see this locally 🤔
Will check again after final rebase.

# @param hierarchy [Array<String>] something like ["Plain-Service-1234", "Association-Service-1234:plans" ...]
# @note processes deleting a hierarchy of objects and reschedules itself at current progress after a time limit
def perform(*hierarchy)
return compatibility(hierarchy) unless hierarchy.first.is_a?(String)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this method for? apparently the hierarchy parameter, called _object inside the method, is just being ignored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I don't care about the object. The idea is to take the root object of the original operation and schedule it anew with the new worker logic. The job locking should take care of discarding the pile of duplicates that could be present with the old logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This made me think also...
I think the issue is in the hierarchydocumentation comment.

hierarchy is actually something like "gid://system/User/17", ["Hierarchy-Account-12"], "destroy", then it makes sense that the GID refers to the _object variable, and the second argument is an array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mayorova. So this is how hierarchy was before this PR. But in that case, is "gid://system/User/17" not a String? Because the condition to call compatibility is unless hierarchy.first.is_a?(String)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here variables have been set previously:

@object, workers_hierarchy, @background_destroy_method = job.arguments

#object is just a reader of the instance variable (

attr_reader :object, :caller_worker_hierarchy
)

These are examples how it was called previously:

So the first argument is an AR object and what you see is the serialization that ActiveJob performs on such arguments.

P.S. If object is removed somehow before the job runs, the we get the ActiveJob::DeserializationError that we speak about in another thread above. So once we stop calling the job with an object, that exception no longer needs to be handled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks for clarifying. One more question: in the case we have a job which was enqueued with only an object as param, like from https://github.com/3scale/porta/blob/master/app/models/backend_api.rb#L109, then the compatibility will do nothing and return, as I understand it:

def compatibility(_object, caller_worker_hierarchy = [], _background_destroy_method = 'destroy')
# maybe requeue first object from hierarchy would be adequate and uniqueness should deduplicate jobs
hierarchy_root = Array(caller_worker_hierarchy).first
return unless hierarchy_root

Then we are losing that job, or am I wrong?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes, I had the same question actually.
If the job has just been scheduled (with the old code), but not processed yet (so, no hierarchy has not been populater), I think we need to still process it, based on the first argument (object).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Although this should be a super low chance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed with 4395034

@akostadinov
Copy link
Contributor Author

There's a basic question I have here: what is the background deletion about?

My understanding is that we want to split a large delete operation into smaller ones that can run in the background without DoS-ing the server. Also if we try to delete in one go, on a busy server it is likely to hit a transaction deadlock or whatever that will make the delete always fail.

Is it to replace dependent: destroy for something the does the same but in background jobs? is it that?

I wouldn't say it is to replace. I see it more like a way to define a reasonable hierarchy that practically works with tenants with a lot of related objects. In general we still want to delete all dependent objects. And since the hierarchy definition is not complete, some objects will deceive destroy! and their dependents will still be deleted by means of dependent: destroy or delete_all or whatever was defined.

Walking the hierarchy from bottom to top, we make sure that there are not too many objects for dependent: destroy to catch so that the transaction doesn't grow to an unsustainable size.

I hope this makes sense.

def batch_description
"Deleting #{object.class.name} [##{object.id}]"
end
if ar_object.is_a?(Account) && !ar_object.should_be_deleted?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe?

Suggested change
if ar_object.is_a?(Account) && !ar_object.should_be_deleted?
if ar_object.is_a?(Account) && ar_object.should_not_be_deleted?

Also, I'm trying to find the definition of Account#scheduled_for_deletion? and couldn't find it, do you know where is it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think should_be_deleted? should only be checked for provider, because buyer accounts are not marked for deletion.

Copy link
Contributor Author

@akostadinov akostadinov Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But buyer accounts inherit should be deleted status from their provider

    def will_be_deleted?
      scheduled_for_deletion? || (buyer? && provider_account.try(:should_be_deleted?))
    end

So no issue with that check. Except that background deleting a buyer without deleting a provider will not be ever possible.

Should it be possible?

P.S. I believe in the past is wasn't. While a little inconsistent, I it might be a reasonable safety measure.

Copy link
Contributor

@jlledom jlledom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments:

  1. When deleting an account, when is the first DeleteObjectHierarchyWorker scheduled? I couldn't find the place?
  2. I miss some unit tests for hierarchy_entries_for which is crucial to be sure it works fine.

@akostadinov
Copy link
Contributor Author

When deleting an account, when is the first DeleteObjectHierarchyWorker scheduled? I couldn't find the place?

There is no such place I'm aware of. The idea is to have a cron job so something to clear old accounts regularly. Without such, scheduled for deletion accounts stay in this state indefinitely.

I miss some unit tests for hierarchy_entries_for which is crucial to be sure it works fine.

Many tests need to be changed/added, this is a draft to get comments before investing too much into the details.

# self.background_deletion_scope_name = :non_deleted
#
# scope :non_deleted, -> { where.not(state: :deleted) }
alias delete_truly delete # keep reference to the original AR #delete method to use for background deletion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than aliasing delete as delete_truly, I would really rather just change the method that changes the alert state:

    event :mark_as_deleted do
      transition [:unread, :read] => :deleted
    end

And then, in both controllers that are currently calling alert.delete! change this invocation to alert.mark_as_deleted!. I think that would be much more clear.

Indeed, I think that the alerts not being deleted when the corresponding cinstance (or the account owning it) is deleted is just an oversight.
Actually, the fact that the alerts are just marked as deleted and not actually deleted on UI actions is also questionable, I think. But I guess this is because there are two "consumers" of the alerts - the provider and the buyer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but I'd like it to be a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about first doing this refactoring in a separate PR, confirming that nothing is broken (as per the tests), and then rebase this PR on top of it?

I think it's more confusing to add these aliases, and then to remove them...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really enthusiastic. I'll check if it is easy and give up if not. ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seemed fairly easy to me, like 3 small changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think it would be nice to rebase anyway, because the current branch seems to be based on Rails 6.1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #3959

I'm a little concerned whether that method was ever used in drops. See
https://github.com/3scale/porta/blob/1cb77853cb82d3b66d3012137d2b621bb3e24643/doc/liquid/drops.md#alert-drop

@@ -5,32 +5,17 @@ module BackgroundDeletion

included do
class_attribute :background_deletion, default: [], instance_writer: false
class_attribute :background_deletion_method, default: :destroy!, instance_writer: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the original default method was destroy, while now it's destroy!. I guess that was a deliberate choice?..
I'm wondering what are the potential consequences though. I mean, probably it's better to just abort the deletion if there is an issue, rather than let some intermediate deletes silently fail... But just wanted to ask anyway :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my original comment. Originally the default was destroy for "by association" and destroy! for the main object.

See

def destroy_method(bang_if_possible: false)
object_destroy_method = @destroy_method.presence || 'destroy'
(bang_if_possible && object_destroy_method == 'destroy') ? 'destroy!' : object_destroy_method
end

should_destroy_by_association? ? destroy_by_association : object.public_send(destroy_method(bang_if_possible: true))

No reason I see to avoid bang. I hope testing in stg will reveal any issues.

@@ -5,32 +5,17 @@ module BackgroundDeletion

included do
class_attribute :background_deletion, default: [], instance_writer: false
class_attribute :background_deletion_method, default: :destroy!, instance_writer: false
class_attribute :background_deletion_scope_name, default: :all, instance_writer: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this? Seems that the only place it is currently used in is the alerts, but it's commented out...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you say, I will remove. I wanted in the PoC to show the two approaches. But it seems we are on the same page that it was an overlook originally so we just remove any objects, not change state.

return unless (tenant_id = object.tenant_id)
caller_worker_hierarchy.include?("Hierarchy-Account-#{tenant_id}")
def hierarchy_entries_for(...)
self.class.send(:hierarchy_entries_for, ...)
Copy link
Contributor

@mayorova mayorova Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it probably be substituted by delegate :hierarchy_entries_for, to: :class, private: true ? Seems cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private: true makes the delegated method private, it doesn't allow you to call a private method. This was also my original thought though.

rescue ActiveRecord::UnknownPrimaryKey => exception
Rails.logger.info "DeleteObjectHierarchyWorker#perform raised #{exception.class} with message #{exception.message}"
end
self.class.delete_later object_class.constantize.find(id.to_i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we probably throw DoNotRetryError if the object is not found? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, good catch, actually better do not queue if object is not found

def destroyable_association?(_association)
true
# we can just use job.arguments.first but for compatibility mode we want it not to lock
def lock_key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I don't quite get it - is this locking used?... Is this method used by activejob-uniqueness? If so, let's leave a comment, because I think it's a bit confusing :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how activejob-uniqueness works standard. If you think it is needed, suggest a comment and I'll add.


ReflectionDestroyer.new(object, reflection, caller_worker_hierarchy).destroy_later
started = now
while hierarchy.present? && now - started < WORK_TIME_LIMIT_SECONDS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, to make sure I understand correctly:
we process the array of hierarchy for N seconds (5 at the moment), and when the time is up we reschedule whatever is still left?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

case reflection.macro
when :has_many
# here we keep original hierarchy entry if we still find an associated object
dependent = ar_object.public_send(association).public_send(:background_deletion_scope).take
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is this .take deliberate? Is this to add just one object to the tree on each iteration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, one at a time

dependent ? [hierarchy_association_string, *hierarchy_entries_for(dependent)] : []
when :has_one
# maximum of one associated so we never keep the original hierarchy entry
hierarchy_entries_for ar_object.public_send(association)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
hierarchy_entries_for ar_object.public_send(association)
dependent = ar_object.public_send(association)
dependent ? hierarchy_entries_for(dependent) : []

There is an issue with :profile association on Account, it is initialized lazily, so for some providers will not exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how comment is related to precisely this point of the PR. But there is an additional check somewhere

https://github.com/3scale/porta/pull/3958/files#diff-213ad1bbfcef53e615f3a15cdb64e4aeb8de33bc817fb080a90f235642831f6cR51

So if object has just been initialized, I expect it would be skipped.

@mayorova
Copy link
Contributor

I am a bit concerned about the fact that we can only (apparently) schedule the deletion via delete_later, and not perform_later.
We currently have FindAndDeleteScheduledAccountsWorker that does

Account.deleted_since.find_each(&DeleteAccountHierarchyWorker.method(:perform_later))

and I think with the current implementation it won't work.

@akostadinov
Copy link
Contributor Author

We currently have FindAndDeleteScheduledAccountsWorker that does ... and I think with the current implementation it won't work.

Users are not yet fixed. This is next step. Once you have generally approved the design.

@jlledom
Copy link
Contributor

jlledom commented Feb 10, 2025

When deleting an account, when is the first DeleteObjectHierarchyWorker scheduled? I couldn't find the place?

There is no such place I'm aware of. The idea is to have a cron job so something to clear old accounts regularly. Without such, scheduled for deletion accounts stay in this state indefinitely.

So there's no such cron job currently?, does that mean background deletion is currently dead code?

@mayorova
Copy link
Contributor

When deleting an account, when is the first DeleteObjectHierarchyWorker scheduled? I couldn't find the place?

There is no such place I'm aware of. The idea is to have a cron job so something to clear old accounts regularly. Without such, scheduled for deletion accounts stay in this state indefinitely.

So there's no such cron job currently?, does that mean background deletion is currently dead code?

Yes, there is.

[FindAndDeleteScheduledAccountsWorker, :perform_async],

It's just disabled for SaaS:

return unless ThreeScale.config.onpremises

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants