Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add capability to discard duplicate jobs with concurrency configuration #523

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

joelzwarrington
Copy link

@joelzwarrington joelzwarrington commented Feb 25, 2025

Closes #176

Add ability to discard duplicate jobs instead of becoming blocked. When configuring concurrency_limits with at_limit: discard jobs scheduled above the concurrency limit are discarded and not executed.

Implementation Details

  • adds class_attribute concurrency_at_limit and at_limit option to concurrency_limits
  • adds methods to Semaphore to detect when a job would be above the concurrency limit
  • adds logic to discard a job on dispatch when at concurrency limit

@@ -66,6 +66,8 @@ def prepare_for_execution

def dispatch
if acquire_concurrency_lock then ready
elsif discard_concurrent?
discard
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the way 🤔 We're in the middle of a transaction here, and the job hasn't even been committed to the DB. It makes no sense to delete a record in the same transaction you're creating it. It'd make sense to roll that transaction back instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I don't fully understand how we're getting into this code path, but from my investigation, it looks as though there isn't an open transaction at this time.

I tried the following and running the test suite and didn't hit any open transactions.

raise "open transactions" if ApplicationRecord.connection.open_transactions.positive?

I still believe we'd want to discard here. Let me know your thoughts.

@@ -39,6 +43,14 @@ def initialize(job)
@job = job
end

def at_limit?
if semaphore = Semaphore.find_by(key: key)
semaphore.value.zero?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is vulnerable to race conditions and the reason we don't check it in this way when blocking jobs. If two concurrent jobs are claiming the semaphore, both of them will see it open, and none will be discarded. Then, both will run together because we won't block them either.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You raise an excellent point. Do you have a suggestion to how we might be able to avoid this race condition?

Are you thinking Pessimistic Locking might help?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking that perhaps we could rely on the same check we're already doing to block the job, but instead of blocking the job, we'd rollback the transaction 🤔 What I'm not sure is whether this should actually raise some exception to indicate the job hasn't been enqueued. Maybe that's not necessary, but at least it should set the successfully_enqueued attribute in the active job to false.

@joelzwarrington
Copy link
Author

Thanks for the review @rosa ❤️, I'll review your comments more thoroughly and make some changes. 😊

Copy link
Author

@joelzwarrington joelzwarrington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rosa, I've made a few changes and believe to have encountered weird interactions with bulk enqueuing which might be a bug. Let me know if you'd like those changes split out into another PR.

active_jobs.each do |active_job|
job = enqueued_jobs_by_active_job_id[active_job.job_id]
active_job.provider_job_id = job&.id
active_job.successfully_enqueued = job.present?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reorganized this method so that we can efficiently set successfully_enqueued for both enqueued/blocked and discarded jobs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I think the previous implementation did that as well (only setting successfully_enqueued = true for jobs that had been actually enqueued), it should work without doing anything for discarded jobs, as long as discarded jobs don't create a job record. I must be missing something here, why do you think the previous implementation didn't set that correctly?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rosa previous implementation would only set successfully_enqueued = true, but wouldn't set successfully_enqueued = false if it wasn't enqueued.

If you're okay with that quirk, then I can revert this change.

@@ -49,7 +52,7 @@ def create_from_active_job(active_job)
def create_all_from_active_jobs(active_jobs)
job_rows = active_jobs.map { |job| attributes_from_active_job(job) }
insert_all(job_rows)
where(active_job_id: active_jobs.map(&:job_id))
where(active_job_id: active_jobs.map(&:job_id)).order(id: :asc)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may have been an unintentional bug, or would be something that's good to do moving forward with more concurrency configuration.

Previous to adding this order, the jobs could be dispatched out of order they were provided in. This isn't really a problem when there isn't concurrency controls on the job, but when using concurrency controls, it might have unintended consequences.

So if you were to do:

jobs = [
  job_1 = DiscardedNonOverlappingJob.new(@result, name: "A"),
  job_2 = DiscardedNonOverlappingJob.new(@result, name: "B")
]

enqueued_jobs_count = SolidQueue::Job.enqueue_all(jobs)

There wasn't any guarantee that job_1 would be dispatched before job_2. Now, it should be as these jobs are bulk inserted with auto-incrementing IDs, job_1 will have the lower id.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concurrency controls don't guarantee anything about the order, though 🤔 I think in the case you're enqueuing two jobs together (in bulk), concurrency-controlled, there shouldn't be any guarantees that one is run before the other.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intuitively, if I were using bulk enqueue and concurrency controls, I would want it to enqueue the jobs in the order they were created, no?

If you don't want to make that guarantee, I can remove these changes and instead add a more prominent call-out in the bulk enqueue documentation that ordering is not guaranteed.

@@ -66,6 +66,7 @@ def prepare_for_execution

def dispatch
if acquire_concurrency_lock then ready
elsif discard_concurrent? then discard
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe discard does anything here, since it just deletes the execution. But I believe in this point we won't have an execution?

This isn't deleting the job as I once thought it was. This is effectively doing 'nothing'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe discard does anything here, since it just deletes the execution. But I believe in this point we won't have an execution?

That's right, we won't have an execution yet. However, we need to ensure we don't create the job. I think this will result in the job record being created. That's why I think we need to rollback the current transaction. There should be a transaction open since this is called as an after_create callback on the job.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rosa ah - thanks for pointing out the callback. Because dispatch is also used in bulk enqueueing, should we rollback all jobs in the bulk enqueue if one failed to be enqueued due to concurrency controls?

@rajeevriitm
Copy link

I am really in need of this feature. As its not yet available , could you help me in implementing this ? I have a ContinuousSearchJob thats run with recurring.yml . Sometimes the job takes longer and I dont want another job to be added to the queue unless the existing one is finished. Could you help me with the best way I can implement this currently?

@joelzwarrington
Copy link
Author

@rajeevriitm if you have a look at the read me you should find concurrency controls are available for you to use.

The only difference being that jobs will always be blocked, so you may get several jobs that are awaiting execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Discard duplicate jobs
3 participants