Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Active record poisoned transaction with postgres #881

Merged
merged 8 commits into from
Sep 3, 2024

Conversation

jnunemaker
Copy link
Collaborator

@jnunemaker jnunemaker commented Aug 27, 2024

The issue in #866 is tricky. For some reason when using the pg adapter, active record is poisoning the transaction on unique error even though that is specifically rescued. The error does not occur with sqlite or mysql. 🤷‍♂️

I asked copilot how to let AR and company know that the transaction is not poisoned and they suggested save points. And guess what... it worked! Haha. I don't love these two extra network calls but it does fix the issue for now.

If anyone smarter or with more time can find a more slick solution or dig into why pg handles this different than sqlite/mysql I'm all ears.

Initially I thought the problem was because of the connection check out stuff happening before the rescue but even moving the rescue inside with_connection block didn't fix the issue for pg.

refs #867
fixes #activerecord #866

Unsure if this is right approach but it does work. Oddly only postgres was complaining about the poisoned transaction.
silence should return yield. Added new method capture_output to better describe capturing the output.

```
 DEPRECATION WARNING: config.active_support.remove_deprecated_time_with_zone_name is deprecated and will be removed in Rails 7.2. (called from block (2 levels) in <top (required)> at /home/runner/work/flipper/flipper/spec/flipper/engine_spec.rb:32)
 ```
@jnunemaker jnunemaker self-assigned this Aug 27, 2024
@jnunemaker jnunemaker changed the title Active record poisoned transaction Active record poisoned transaction with postgres Aug 27, 2024
begin
connection.transaction do
savepoint_name = "flipper_savepoint_#{SecureRandom.hex(8)}"
connection.execute("SAVEPOINT #{savepoint_name}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that: pg transaction savepoints can cause performance degradation on postgres.

Gitlab have blogged about this and how they remove them from their app https://about.gitlab.com/blog/2021/09/29/why-we-spent-the-last-month-eliminating-postgresql-subtransactions/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@camallen yep, I'm aware. This is a very low throughput method. Even at a large company/app I'd be surprised if it was executed more than 100 times in a day. I think that level of throughput this should be fine. Knowing that, are you still worried?

If anyone knows another way to fix it, I'm all ears but after hacking around on it for 30 minutes this was the first thing that worked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that level of throughput this should be fine. Knowing that, are you still worried?

I am still a little worried about this change, perhaps unduly.

From my reading the gitlab investigation showed that any savepoint use on the primary can impact all replicas as the replica has to record all active transactions after the use of a savepoint and this tracking is a problem during any 'long' running queries. On a very active system the duration of a 'long' query will vary as it depends on the number of active transactons. If there are enough they will saturate the pg server cache and spill from memory to disk and then impact query perf due to slower IO on hot code paths.

My concern is around the the unpredictable nature of this issue due to the interplay of the active transaction count and the duration of any queries that impact the replicas. It'll appear like random performance degradation that (hopefully) eventually self heals after the queries complete.

Teams that use replicas most likely are already experiencing a high transaction load on the primary and this change then might open them up to this situation.

One of the proposed solutions (that may have landed in new pg version) was to allow a configurable PG server cache size to avoid spilling over to disk and slow IO on a hot code path. However by default that isn't available on older pg versions and may not be configurable on pg cloud variants.

Again i may be reading it wrong, I'm wary as I've been bitten in the past from long running queries impacting replica query performance and the random nature of these issues.

If anyone knows another way to fix it, I'm all ears but after hacking around on it for 30 minutes this was the first thing that worked.

ARs postgres adapter has a private in_transaction? method to detect if we're in a transaction that uses open_transactions. We might be able to introspect on the connection adapter's open_transactions size to determine if we're in a transaction and if we need to use a savepoint sql directive instead of opting everyone into the use of savepoints by default.

I'm not privy if there is a better way to introspect the AR connection transaction stack state to determine if we can 'opt' in to the use nested transactions. Maybe @byroot has thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @byroot has thoughts.

I don't know the code, so hard. But it looks like it's trying to insert a record and ignore if the record already exist? If so then, ON DUPLICATE KEY IGNORE wouldn't require a sub transaction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ref to similar discussion when we used a savepoint in find_or_create rails/rails#51052 (comment).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@camallen this isn't a hot code path and shouldn't ever be.

If it is, then you aren't using flipper correctly. Individual actors is for enabling a feature for a few people here and there per feature. Anything that requires 100+ should be in a group or % rollout as detailed in the docs.

Any write in flipper is a rare thing compared to reads and that is what flipper is optimized for.

Copy link
Contributor

@camallen camallen Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't a hot code path and shouldn't ever be.

Apologies for the confusion, when I used the 'hot code path' term, I was referring to the PostgreSQL server when a savepoint is used on a primary and the replica then has to start tracking all know transactions. I wasn't referring to flipper configuring a feature flag, though flipper could be the trigger when it issues a savepoint transaction.

I still think there is a risk with this change. If the GitLab investigation is correct then this change introduces a ticking bomb that may at some point explode for busy PostgreSQL systems that have replicas enabled. Similar to the concerns raised in rails/rails#51052

Would you be open to a PR that attempts to use the upsert on conflict do nothing approach that byroot mentioned? I'd be up for working on this if it would be considered to land.

A naive implementation would run the model validations and if the model is valid it would then use ARs insert / insert_all methods combined with :unique_by to skip the duplicate records and thus not poison the outer transaction.

A caveat being that the insert on conflict feature is only available on PG versions 9.5+ and active record supports PG 9.3+ versions . It appears that flipper supports active record 4.2+ so any implementation would require branching code based on rails pg server feature availability detection.

If the server doesn't support ON CONFLICT DO NOTHING SQL then the current requires_new savepoint would be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the configurable cache sizes have landed in PG 17 https://pganalyze.com/blog/5mins-postgres-17-configurable-slru-cache

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@camallen happy to take a look at a PR. Again, this is super low throughput so I don't believe it'll be an issue.

begin
connection.transaction do
savepoint_name = "flipper_savepoint_#{SecureRandom.hex(8)}"
connection.execute("SAVEPOINT #{savepoint_name}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that's also probably why you are "poisoning" the connection. conn.transaction { conn.transaction { } } doesn't do nested transaction. You need require_new: true for that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I'll try that out. Thanks for taking the time to teach!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1c9e0a2 did the trick.

Still odd to me that mysql and sqlite don't raise about the poisoned transaction and pg gem does. But perhaps it is something in the pg gem and not the AR adapter for Postgres?

requires_new: true is definitely cleaner than the save point junk I had so I'll go with that for now.

@jnunemaker jnunemaker merged commit 18cccb9 into main Sep 3, 2024
72 checks passed
@jnunemaker jnunemaker deleted the active-record-poisoned-transaction branch September 3, 2024 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants