Active record poisoned transaction with postgres #881

jnunemaker · 2024-08-27T12:36:01Z

The issue in #866 is tricky. For some reason when using the pg adapter, active record is poisoning the transaction on unique error even though that is specifically rescued. The error does not occur with sqlite or mysql. 🤷‍♂️

I asked copilot how to let AR and company know that the transaction is not poisoned and they suggested save points. And guess what... it worked! Haha. I don't love these two extra network calls but it does fix the issue for now.

If anyone smarter or with more time can find a more slick solution or dig into why pg handles this different than sqlite/mysql I'm all ears.

Initially I thought the problem was because of the connection check out stuff happening before the rescue but even moving the rescue inside with_connection block didn't fix the issue for pg.

refs #867
fixes #activerecord #866

Unsure if this is right approach but it does work. Oddly only postgres was complaining about the poisoned transaction.

silence should return yield. Added new method capture_output to better describe capturing the output. ``` DEPRECATION WARNING: config.active_support.remove_deprecated_time_with_zone_name is deprecated and will be removed in Rails 7.2. (called from block (2 levels) in <top (required)> at /home/runner/work/flipper/flipper/spec/flipper/engine_spec.rb:32) ```

camallen · 2024-08-30T06:31:45Z

lib/flipper/adapters/active_record.rb

+          begin
+            connection.transaction do
+              savepoint_name = "flipper_savepoint_#{SecureRandom.hex(8)}"
+              connection.execute("SAVEPOINT #{savepoint_name}")


Noting that: pg transaction savepoints can cause performance degradation on postgres.

Gitlab have blogged about this and how they remove them from their app https://about.gitlab.com/blog/2021/09/29/why-we-spent-the-last-month-eliminating-postgresql-subtransactions/

@camallen yep, I'm aware. This is a very low throughput method. Even at a large company/app I'd be surprised if it was executed more than 100 times in a day. I think that level of throughput this should be fine. Knowing that, are you still worried?

If anyone knows another way to fix it, I'm all ears but after hacking around on it for 30 minutes this was the first thing that worked.

I think that level of throughput this should be fine. Knowing that, are you still worried?

I am still a little worried about this change, perhaps unduly.

From my reading the gitlab investigation showed that any savepoint use on the primary can impact all replicas as the replica has to record all active transactions after the use of a savepoint and this tracking is a problem during any 'long' running queries. On a very active system the duration of a 'long' query will vary as it depends on the number of active transactons. If there are enough they will saturate the pg server cache and spill from memory to disk and then impact query perf due to slower IO on hot code paths.

My concern is around the the unpredictable nature of this issue due to the interplay of the active transaction count and the duration of any queries that impact the replicas. It'll appear like random performance degradation that (hopefully) eventually self heals after the queries complete.

Teams that use replicas most likely are already experiencing a high transaction load on the primary and this change then might open them up to this situation.

One of the proposed solutions (that may have landed in new pg version) was to allow a configurable PG server cache size to avoid spilling over to disk and slow IO on a hot code path. However by default that isn't available on older pg versions and may not be configurable on pg cloud variants.

Again i may be reading it wrong, I'm wary as I've been bitten in the past from long running queries impacting replica query performance and the random nature of these issues.

If anyone knows another way to fix it, I'm all ears but after hacking around on it for 30 minutes this was the first thing that worked.

ARs postgres adapter has a private in_transaction? method to detect if we're in a transaction that uses open_transactions. We might be able to introspect on the connection adapter's open_transactions size to determine if we're in a transaction and if we need to use a savepoint sql directive instead of opting everyone into the use of savepoints by default.

I'm not privy if there is a better way to introspect the AR connection transaction stack state to determine if we can 'opt' in to the use nested transactions. Maybe @byroot has thoughts.

Maybe @byroot has thoughts.

I don't know the code, so hard. But it looks like it's trying to insert a record and ignore if the record already exist? If so then, ON DUPLICATE KEY IGNORE wouldn't require a sub transaction.

Ref to similar discussion when we used a savepoint in find_or_create rails/rails#51052 (comment).

@camallen this isn't a hot code path and shouldn't ever be.

If it is, then you aren't using flipper correctly. Individual actors is for enabling a feature for a few people here and there per feature. Anything that requires 100+ should be in a group or % rollout as detailed in the docs.

Any write in flipper is a rare thing compared to reads and that is what flipper is optimized for.

this isn't a hot code path and shouldn't ever be.

Apologies for the confusion, when I used the 'hot code path' term, I was referring to the PostgreSQL server when a savepoint is used on a primary and the replica then has to start tracking all know transactions. I wasn't referring to flipper configuring a feature flag, though flipper could be the trigger when it issues a savepoint transaction.

I still think there is a risk with this change. If the GitLab investigation is correct then this change introduces a ticking bomb that may at some point explode for busy PostgreSQL systems that have replicas enabled. Similar to the concerns raised in rails/rails#51052

Would you be open to a PR that attempts to use the upsert on conflict do nothing approach that byroot mentioned? I'd be up for working on this if it would be considered to land.

A naive implementation would run the model validations and if the model is valid it would then use ARs insert / insert_all methods combined with :unique_by to skip the duplicate records and thus not poison the outer transaction.

A caveat being that the insert on conflict feature is only available on PG versions 9.5+ and active record supports PG 9.3+ versions . It appears that flipper supports active record 4.2+ so any implementation would require branching code based on rails pg server feature availability detection.

If the server doesn't support ON CONFLICT DO NOTHING SQL then the current requires_new savepoint would be used.

Looks like the configurable cache sizes have landed in PG 17 https://pganalyze.com/blog/5mins-postgres-17-configurable-slru-cache

@camallen happy to take a look at a PR. Again, this is super low throughput so I don't believe it'll be an issue.

byroot · 2024-09-03T17:50:00Z

lib/flipper/adapters/active_record.rb

+          begin
+            connection.transaction do
+              savepoint_name = "flipper_savepoint_#{SecureRandom.hex(8)}"
+              connection.execute("SAVEPOINT #{savepoint_name}")


You can use transaction(require_new: true) do https://api.rubyonrails.org/classes/ActiveRecord/ConnectionAdapters/DatabaseStatements.html#method-i-transaction

And that's also probably why you are "poisoning" the connection. conn.transaction { conn.transaction { } } doesn't do nested transaction. You need require_new: true for that.

Awesome! I'll try that out. Thanks for taking the time to teach!

1c9e0a2 did the trick.

Still odd to me that mysql and sqlite don't raise about the poisoned transaction and pg gem does. But perhaps it is something in the pg gem and not the AR adapter for Postgres?

requires_new: true is definitely cleaner than the save point junk I had so I'll go with that for now.

jnunemaker added 3 commits August 27, 2024 08:18

Add failing spec to show problem

e58b257

Add save points to fix it

33153db

Unsure if this is right approach but it does work. Oddly only postgres was complaining about the poisoned transaction.

jnunemaker self-assigned this Aug 27, 2024

jnunemaker changed the title ~~Active record poisoned transaction~~ Active record poisoned transaction with postgres Aug 27, 2024

jnunemaker added 3 commits August 27, 2024 08:37

Update fail_on_output to use capture_output when checking

028a9bb

Silence more stuff

110820b

Correct spelling

8695db3

camallen reviewed Aug 30, 2024

View reviewed changes

byroot reviewed Sep 3, 2024

View reviewed changes

jnunemaker added 2 commits September 3, 2024 14:31

Use requires_new with transaction to avoid poisoning

1c9e0a2

require new transaction for add/set as well

0a38a46

jnunemaker merged commit 18cccb9 into main Sep 3, 2024
72 checks passed

jnunemaker deleted the active-record-poisoned-transaction branch September 3, 2024 18:46

jnunemaker mentioned this pull request Sep 3, 2024

ActiveRecord adapter: unexpected error when running in a transaction #866

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Active record poisoned transaction with postgres #881

Active record poisoned transaction with postgres #881

jnunemaker commented Aug 27, 2024 •

edited

Loading

camallen Aug 30, 2024

jnunemaker Sep 3, 2024

camallen Sep 3, 2024

byroot Sep 4, 2024

byroot Sep 4, 2024

jnunemaker Sep 4, 2024

camallen Sep 4, 2024 •

edited

Loading

camallen Sep 5, 2024

jnunemaker Sep 6, 2024

byroot Sep 3, 2024

byroot Sep 3, 2024

jnunemaker Sep 3, 2024

jnunemaker Sep 3, 2024

Active record poisoned transaction with postgres #881

Active record poisoned transaction with postgres #881

Conversation

jnunemaker commented Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

camallen Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnunemaker commented Aug 27, 2024 •

edited

Loading

camallen Sep 4, 2024 •

edited

Loading