Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduction of back pressure and circuit breaker #534

Open
sdegroot opened this issue Feb 25, 2025 · 1 comment
Open

Introduction of back pressure and circuit breaker #534

sdegroot opened this issue Feb 25, 2025 · 1 comment
Labels

Comments

@sdegroot
Copy link

sdegroot commented Feb 25, 2025

Thema / Theme

Objecten API

Omschrijving / Description

Several ZGW-services communicate with each other. The most common (and potentially problematic) example: open-zaak / objecten-api calling open-notificaties.

I've seen several times now that when there is a misconfiguration or when there is a semi-large load, things start to break down. Response times start to increase, rest calls failing and database usage is maxing out.

Recently, I experienced a misconfiguration of open-zaak where the authentication for open-notifications was incorrect. Thus, resulting in a HTTP 403 for every notification being sent. Open-zaak kept trying to send notifications with about 10K calls per minute. In turn, this lead to the database use being 100% continuously (not sure why, probably for each call there is some sort of database check?). All systems started to degrade in performance and functionality.

I suggest that all ZGW components introduce some kind of back pressure and ideally even a circuit breaker to prevent the overload of components. Having a circuit breaker will also help with alerting on problems before the impact becomes so large that end-users start to complain.

Toegevoegde waarde / Added value

  • better monitoring capabilities
  • preventing cascading failures after a component fails

Aanvullende opmerkingen / Additional context

No response

@sdegroot sdegroot added enhancement New feature or request triage labels Feb 25, 2025
@alextreme
Copy link
Member

Discussed, and as we're talking about communicating performance between 4 processes (oz-uwsgi -> oz-worker -> on-uwsgi -> on-worker) and you want to know on the side of oz-uwsgi if on-worker is able to keep up, this requires quite a bit of coordination between the various services in order to deal with memory exhaustion at the various steps

We could communicate this via a HTTP header from ON to OZ (and Objects API). The alternative is that OZ throttles incoming calls if it detects that multiple notifications to ON are throwing errors.

Steven mentions that retrying on a HTTP 403 as mentioned by Sander doesn't make sense as a Forbidden won't work if you try it multiple times, in that case authentication needs to be fixed

We'll keep this in triage and come up with a few various ideas, maybe a PoC to see what would work in which situations. There isn't an easy fix for this in my eyes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants