You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Several ZGW-services communicate with each other. The most common (and potentially problematic) example: open-zaak / objecten-api calling open-notificaties.
I've seen several times now that when there is a misconfiguration or when there is a semi-large load, things start to break down. Response times start to increase, rest calls failing and database usage is maxing out.
Recently, I experienced a misconfiguration of open-zaak where the authentication for open-notifications was incorrect. Thus, resulting in a HTTP 403 for every notification being sent. Open-zaak kept trying to send notifications with about 10K calls per minute. In turn, this lead to the database use being 100% continuously (not sure why, probably for each call there is some sort of database check?). All systems started to degrade in performance and functionality.
I suggest that all ZGW components introduce some kind of back pressure and ideally even a circuit breaker to prevent the overload of components. Having a circuit breaker will also help with alerting on problems before the impact becomes so large that end-users start to complain.
Toegevoegde waarde / Added value
better monitoring capabilities
preventing cascading failures after a component fails
Aanvullende opmerkingen / Additional context
No response
The text was updated successfully, but these errors were encountered:
Discussed, and as we're talking about communicating performance between 4 processes (oz-uwsgi -> oz-worker -> on-uwsgi -> on-worker) and you want to know on the side of oz-uwsgi if on-worker is able to keep up, this requires quite a bit of coordination between the various services in order to deal with memory exhaustion at the various steps
We could communicate this via a HTTP header from ON to OZ (and Objects API). The alternative is that OZ throttles incoming calls if it detects that multiple notifications to ON are throwing errors.
Steven mentions that retrying on a HTTP 403 as mentioned by Sander doesn't make sense as a Forbidden won't work if you try it multiple times, in that case authentication needs to be fixed
We'll keep this in triage and come up with a few various ideas, maybe a PoC to see what would work in which situations. There isn't an easy fix for this in my eyes
Thema / Theme
Objecten API
Omschrijving / Description
Several ZGW-services communicate with each other. The most common (and potentially problematic) example: open-zaak / objecten-api calling open-notificaties.
I've seen several times now that when there is a misconfiguration or when there is a semi-large load, things start to break down. Response times start to increase, rest calls failing and database usage is maxing out.
Recently, I experienced a misconfiguration of open-zaak where the authentication for open-notifications was incorrect. Thus, resulting in a HTTP 403 for every notification being sent. Open-zaak kept trying to send notifications with about 10K calls per minute. In turn, this lead to the database use being 100% continuously (not sure why, probably for each call there is some sort of database check?). All systems started to degrade in performance and functionality.
I suggest that all ZGW components introduce some kind of back pressure and ideally even a circuit breaker to prevent the overload of components. Having a circuit breaker will also help with alerting on problems before the impact becomes so large that end-users start to complain.
Toegevoegde waarde / Added value
Aanvullende opmerkingen / Additional context
No response
The text was updated successfully, but these errors were encountered: