Tibber: recover from disconnect #18504

GrimmiMeloni · 2025-01-30T20:42:00Z

@andig I have no tibber to actually test this, but the logic is simple enough...

andig · 2025-01-31T09:27:02Z

meter/tibber-pulse.go

+			select {
+			case <-tick:
+			case <-ctx.Done():
+				return
+			}


Isn't that really

Suggested change

select {

case <-tick:

case <-ctx.Done():

return

}

select {

case <-ctx.Done():

return

default:

}

No really. The tick is to ensure we don't hammer the pulse in case it just asked to disconnect.

Das macht doch schon der loop?

Nein, der Loop initialisiert den tick doch nur einmalig. Das Warten entsteht dann durch den select auf dem tick, oder hab ich das jetzt komplett falsch verstanden?

Ah, stimmt. Tricky. Das warten könnte man auch im for machen.

andig · 2025-01-31T09:29:22Z

I don't understand this PR. It will exit the loop on any client error. Imho it also doesn't fix the stuck clients, these would accumulate.

GrimmiMeloni · 2025-01-31T14:29:52Z

I don't understand this PR. It will exit the loop on any client error.

That's not my understanding. On errors, the clients internal logic will do a resubscribe, so it never gets here.

Imho it also doesn't fix the stuck clients, these would accumulate.

Well that's the thing. The clients are not stuck, they just exit. I posted my analysis here.

And this exactly is what this change is about. The specific scenario is that the Pulse denies the subscription (although the given data for the subscription is known valid just a few seconds ago). When it does this, it sends the client a request to properly shutdown - which the client library obeys. That is exactly when you land back in this go routine I changed, but with err being nil. So far the code than just ended the go routine, and that is why evcc never recovers and reestablishes the connection with the Pulse. Now it does.

So in a nutshell what we do is, we follow the ask of the server (i.e. Pulse) to shutdown the websocket gracefully, and then after 10s we bring it back up anew.

andig · 2025-01-31T14:47:46Z

They are clearly stuck because the error channel is blocked as the pprof tells?

GrimmiMeloni · 2025-01-31T19:07:33Z

They are clearly stuck because the error channel is blocked as the pprof tells?

Not to my understanding. The routine that waits on the error channel inside subscription.go is ok to block in the pproff that was provided. It selects on both the error channel and the ctx. So my understanding is, that for this go routine it is expected to be usually blocking. It will either be woken up by an error, or when the context is closed due to client shutdown. Otherwise it will just sit there.

The actual issue (based on my understanding of the pprof) is that the "main" routine of the client exits and so does then the evcc routine that called Run() to start everything. This then leaves the error handler routine orphaned. That is potentially also an additional bug in the wsClient library. But based on the pprof shared, as well as the augmented logs we got based on your patch, I think we have confirmation that the main wsclient routine we trigger from evcc in tibber-pulse.go simply silently ends (without error), and never gets restarted.

If you are not convinced, I would suggest we patch tibber-pulse.go in the nightly, to log something when the go routine that starts Run() ends (independent of wether an error was returned or not). I am pretty sure we will see this log message as the final message from tibber-pulse.go whenever users report things don't work.

andig · 2025-01-31T21:29:23Z

Long story short: es müsste mal jemand diesen PR ausprobieren 👍🏻

dustin-ha · 2025-01-31T21:56:39Z

Long story short: es müsste mal jemand diesen PR ausprobieren 👍🏻

Ist das schon im nightly Docker Build? Dann könnte ich es sofort ausprobieren

andig · 2025-02-01T11:09:11Z

Not to my understanding. The routine that waits on the error channel inside subscription.go is ok to block in the pproff that was provided

@GrimmiMeloni es blockt beim Schreiben (nicht beim Lesen!) in den Error Channel. Der Client bliebt also hängen- schau bitte nochmal in das pprof.

GrimmiMeloni · 2025-02-01T11:46:28Z

Not to my understanding. The routine that waits on the error channel inside subscription.go is ok to block in the pproff that was provided

@GrimmiMeloni es blockt beim Schreiben (nicht beim Lesen!) in den Error Channel. Der Client bliebt also hängen- schau bitte nochmal in das pprof.

Ja, ich habe in meiner Erklärung die Threads vertauscht, dumm. Sorry für die Verwirrung. Die Analyse ist aber trotzdem sauber.

Du hast Recht, das dieser Writer auf dem Channel blockt und somit hängt.
Aber, den dürfte es gar nicht mehr geben. Wie schon erwähnt, sehe ich hier eher einen potentiellen Bug in go-graphql-client.
Dies wird u.a. durch den pproff belegt, denn dort ist zu erkennen, dass die Reader Routine (konkret: Go Routine 93) nicht mehr existiert. Der blockierte Writer (Go Routine 3194046) ist also ein Folgefehler.

An den Logs können wir ferner erkennen, daß eine Nachricht für den Shutdown der wsClient Verbindung empfangen wird. Wir kommen also hier vorbei, was den Reader (und somit den einzigen consumer des error channel) beendet:

https://github.com/hasura/go-graphql-client/blob/47ee315bef0dc3e83d30f4f504d0c94608c7429f/subscription.go#L859

Das close(subContext) sollte nach meinem Verständnis dafür sorgen, daß auch der Writer beendet wird. Tut es scheinbar nicht, daher der Routine Leak. Ich vermute hier den Bug in graphql-client - kenne mich aber nicht gut genug mit Context und dem "SubContext" Konzept aus. Das ist aber (mal abgesehen vom Leak der Routine) für unser Problem unerheblich.

Denn - nehmen wir an, der Reader würde den Writer noch sauber beenden. Dann bliebe doch gar nichts mehr im pprof was noch mit Tibber zu tun hat.
Der graphql-client beendet sich schlicht einfach, und somit endet auch die Routine in tibber-pulse.go die ganz initial den Client startete.

Nochmal anders gesagt: In allen Fällen in denen wir aus subscription.go mit einem return kommen das NICHT sc.Run() aufruft, wird unsere Pulse Implementierung totgelegt.

andig · 2025-02-01T12:07:53Z

Der blockierte Writer (Go Routine 3194046) ist also ein Folgefehler.

Ja. Ändert aber nix dran dass der damit hängt und weiter Speicher verbraucht. Aber zur Not ist das so

That's not my understanding. On errors, the clients internal logic will do a resubscribe, so it never gets here.

Verstehe ich weiter nicht. Run wird beendet. Der ganze Sinn Deines for loops ist ja ihn neu zu starten. Die einzige Frage ist also ob der mit Fehler raus geht. Falls ja dürfen wir im Fehlerfalls auch kein return machen, sonst itst wie vorher ;)

GrimmiMeloni · 2025-02-01T12:42:41Z

Verstehe ich weiter nicht. Run wird beendet. Der ganze Sinn Deines for loops ist ja ihn neu zu starten. Die einzige Frage ist also ob der mit Fehler raus geht. Falls ja dürfen wir im Fehlerfalls auch kein return machen, sonst itst wie vorher ;)

In dem hier diskutierten Szenario geht er nicht mit Fehler raus. Aus Sicht der Client Lib kommt ein sauberer Request durch, der dann auf ws-Protokoll Ebene ein Verbindungsende anfordert.
Der eigentliche Bug ist ja im Pulse (bzw. bei Tibber), das sie die Subscription ablehnen (obwohl sie gültig ist), und dann genau diese Antwort senden die Clients zum beenden veranlasst.

Daraus würde ich jetzt nicht unbedingt ableiten, daß es im Fehlerfall richtig(er) ist, auch wieder den Client neuzustarten. Andererseits gibt es in evcc auch keine andere Möglichkeit das noch abzufangen.

Dann also Vorschlag: Return raus nehmen, und Immer restarten (in tibber-pulse.go).

meter/tibber-pulse.go

Co-authored-by: Michael Heß <[email protected]>

restart client upon graceful exit

cae6f54

GrimmiMeloni requested a review from andig January 30, 2025 20:43

GrimmiMeloni self-assigned this Jan 30, 2025

GrimmiMeloni added the bug Something isn't working label Jan 30, 2025

andig reviewed Jan 31, 2025

View reviewed changes

Andi1887 mentioned this pull request Feb 1, 2025

Tibber: evcc hängt nach kurzzeitigem Ausfall von Tibber #17925

Open

1 task

GrimmiMeloni commented Feb 1, 2025

View reviewed changes

meter/tibber-pulse.go Outdated Show resolved Hide resolved

Update meter/tibber-pulse.go

5c174bc

Co-authored-by: Michael Heß <[email protected]>

andig merged commit 1bfcc19 into evcc-io:master Feb 1, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tibber: recover from disconnect #18504

Tibber: recover from disconnect #18504

GrimmiMeloni commented Jan 30, 2025 •

edited

Loading

andig Jan 31, 2025

GrimmiMeloni Jan 31, 2025

andig Jan 31, 2025

GrimmiMeloni Jan 31, 2025

andig Jan 31, 2025

andig commented Jan 31, 2025

GrimmiMeloni commented Jan 31, 2025

andig commented Jan 31, 2025

GrimmiMeloni commented Jan 31, 2025 •

edited

Loading

andig commented Jan 31, 2025

dustin-ha commented Jan 31, 2025

andig commented Feb 1, 2025

GrimmiMeloni commented Feb 1, 2025 •

edited

Loading

andig commented Feb 1, 2025 •

edited

Loading

GrimmiMeloni commented Feb 1, 2025

Tibber: recover from disconnect #18504

Tibber: recover from disconnect #18504

Conversation

GrimmiMeloni commented Jan 30, 2025 • edited Loading

andig Jan 31, 2025

Choose a reason for hiding this comment

GrimmiMeloni Jan 31, 2025

Choose a reason for hiding this comment

andig Jan 31, 2025

Choose a reason for hiding this comment

GrimmiMeloni Jan 31, 2025

Choose a reason for hiding this comment

andig Jan 31, 2025

Choose a reason for hiding this comment

andig commented Jan 31, 2025

GrimmiMeloni commented Jan 31, 2025

andig commented Jan 31, 2025

GrimmiMeloni commented Jan 31, 2025 • edited Loading

andig commented Jan 31, 2025

dustin-ha commented Jan 31, 2025

andig commented Feb 1, 2025

GrimmiMeloni commented Feb 1, 2025 • edited Loading

andig commented Feb 1, 2025 • edited Loading

GrimmiMeloni commented Feb 1, 2025

GrimmiMeloni commented Jan 30, 2025 •

edited

Loading

GrimmiMeloni commented Jan 31, 2025 •

edited

Loading

GrimmiMeloni commented Feb 1, 2025 •

edited

Loading

andig commented Feb 1, 2025 •

edited

Loading