-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(snow/vm): health check #1898
Conversation
snow/health.go
Outdated
return &VMReadiness{isReady: isReady} | ||
} | ||
|
||
func (v *VMReadiness) HealthCheck(_ context.Context) (interface{}, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR makes common use of this but after go1.18+ I think it's idiomatic to just use the any
alias which reads nicer in place of interface{}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I replaced interface{}
with any
.
snow/health.go
Outdated
func NewUnresolvedBlocksCheck[I Block]() *UnresolvedBlocksCheck[I] { | ||
return &UnresolvedBlocksCheck[I]{ | ||
unresolvedBlocks: set.Set[ids.ID]{}, | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this constructor is needed because the zero values for set.Set
and sync.RWMutex
are safe to use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it's not. Updated, thanks.
snow/health.go
Outdated
) | ||
|
||
func (v *VM[I, O, A]) RegisterHealthChecker(name string, healthChecker health.Checker) error { | ||
if _, loaded := v.healthCheckers.LoadOrStore(name, healthChecker); loaded { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Use ok
for these type of bool names where you're checking if something is present, I've also seen has
but ok
seems most common in Go
snow/health.go
Outdated
} | ||
|
||
func (v *VM[I, O, A]) HealthChecker(name string) (interface{}, error) { | ||
value, ok := v.healthCheckers.Load(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I just typed out the previous comment but it looks like we ended up doing this later. We should try to be consistent w/ naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, thanks for pointing it out.
snow/health.go
Outdated
VMReadinessHealthChecker = "snowVMReadiness" | ||
UnresolvedBlocksHealthChecker = "snowUnresolvedBlocks" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I need to check what our naming convention is for this, but I don't think it's camelCasing?
- These don't need to be exported/exposed as part of the package api AFAICT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re on point 1.
The naming convention here is flatcase. I suggest using camelCase since heath checks responses typically get converted to JSON for structured logging in monitoring systems. AvalancheGo is also following camelCase naming convention.
Re on point 2.
Updated.
vm/statesync.go
Outdated
err = vm.snowApp.RegisterHealthChecker(StateSyncNamespace, vm.SyncClient) | ||
if err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err = vm.snowApp.RegisterHealthChecker(StateSyncNamespace, vm.SyncClient) | |
if err != nil { | |
return err | |
} | |
return vm.snowApp.RegisterHealthChecker(StateSyncNamespace, vm.SyncClient) |
snow/health.go
Outdated
lock sync.RWMutex | ||
} | ||
|
||
func NewUnresolvedBlocksCheck[I Block]() *UnresolvedBlocksCheck[I] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these types are necessary as part of the exported package api
// accepted during state sync will eventually be rejected in favor of valid blocks. | ||
type UnresolvedBlocksCheck[I Block] struct { | ||
unresolvedBlocks set.Set[ids.ID] | ||
lock sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have to be safe against concurrent use? I thought that the event types had Notify
called in serial... but I don't recall for HealthCheck
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do need sync primitive: https://github.com/ava-labs/avalanchego/tree/master/api/health#health-check-worker
snow/health.go
Outdated
func (v *VMReadiness) HealthCheck(_ context.Context) (interface{}, error) { | ||
ready := v.isReady() | ||
if !ready { | ||
return nil, errors.New("vm is not ready") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we return this map in both cases but change the value to false
if it's not ready?
snow/statesync.go
Outdated
return err | ||
} | ||
|
||
unresolvedBlkHealthChecker, typeConvOK := healthChecker.(*UnresolvedBlocksCheck[I]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we just call this ok
? This is idiomatic in Go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been removed.
@@ -143,16 +144,15 @@ type VM[I Block, O Block, A Block] struct { | |||
tracer trace.Tracer | |||
|
|||
shutdownChan chan struct{} | |||
|
|||
healthCheckers sync.Map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to figure out if we need a sync.Map
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at AvalancheGo health package doc it seems like we need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link you described says that health checks run in their own goroutine (and this is true, ref). AFAICT, there aren't multiple concurrent health checks for a single type from what I can tell... it just means that multiple vms can have their health checks run concurrently, not that a vm will have multiple health checks running concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This enables us to register health checks when AvalancheGo may call HealthCheck
concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think avalanchego calls HealthCheck
concurrently in the way that this code is trying to guard against, refer to code I linked. VM health checks are concurrent (i.e multiple vms can run health checks concurrently), but each VM's health check will only run once the previous one has completed.
Closes: #1878
This patch set introduces a new health check mechanism for the Snow VM, which monitors VM readiness and tracks vacuously verified blocks.