Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(snow/vm): health check #1898

Merged
merged 6 commits into from
Jan 30, 2025
Merged

feat(snow/vm): health check #1898

merged 6 commits into from
Jan 30, 2025

Conversation

Elvis339
Copy link
Contributor

Closes: #1878

This patch set introduces a new health check mechanism for the Snow VM, which monitors VM readiness and tracks vacuously verified blocks.

@Elvis339 Elvis339 self-assigned this Jan 29, 2025
snow/health.go Outdated
return &VMReadiness{isReady: isReady}
}

func (v *VMReadiness) HealthCheck(_ context.Context) (interface{}, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR makes common use of this but after go1.18+ I think it's idiomatic to just use the any alias which reads nicer in place of interface{}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I replaced interface{} with any.

snow/health.go Outdated
Comment on lines 93 to 97
func NewUnresolvedBlocksCheck[I Block]() *UnresolvedBlocksCheck[I] {
return &UnresolvedBlocksCheck[I]{
unresolvedBlocks: set.Set[ids.ID]{},
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this constructor is needed because the zero values for set.Set and sync.RWMutex are safe to use

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it's not. Updated, thanks.

snow/health.go Outdated
)

func (v *VM[I, O, A]) RegisterHealthChecker(name string, healthChecker health.Checker) error {
if _, loaded := v.healthCheckers.LoadOrStore(name, healthChecker); loaded {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Use ok for these type of bool names where you're checking if something is present, I've also seen has but ok seems most common in Go

snow/health.go Outdated
}

func (v *VM[I, O, A]) HealthChecker(name string) (interface{}, error) {
value, ok := v.healthCheckers.Load(name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I just typed out the previous comment but it looks like we ended up doing this later. We should try to be consistent w/ naming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, thanks for pointing it out.

snow/health.go Outdated
Comment on lines 20 to 21
VMReadinessHealthChecker = "snowVMReadiness"
UnresolvedBlocksHealthChecker = "snowUnresolvedBlocks"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I need to check what our naming convention is for this, but I don't think it's camelCasing?
  2. These don't need to be exported/exposed as part of the package api AFAICT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re on point 1.
The naming convention here is flatcase. I suggest using camelCase since heath checks responses typically get converted to JSON for structured logging in monitoring systems. AvalancheGo is also following camelCase naming convention.

Re on point 2.
Updated.

vm/statesync.go Outdated
Comment on lines 147 to 150
err = vm.snowApp.RegisterHealthChecker(StateSyncNamespace, vm.SyncClient)
if err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
err = vm.snowApp.RegisterHealthChecker(StateSyncNamespace, vm.SyncClient)
if err != nil {
return err
}
return vm.snowApp.RegisterHealthChecker(StateSyncNamespace, vm.SyncClient)

snow/health.go Outdated
lock sync.RWMutex
}

func NewUnresolvedBlocksCheck[I Block]() *UnresolvedBlocksCheck[I] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these types are necessary as part of the exported package api

// accepted during state sync will eventually be rejected in favor of valid blocks.
type UnresolvedBlocksCheck[I Block] struct {
unresolvedBlocks set.Set[ids.ID]
lock sync.RWMutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to be safe against concurrent use? I thought that the event types had Notify called in serial... but I don't recall for HealthCheck...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snow/health.go Outdated
func (v *VMReadiness) HealthCheck(_ context.Context) (interface{}, error) {
ready := v.isReady()
if !ready {
return nil, errors.New("vm is not ready")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return this map in both cases but change the value to false if it's not ready?

return err
}

unresolvedBlkHealthChecker, typeConvOK := healthChecker.(*UnresolvedBlocksCheck[I])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we just call this ok? This is idiomatic in Go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been removed.

snow/vm.go Outdated Show resolved Hide resolved
@@ -143,16 +144,15 @@ type VM[I Block, O Block, A Block] struct {
tracer trace.Tracer

shutdownChan chan struct{}

healthCheckers sync.Map
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to figure out if we need a sync.Map here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at AvalancheGo health package doc it seems like we need it.

Copy link
Contributor

@joshua-kim joshua-kim Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link you described says that health checks run in their own goroutine (and this is true, ref). AFAICT, there aren't multiple concurrent health checks for a single type from what I can tell... it just means that multiple vms can have their health checks run concurrently, not that a vm will have multiple health checks running concurrently.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enables us to register health checks when AvalancheGo may call HealthCheck concurrently.

Copy link
Contributor

@joshua-kim joshua-kim Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think avalanchego calls HealthCheck concurrently in the way that this code is trying to guard against, refer to code I linked. VM health checks are concurrent (i.e multiple vms can run health checks concurrently), but each VM's health check will only run once the previous one has completed.

@aaronbuchwald aaronbuchwald merged commit 3963552 into main Jan 30, 2025
17 checks passed
@aaronbuchwald aaronbuchwald deleted the vm-health-check branch January 30, 2025 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add State Sync Transition Health Check to Snow VM
3 participants