Scheduler leader election #841
Labels
a/reliability
Area: relates to reliability of the service
c/autoscaling/scheduler
Component: autoscaling: k8s scheduler
t/feature
Issue type: feature, for new features or requests
Problem description / Motivation
Similar to #762, we only run a single instance of the scheduler at a time, which means we're vulnerable to extended outages if a node goes down. A "simple" way to fix this is via leader election.
Currently this is unsound, and is unlikely to work correctly.
Feature idea(s) / DoD
Scheduler supports leader election, for high availability in case of single node failure.
Scheduler should probably also have anti-affinity with itself (not sure if that's already provided with replicaset / deployment).
Implementation ideas
In addition to the changes to the deployment yaml, we also should adapt the scheduler plugin in some way so that its state is discarded when it's no longer the leader. Not sure how much work this is, or how we can get that signal.
Alternatively, if the pod/VM/node listing on startup is too expensive, we can modify the plugin so that having decisions made without its input is actually sound (within reason).
We also need to adapt the autoscaler-agent to be able to handle multiple scheduler instances — or expose a connection to the current leader via k8s service, or something. Not sure if that's possible.
The text was updated successfully, but these errors were encountered: