New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Scheduler subsystem high level design proposal #603

Open

smarterclayton wants to merge 1 commit into kubernetes-sigs:main from smarterclayton:docs

+150 −0

Contributor

smarterclayton commented Mar 28, 2025 •

edited

Loading

This sets down basic design principles of the current gateway scheduler. We also highlight who we are targeting as users, and why we prioritize the current approach. It also selects standard terminology for scheduling that the implementation should adopt.

This is a high level design and thus sets general scope, without expecting to fully address all problems.

k8s-ci-robot requested review from danehans and Jeffwan

March 28, 2025 17:44

k8s-ci-robot added the cncf-cla: yes label

Contributor

k8s-ci-robot commented Mar 28, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: smarterclayton
Once this PR has been reviewed and has the lgtm label, please assign jeffwan for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the size/L label

netlify bot commented Mar 28, 2025 •

edited

Loading

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`5826cec`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67e6e62626758b000876dc66
😎 Deploy Preview	https://deploy-preview-603--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.


          Scheduler subsystem high level design proposal

5826cec

This sets down basic design principles of the current gateway
scheduler. We also highlight who we are targeting as users, and
why we prioritize the current approach. It also selects standard
terminology for scheduling that the implementation should adopt.

This is a high level design and thus sets general scope, without
expecting to fully address all problems.

smarterclayton force-pushed the docs branch from 37eec8d to 5826cec Compare

March 28, 2025 18:10

kfswain reviewed

View reviewed changes

docs/proposals/006-scheduler/README.md


		#### Replacement Scheduler

		The replacement scheduler will be a low-latency mechanism for out-of-process execution of the core endpoint selection option. The replacement scheduler will accept one or more requests to schedule, a list of endpoints, and optionally the associated informer state for those endpoints. The replacement scheduler will return one or zero endpoints per request.

Collaborator

kfswain Mar 28, 2025

You mention above:

the scheduler will consult a list of configured scorers to score the matches into a prioritized list of endpoints.

Should that just be the expected output of any scheduler?

It could allow the Personas to do cool parity comparisons, i.e. (run these 3 schedulers in parallel, but treat the reference scheduler as the source of truth, and only block on the response of that one) Which could be a useful/safe way to roll out algorithms in production traffic

Contributor Author

smarterclayton Mar 28, 2025

Yeah, that layering is that filtering and scoring and prioritization are three phases.

Being able to run a shadow schedule is indeed valuable and aligns to the rollout use cases.

docs/proposals/006-scheduler/README.md


		A benchmarking harness will be provided to capture and reproduce a production trace, primarily to aid algorithmic contributors. A small but diverse set of production traces will be used initially to anchor expectations, and scaling both the number of supported traces and efficient regression testing at scale will be critical.

		We anticipate that accelerator availability will limit the scale of e2e testing and contribution. We will develop a model server stub that can emulate the behavior of the core expected algorithm for model servers and does not require accelerators. We will support both time-accurate and configurable ratio emulation to allow fast execution.

Collaborator

kfswain Mar 28, 2025

++

kfswain reviewed

View reviewed changes

docs/proposals/006-scheduler/README.md

+              ## Non-Goals
+              - Dynamic reconfiguration of the reference scheduler algorithms at runtime
+              - Being a general scheduler framework for load balancing

Collaborator

kfswain Mar 28, 2025 •

edited

Loading

I agree with this. I'm trying to think through how we enforce/where we draw the line. What do we consider in scope? There will be some overlap likely, but any addition/improvement should probably have an inference-specific justification.

kfswain reviewed

View reviewed changes

docs/proposals/006-scheduler/README.md

+              - The scheduler should be educatable - extending the [model server protocol](../003-model-server-protocol/) with new metrics or adding a new source of data should be minimally invasive
+              - The scheduler should be replaceable - the reference endpoint picker implementation should support delegating scheduling decisions per pool to an alternative **replacement scheduler**
+              ## Non-Goals

Collaborator

kfswain Mar 28, 2025

I think we should specify that a non-goal would be the concept of:

How to graduate ideas from: fork -> included in reference scheduler -> exposed as EPP config -> config expressed in our API

Something we should figure out, but not in this doc.

ahg-g reviewed

View reviewed changes

docs/proposals/006-scheduler/README.md


		We desire the following outcomes from the reference scheduler:

		1. Keep model servers optimally utilized without saturating

Contributor

ahg-g Mar 28, 2025 •

edited

Loading

this is mostly dictated by the qps though, the algorithm can't ensure that the model servers don't saturate

Contributor Author

smarterclayton Mar 28, 2025

Yeah, should be “allow model servers to more predictably approach saturation” instead

docs/proposals/006-scheduler/README.md

+. Keep model servers optimally utilized without saturating
+. Make user-visible request latency more predictable
+. Provide isolation between multiple workloads on the same model servers before saturation

Contributor

ahg-g Mar 28, 2025

I recommend to define saturation in the proposal

Contributor Author

smarterclayton Mar 28, 2025

Good point, might as well open that doc up

docs/proposals/006-scheduler/README.md


		#### Replacement Scheduler

		The replacement scheduler will be a low-latency mechanism for out-of-process execution of the core endpoint selection option. The replacement scheduler will accept one or more requests to schedule, a list of endpoints, and optionally the associated informer state for those endpoints. The replacement scheduler will return one or zero endpoints per request.

Contributor

ahg-g Mar 28, 2025

Why does this look like a batch scheduler?

Contributor Author

smarterclayton Mar 28, 2025

1k qps will require either streaming or batching, fair point we don’t want to constrain that design yet

docs/proposals/006-scheduler/README.md


		Given that we anticipate a significant amount of future work to integrate heterogenous hardware (different generations / topologies) and heterogeous server roles (prefill-heavy, prefill/decode split, latency objectives), we expect that there will be an assignment informer that partitions the candidate endpoints over multiple dimensions for the scheduler. This will decouple the scheduling algorithm from the process of determining the capacity and suitability of different model servers to different dimensions of request cost.

		#### Replacement Scheduler

Contributor

ahg-g Mar 28, 2025 •

edited

Loading

I am not following what this section is proposing, what do we mean by replacement? replacing the reference scheduler? why is this proposal trying to define that?

Contributor Author

smarterclayton Mar 28, 2025

It’s kind of defined above, but implementing a whole EPP is a lot, and i don’t want to duplicate everything in another language.

Contributor

ahg-g Mar 28, 2025

I guess my question is: are we proposing to build a second reference scheduler in this repo?

Collaborator

kfswain Mar 28, 2025

Just the tools to do so: I'm gonna clean up & finish implementing: https://github.com/kfswain/go-py-interface/tree/main (it's hideous and incomplete, which is why I've been so cagey about it)

So the Replacement Scheduler could be a Python based scheduler, that is called via an EPP fork. If we create a simple ingress interface:

endpoint map w/metrics
config params

And a simple egress interface:

scored endpoints
maybe specify how many endpoints to duplicate to? (was mentioned as a potential need by prodstack)

implementing a new algo in Python should be straightforward

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes size/L