Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[node-labeler] Refactor node labeling to use taints instead of labels #20652

Merged
merged 5 commits into from
Mar 11, 2025

Conversation

iQQBot
Copy link
Contributor

@iQQBot iQQBot commented Mar 5, 2025

Description

[node-labeler] Refactor node labeling to use taints instead of labels

Note: It's not a long-term taint, so we don't need to worry too much about interfering with the scheduling of other daemonsets, and I've checked our daemonsets at GCP and AWS, and the vast majority of them tolerate the NoSchedule taint

Related Issue(s)

Fixes CLC-1032

How to test

The best testing method is to use a cell or a ephemeral workspace cluster.

  1. deploy this node-labeler image to cell or ephemeral workspace cluster
  2. try deleting the ws-daemonset / registry-facade pod to see how it affects the corresponding node's taint
  3. observe how the newly added workspace node handles these taint (We also need other PRs to make ASG create the node with this taint)

I ran loadgen in my testing cell, it works without problem.

image

Documentation

Preview status

Gitpod was successfully deployed to your preview environment.

Build Options

Build
  • /werft with-werft
    Run the build with werft instead of GHA
  • leeway-no-cache
  • /werft no-test
    Run Leeway with --dont-test
Publish
  • /werft publish-to-npm
  • /werft publish-to-jb-marketplace
Installer
  • analytics=segment
  • with-dedicated-emulation
  • workspace-feature-flags
    Add desired feature flags to the end of the line above, space separated
Preview Environment / Integration Tests
  • /werft with-local-preview
    If enabled this will build install/preview
  • /werft with-preview
  • /werft with-large-vm
  • /werft with-gce-vm
    If enabled this will create the environment on GCE infra
  • /werft preemptible
    Saves cost. Untick this only if you're really sure you need a non-preemtible machine.
  • with-integration-tests=all
    Valid options are all, workspace, webapp, ide, jetbrains, vscode, ssh. If enabled, with-preview and with-large-vm will be enabled.
  • with-monitoring

/hold

mgr.GetClient(),
}

err = ctrl.NewControllerManagedBy(mgr).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ideal should be achieved under ASG, so that when a node is newly created, it will have the corresponding taint.
However, this may require upgrading infra CF. Therefore, the approach here is a supplement; we will perform a check as soon as the workspace node is created and add the corresponding label and taint.

@iQQBot iQQBot requested a review from a team as a code owner March 7, 2025 09:05
@kylos101
Copy link
Contributor

and the vast majority of them tolerate the NoSchedule taint

What does not support the NoSchedule taint?

@iQQBot
Copy link
Contributor Author

iQQBot commented Mar 11, 2025

for daemon set it will keep waiting until taint remove

Copy link
Contributor

@kylos101 kylos101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding ✔️ to unblock.

Please see this question: #20652 (comment)

Depending on which workloads are impacted (metrics, logs), we should consider a follow-on PR for prereqs.

@iQQBot
Copy link
Contributor Author

iQQBot commented Mar 11, 2025

Depending on which workloads are impacted (metrics, logs), we should consider a follow-on PR for prereqs.

All of the things you mentioned are tolerated, only some of the customized optional addons are not.

@iQQBot
Copy link
Contributor Author

iQQBot commented Mar 11, 2025

/unhold

@roboquat roboquat merged commit 52a7727 into main Mar 11, 2025
18 checks passed
@roboquat roboquat deleted the pd/CLC-1032 branch March 11, 2025 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants